Policy Gradient Methods for Reinforcement Learning With Function Approximation (1999)
April 27, 2024Policy gradient methods approximate stochastic policies using function approximators, which are independent of value functions. They update the policies according to the gradient of expected reward with respect to the policy parameters. Policy Gradient Methods for Reinforcement Learning With Function Approximation shows that a form of policy iteration with function approximation converges to a locally optimal policy.
In the standard reinforcement learning framework, where a learning agent interacts with a Markov decision process, with \(\theta\) as the parameter vector, \(\rho\) as the objective function, and \(\alpha\) as the learning rate, policy gradient methods update the policy parameters approximately proportional to the gradient: $$ \Delta\theta \approx \alpha \frac{\partial \rho}{\partial \theta} $$
The paper considers two objectives. One is a function that maps a policy to the long-term average reward: $$ \rho(\pi)=\lim_{t\rightarrow\infty}\frac{1}{t}E\{r_1+r_2+\cdots+r_t|\pi\}=\sum_sd^\pi(s)\sum_a\pi(s, a)\mathcal{R}^a_s $$ where \(d^\pi(s) = \lim_{t\rightarrow\infty}\textit{Pr}\{s_t=s|s_0, \pi\}\) is the stationary distribution of states under \(\pi\), which is assumed to exist and independent of \(s_0\) for all policies. The corresponding state-action pair is defined as $$ Q^\pi(s, a)=\sum^\infty_{t=1}E\{r_t-\rho(\pi)|s_0=s, a_0 = a, \pi\} $$
The second covered formulation is an objective with a discount rate \(\gamma\in[0, 1]\): $$ \begin{align*} \rho(\pi) &= E\left\{\sum^\infty_{t=1}\gamma^{t-1}r_t|s_0,\pi \right\}\\ Q^\pi(s, a) &= E\left\{\sum^\infty_{k=1}\gamma^{k-1}r_{t+k}|s_t=s, a_t=a, \pi\right\} \end{align*} $$
If \(d^\pi(s)\) is defined as a discounted weighting of states encountered starting at \(s_0\) and then following \(d^\pi(s)=\sum^\infty_{t=0}\gamma^tPr\{s_t=s|s_0,\pi\}\), for any MDP, the partial differentiation with respect to the policy parameter of the above two objectives is: $$ \begin{equation} \frac{\partial\rho}{\partial\theta}=\sum_sd^\pi(s)\sum_a\frac{\partial\pi(s, a)}{\partial\theta}Q^\pi(s, a) \end{equation} $$
\(Q^\pi(s, a)\) is typically unknown and must be estimated. Let \(f_w: \mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}\) with parameter \(w\) be an approximation to \(Q^\pi\), and \(\hat{Q}(s_t, a_t)\) be some unbiased estimator of \(Q^\pi (s_t, a_t)\), perhaps \(R_t\), \(f_w\) can be learnd by updating \(w\) by \(\Delta w_t\propto\frac{\partial}{\partial w}[\hat{Q}^\pi(s_t, a_t)-f_w(s_t, a_t)]^2\propto [\hat{Q}^\pi(s_t, a_t)-f_w(s_t, a_t)]\frac{\partial f_w(s_t, a_t)}{\partial w}\). When the objective converges to a local optimum, then $$ \begin{equation} \sum_s d^\pi(s)\sum_a\pi(s, a)[Q^\pi(s, a)-f_w(s, a)]\frac{\partial f_w(s, a)}{\partial w} = 0 \end{equation} $$
To express \(\frac{\partial\rho}{\partial\theta}\) with \(f_w(s, a)\), let us assume \(f_w\) satisfies (2) and is compatible with the policy parameterization in the sense that $$ \begin{equation} \frac{\partial f_w(s, a)}{\partial w}=\frac{\partial \pi (s, a)}{\partial \theta}\frac{1}{\pi (s, a)} \end{equation} $$ Combining (2), (3) gives $$ \sum_s d^\pi(s)\sum_a\frac{\partial\pi(s, a)}{\partial \theta}[Q^\pi(s, a)-f_w(s, a)] = 0 $$ This expression above is zero, and it can be subtracted from the policy gradient theorem (1) to yield $$ \begin{align*} \frac{\partial\rho}{\partial\theta}&=\sum_sd^\pi(s)\sum_a\frac{\partial\pi(s, a)}{\partial\theta}Q^\pi(s, a) - \sum_s d^\pi(s)\sum_a\frac{\partial\pi(s, a)}{\partial \theta}[Q^\pi(s, a)-f_w(s, a)]\\ &=\sum_sd^\pi(s)\sum_a\frac{\partial\pi(s, a)}{\partial\theta}[Q^\pi(s, a)-Q^\pi(s, a)+f_w(s, a)] \end{align*} $$ Finally, if \(f_w\) satisfies (2) and (3), then $$ \begin{equation} \frac{\partial \rho}{\partial \theta}=\sum_s d^\pi(s)\sum_a\frac{\partial \pi(s, a)}{\partial \theta}f_w(s, a) \end{equation} $$
In additon to (3), if there is a bound \(B\) such that \(\max_{\theta s, a, i, j|\frac{\partial^2 \pi(s, a)}{\partial \theta_i\partial \theta_j}|}<B<\infty\), \(\frac{\partial^2\rho}{\partial \theta_i\partial\theta_j}\) is also bounded. If \(B\) exists and assuming the learning rate fulfills \(\lim_{k\rightarrow\infty}\alpha_k=0, \sum_k\alpha_k=\infty\), $$ \begin{align*} w_k&= w\ \text{such that} \sum_s d^\pi(s)\sum_a\pi(s, a)[Q^\pi(s, a)-f_w(s, a)]\frac{\partial f_w(s, a)}{\partial w} = 0\\ \theta_{k+1}&=\theta_k+\alpha_k\sum_sd^{\pi_k}(s)\sum_a\frac{\partial \pi_k(s, a)}{\partial\theta}f_{w_k}(s, a) \end{align*} $$ converges such that \(\lim_{k\rightarrow\infty}\frac{\partial\rho(\pi_k)}{\partial\theta}=0\).