Equations You Need to Know for Chapter 1 Ap Stats

Policy Gradients in a Nutshell

This article aims to provide a concise however comprehensive introduction to one of the almost important class of control algorithms in Reinforcement Learning — Policy Gradients. I volition discuss these algorithms in progression, arriving at well-known results from the ground up. It is aimed at readers with a reasonable background as for any other topic in Car Learning. By the terminate, I hope that you lot'd be able to attack a vast amount of (if not all) Reinforcement Learning literature.

Introduction

Reinforcement Learning (RL) refers to both the learning problem and the sub-field of machine learning which has lately been in the news for smashing reasons. RL based systems have at present beaten world champions of Go, helped operate datacenters better and mastered a wide diversity of Atari games. The research community is seeing many more promising results. With enough motivation, permit usa now have a look at the Reinforcement Learning problem.

Reinforcement Learning is the nigh general description of the learning trouble where the aim is to maximize a long-term objective. The system description consists of an amanuensis which interacts with the environment via its actions at detached time steps and receives a reward. This transitions the agent into a new country. A canonical agent-environment feedback loop is depicted by the figure below.

The Canonical Agent-Environment Feedback Loop

The Reinforcement Learning season of the learning problem is strikingly like to how humans effectively behave — experience the world, accrue knowledge and use the learnings to handle novel situations. Like many people, this attractive nature (although a harder conception) of the problem is what excites me and hope it does y'all too.

Background and Definitions

A large amount of theory behind RL lies under the assumption of The Reward Hypothesis which in summary states that all goals and purposes of an agent can be explained by a single scalar called the reward. This is nonetheless subject area to contend but has been adequately difficult to disprove even so. More than formally, the reward hypothesis is given beneath

The Reward Hypothesis: That all of what we hateful by goals and purposes tin be well idea of equally the maximization of the expected value of the cumulative sum of a received scalar betoken (called reward).

As an RL practitioner and researcher, one's chore is to find the right set of rewards for a given problem known as advantage shaping.

The agent must formally work through a theoretical framework known as a Markov Decision Process which consists of a decision (what activeness to accept?) to exist fabricated at each country. This gives rise to a sequence of states, deportment and rewards known equally a trajectory,

and the objective is to maximize this set of rewards. More formally, we await at the Markov Decision Process framework.

Markov Determination Process: A (Discounted) Markov Decision Process (MDP) is a tuple (S,A,R,p,γ), such that

where S_t, S_(t+1) ∈ Due south (state space), A_(t+1) ∈ A (action space), R_(t+1), R_t ∈ R (reward space), p defines the dynamics of the procedure and G_t is the discounted return.

In simple words, an MDP defines the probability of transitioning into a new state, getting some reward given the current state and the execution of an action. This framework is mathematically pleasing considering information technology is Offset-Order Markov. This is only a fancy manner of saying that annihilation that happens adjacent is dependent simply on the present and not the by. It does not thing how 1 arrives at the current state as long every bit i does. Another of import part of this framework is the disbelieve factor γ. Summing these rewards over time with a varying degree of importance to the rewards from the future leads to a notion of discounted returns. As one might expect, a higher γ leads to higher sensitivity for rewards from the future. Yet, the extreme example of γ=0 doesn't consider rewards from the time to come at all.

The dynamics of the environment p are exterior the control of the agent. To internalize this, imagine standing on a field in a windy surround and taking a step in one of the 4 directions at each 2d. The winds are so strong, that it is difficult for you to motion in a direction perfectly aligned with north, east, w or s. This probability of landing in a new land at the next second is given past the dynamics p of the windy field. Information technology is certainly not in your (agent'south) command.

Nonetheless, what if you somehow understand the dynamics of the environment and move in a direction other than north, due east, west or southward. This policy is what the agent controls. When an agent follows a policy π, it generates the sequence of states, actions and rewards called the trajectory.

Policy: A policy is defined as the probability distribution of deportment given a state

With all these definitions in listen, let us encounter how the RL trouble looks similar formally.

Policy Gradients

The objective of a Reinforcement Learning agent is to maximize the "expected" reward when post-obit a policy π. Similar any Machine Learning setup, we define a gear up of parameters θ (eastward.g. the coefficients of a circuitous polynomial or the weights and biases of units in a neural network) to parametrize this policy — π_θ (besides written a π for brevity). If we correspond the total reward for a given trajectory τ as r(τ), we arrive at the post-obit definition.

Reinforcement Learning Objective: Maximize the "expected" advantage following a parametrized policy

All finite MDPs have at least one optimal policy (which can requite the maximum reward) and among all the optimal policies at least one is stationary and deterministic.

Like any other Automobile Learning problem, if we can observe the parameters θ⋆ which maximize J, we will have solved the task. A standard approach to solving this maximization trouble in Machine Learning Literature is to use Slope Ascent (or Descent). In slope ascent, nosotros go along stepping through the parameters using the following update dominion

Here comes the challenge, how do we find the gradient of the objective above which contains the expectation. Integrals are always bad in a computational setting. We demand to find a style around them. Beginning step is to reformulate the slope starting with the expansion of expectation (with a slight corruption of annotation).

The Policy Slope Theorem: The derivative of the expected advantage is the expectation of the production of the reward and gradient of the log of the policy π_θ.

Now, allow the states expand the definition of π_θ(τ).

To sympathise this computation, let united states interruption information technology downward — P represents the ergodic distribution of starting in some state s_0. From then onwards, we employ the product rule of probability because each new activity probability is independent of the previous ane (remember Markov?). At each step, nosotros take some action using the policy π_θ and the environment dynamics p decide which new state to transition into. Those are multiplied over T time steps representing the length of the trajectory. Equivalently, taking the log, nosotros have

This issue is beautiful in its own right considering this tells u.s.a., that we don't really need to know almost the ergodic distribution of states P nor the environs dynamics p. This is crucial because for near practical purposes, it difficult to model both these variables. Getting rid of them, is certainly good progress. As a issue, all algorithms that utilize this consequence are known as "Model-Gratuitous Algorithms" because nosotros don't "model" the environment.

The "expectation" (or equivalently an integral term) notwithstanding lingers around. A simple but effective approach is to sample a large number of trajectories (I really mean LARGE!) and boilerplate them out. This is an approximation but an unbiased one, similar to approximating an integral over continuous infinite with a discrete set of points in the domain. This technique is formally known every bit Markov Concatenation Monte-Carlo (MCMC), widely used in Probabilistic Graphical Models and Bayesian Networks to gauge parametric probability distributions.

1 term that remains untouched in our handling above is the reward of the trajectory r(τ). Even though the slope of the parametrized policy does not depend on the advantage, this term adds a lot of variance in the MCMC sampling. Effectively, in that location are T sources of variance with each R_t contributing. Even so, we can instead brand use of the returns G_t considering from the standpoint of optimizing the RL objective, rewards of the past don't contribute annihilation. Hence, if we supersede r(τ) by the discounted return G_t, we go far at the archetype algorithm Policy Gradient algorithm chosen REINFORCE. This doesn't totally alleviate the trouble as we hash out farther.

REINFORCE (and Baseline)

To reiterate, the REINFORCE algorithm computes the policy gradient as

REINFORCE Gradient

We still take not solved the problem of variance in the sampled trajectories. I manner to realize the problem is to reimagine the RL objective defined above as Likelihood Maximization (Maximum Likelihood Approximate). In an MLE setting, information technology is well known that data overwhelms the prior — in simpler words, no thing how bad initial estimates are, in the limit of data, the model will converge to the true parameters. However, in a setting where the data samples are of high variance, stabilizing the model parameters can exist notoriously hard. In our context, any erratic trajectory can crusade a sub-optimal shift in the policy distribution. This problem is aggravated by the calibration of rewards.

Consequently, we instead try to optimize for the departure in rewards by introducing another variable chosen baseline b. To keep the slope approximate unbiased, the baseline independent of the policy parameters.

REINFORCE with Baseline

To come across why, we must show that the gradient remains unchanged with the additional term (with slight corruption of notation).

Using a baseline, in both theory and do reduces the variance while keeping the gradient still unbiased. A good baseline would exist to use the state-value electric current country.

State Value: Land Value is defined as the expected returns given a state following the policy π_θ.

Histrion-Critic Methods

Finding a good baseline is another challenge in itself and computing it another. Instead, let us make approximate that besides using parameters ω to make V^ω_(s). All algorithms where nosotros bootstrap the slope using learnable V^ω_(s) are known as Actor-Critic Algorithms because this value function estimate behaves like a "critic" (good 5/southward bad values) to the "actor" (agent's policy). Nonetheless this time, we have to compute gradients of both the role player and the critic.

One-Pace Bootstrapped Render: A single footstep bootstrapped render takes the firsthand advantage and estimates the return by using a bootstrapped value-estimate of the next state in the trajectory.

Actor-Critic Policy Gradient

It goes without being said that nosotros also demand to update the parameters ω of the critic. The objective there is generally taken to exist the Hateful Squared Loss (or a less harsh Huber Loss) and the parameters updated using Stochastic Gradient Descent.

Critic'southward Objective

Deterministic Policy Gradients

Often times, in robotics, a differentiable control policy is available but the actions are not stochastic. In such environments, it is hard to build a stochastic policy as previously seen. I approach is to inject racket into the controller. More over, with increasing dimensionality of the controller, the previously seen algorithms start performing worse. Owing to such scenarios, instead of learning a big number of probability distributions, let us directly learn a deterministic action for a given state. Hence, in its simplest grade, a greedy maximization objective is what nosotros need

Deterministic Deportment

Withal, for most practical purposes, this maximization operation is computationally infeasible (as there is no other way than to search the entire space for a given action-value role). Instead, what nosotros tin aspire to do is, build a function approximator to guess this argmax and therefore called the Deterministic Policy Gradient (DPG).

We sum this up with the following equations.

DPG Objective

Deterministic Policy Gradient

It shouldn't exist surprising enough anymore that this value turned out to another expectation which nosotros can again judge using MCMC sampling.

Generic Reinforcement Learning Framework

We can now arrive at a generic algorithm to meet where all the pieces we've learned fit together. All new algorithms are typically a variant of the algorithm given below, trying to attack one (or multiple steps of the problem).

Lawmaking

For the readers familiar with Python, these lawmaking snippets are meant to exist a more tangible representation of the above theoretical ideas. These have been taken out of the learning loop of existent code.

Policy Gradients (Synchronous Player-Critic)

Deep Deterministic Policy Gradients

Complete Implementations

Completed Modular implementations of the full pipeline tin be viewed at activatedgeek/torchrl .

References

[Sutton and Barto, 1998] Sutton, R. S. and Barto, A. Grand. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

[Dimitri, 2017] Dimitri, P. B. (2017). DYNAMIC PROGRAMMING AND OPTIMAL CONTROL. Athena Scientiﬁc.

[Lillicrap et al., 2015] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv e-prints.

[Watkins and Dayan, 1992] Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.

[Williams, 1992] Williams, R. J. (1992). Simple statistical slope-post-obit algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer.

[Silverish et al., 2014] Silvery, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, One thousand. (2014). Deterministic policy gradient algorithms. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 387–395, Bejing, China. PMLR.

durenackithe.blogspot.com

Source: https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d