Project - Deep Reinforcement Learnning p2 - Continuous Control - following a moving target

The Problem - Project Goal

In this environment, a double-jointed arm can move to target locations.
A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.
The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm.
Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

Implementation Details

I decided to go with the DDPG algorithm.
The Actor and Critc networks are almost identical and simple - linear layers followed by RELU Activation layers.
A finite sized cache called the replay buffer has been used to store <s,a,r,s'> pairs. Pairs from this buffer are used to train the actor and the critic.
The concept of soft updates has been used for both actor and critic networks where a copy of the networks is used for evaluation and updated at regular intervals with the weights from the target network being trained.
Exploration (which is a major challenge in continuous actions spaces) by adding noise to the action spaces is done by using an Ornstein-Uhlenbeck process.
Loss functions for both networks as mentioned in the paper

Network Architecture

I first setup the network as mentioned in the paper for both the Actor and Critic but through experiments, finally settled on the below mentioned values.

Both actor and critic have 2 hidden layers
Dimension of hidden layer1 = 330, hidden layer 2 = 300
Size of input to both networks is the state size = 33
Actions were included for the Critic network only in the second layer
The Actor outputs actions with action size = 4
Critic outputs the Q value for a state, action pair
Adam optimiser for both networks
A replay buffer of size 100000. As the buffer gets full, old entries are replaced.
Discount factor or Gamma = 0.99

Hyperparamters

Learning rate for both networks = 0.001
Tau for soft updates = 0.001
Weight decay for Critic = 0
Batch size - 128

Plot - Below is the plot that shows how the agents trainied.

Future Work - ideas

Implement the D4PG, A3C algorithm and compare with this DDPG performance
Implement a prioritised Replay buffer
Further tweak the network parameters for faster training and a higher target score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report.md

report.md

Project - Deep Reinforcement Learnning p2 - Continuous Control - following a moving target

The Problem - Project Goal

Implementation Details

Network Architecture

Hyperparamters

Plot - Below is the plot that shows how the agents trainied.

Future Work - ideas

Files

report.md

Latest commit

History

report.md

File metadata and controls

Project - Deep Reinforcement Learnning p2 - Continuous Control - following a moving target

The Problem - Project Goal

Implementation Details

Network Architecture

Hyperparamters

Plot - Below is the plot that shows how the agents trainied.

Future Work - ideas