- In this environment, a double-jointed arm can move to target locations.
- A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.
- The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm.
- Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.
- I decided to go with the DDPG algorithm.
- The Actor and Critc networks are almost identical and simple - linear layers followed by RELU Activation layers.
- A finite sized cache called the replay buffer has been used to store <s,a,r,s'> pairs. Pairs from this buffer are used to train the actor and the critic.
- The concept of soft updates has been used for both actor and critic networks where a copy of the networks is used for evaluation and updated at regular intervals with the weights from the target network being trained.
- Exploration (which is a major challenge in continuous actions spaces) by adding noise to the action spaces is done by using an Ornstein-Uhlenbeck process.
- Loss functions for both networks as mentioned in the paper
I first setup the network as mentioned in the paper for both the Actor and Critic but through experiments, finally settled on the below mentioned values.
- Both actor and critic have 2 hidden layers
- Dimension of hidden layer1 = 330, hidden layer 2 = 300
- Size of input to both networks is the state size = 33
- Actions were included for the Critic network only in the second layer
- The Actor outputs actions with action size = 4
- Critic outputs the Q value for a state, action pair
- Adam optimiser for both networks
- A replay buffer of size 100000. As the buffer gets full, old entries are replaced.
- Discount factor or Gamma = 0.99
- Learning rate for both networks = 0.001
- Tau for soft updates = 0.001
- Weight decay for Critic = 0
- Batch size - 128
- Implement the D4PG, A3C algorithm and compare with this DDPG performance
- Implement a prioritised Replay buffer
- Further tweak the network parameters for faster training and a higher target score.