Project Report

The latest version of this project has used the following common tricks:

The Actor and Critic share a state encoder layer
Collect multiple steps (here, 5 steps TD(5)) before using bootstrap

Learning Algorithm

Network Architecture

The input state vector is encoded by 1 fully connected layers befere branching into the Actor and the Critic heads, i.e. the actor and the critic share the input encoder. The Actor head outputs the mean vector for the action vector variable to be sampled, while the Critic head outputs the state value vector. There is also a vector parameter learning the standard deviations for the action vector distribution, the standard deviation and mean vector would be used to parameterize a multi-variable normal distribution which is used to sample actions given the current state.

Hyper-parameters
- rollout_length = 5
- learning rate =1e-4
- learning rate decay rate = 0 .95
- gamma = 0.95
- value loss weight = 1.0
- gradient clip threshold = 5
Training Strategy
- Adam is used as the optimizer
- An early-stop scheme is applied to stop training if the 100-episode-average score continues decreasing over 10 consecutive episodes.
- Each time the model gets worse regarding avg scores, the model recovers from the last best model and the learning rate of Adam is decreased: new learning rate = old learning rate * learning rate decay rate
- Gradients are clampped into (-5, +5) range to prevent exploding

Performance Evaluation

Training

During training, the performance stabilized from around the 240th episode after a series of fluctuation. Before that, the first time the performance surpassed 30 occurred at around episode 120. The episodic and average (over 100 latest episodes) scores are plotted as following:

Total rewards per-episode during training

Average total rewards over latest 100 episodes during training

As can be seen from the plot, the average score gradually reached and passed 30 during training, before the early-stopping scheme terminates the training process.

Testing

The scores of 100 testing episodes are visualized as follows:

The model obtained an average score of 37.91 during testing, which is over 30.

Conclusion

The trained model has successfully solved the continuous task. The performance:

an average score of 37.91 over 100 episodes
the best model was trained using around 250 episodes

has fulfilled the passing threshold of solving the problem: obtain an average score of higher than 30.00 over 100 consecutive episodes.

Ideas for Future Work

Try using methods like GAE or PPO in the calculation of policy loss, to fasten or stabilize the training process.
See if separately train the critic network using methods that improve DQ network help with the A2C framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Project Report

Learning Algorithm

Performance Evaluation

Training

Testing

Conclusion

Ideas for Future Work

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Project Report

Learning Algorithm

Performance Evaluation

Training

Testing

Conclusion

Ideas for Future Work