-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the training settings #5
Comments
I don't think they use a learning rate decay. At least it is not mentioned anywhere. They do train on all tasks simultaneously (according to the first author) for 500k steps with the attention-is-all-you-need transformer architecture which is a 6 layer encoder and decoder with a hidden size of 512 and a dense filter size of 2048. The batch size is 1024 so you will need some serious compute in order to reproduce this. With this config trained on 4 V100 GPUs, you can do 50k steps in ~13h. They used the tensor2tensor implementation of the transformer. So technically the code is public. Good luck with that. Have you had any success @mayukuner? |
That's correct: no learning rate decay for the results reported in the paper. 6 layers in the decoder and encoder. |
@ischlag Not even close to success. I used transformer_base_v1 as the base parameter set and modified it a little by adding a constant lr scheduler, a warmup procedure, and other stuff like this:
Here I am using a batch size of 8192 because 8 GTX 1080TI is utilized and 1024 sentences have approximately 8192 * 8 tokens so I think this is not a problem. By the way, I changed the dataset generator a little by random selecting training data from @davidsaxton Have you used curriculum training? I don't think you have while I really couldn't figure out why I cannot reproduce your results. Am I missing something here? |
I'd highly recommend to not deviate from the hyperparameters that are given in the paper. The transformer architecture is rather sensible to those. Remove your schedule and set the batch_size to 1024. Then train for 500k steps. Make sure your accuracy is 1 for getting all output tokens right and 0 for getting even just one wrong (no per symbol accuracy is reported). |
@ischlag I am trying my best to get close to the settings in the paper. As you can see, the batch_size here is the maximum number of tokens per batch per GPU, so overall each batch contains 8192*8 tokens, which is close to 1024 sentences per batch. Plus, I did not use any scheduler except for the warmup, the learning rate curve is as follows: Also, the reported So I guess I am not doing anything wrong here, right? |
Well, I'm not sure how I'm supposed to "see" that. If you are certain that batch_size is actually the number of tokens per GPU instead of the number of samples used for one step, then so be it.
Are you sure this is not going to skip data? The tf.data pipeline might do some caching and only goes through the generator once. Unfortunately, it is virtually impossible for me to tell by looking at the t2t code. |
@ischlag Sorry I did not explain it well because I thought you are familiar with T2T. The generator in T2T generates |
I'm somewhat familiar with it but I decided to not use it due to its obscurity. I'm just trying to help you here. We are working on reproducing it ourselves with a clean PyTorch implementation and I'll post the results one we managed. That said, you should not have 2M samples in total but n * 2M where n is the number of modules (I think 56 or so). If that also doesn't help then I'm out of ideas. As a dummy experiment, you could train only on numbers__place_value, which in my case takes ca. 3-5k steps to train for virtually 100% accuracy. |
@ischlag You are right, I missed |
@mayukuner I'm currently training 3 baselines with my PyTorch implementation. The best result so far is 50% accuracy on all interpolation data after 45k steps and improving. So this starts to look promising. However, this is with a learning rate of 1e-4, not 6e-4. The 6e-4 run is stuck at a loss of 3.15 and 0% train accuracy even after 50k steps. @davidsaxton Are you sure your learning rate in the paper is 6e-4 and not 6e-5? |
@ischlag Have you clipped the gradients of the tensors? You may also try to use warmup in the beginning of the training stage. The LR of 6e-4 seems OK to me. With tensor2tensor, the model can be trained to have an accuracy of 70% on interpolation test after 300k steps. |
Yes, I'm clipping the gradient norm of the parameters at 0.1. 6e-4 doesn't work at all. Even 3e-4 doesn't work at all. I've been very carefully going through my implementation several times. My parameters are initialized from U[-a,a] with a=sqrt(6/in_out_avg). I share the Embedding matrix with the last layer before the softmax. I only scale the Embedding by sqrt(d_model) and I scale the dot products by 1/sqrt(d_k). Beta1 0.9, beta2 0.995. With the default epsilon. The Embedding I scale just like in the official transformer code but I'm not sure why it is sqrt(d_model). The one for the keys makes sense though. @mayukuner are you doing the same? I'm still training and I'm now at 60% interpolation accuracy after 120k steps. So it looks good, just not with the right learning rate for me. |
@ischlag I clipped the gradient absolute value not norm (i.e., |g_i| <= 0.1 for every gradient index i) |
Hi @mayukuner , can you share your implementation with me? |
Hi! I am really interested in this fascinating work. However, I have some questions about the training methods for the transformer model.
In the paper you mention the transformer model is trained with learning rate = 6e-4 but do not say which lr decay method you are using, which I am curious about. I am also curious about the number of layers in the encoder and decoder.
Could you please demonstrate more specifically about the training settings? It will be more convenient for someone like me who want to reproduce your results if you could just publish your training source codes.
Thank you very much!
The text was updated successfully, but these errors were encountered: