Pytorch implementation of Attention Modeling for Image Captioning
- A normal CNN-RNN architecture for image captioning
- Implemenation of Visual Attention for image captioning described in Show, Attend and Tell
I have validated the methods on standard Flickr8k and MSCOCO dataset. They achieves state of the art accuracy. Results are as follows:
-
For MSCOCO dataset
- BLEU-1 : 0.705
- BLEU-4 : 0.265
-
For Flickr8k dataset
- BLEU-1 : 0.630
- BLEU-4 : 0.177
-
For MSCOCO dataset
- BLEU-1 : 0.731
- BLEU-4 : 0.320
-
For Flickr8k dataset
- BLEU-1 : 0.655
- BLEU-4 : 0.218
- Add attention visualization utility