Image Caption Generator Using Bahdanau Attention Mechanism

Nikhita B Gowda; Vaishnavi; Avin Skanda B N; Rohan M; Pratheek V Raikar

doi:10.62527/ijasce.7.3.264

DOI : https://doi.org/10.62527/ijasce.7.3.264

Image Caption Generator Using Bahdanau Attention Mechanism

Nikhita B Gowda ⁽¹⁾, Vaishnavi ⁽²⁾, Avin Skanda B N ⁽³⁾, Rohan M ⁽⁴⁾, Pratheek V Raikar ⁽⁵⁾

(1) Department of Computer Science & Engineering JSS Science and Technology University Mysuru, India

(2) Department of Computer Science & Engineering JSS Science and Technology University Mysuru, India

(3) Department of Computer Science & Engineering JSS Science and Technology University Mysuru, India

(4) Department of Computer Science & Engineering JSS Science and Technology University Mysuru, India

(5) Department of Computer Science & Engineering JSS Science and Technology University Mysuru, India

Fulltext View | Download

How to cite (IJASCE) :

Gowda , N. B., Vaishnavi, Skanda B N , A., Rohan M, & Raikar , P. V. (2025). Image Caption Generator Using Bahdanau Attention Mechanism. International Journal of Advanced Science Computing and Engineering, 7(3), 93–97. https://doi.org/10.62527/ijasce.7.3.264

Citation Format :

This project proposes a sophisticated image captioning system developed using an encoder-decoder framework bolstered with an attention mechanism. The system generates contextually appropriate text descriptions by dynamically weighting relevant image regions with CNNs for feature extraction and RNNs with attention layers. The model shows significant improvement on the Flickr8k dataset, as measured by BLEU. The study examines the use of such systems across domains, including assistive devices and automated indexing, and proposes employing transformer-based attention methods in future upgrades. The development of an image captioning system with an attention mechanism is a key advancement in computer vision and natural language processing. This mechanism helps the model focus on relevant image parts when generating words, improving contextual relevance and semantic accuracy. It aligns visual features with language more effectively, producing captions similar to human descriptions. The model employs teacher forcing during training to accelerate learning and improve fluency. Standard metrics like BLEU evaluate performance and compare models. Inspired by works like “Show, Attend and Tell,” attention bridges image features and language. Attention-based captioning can aid visually impaired users, enable content indexing, and improve human–computer interaction. Future research will likely scale models to larger datasets and enhance generalization across diverse scenes.

Y. Ming et al., "Visuals to text: A comprehensive review on automatic image captioning," IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1339–1365, Aug. 2022, doi: 10.1109/JAS.2022.105734.

A. Gupta, D. S. Bhadauria, M. Atray, and I. Kaur, "Predicting relevant captions using image caption generator in social media platforms," in Computational Methods in Science and Technology. Boca Raton, FL, USA: CRC Press, 2024, pp. 423–432, doi:10.1201/9781003501244-65.

I. D. Mienye, T. G. Swart, and G. Obaido, "Recurrent neural networks: A comprehensive review of architectures, variants, and applications," Information, vol. 15, no. 9, p. 517, 2024, doi: 10.3390/info15090517.

S. Pandey, P. Saha, and G. Sharan, "Enhancing chest X-ray analysis using encoder-decoder with GRU for report generation," in Proc. 4th Int. Conf. Adv. Electr., Comput., Commun. Sustain. Technol. (ICAECT), Bhilai, India, 2024, pp. 1–8, doi:10.1109/ICAECT60202.2024.10469644.

H. Suh, J. Kim, J. So, and J. Jung, "A core region captioning framework for automatic video understanding in story video contents," Int. J. Eng., Bus. Manag., vol. 14, Nov. 2022, doi:10.1177/18479790221078130.

T.-Y. Lin et al., "Microsoft COCO: Common objects in context," in Proc. Eur. Conf. Comput. Vis. (ECCV), Zurich, Switzerland, Sep. 2014, pp. 740–755, doi: 10.1007/978-3-319-10602-1_48.

Adityajn "Flickr8K Dataset," Kaggle. Accessed: [Date]. [Online]. Available: https://www.kaggle.com/datasets/adityajn105/flickr8k.

O. Arshi and P. Dadure, "A comprehensive review of image caption generation," Multimed. Tools Appl., vol. 84, no. 25, pp. 29419–29471, 2025, doi: 10.1007/s11042-024-20095-0.

M. Mahajan et al., "Image captioning—A comprehensive encoder-decoder approach on Flickr8K," in Proc. Int. Conf. Autom. Comput. (AUTOCOM), Dehradun, India, 2025, pp. 1310–1315, doi:10.1109/AUTOCOM64127.2025.10956672.

K. R. Suresh, A. Jarapala, and P. V. Sudeep, "Image captioning encoder-decoder models using CNN-RNN architectures: A comparative study," Circuits, Syst., Signal Process., vol. 41, no. 10, pp. 5719–5742, Oct. 2022, doi: 10.1007/s00034-022-02050-2.

A. Raphael, S. Abisri, E. Anitha, S. Ritika, and M. Venugopalan, "Attention based CNN-RNN hybrid model for image captioning," in Proc. IEEE 5th Glob. Conf. Adv. Technol. (GCAT), Bangalore, India, 2024, pp. 1–5, doi: 10.1109/GCAT62922.2024.10923871.

H. Parmar, M. Rai, and U. K. Murari, "A novel image caption generation based on CNN and RNN," in Proc. Int. Conf. Adv. Comput. Res. Sci., Eng. Technol. (ACROSET), Kozhikode, India, 2024, pp. 1–8, doi: 10.1109/ACROSET62108.2024.10743848.

T. Zhang, T. Zhang, Y. Zhuo, and F. Ma, "CATNIC: A feature relevance based transformer model for automatic image caption generation," SSRN Electron. J., 2022, doi: 10.2139/ssrn.4272712.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 3156–3164, doi: 10.1109/CVPR.2015.7298935.

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, "End-to-end attention-based large vocabulary speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Shanghai, China, Mar. 2016, pp. 4945–4949, doi:10.1109/ICASSP.2016.7472618.

M. Wang, L. Song, X. Yang, and C. Luo, "A parallel-fusion RNN-LSTM architecture for image caption generation," in Proc. IEEE Int. Conf. Image Process. (ICIP), Phoenix, AZ, USA, Sep. 2016, pp. 4448–4452, doi: 10.1109/ICIP.2016.7533201.

S. Sheng and M.-F. Moens, "Generating captions for images of ancient artworks," in Proc. 27th ACM Int. Conf. Multimedia, Nice, France, Oct. 2019, pp. 2478–2486, doi: 10.1145/3343031.3350972.

J. R. Chowdhury and C. Caragea, "Beam tree recursive cells," in Proc. Int. Conf. Mach. Learn. (ICML), Honolulu, HI, USA, Jul. 2023, pp. 28768–28791.

S. Kudugunta et al., "Madlad-400: A multilingual and document-level large audited dataset," Adv. Neural Inf. Process. Syst., vol. 36, pp. 67284–67296, Dec. 2023.

S. Liu and J. Zhang, "Local alignment deep network for infrared-visible cross-modal person reidentification in 6G-enabled Internet of Things," IEEE Internet Things J., vol. 8, no. 20, pp. 15170–15179, Oct. 2021, doi: 10.1109/JIOT.2020.3038794.

N. K. Kumar, D. Vigneswari, A. Mohan, K. Laxman, and J. Yuvaraj, "Detection and recognition of objects in image caption generator system: A deep learning approach," in Proc. 5th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS), Coimbatore, India, Mar. 2019, pp. 107–109, doi: 10.1109/ICACCS.2019.8728516.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.