Mamba-caption: Long-range sequence modelling for efficient and accurate image captioning

Tariq Shahzad, Muhammad Aoun, Tehseen Mazhar, Muhammad Usman Tariq, Khmaies Ouahada, Habib Hamam

Research output: Contribution to journalArticlepeer-review

Abstract

Image captioning has been a problem in vision–language research for a long time. Long-range dependencies and efficiency are challenges for the standard models, such as recurrent neural networks (RNNs) and Transformers. To overcome this, we present Mamba-Caption, an efficient sequence processing model that replaces attention mechanisms with selective state-space modelling. The core novelty is a Mamba-based decoder that substitutes self-attention with selective state-space updates, enabling linear-time caption generation while preserving long-range token dependencies; this decoder is a drop-in language-side component that conditions on a convolutional neural network (CNN) image embedding without domain-specific heuristics. Our model utilizes a CNN encoder, a token embedding layer, and a Mamba-based decoder; the decoder is trained using teacher forcing with a cross-entropy objective. Our model outperforms baselines on all standard metrics when evaluated on the Flickr30k dataset, achieving a Bilingual Evaluation Understudy (BLEU-1) score of 0.83, a Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.79, a Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence (ROUGE-L) score of 0.73, and a Consensus-based Image Description Evaluation (CIDEr) score of 1.30. We further contextualize efficiency via a qualitative/complexity discussion and ablation framing that isolates decoder-side design choices, reinforcing that the gains in efficiency do not sacrifice accuracy. Mamba-Caption can be applied to real-world captioning tasks due to its high efficiency and generalizability.

Original languageEnglish
Article number100538
JournalArray
Volume28
DOIs
Publication statusPublished - Dec 2025

Keywords

  • Blue score
  • CIDEr
  • CNN
  • Decoder
  • Encoder
  • Mamba
  • Meteor
  • Res net
  • RNN

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Mamba-caption: Long-range sequence modelling for efficient and accurate image captioning'. Together they form a unique fingerprint.

Cite this