Image Caption Algorithm Based on ViLBERT and BiLSTM
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Traditional image captioning has the problems of the under-utilization of extracted image features, the lack of context information learning and too many training parameters. This study proposes an image captioning algorithm based on Vision-and-Language BERT (ViLBERT) and Bidirectional Long Short-Term Memory network (BiLSTM). The ViLBERT model is used as an encoder, which can combine image features and descriptive text information through the co-attention mechanism and output the joint feature vector of image and text. The decoder uses a BiLSTM combined with attention mechanism to generate image caption. The algorithm is trained and tested on MSCOCO2014, and the scores of evaluation criteria BLEU-4 and BLEU are 36.9 and 125.2 respectively. This indicates that the proposed algorithm is better than the image captioning based on the traditional image feature extraction combined with the attention mechanism. The comparison of generated text descriptions demonstrates that the image caption generated by this algorithm can describe the image information in more detail.

    Reference
    Related
    Cited by
Get Citation

许昊,张凯,田英杰,种法广,王子超.基于ViLBERT与BiLSTM的图像描述算法.计算机系统应用,2021,30(11):195-202

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:December 29,2020
  • Revised:February 03,2021
  • Adopted:
  • Online: October 22,2021
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063