食品科学

• •    下一篇

基于增强Vision Transformer的哈希食品图像检索

曹品丹,闵巍庆,宋佳骏,盛国瑞,杨延村,王丽丽,蒋树强   

  1. 1. 鲁东大学
    2. 中国科学院计算技术研究所
    3. 中国人民大学农业与农村发展学院
  • 收稿日期:2024-01-01 修回日期:2024-02-27 出版日期:2024-03-06 发布日期:2024-03-06
  • 通讯作者: 王丽丽

Hash Food Image Retrieval Based on Enhanced Vision Transformer

CAO Pindan, MIN Weiqing, SONG Jiajun, SHENG Guorui, YANG Yancun, WANG Lili   

  1. 1. School of Information and Electrical Engineering, Ludong University, Yantai, Shandong 264025, China;
    2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    3. School of Agricultural and Rural Development, Renmin University of China, Beijing 100872, China
  • Received:2024-01-01 Revised:2024-02-27 Online:2024-03-06 Published:2024-03-06
  • Contact: 静茹

摘要: 作为食品计算的一个主要任务,食品图像检索近年来受到了广泛的关注。然而,食品图像检索面临着两个主要的挑战。首先,食品图像具有细粒度的特点,这意味着不同食品类别之间的视觉差异可能很小,只能在图像的局部区域中观察到。其次,食品图像包含丰富的语义信息,如食材、烹饪方式等,这些信息的提取和利用对于提高检索性能至关重要。为了解决这些问题,本文基于预训练的视觉Transformer(Vision Transformer,ViT)模型提出了一种增强ViT的哈希网络(Enhanced ViT Hash Network, EVHNet)。针对食品图像的细粒度特点,EVHNet中设计了一个局部特征增强模块,通过卷积结构使网络能够学习到更具有代表性的特征。为了更好地利用食品图像的语义信息,EVHNet中还设计了一个聚合语义特征模块,根据类令牌特征来聚合食品图像中的语义信息。本文提出的EVHNet模型在贪婪哈希(Greedy Hash,GreedyHash)、中心相似量化(Central Similarity Quantization,CSQ)和深度极化网络(Deep Polarized Network,DPN)三种流行的哈希图像检索框架下进行了评估,并与AlexNet,ResNet50、ViT-B_32和ViT-B_16四种主流的网络模型进行了比较,在Food-101、Vireo Food-172、UEC Food-256三个食品数据集上的实验结果表明EVHNet模型在检索精度上的综合性能优于其他模型。

关键词: 食品图像检索, 食品计算, 哈希检索, Vision Transformer网络, 深度哈希学习

Abstract: Food image retrieval, as a major task in food computing, has garnered extensive attention in recent years. However, it faces two primary challenges. Firstly, food images exhibit fine-grained characteristics, implying that the visual differences between different food categories can be subtle and often only observable in local regions of the image. Secondly, food images contain abundant semantic information, such as ingredients and cooking methods, the extraction and utilization of which are crucial for enhancing retrieval performance. To address these issues, this paper proposes an Enhanced ViT Hash Network (EVHNet) based on the pre-trained Vision Transformer (ViT) model. To cater to the fine-grained nature of food images, a Local Feature Enhancement Module is designed in EVHNet, enabling the network to learn more representative features through a convolutional structure. To better leverage the semantic information in food images, an Aggregated Semantic Feature Module is designed in EVHNet, aggregating the semantic information in food images based on class token features. The proposed EVHNet model is evaluated under three popular hash image retrieval frameworks, namely Greedy Hash (GreedyHash), Central Similarity Quantization (CSQ), and Deep Polarized Network (DPN), and compared with four mainstream network models, AlexNet, ResNet50, ViT-B_32, and ViT-B_16. Experimental results on the Food-101, Vireo Food-172, and UEC Food-256 food datasets demonstrate that the EVHNet model outperforms other models in terms of comprehensive retrieval accuracy.

Key words: food image retrieval, food computing, hash retrieval, vision transformer network, deep hash learning

中图分类号: