基于增强Vision Transformer的哈希食品图像检索

摘要/Abstract

摘要： 作为食品计算的一个主要任务，食品图像检索近年来受到了广泛的关注。然而，食品图像检索面临着两个主要的挑战。首先，食品图像具有细粒度的特点，这意味着不同食品类别之间的视觉差异可能很小，只能在图像的局部区域中观察到。其次，食品图像包含丰富的语义信息，如食材、烹饪方式等，这些信息的提取和利用对于提高检索性能至关重要。为了解决这些问题，本文基于预训练的视觉Transformer（Vision Transformer，ViT）模型提出了一种增强ViT的哈希网络（Enhanced ViT Hash Network, EVHNet）。针对食品图像的细粒度特点，EVHNet中设计了一个局部特征增强模块，通过卷积结构使网络能够学习到更具有代表性的特征。为了更好地利用食品图像的语义信息，EVHNet中还设计了一个聚合语义特征模块，根据类令牌特征来聚合食品图像中的语义信息。本文提出的EVHNet模型在贪婪哈希（Greedy Hash，GreedyHash）、中心相似量化（Central Similarity Quantization，CSQ）和深度极化网络（Deep Polarized Network，DPN）三种流行的哈希图像检索框架下进行了评估，并与AlexNet，ResNet50、ViT-B_32和ViT-B_16四种主流的网络模型进行了比较，在Food-101、Vireo Food-172、UEC Food-256三个食品数据集上的实验结果表明EVHNet模型在检索精度上的综合性能优于其他模型。

关键词: 食品图像检索, 食品计算, 哈希检索, Vision Transformer网络, 深度哈希学习

Abstract: Food image retrieval, as a major task in food computing, has garnered extensive attention in recent years. However, it faces two primary challenges. Firstly, food images exhibit fine-grained characteristics, implying that the visual differences between different food categories can be subtle and often only observable in local regions of the image. Secondly, food images contain abundant semantic information, such as ingredients and cooking methods, the extraction and utilization of which are crucial for enhancing retrieval performance. To address these issues, this paper proposes an Enhanced ViT Hash Network (EVHNet) based on the pre-trained Vision Transformer (ViT) model. To cater to the fine-grained nature of food images, a Local Feature Enhancement Module is designed in EVHNet, enabling the network to learn more representative features through a convolutional structure. To better leverage the semantic information in food images, an Aggregated Semantic Feature Module is designed in EVHNet, aggregating the semantic information in food images based on class token features. The proposed EVHNet model is evaluated under three popular hash image retrieval frameworks, namely Greedy Hash (GreedyHash), Central Similarity Quantization (CSQ), and Deep Polarized Network (DPN), and compared with four mainstream network models, AlexNet, ResNet50, ViT-B_32, and ViT-B_16. Experimental results on the Food-101, Vireo Food-172, and UEC Food-256 food datasets demonstrate that the EVHNet model outperforms other models in terms of comprehensive retrieval accuracy.

Key words: food image retrieval, food computing, hash retrieval, vision transformer network, deep hash learning

中图分类号:

S126

曹品丹闵巍庆宋佳骏盛国瑞杨延村王丽丽蒋树强. 基于增强Vision Transformer的哈希食品图像检索[J]. 食品科学.

CAO Pindan, MIN Weiqing, SONG Jiajun, SHENG Guorui, YANG Yancun, WANG Lili. Hash Food Image Retrieval Based on Enhanced Vision Transformer[J]. FOOD SCIENCE.

参考文献

[1] 李兆丰, 刘炎峻, 徐勇将, 等. 数字化食品在新时代下的发展与挑战[J]. 食品科学, 2022, 43(11): 1-8. DOI:10.7506//spkx1002-6630-20220324-292.
[2] 张南, 马春晖, 周晓丽, 等. 食品科学研究现状、热点与交叉学科竞争力的文献计量学分析[J]. 食品科学, 2017, 38(03): 310-315.
[3] MIN W, JIANG S, LIU L, et al. A Survey on Food Computing[J]. ACM Computing Surveys, 2019, 52(5): 1-36. DOI:10.1145/3329168.
[4] 梅舒欢, 闵巍庆, 刘林虎, 等. 基于Faster R-CNN的食品图像检索和分类[J]. 南京信息工程大学学报(自然科学版), 2017, 9(06): 635-641. DOI:10.13878/j.cnki.jnuist.2017.06.007.
[5] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6): 1137-1149. DOI:10.1109/TPAMI.2016.2577031.
[6] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. DOI:10.1145/3065386.
[7] SONG J, MIN W, LIU Y, et al. A Noise-robust Locality Transformer for Fine-grained Food Image Retrieval[C]//2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval. IEEE, 2022: 348-353. DOI:10.1109/MIPR54900.2022.00068.
[8] SONG J, LI Z, MIN W, et al. Towards Food Image Retrieval via Generalization-oriented Sampling and Loss Function Design[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20(1): 13:1-13:19. DOI:10.1145/3600095.
[9] 张俊凯. 消费者对食品营养标签的使用行为及其影响因素[J]. 现代食品, 2017, (13): 64-66. DOI:10.16736/j.cnki.cn41-1434/ts.2017.13.024.
[10] ZHAO Q, WANG X, LYU S, et al. A feature consistency driven attention erasing network for fine-grained image retrieval[J]. Pattern Recognition, 2022, 128: 108618. DOI:10.1016/j.patcog.2022.108618.
[11] LUO X, CHEN C, ZHONG H, et al. Luo X, Wang H, Wu D, et al. A survey on deep hashing methods[J]. ACM Transactions on Knowledge Discovery from Data, 2023, 17(1): 1-50. DOI:10.1145/3532624.
[12] LIU H, WANG R, SHAN S, et al. Deep Supervised Hashing for Fast Image Retrieval[C]//IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2016: 2064--2072. DOI:10.1109/CVPR.2016.227.
[13] CAO Z, LONG M, WANG J, et al. Hashnet: Deep learning to hash by continuation[C]//Proceedings of the IEEE international conference on computer vision, 2017: 5608-5617. DOI:10.1109/ICCV.2017.598.
[14] SU S, ZHANG C, HAN K, et al. Greedy hash: towards fast optimization for accurate hash coding in CNN[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018: 806-815.
[15] ZHANG Z, ZOU Q, LIN Y, et al. Improved deep hashing with soft pairwise similarity for multi-label image retrieval[J]. IEEE Transactions on Multimedia, 2019, 22(2): 540-553. DOI:10.1109/TMM.2019.2929957.
[16] YUAN L, WANG T, ZHANG X, et al. Central similarity quantization for efficient image and video retrieval[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020: 3083-3092. DOI:10.1109/CVPR42600.2020.00315.
[17] FAN L, NG K, JU C, et al. Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes[C]//Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020: 825-831. DOI:10.24963/IJCAI.2020/115.
[18] WANG J, ZHANG T, SONG J, et al. A survey on learning to hash[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 769-790. DOI:10.1109/TPAMI.2017.2699960.
[19] ZHUANG B, LIU J, PAN Z, et al. A survey on efficient training of transformers[C]//Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023: 6823-6831. DOI:10.24963/IJCAI.2023/764.
[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations, 2020.
[21] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 770-778. DOI:10.1109/CVPR.2016.90.
[22] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 8759-8768. DOI:10.1109/CVPR.2018.00913.
[23] LI Y, HE J, ZHANG T, et al. Diverse part discovery: Occluded person re-identification with part-aware transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2898-2907. DOI:10.1109/CVPR46437.2021.00292.
[24] MIECH A, ALAYRAC J, LAPTEV I, et al. Thinking fast and slow: Efficient text-to-visual retrieval with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9826-9836. DOI:10.1109/CVPR46437.2021.00970.
[25] CHEN Y, ZHANG S, LIU F, et al. Transhash: Transformer-based hamming hashing for efficient image retrieval[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022: 127-136. DOI:10.1145/3512527.3531405.
[26] DUBEY S R, SINGH S K, CHU W T. Vision transformer hashing for image retrieval[C]//2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022: 1-6. DOI:10.1109/ICME52920.2022.9859900.
[27] BOSSARD L, GUILLAUMIN M, GOOL L V. Food-101–Mining Discriminative Components with Random Forests[J]. Springer International Publishing, 2014: 446-461. DOI:10.1007/978-3-319-10599-4_29.
[28] CHEN J, NGO C W. Deep-based ingredient recognition for cooking recipe retrieval[C]//Proceedings of the 24th ACM international conference on Multimedia, 2016: 32-41. DOI:10.1145/2964284.2964315.
[29] KAWANO Y, YANAI K. Kawano Y, Yanai K. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation[C]//Computer Vision-ECCV 2014 Workshops, 2015: 3-17. DOI:10.1007/978-3-319-16199-0_1.
[30] RU L, ZHENG H, ZHAN Y, et al. Token contrast for weakly-supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 3093-3102. DOI:10.1109/CVPR52729.2023.00302.
[31] HENDRYCKS D, GIMPEL K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units[J]. arXiv preprint arXiv:1606.08415, 2016.
[32] WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision, 2018: 3-19. DOI:10.1007/978-3-030-01234-2_1.