Hash Food Image Retrieval Based on Enhanced Vision Transformer

doi:10.7506/spkx1002-6630-20231231-270

FOOD SCIENCE ›› 2024, Vol. 45 ›› Issue (10): 1-8.doi: 10.7506/spkx1002-6630-20231231-270

• Machine Learning • Next Articles

Hash Food Image Retrieval Based on Enhanced Vision Transformer

CAO Pindan, MIN Weiqing, SONG Jiajun, SHENG Guorui, YANG Yancun, WANG Lili, JIANG Shuqiang

(1. School of Information and Electrical Engineering, Ludong University, Yantai 264025, China;2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;3. School of Agricultural Economics and Rural Development, Renmin University of China, Beijing 100872, China)

Online:2024-05-25 Published:2024-06-08

Abstract

Abstract: Food image retrieval, a major task in food computing, has garnered extensive attention in recent years. However, it faces two primary challenges. First, food images exhibit fine-grained characteristics, implying that visual differences between different food categories may be subtle and often can only be observable in local regions of the image. Second, food images contain abundant semantic information, such as ingredients and cooking methods, whose extraction and utilization are crucial for enhancing the retrieval performance. To address these issues, this paper proposes an enhanced ViT hash network (EVHNet) based on a pre-trained Vision Transformer (ViT) model. Given the fine-grained nature of food images, a local feature enhancement module enabling the network to learn more representative features was designed in EVHNet based on convolutional structure. To better leverage the semantic information in food images, an aggregated semantic feature module aggregating the information based on class token features was designed in EVHNet. The proposed EVHNet model was evaluated under three popular hash image retrieval frameworks, namely greedy hash (GreedyHash), central similarity quantization (CSQ), and deep polarized network (DPN), and compared with four mainstream network models, AlexNet, ResNet50, ViT-B_32, and ViT-B_16. Experimental results on the Food-101, Vireo Food-172, and UEC Food-256 food datasets demonstrated that the EVHNet model outperformed other models in terms of comprehensive retrieval accuracy.

Key words: food image retrieval; food computing; hash retrieval; Vision Transformer network; deep hash learning

CLC Number:

S126

CAO Pindan, MIN Weiqing, SONG Jiajun, SHENG Guorui, YANG Yancun, WANG Lili, JIANG Shuqiang. Hash Food Image Retrieval Based on Enhanced Vision Transformer[J]. FOOD SCIENCE, 2024, 45(10): 1-8.

[1]	静茹. Hash Food Image Retrieval Based on Enhanced Vision Transformer [J]. FOOD SCIENCE, 0, (): 0-0.
[2]	CHEN Liang, YANG Jiahong, TIAN Xing. Research Progress and Future Trends of Machine Learning in the Field of Food Flavor [J]. FOOD SCIENCE, 2024, 45(10): 28-37.
[3]	XU Jinsheng, ZHANG En. Design and Implementation of Blockchain-Based Anti-counterfeiting and Traceability System for Sesame Oil [J]. FOOD SCIENCE, 2024, 45(8): 13-18.
[4]	Jin-Sheng XU en zhang. Design and implementation of sesame oil anti-counterfeiting traceability system based on blockchain [J]. FOOD SCIENCE, 0, (): 0-0.
[5]	佳琪韩 Xin ZHANG Jian-Lei KONG. Research on traceability of grain and oil quality and safety based on trusted blockchain and trusted identification [J]. FOOD SCIENCE, 0, (): 0-0.
[6]	XU Jiping, HAN Jiaqi, ZHANG Xin, WANG Xiaoyi, ZHAO Zhiyao, KONG Jianlei. Quality and Safety Traceability of Grains and Oils Based on Trusted Blockchain and Trusted Identity [J]. FOOD SCIENCE, 2023, 44(3): 48-59.
[7]	Ying-Wang GAO. Non-invasive bruise detection for postharvest fruit and vegetable - a review [J]. FOOD SCIENCE, 0, (): 0-0.

Hash Food Image Retrieval Based on Enhanced Vision Transformer

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 7

Recommended Articles

Metrics

Comments