月旦知識庫
 
  1. 熱門:
 
首頁 臺灣期刊   法律   公行政治   醫事相關   財經   社會學   教育   其他 大陸期刊   核心   重要期刊 DOI文章
電腦學刊 本站僅提供期刊文獻檢索。
  【月旦知識庫】是否收錄該篇全文,敬請【登入】查詢為準。
最新【購點活動】


篇名
End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
並列篇名
End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
作者 Chao Wang (Chao Wang)Wei Luo (Wei Luo)Jia-Rui Zhu (Jia-Rui Zhu)Ying-Chun Xia (Ying-Chun Xia)Jin He (Jin He)Li-Chuan Gu (Li-Chuan Gu)
英文摘要

Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.

 

起訖頁 083-095
關鍵詞 visual groundingquery text guidanceSwin-transformerattention modulemulti-stage reasoning
刊名 電腦學刊  
期數 202402 (35:1期)
該期刊-上一篇 Efficient First-price Sealed E-auction Protocol Under Secure Multi-party Computational Malicious Model
該期刊-下一篇 A Novel Deep Neural Network for Facial Beauty Improvement
 

新書閱讀



最新影音


優惠活動




讀者服務專線:+886-2-23756688 傳真:+886-2-23318496
地址:臺北市館前路28 號 7 樓 客服信箱
Copyright © 元照出版 All rights reserved. 版權所有,禁止轉貼節錄