Compositional Visual-Linguistic Models Via Visual Markers and Counterfactual Examples
Автор: UWMadison MLOPT Idea Seminar
Загружено: 2024-03-15
Просмотров: 457
Speaker: Mu Cai (UW-Madison)
Time: May 15, 2024, 12:30 PM – 1:30 PM CT
Title: Compositional Visual-Linguistic Models Via Visual Markers and Counterfactual Examples
Abstract: Vision-Language models like CLIP, GPT-4, and LLaVA have made significant advancements in visual recognition and reasoning, yet they still struggle with understanding region-level visual information and complex linguistic concepts such as distinguishing between “black shirt and blue pants” and “blue shirts and black pants”. Our research indicates that compositionality can enhance these models’ capabilities. By using visually marked, overlaid images, our refined method can reach state-of-the-art performance levels in region-level understanding. Moreover, we found that using counterfactual reasoning to curate compositional images and captions can enhance the model’s understanding of complex object relationships. We also demonstrate that visual markers can be represented as Scalable Vector Graphics (SVG), allowing visual information to be textually represented, thereby eliminating the need for a visual encoder when building Vision-Language models.
Bio: Mu Cai is a fourth-year Ph.D. student in the Computer Sciences Department at the University of Wisconsin-Madison, advised by Prof. Yong Jae Lee. His research interest lies in the intersection of deep learning and computer vision. He is especially interested in multimodal generative models, video and 3D understanding.
Location: Engineering Research Building (1500 Engineering Drive) Room 106
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: