Can MLLMs Perform Text-to-Image In-Context Learning? (Re-recorded version)
Автор: UWMadison MLOPT Idea Seminar
Загружено: 2024-02-23
Просмотров: 315
This video has been re-recorded due to the original presentation not being captured during the talk.
Speaker: Yuchen Zeng (https://yzeng58.github.io/zyc_cv/) from UW-Madison
Time: Feb 23, 2024, 12:45 PM – 1:45 PM CT
Paper Link: https://arxiv.org/abs/2402.01293
Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at https://github.com/UW-Madison-Lee-Lab....
Bio: Yuchen is a graduate student pursuing a PhD’s degree in the Department of Computer Science at the University of Wisconsin-Madison. She is advised by Prof. Kangwook Lee. Her current research interest is centering on large language models.
Location: Engineering Research Building (1550 Engineering Drive) Room 106
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: