Bee: 15M CoT Data, Pipeline, and 8B MLLM
Автор: AI Research Roundup
Загружено: 2025-10-16
Просмотров: 32
In this AI Research Roundup episode, Alex discusses the paper:
'Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs'
The work tackles the performance gap in fully open multimodal LLMs by improving SFT data quality and boosting complex Chain-of-Thought coverage. It releases Honey-Data-15M (≈12.2M short-CoT, ≈2.7M long-CoT), plus HoneyPipe/DataStudio—an automated curation pipeline with deduplication, rule/model-based filtering, CoT enrichment, and LLM-as-a-judge verification. The dual-level CoT design routes medium items to large-scale short-CoT via Qwen2.5-VL and sends hard cases to long-CoT with stronger models, all verified by Qwen2.5-VL-72B. The suite is validated by training Bee-8B, demonstrating the pipeline’s effectiveness.
Paper URL: https://arxiv.org/abs/2510.13795
#AI #MachineLearning #DeepLearning #Multimodal #LLM #ChainOfThought #OpenSource #Dataset
Resources:
Hugging Face model: https://huggingface.co/Open-Bee/Bee-8...
Hugging Face model 2: https://huggingface.co/Open-Bee/Bee-8...
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: