DeepOCR: Reproduction of Optical Context Compression. vision-language model - VLM. VILA based.
Автор: AI Podcast Series. Byte Goose AI.
Загружено: 2025-11-17
Просмотров: 21
DeepOCR: Reproduction of Optical Context Compression
The podcast provides the technical overview of the DeepSeek-OCR / DeepOCR, a vision-language model designed to explore and validate the concept of contexts optical compression for long documents. This innovative approach compresses large amounts of text into visual representations, achieving compression ratios between 7× and 20× while maintaining high Optical Character Recognition (OCR) accuracy. The core technology is the DeepEncoder, a novel architecture that combines a window attention component (SAM-base) for high-resolution perception and a global attention component (CLIP-large), bridged by a 16× convolutional compressor to efficiently reduce vision tokens. One source details the original research and performance metrics, demonstrating state-of-the-art results on benchmarks like OmniDocBench with fewer vision tokens than competing models. The other sources present DeepOCR, an open-source reproduction of the architecture using the VILA framework and a Qwen2-7B decoder, confirming the feasibility and efficiency of the compression hypothesis for addressing long-context challenges in Large Language Models.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: