DeepSeek Unveils Multimodal AI for Efficient Text Compression

tech360.tv
Oct 22, 2025
2 min read

DeepSeek, a Hangzhou-based artificial intelligence start-up, released a new open-source multimodal AI model on Monday. Named DeepSeek-OCR, the model processes large documents with significantly fewer tokens by utilising visual perception to compress information.

Credit: Cath Virginia / The Verge

DeepSeek-OCR, available on online developer platforms Hugging Face and GitHub, emerged from an investigation into vision encoders' role in compressing text for large language models. This approach enables LLMs to process extensive text without a proportional increase in computing costs.

The company stated DeepSeek-OCR achieved significant token reduction, between seven and 20 times, for different historical context stages. This offers a promising direction for addressing long-context challenges within LLMs.

This release continues DeepSeek's efforts to enhance AI model efficiency and reduce building and usage costs. The organisation followed this principle in developing its open-source V3 and R1 models, released in December and February, respectively.

Graph with purple and blue bars showing precision vs. text tokens per page. Right chart features encoder performance on average vision tokens. — Credit: GitHub

DeepSeek-OCR comprises two main components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. DeepEncoder serves as the model's core engine, maintaining low activation under high-resolution inputs while achieving strong compression ratios.

The decoder, a Mixture-of-Experts model with 570 million parameters, reconstructs the original text. Its architecture divides the model into separate sub-networks, or experts, that specialise in a subset of the input data to jointly perform a task.

Beyond standard vision tasks like image captioning and object detection, DeepSeek-OCR parses highly structured visual content. This includes tables, formulas, and geometric diagrams, benefiting applications in finance and science.

Benchmark tests showed DeepSeek-OCR achieved 97% decoding accuracy when text tokens were within ten times the size of visual tokens, indicating a compression ratio below 10x. Even at a 20x ratio, the model maintained around 60% accuracy, preserving information despite extreme compression.

On OmniDocBench, a benchmark for diverse document understanding, DeepSeek-OCR surpassed major OCR models, including GOT-OCR 2.0 and MinerU 2.0. It accomplished this while using far fewer tokens.

The new model can also generate over 200,000 pages of training data daily on a computing system powered by a single Nvidia A100-40G graphics processing unit.

DeepSeek-OCR enables users to handle scalable ultra-long context processing. This system preserves recent content at high resolution, while older contexts consume fewer computing resources.

This suggests DeepSeek-OCR could facilitate theoretically unlimited context architectures, balancing information retention with efficiency. In late September, the company launched DeepSeek V3.2-Exp, an experimental version of its V3 model.

DeepSeek V3.2-Exp improves training and inference efficiency, while sharply reducing application programming interface costs.

DeepSeek released DeepSeek-OCR, a new open-source multimodal AI model, on Monday.
The model uses visual perception to compress text input, significantly reducing tokens for large language models.
DeepSeek-OCR improves AI model efficiency, lowers computing costs, and enhances long-context processing.

Source: SCMP