OCR's Cambrian Explosion 1 - Introduction

Strong, open VLMs enabled an explosion of open OCR model releases, with little sign of things letting up. In this survey, I detail the models, their evaluation, research trends, and open questions.

By Joe Barrow2026-05-12

Until the past few years, incorporating OCR largely meant: forking over money to a cloud provider, figuring out how to run a research codebase, or accepting poor quality from running something like Tesseract. Recent strong, open vision language models (VLM) like the Qwen series ^[1,2] have led to a huge uptick in the number, diversity, and accuracy of OCR models. Now if you want to run your own OCR model, you have a lot more decisions to make. This series is a guide to how these models are built, with an eye towards where the field is going and why you might choose some model over another.

The chart below records every named OCR-capable VLM release I have catalogued from September 2021 through May 2026.

Fig. 01 / OCR releases

VLMs with explicit OCR or document-parsing capabilities, by release month

Hover over any column to see which models were released in that month.

n = 43 releases

October 2025 alone saw six new releases.

Dating Releases

Note that the releases are dated by their publication (arXiv or blog) date. It’s not an exact science, but the trends are still there.

The Phylogenetic Tree

VLM-based OCR approaches can be split relatively cleanly along decision points: grounded v. ungrounded, layout v. blocks, and pipelined v. single pass. The breakdown of these decision points is shown in Figure 02. An example of three different model outputs on the same page is shown in Figure 03.

Grounded v. Ungrounded – Ungrounded approaches, like LightOnOCR ^[3] or OlmOCR 2 ^[4] convert a whole page image into pure text, like markdown with TeX formulas. Grounded approaches, like Chandra and Kosmos-2.5 output coordinates and text.
Layout v. Blocks – Within grounded approaches, the model creators must decide if they want to output layout elements like “Heading” and “Table” or just text runs.
Single Pass v. Pipeline – A single pass approach consumes a whole page at once and outputs the full OCR of the page. This has the benefit of being globally coherent, but increases latency for autoregressive models. A pipelined approach splits this into two passes: first, get the coordinates of layout elements like paragraphs, formulas, and tables; and, second for each chunk run it separately.

Fig. 02 / Phylogeny

A rough phylogeny of VLM-based OCR approaches

Comparing OCR Outputs

To make the differences between approaches concrete, the figures below run three open models — LightOnOCR 2 (1B, ungrounded markdown), Kosmos-2.5 (line-level blocks, no roles), and Chandra (block-level with role labels) — against the same input pages. Switch tabs to overlay each model’s blocks on the page; the raw output sits next to it.

Fig. 03 / OCR comparison — page 1

Three OCR models, three types of outputs

References

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2024). https://arxiv.org/abs/2409.12191
Qwen2.5-VL Technical Report (2025). https://arxiv.org/abs/2502.13923
LightOn AI. (2025). LightOnOCR: An Efficient Open OCR Model. Hugging Face Blog. https://huggingface.co/blog/lightonai/lightonocr
olmOCR 2: Unit Test Rewards for Document OCR (2025). https://arxiv.org/abs/2510.19817

Series

OCR's Cambrian Explosion

OCR's Cambrian Explosion 1 - Introduction
OCR's Cambrian Explosion 2 - Models and Pipelines Upcoming
OCR's Cambrian Explosion 3 - Training and Data Upcoming
OCR's Cambrian Explosion 4 - Evaluation and Benchmarking Upcoming
OCR's Cambrian Explosion 5 - Model Efficiency Upcoming
OCR's Cambrian Explosion 6 - Backbones Upcoming
OCR's Cambrian Explosion 7 - Open Questions Upcoming