vlm ocr survey / introduction
OCR's Cambrian Explosion · part 1 of 7
OCR's Cambrian Explosion 1 - Introduction
Strong, open VLMs enabled an explosion of open OCR model releases, with little sign of things letting up. In this survey, I detail the models, their evaluation, research trends, and open questions.
Until the past few years, incorporating OCR largely meant: forking over money to a cloud provider, figuring out how to run a research codebase, or accepting poor quality from running something like Tesseract. Recent strong, open vision language models (VLM) like the Qwen series [1,2] have led to a huge uptick in the number, diversity, and accuracy of OCR models. Now if you want to run your own OCR model, you have a lot more decisions to make. This series is a guide to how these models are built, with an eye towards where the field is going and why you might choose some model over another.
The chart below records every named OCR-capable VLM release I have catalogued from September 2021 through May 2026.
VLMs with explicit OCR or document-parsing capabilities, by release month
Hover over any column to see which models were released in that month.
October 2025 alone saw six new releases.
Dating Releases
Note that the releases are dated by their publication (arXiv or blog) date. It’s not an exact science, but the trends are still there.
The Phylogenetic Tree
VLM-based OCR approaches can be split relatively cleanly along decision points: grounded v. ungrounded, layout v. blocks, and pipelined v. single pass. The breakdown of these decision points is shown in Figure 02. An example of three different model outputs on the same page is shown in Figure 03.
- Grounded v. Ungrounded – Ungrounded approaches, like LightOnOCR [3] or OlmOCR 2 [4] convert a whole page image into pure text, like markdown with TeX formulas. Grounded approaches, like Chandra and Kosmos-2.5 output coordinates and text.
- Layout v. Blocks – Within grounded approaches, the model creators must decide if they want to output layout elements like “Heading” and “Table” or just text runs.
- Single Pass v. Pipeline – A single pass approach consumes a whole page at once and outputs the full OCR of the page. This has the benefit of being globally coherent, but increases latency for autoregressive models. A pipelined approach splits this into two passes: first, get the coordinates of layout elements like paragraphs, formulas, and tables; and, second for each chunk run it separately.
A rough phylogeny of VLM-based OCR approaches
Comparing OCR Outputs
To make the differences between approaches concrete, the figures below run three open models — LightOnOCR 2 (1B, ungrounded markdown), Kosmos-2.5 (line-level blocks, no roles), and Chandra (block-level with role labels) — against the same input pages. Switch tabs to overlay each model’s blocks on the page; the raw output sits next to it.
Three OCR models, three types of outputs
References
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2024). https://arxiv.org/abs/2409.12191
- Qwen2.5-VL Technical Report (2025). https://arxiv.org/abs/2502.13923
- LightOn AI. (2025). LightOnOCR: An Efficient Open OCR Model. Hugging Face Blog. https://huggingface.co/blog/lightonai/lightonocr
- olmOCR 2: Unit Test Rewards for Document OCR (2025). https://arxiv.org/abs/2510.19817