Joe Barrow VOL_03 / N17

computer_vision / briefing

The OCR Cambrian · part 1 of 2

The OCR Cambrian

For most of a decade, optical character recognition felt like a settled engineering problem. In the last twelve months, it has become the most crowded frontier in vision–language modelling.

By Your Name

Until very recently, the working list of OCR models worth knowing about fit comfortably on a postcard. TrOCR, DONUT, Pix2Struct, Nougat — a paper a year, give or take, mostly from a small set of labs revisiting the same encoder–decoder template. Then, somewhere around the start of 2025, the trickle became a flood.

The chart below records every named OCR-capable VLM release I have catalogued from September 2021 through November 2025. The shape of it is the story.

Fig. 01 / OCR releases

Models with explicit OCR or document-parsing capabilities, by release month

Each bar counts named model releases in a given month. Hover any column to see which models appeared.

Source / Author's catalogue · 2021—09 → 2025—11 n = 29 releases

October 2025 alone saw six new releases — more than the cumulative total for 2021 through 2023. The acceleration cannot be explained by any single architectural breakthrough; it is better read as a downstream effect of strong general-purpose VLM bases becoming widely available, with OCR-specific finetunes and small pretrains following quickly behind. The follow-up post, Tokenizers and Layout, looks at one slice of why that base-model shift mattered.

The interesting question is no longer whether a VLM can read a page, but which lab’s reading you trust.

What I find most striking is not the volume but the diversity. The recent cohort spans labs across three continents, parameter counts from sub-billion to flagship-scale, and licensing from fully open to closed-API. Subsequent figures in this series will compare how each handles a standard test page side by side, what their tokenizers do to layout information, and how patch-level attention behaves on dense documents.