Paper Notes: LightOnOCR

last updated 2026-05-12

vlm-ocr

Notes: 400MM param NaViT vision encoder, 600MM param language encoder (Qwen-3). They also show that 2-stage training is ~on par with (and maybe even a little worse than) 1 stage training! Cite FineVision as a model that found something similar. Huge improvements to using the 72B qwen as opposed to the 7B qwen.

Another paper that shows that resolution matters!

Distill data from Qwen2-VL-72B-Instruct, prompted to do OCR to markdown with latex formatting. Compare against legacy OCR systems to filter for hallucinations. Generate a full dataset of 17.6MM pages, and 45.5B tokens.