Paper Notes: OlmOCR 2

last updated 2026-05-27

Dataset: added olmocr-synthmix; synthetic tests for unit test rewards. Iteratively prompt a VLM to generate HTML layout of a page:

layout analysis (number of cols, presence of images/tables, headers/footers, etc.); generates guidance HTML
prompt with “generate semantic html to match the original”
render the html from (2), give original image, new image, and html and ask for a revised layout

Cost of $0.12/document page. olmocr-synthmix contains 2,186 pages with 30,381 test cases.

Update olmocr-mix-1025 (from olmocr-mix-0225): use GPT-4.1 to

Notes: RL from unit test rewards. Improvements come from:

dynamic temperature scaling (start at 0.1 and back off to 0.2 etc whenever the model fails to generate an EOS token)
switching from JSON out to YAML out (wtf??)
bigger images (1288px instead of 1024px) (resolution matters!)
updating base model from qwen2 to qwen2.5
fix missing/blank pages

Formula evaluations are rendering-based (just like olmocr)