Paper Notes: OlmOCR 2
last updated 2026-05-10
Dataset: added olmocr-synthmix; synthetic tests for unit test rewards. Iteratively prompt a VLM to generate HTML layout of a page:
- layout analysis (number of cols, presence of images/tables, headers/footers, etc.); generates guidance HTML
- prompt with “generate semantic html to match the original”
- render the html from (2), give original image, new image, and html and ask for a revised layout
Cost of $0.12/document page. olmocr-synthmix contains 2,186 pages with 30,381 test cases.
Update olmocr-mix-1025 (from olmocr-mix-0225): use GPT-4.1 to
Notes: RL from unit test rewards. Improvements come from:
- dynamic temperature scaling (start at 0.1 and back off to 0.2 etc whenever the model fails to generate an EOS token)
- switching from JSON out to YAML out (wtf??)
- bigger images (1288px instead of 1024px) (resolution matters!)
- updating base model from qwen2 to qwen2.5
- fix missing/blank pages
Formula evaluations are rendering-based (just like olmocr)