Joe Barrow field_notes

Field Notes

Paper Notes: OlmOCR 2

last updated 2026-05-10

Dataset: added olmocr-synthmix; synthetic tests for unit test rewards. Iteratively prompt a VLM to generate HTML layout of a page:

Cost of $0.12/document page. olmocr-synthmix contains 2,186 pages with 30,381 test cases.

Update olmocr-mix-1025 (from olmocr-mix-0225): use GPT-4.1 to

Notes: RL from unit test rewards. Improvements come from:

Formula evaluations are rendering-based (just like olmocr)