Paper Notes: OlmOCR
last updated 2026-05-10
Dataset: OlmOCR Mix – 260k pages taken from 100k pdfs from the internet; OCR output computed by GPT-4o. Filtered to English docs (using the Lingua package). Sample up to 3 pages from each of the 100k PDFs. Use GPT-4o to generate OCR using: image, text blocks, and locations (task they refer to as DOCUMENT-ANCHORING)
Evaluation Data: OlmOCR-Bench, 1400 PDFs, 7000 unit tests for evaluation. A super interesting and simple evaluation approach (several binary judgements per page, compute accuracy) rather than the hodgepodge of edit-distance-based metrics.
text segments are 1-3 sentence spans, verify presence with fuzzy matching. reading order checks that a pair of text segments have the correct ordering. table checks that a table with a cell with a specific value is contained, and is relationally above/below/left/right of another cell with another value. formula checks symbol bounding boxes after rendered with KaTeX.
Notes: Single training run is 16 hours on an 8xH100 node (7B Qwen model), for 1.2 epochs. PDF is rendered to a max of 1024px on the longest edge, aspect ratio is retained. Typical prompt is 1k tokens for the image, and 1800 tokens for the anchor text.