Paper Notes: Dolphin OCR

last updated 2026-05-10

Dataset: 30MM examples, 23MM of which are formulas. 120k mixed docs, 4.3MM HTML pages (En + Zh wikipedia), 500k TeX docs, 710k markdown docs, 1.6MM tables. The 120k mixed docs are surprisingly high-quality:

All documents are annotated with element-level boundaries and their reading order, enabling training for both layout analysis and order prediction.

Benchmark: Fox dataset, but also released a benchmark dataset (Dolphin).

Notes: 896x896px; uses shrink-then-pad as opposed to stretch-to-fit.

They also distinguish an element-cropping vs box-query approach. The element-cropping approach removes all the image around the predicted boxes, the box-query approach just provides the bounding box as a part of the prompt.

I also like the phrase analyze-then-parse.