Paper Notes: MonkeyOCR

last updated 2026-05-10

Dataset: MonkeyDoc – 3.9MM block-level instances in both Chinese and English; contains annotations for layout detection, reading order, text recognition, table recognition, formula recognition, code-block recognition. Aggregate data from M6Doc, DocLayNet, D4LA, CLDA

Notes: multi-stage, with a YOLO model for layout detection, then 2 joint stages: Qwen for recognizing content within a block and a model for outputting reading order.