Paper Notes: MinerU2.5

last updated 2026-05-10

Dataset: “model-labeled data and public datasets”; Stage (1) data: 2.3MM for layout analysis, 2.4MM for text blocks; 1.1MM for formula blocks; 1.1MM for table blocks. Use the resulting model from stage (1) to sample hard examples for human annotation for stage (2). Stage (2) dataset: 43k layout analysis, 300k text blocks, 147k formula blocks, 140k table blocks.

Notes: two-stage model, 500MM decoder, 675MM image NativeResVIT encoder. First stage does a resize to 1036x1036 to detect layout elements (uses a Kosmos-like token system, but also predicts orientation and type). Second stage uses native resolution of the boxes of the detected images.

Like GOT-2.0, they do a multi-stage training pipeline: (0) modality alignment, training only the 2-layer MLP within the patch merger, with image captions (558k); then unfreeze the full model and finetune image captioning+OCR+VQA (665k); (1) layout and OCR pre-training (6.9MM) 2 epochs; (2) layout and OCR finetuning (630k) 3 epochs.

Separate prompts for layout detection, tables, formula, and text recognition.

Data Augmentation: Spatial transformations were not applied to layout analysis samples.