Paper Notes: PaddleOCR-VL

last updated 2026-05-10

Dataset:

Notes: two-stage pipeline, like Dolphin, MinerU2.5, etc.; claim that end-to-end approaches rely on very long sequence autoregression (true) which can lead to memory and latency issues (true) and risks hallucinations/decoding issues (probably true).

They train an RT-DETR on document layout (PP-DocLayoutv2), and then do the content recognition with an encoder/decoder VLM.