Paper Notes: PaddleOCR-VL
last updated 2026-05-10
Dataset:
Notes: two-stage pipeline, like Dolphin, MinerU2.5, etc.; claim that end-to-end approaches rely on very long sequence autoregression (true) which can lead to memory and latency issues (true) and risks hallucinations/decoding issues (probably true).
They train an RT-DETR on document layout (PP-DocLayoutv2), and then do the content recognition with an encoder/decoder VLM.