Paper Notes: DONUT

last updated 2026-05-11

vlm-ocr

Dataset: IIT-CDIP, 11MM scanned english document images (then re-OCR’d with CLOVA OCR API)

Synthetic Data: Generated via the Synthetic Document Generator (SynthDoG; an extension of SynthTIGER); 500k samples per language for Chinese, Japanese, Korean, and English: choose background (from ImageNet), document (paper photos); text (from wikipedia); layout (algorithm that stacks grids)

Model sizes: 143MM params

Tasks: jointly trained model for VQA, classification, and OCR