Paper Notes: DONUT
last updated 2026-05-10
Dataset: IIT-CDIP, 11MM scanned english document images (then re-OCR’d with CLOVA OCR API)
Synthetic Data: Generated via the Synthetic Document Generator (SynthDoG; an extension of SynthTIGER); 500k samples per language for Chinese, Japanese, Korean, and English: choose background (from ImageNet), document (paper photos); text (from wikipedia); layout (algorithm that stacks grids)
Model sizes: 143MM params
Tasks: jointly trained model for VQA, classification, and OCR