Paper Notes: TrOCR
last updated 2026-05-10
Data augmentations: random rotation (-10, 10 deg); gaussian blurring; image dilation; image erosion, downscaling; underlining
Dataset: 2 million PDF pages randomly sampled from available PDFs on the internet (born digital only); totaling 684MM textlines
Handwriting Data: Synthetic via 5,427 handwritten fonts + text from wikipedia; IIT-HWS (17.9MM textlines); 53k receipt images OCR’d with commercial engines
Model sizes: 62MM params for small, 334MM params for base, 558MM params for large
Tasks: just textline recognition