Joe Barrow field_notes

Field Notes

Paper Notes: TrOCR

last updated 2026-05-10

Data augmentations: random rotation (-10, 10 deg); gaussian blurring; image dilation; image erosion, downscaling; underlining

Dataset: 2 million PDF pages randomly sampled from available PDFs on the internet (born digital only); totaling 684MM textlines

Handwriting Data: Synthetic via 5,427 handwritten fonts + text from wikipedia; IIT-HWS (17.9MM textlines); 53k receipt images OCR’d with commercial engines

Model sizes: 62MM params for small, 334MM params for base, 558MM params for large

Tasks: just textline recognition