Paper Notes: Nougat
last updated 2026-05-10
Dataset: focused on scientific documents and textbooks. Rendered at 96dpi, go into Swin transformer at (896, 672) resolution – aspect ratio between US letter and A4; resize+pad. TeX is compiled to HTML via LaTeXML then to Markdown. 1,748,201 arXiv articles. Plus PMC (pubmed central?) and IDL articles (IDL = OCR only).
Dataset size: 8.2MM pages total, 7.5MM from arXiv, 536k from PMC, and 447k from IDL
Augmentations: Bitmap, Erosion, Dilation, Affine transformation, Shift scale rotate, Grid distortion, Elastic transform, Random brightness contrast, Image compression, Gaussian noise, Gaussian blur; all transformations performed via albumentations
Model sizes: 350M parameters
Tasks: image to markdown
Note: trained for 3 epochs (total of 24 million pages seen as a result)