Paper Notes: Pix2Struct

last updated 2026-05-10

Dataset: html screenshots in 1024x1024 pix; dom is simplified based on which elements are visible; render text from BooksCorpus, screenshots come from C4

Model sizes: 282MM (Base) and 1.3B params (Large)

Tasks: screenshot parsing with ocr: predicting masked text as well as visible text in masked screenshot images (masks with DOMs), as well as figure alt text; was further finetuned for tasks like VQA

Note: both DONUT and Pix2Struct show that the models are sensitive to resolution, and Pix2Struct further shows that it’s sensitive to aspect ratio for OCR/VQA; variable > stretched > padded (perhaps because of resolution loss with padded?)

In addition, for both DONUT and Pix2Struct, OCR is used as the pretraining to arrive at a strong document-specific base model, and the actual downstream tasks are things like OCR-free DocVQA.