Paper Notes: Vary

last updated 2026-05-10

Dataset: Two phases.

First “generation” phase data engine: pdfs from arXiv and CC-MAIN-2021-31-PDFUNTRUNCATED, plus Chinese eBooks. PyMuPDF to extract all the text and render the PDFs. 1MM english and 1MM chinese pages. 250k charts in Chinese and English using matplotlib, and 500k charts in Chinese and English using pyecharts. Title, y-axis, x-axis, etc. are randomly selected from an NLP corpus.

Second “scaling up” phase data engine: .tex files from arxiv, rendered onto single pages. 500k english pages and 400k chinese pages. GPT-4 generated charts (200k).

DocVQA and ChartQA data for conversation format.

Model size: vary-tiny: 136MM (ViTDet) + 125MM (OPT-125M) = 261MM, vary-base: 136MM + 250MM (CLIP-L) + 7B ~= 7B

Notes: “Scaling up Vision vocabulary”, the authors train a second high-res (1024x1024) ViT on document images that they use to augment the frozen CLIP embeddings (computed from a 224x224 image). They use a two-stage training pipeline to do this, first using OPT-125M as the decoder, then scaling up to a 7B param decoder.

They do a token resampling step using a convolutional neural network (really just to align the VitDET 64x64x256 outputs to the CLIP 256x1024 outputs): 64x64x256 -> 32x32x512 -> 16x16x1024.