Paper Notes: Fox

last updated 2026-05-10

Datasets: CC-MAIN, arxiv, and e-books. Effectively the same mix as Vary (same authors, this is follow-on work).

BLIP559k + Laion-COCO + Region-Chat for figure+text interleaved data. Use GPT-3.5 to generate translations of data in boxes. 1.6MM natural images rendered as docs.

4.6MM document image/text pairs, with lots of box/translation/etc. examples. 800k multipage data (OCR + QA)

1MM conversational data from Laion-COCO, Alpaca, Baize, and ShareGPT

Use GPT-3.5 to write chat prompts for 10k from each of the above, plus LLaVA80k.

Notes: from (largely) the same authors as the Vary paper. They focus on points and boxes in (i.e. asking the LLM to translate, OCR, etc. a line, contents of a box, a figure, etc.). They use the same architecture as Vary – two vision encoders, feeding into an LLM – plus an additional prompt that can include points and boxes.