Working notes — things I'm reading, thinking about, and trying to figure out. Less polished than the long-form posts, sometimes revised in place.
Looking at a 2010 paper with fresh eyes: what does search look like on Mars?
What happens when you pair a huge vision encoder (600M params) with a tiny text decoder (250M params)? Let's find out!
1000 difficult to OCR pages, used as a canonical torture test.
RL techniques for training a surprisingly powerful small prover.
A 650M param OCR model that's ~on par with LightOnOCR-2, and outputs boxes as well.
An RLVR approach for training OCR.
A 1400 PDF benchmark that uses unit test rewards to compute accuracy.
A performant 1B parameter OCR model, built on Hunyuan Large 0.5B.
A 1.3B VLM trained on over 350MM pages to output text block coordinates and their text.
A training-free speedup for document parsing, by getting a good speculator
A 3B OCR model that also derenders charts, tables, and other graphical elements.
A lightweight speculative decoding method from NVidia that decodes multiple drafts in parallel.
Diffusion OCR model
Adding a cache directory for vLLM docker can reduce start times to ~11s.
A 1B VLM, stitching together Qwen3-0.6B and SigLip-400M.
A benchmark for the current frontier of retrievers: possible to verify with reasoning models, difficult to retrieve.
Can you perform tasks over documents purely using the document image?
In which DeepSeek argues that document images can be more dense, lossless input representations.
A tiny, two-stage, DONUT-based OCR model.
TODO
Evolving rubrics for training a small, powerful deep research model.
TODO
TODO
TODO
An update to LightOnOCR, plus a bbox model for figures.
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
A finetune of Qwen2.5-7B for OCR from Reducto
TODO
TODO
TODO
A truly open OCR model (dataset, model, code) based on Qwen2-VL-7B.
No notes match the selected tags.