vlm ocr survey / introduction
Strong, open VLMs enabled an explosion of open OCR model releases, with little sign of things letting up. In this survey, I detail the models, their evaluation, research trends, and open questions.
vlm ocr survey / core approaches
Butterfly collecting >40 different open model releases. Core approaches (single pass vs. pipelined), task decisions (general VLM vs. OCR-specific), and output formats allow us to fingerprint each model.
vlm ocr survey / training
There is a convergent evolution in how the labs source their training data. This points to blindspots in the models. Unfortunately, most trainign datasets are not released. We collect the ones that are.
vlm ocr survey / benchmarking
Two benchmarks have become the reporting standard: OlmOCR bench and OmniDocBench. What is contained in these benchmarks, how do the various models stack up, and what other benchmarks have been used to date?
vlm ocr survey / efficiency
VLMs as OCR models lift OCR accuracy, but are expensive to inference. We examine different approaches labs have used to improve efficiency, including diffusion language modeling, two-stage pipelines, and architectural innovations.
vlm ocr survey / backbones
An OCR model is only as strong as its vision backbone. Choosing which backbone, input image resolution, and how the backbone passes information to the LLM have a significant effect on downstream efficiency and quality.
vlm ocr survey / open questions
In which we explore open research questions around improving OCR, what the limitations of the data tell us, and where the space might be heading over the next few years.
Probing the supported output types of Gemini.
Navigating Gemini's API for object detection with vision and Structured Outputs.
Thoughts on averaged benchmarks and hidden correlations.
tinyhnsw / introduction
The first post in the TinyHNSW series, introducing the tutorial and the library.