Why VLMs Fail at Tables
last updated 2026-05-20
Working note on what goes wrong when you point a vision-language model at a financial filing or a research paper full of tables.
The failure modes I keep seeing
- Cell drift. The model reads cells in roughly the right order but loses the row–column structure. Data ends up in the wrong column at row boundaries, especially for sparse cells.
- Header collapse. Multi-row headers get flattened into a single row of text, losing hierarchy (“Q1” / “2024” become one cell).
- Phantom rows. The model invents rows where the table has visual rules but no actual data.
- Pagination amnesia. Tables that span pages lose their column alignment between pages.
A toy diagram of cell drift in a sparse table:

The hypothesis
Most VLMs see tables as images of text and rely on the prose-style attention mechanism to recover structure. But a table is fundamentally 2-D — a row is a thing, not a horizontal slice of pixels.
For a longer take on the position-encoding side of this, see Tokenizers and Layout (part 2 of The OCR Cambrian).
Quick experiment to run
Take the The OCR Cambrian release list, render it as a table at three different aspect ratios (tall-and-narrow, wide-and-short, square), and measure cell-extraction error rate on each. If error scales with aspect ratio rather than with text density, the problem is geometric.
# placeholder for the eval harness
def cell_error_rate(model, table_image, ground_truth):
pred = model.extract_cells(table_image)
return diff(pred, ground_truth) / len(ground_truth)