Why VLMs Fail at Tables

last updated 2026-05-20

Working note on what goes wrong when you point a vision-language model at a financial filing or a research paper full of tables.

The failure modes I keep seeing

Cell drift. The model reads cells in roughly the right order but loses the row–column structure. Data ends up in the wrong column at row boundaries, especially for sparse cells.
Header collapse. Multi-row headers get flattened into a single row of text, losing hierarchy (“Q1” / “2024” become one cell).
Phantom rows. The model invents rows where the table has visual rules but no actual data.
Pagination amnesia. Tables that span pages lose their column alignment between pages.

A toy diagram of cell drift in a sparse table:

The hypothesis

Most VLMs see tables as images of text and rely on the prose-style attention mechanism to recover structure. But a table is fundamentally 2-D — a row is a thing, not a horizontal slice of pixels.

For a longer take on the position-encoding side of this, see Tokenizers and Layout (part 2 of The OCR Cambrian).

Quick experiment to run

Take the The OCR Cambrian release list, render it as a table at three different aspect ratios (tall-and-narrow, wide-and-short, square), and measure cell-extraction error rate on each. If error scales with aspect ratio rather than with text density, the problem is geometric.

# placeholder for the eval harness
def cell_error_rate(model, table_image, ground_truth):
    pred = model.extract_cells(table_image)
    return diff(pred, ground_truth) / len(ground_truth)