Tokenizers and Layout

The second figure in the OCR Cambrian series. How modern document VLMs encode position — and why it matters more than the parameter count.

By Your Name2026-05-236 min read

A page is not a sequence. The text on a printed form, a research paper, or an old newspaper carries information in two dimensions — left-to-right within a line, but also top-to-bottom across columns, with figures and captions drifting between them. Most language models flatten that geometry into a 1-D token stream and hope the prose itself encodes the spatial cues.

Most document VLMs do something more deliberate.

Why “tokenizer” is the wrong word here. What we’re really discussing is the visual patch encoder plus its position embedding scheme — the text tokenizer in these models is mostly inherited from the language backbone. I’ll keep saying tokenizer because the field does, but mentally substitute patch+position pipeline.

The simplest scheme is to crop the page into a fixed grid of patches and let the encoder learn 2-D position embeddings. Qwen2-VL pushed this in an interesting direction with M-RoPE, factorising rotary embeddings into temporal, height, and width components, so the same underlying mechanism that encodes word order in text also encodes patch geometry in images.

A minimal sketch of the idea:

def m_rope(x, t, h, w, dims=(128, 64, 64)):
    """Multi-axis rotary position embedding.

    x:    [B, T, D]   token features
    t,h,w:[B, T]      per-token (time, row, col) indices
    """
    dt, dh, dw = dims
    x_t = rotate(x[..., :dt],         t, base=10000)
    x_h = rotate(x[..., dt:dt+dh],    h, base=10000)
    x_w = rotate(x[..., dt+dh:],      w, base=10000)
    return torch.cat([x_t, x_h, x_w], dim=-1)

The decision to split the embedding channels rather than interleaving the rotations is what lets a sequence-only attention mechanism reason about page geometry without any architectural change downstream.

The published M-RoPE allocations differ slightly between Qwen2-VL and Qwen2.5-VL — the 2.5 release rebalances toward the spatial axes for high-resolution document inputs.

Other recent work goes further. Granite Docling threads explicit layout tokens into the input — every block of text carries a structured tag identifying it as a heading, paragraph, table cell, or caption — letting the decoder treat the document as a tree rather than a stream. The cost is annotation: someone (or some system) had to label all that training data with structural roles.

The interesting axis is no longer scale, but what kind of layout information the model is allowed to see at training time.

The figure for this post — a comparison of how five recent models tokenize the same densely formatted research page — will land in the next note. For now, the pattern to watch is the gap between models that learn geometry implicitly from pixels and those that get explicit layout supervision: the latter are quietly winning on tables.