Joe Barrow field_notes

Field Notes

Paper Notes: LocateAnything

last updated 2026-06-08

N.B. Tagging this as vlm-ocr because it’s a VLM that can do OCR, but it’s not really an OCR model.

Use a VLM to identify referents from a prompt using bounding boxes. For instance:

Input:

Select the crop tool

Model output:

<ref>crop tool</ref><box><130><647><190><707></box>

The model is trained on 138 million object detection samples. They hit ~13-16 boxes per second on a single H100 using Qwen2.5-VL-3B as the backbone model.

Parallel Box Decoding

Perhaps the most important contribution of this paper is “parallel box decoding,” where you generate an entire box in a single step.

Using approaches like adding coordinate tokens (e.g. Kosmos-2.5, where every coordinate like <130> is a unique token, along with tokens like <box> and <ref>) you get a pretty compact representation for the object:

[1:<ref>][2:crop][3: tool][4:</ref>][5:<box>][6:<130>]...

In total, you would get 10 tokens for this generation, meaning 10 forward passes.

Instead, we can generate the entire box in parallel, meaning we can reduce this to two steps:

[1:<ref>crop tool</ref>][2:<box><130><647><190><707></box>]

So the first step determines the box referent and the second step determines the box coordinates.

They enable this by converting all of their data into 6-token blocks during training. This looks like:

[1: <ref>    crop   tool   </ref> [null] [null]]
[2: <box>    <130>  <647>  <190>  <707>  </box>]
[3: <im_end> [null] [null] [null] [null] [null]]

Note the inserted [null] tokens, which ensure that the MTP head is predicting a consistent number of tokens at each timestep. So they can chunk variable-length inputs into fixed-length chunks.

Hybrid Mode

Super interesting: the parallel decoding step can be unreliable (e.g., if your label is more than 4 tokens, perhaps, and the model screws up). Then the MTP head will be forced to predict cross-box tokens, which it’s not trained to do. To account for this, they build a “hybrid mode” as a sort of pressure relief valve, which lets the model fall back to the slow, full autoregressive prediction task for as little as a single block.

Model and Training Details

The core model is based on Qwen2.5 with Moon-ViT. The multitoken blocks have bidirectional attention.

Block Representation

  1. Normalize the token coordinates to [0, 1000]
  2. Boxes are represented by upper-left and lower-right coordinates
  3. Group boxes under the same label (only predict the <ref> block once)
  4. Convert to a sequence of blocks

They also have 4 block types:

  1. Semantic Block: indicates the type: <ref> crop tool </ref> [null] [null], splitting across blocks for long labels
  2. Box Block: indicates the box itself: <box> <130> <647> <190> <707> </box>
  3. Negative Block: indicates the absence of an object ``
  4. End Block: indicates the end of generation <im_end> [null] [null] [null] [null] [null]

Dataset

12M unique images with 138M natural language queries, totaling 785M bounding boxes. They combine several open source datasets like Flickr30k Entities, gRefCOCO, RefCOCO, etc.