Paper Notes: LocateAnything

last updated 2026-06-08

Use a VLM to identify referents from a prompt using bounding boxes. For instance:

Input:

Select the crop tool

Model output:

<ref>crop tool</ref><box><130><647><190><707></box>

The model is trained on 138 million object detection samples. They hit ~13-16 boxes per second on a single H100 using Qwen2.5-VL-3B as the backbone model.

Parallel Box Decoding

Perhaps the most important contribution of this paper is “parallel box decoding,” where you generate an entire box in a single step.

Using approaches like adding coordinate tokens (e.g. Kosmos-2.5, where every coordinate like <130> is a unique token, along with tokens like <box> and <ref>) you get a pretty compact representation for the object:

[1:<ref>][2:crop][3: tool][4:</ref>][5:<box>][6:<130>]...

In total, you would get 10 tokens for this generation, meaning 10 forward passes.

Instead, we can generate the entire box in parallel, meaning we can reduce this to two steps:

[1:<ref>crop tool</ref>][2:<box><130><647><190><707></box>]

So the first step determines the box referent and the second step determines the box coordinates.

They enable this by converting all of their data into 6-token blocks during training. This looks like:

tok:  0        1        2        3        4        5
[0:   <ref>    crop     tool     </ref>   [null]   [null]]
[1:   <box>    <130>    <647>    <190>    <707>    </box>]
[2:   <ref>    face     in       image    to       crop  ]
[3:   </ref>   [null]   [null]   [null]   [null]   [null]]
[4:   <box>    None     </box>   [null]   [null]   [null]]
[5:   <im_end> [null]   [null]   [null]   [null]   [null]]

Note the inserted [null] tokens, which ensure that the MTP head is predicting a consistent number of tokens at each time step. So they can chunk variable-length inputs into fixed-length chunks.

Hybrid Mode

Super interesting: the parallel decoding step can be unreliable (e.g., if your label is more than 4 tokens, perhaps, and the model screws up). Then the MTP head will be forced to predict cross-box tokens, which it’s not trained to do. To account for this, they build a “hybrid mode” as a sort of pressure relief valve, which lets the model fall back to the slow, full autoregressive prediction task for as little as a single block.

Model and Training Details

The core model is based on Qwen2.5 with Moon-ViT. The multitoken blocks have bidirectional attention.

Block Representation

Normalize the token coordinates to [0, 1000]
Boxes are represented by upper-left and lower-right coordinates
Group boxes under the same label (only predict the <ref> block once)
Convert to a sequence of blocks

They also have 4 block types:

Semantic Block: indicates the type: <ref> crop tool </ref> [null] [null], splitting across blocks for long labels
Box Block: indicates the box itself: <box> <130> <647> <190> <707> </box>
Negative Block: indicates the absence of an object <box> None </box> [null] [null] [null]
End Block: indicates the end of generation <im_end> [null] [null] [null] [null] [null]

Dataset

12M unique images with 138M natural language queries, totaling 785M bounding boxes. They combine several open source datasets like Flickr30k Entities, gRefCOCO, RefCOCO, etc.

N.B. Tagging this as vlm-ocr because it’s a VLM that can do OCR, but it’s not really an OCR model.