Paper Notes: LocateAnything
last updated 2026-06-08
N.B. Tagging this as vlm-ocr because it’s a VLM that can do OCR, but it’s not really an OCR model.
Use a VLM to identify referents from a prompt using bounding boxes. For instance:
Input:
Select the crop tool
Model output:
<ref>crop tool</ref><box><130><647><190><707></box>
The model is trained on 138 million object detection samples. They hit ~13-16 boxes per second on a single H100 using Qwen2.5-VL-3B as the backbone model.
Parallel Box Decoding
Perhaps the most important contribution of this paper is “parallel box decoding,” where you generate an entire box in a single step.
Using approaches like adding coordinate tokens (e.g. Kosmos-2.5, where every coordinate like <130> is a unique token, along with tokens like <box> and <ref>) you get a pretty compact representation for the object:
[1:<ref>][2:crop][3: tool][4:</ref>][5:<box>][6:<130>]...
In total, you would get 10 tokens for this generation, meaning 10 forward passes.
Instead, we can generate the entire box in parallel, meaning we can reduce this to two steps:
[1:<ref>crop tool</ref>][2:<box><130><647><190><707></box>]
So the first step determines the box referent and the second step determines the box coordinates.
They enable this by converting all of their data into 6-token blocks during training. This looks like:
[1: <ref> crop tool </ref> [null] [null]]
[2: <box> <130> <647> <190> <707> </box>]
[3: <im_end> [null] [null] [null] [null] [null]]
Note the inserted [null] tokens, which ensure that the MTP head is predicting a consistent number of tokens at each timestep. So they can chunk variable-length inputs into fixed-length chunks.
Hybrid Mode
Super interesting: the parallel decoding step can be unreliable (e.g., if your label is more than 4 tokens, perhaps, and the model screws up). Then the MTP head will be forced to predict cross-box tokens, which it’s not trained to do. To account for this, they build a “hybrid mode” as a sort of pressure relief valve, which lets the model fall back to the slow, full autoregressive prediction task for as little as a single block.
Model and Training Details
The core model is based on Qwen2.5 with Moon-ViT. The multitoken blocks have bidirectional attention.
Block Representation
- Normalize the token coordinates to
[0, 1000] - Boxes are represented by upper-left and lower-right coordinates
- Group boxes under the same label (only predict the
<ref>block once) - Convert to a sequence of blocks
They also have 4 block types:
- Semantic Block: indicates the type:
<ref> crop tool </ref> [null] [null], splitting across blocks for long labels - Box Block: indicates the box itself:
<box> <130> <647> <190> <707> </box> - Negative Block: indicates the absence of an object ``
- End Block: indicates the end of generation
<im_end> [null] [null] [null] [null] [null]
Dataset
12M unique images with 138M natural language queries, totaling 785M bounding boxes. They combine several open source datasets like Flickr30k Entities, gRefCOCO, RefCOCO, etc.