M-RoPE channel allocation across Qwen versions

last updated 2026-05-22

Pulling apart how recent Qwen-family VLMs split rotary embedding channels. This note exists because I keep forgetting the exact numbers and having to re-read papers.

Qwen2-VL

Roughly equal split across the three axes. The temporal axis carries the most weight because the model is also used for video.

Qwen2.5-VL

Rebalanced toward the spatial axes for high-resolution document inputs. The temporal allocation shrinks and the height/width budget grows. This is the change that matters for document OCR — a page is one frame, so the temporal channels are mostly wasted otherwise.

What I want to test

Whether you can recover the document-OCR gain from Qwen2-VL by simply rerouting the temporal channels into the spatial ones at finetune time, without retraining the position encoder from scratch.