M-RoPE channel allocation across Qwen versions
last updated 2026-05-22
Pulling apart how recent Qwen-family VLMs split rotary embedding channels. This note exists because I keep forgetting the exact numbers and having to re-read papers.
Qwen2-VL
Roughly equal split across the three axes. The temporal axis carries the most weight because the model is also used for video.
Qwen2.5-VL
Rebalanced toward the spatial axes for high-resolution document inputs. The temporal allocation shrinks and the height/width budget grows. This is the change that matters for document OCR — a page is one frame, so the temporal channels are mostly wasted otherwise.
What I want to test
Whether you can recover the document-OCR gain from Qwen2-VL by simply rerouting the temporal channels into the spatial ones at finetune time, without retraining the position encoder from scratch.
Related: Why VLMs Fail at Tables and Tokenizers and Layout.