Be Careful Interpreting Averaged Benchmarks

Thoughts on averaged benchmarks and hidden correlations.

2025-01-09

New LLM releases come with fancy graphs that compare the averaged performance of this latest release with previous releases. For small/open models, the x-axis is often parameters or tokens/$ (Mistral invented the “Upper Left Triangle”) graph. But there’s a problem with these graphs: averaged benchmark scores are misleading.

Let’s look at the specific case of multimodal models (which have been on my mind a lot lately, as you might guess). Reported benchmarks include: DocVQA, ChartQA, and OCRBench.

But there’s a problem: OCRBench is a dataset composed of instances pulled from DocVQA, ChartQA, and others. So you’re double-counting all the instances from DocVQA/ChartQA/etc. when you also add in OCRBench.

So if you want to juice your model performance, just train it to be really good at DocVQA. Actually, just get really good at DocVQA regardless, it’d make my life easier.