1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
2LLM-Core Xiaomi 3The University of Hong Kong 4Renmin University of China
*Work done during internship at Xiaomi Corporation. ◊Co-corresponding authors.
GroundingME is a challenging visual grounding benchmark designed to rigorously evaluate MLLMs' ability to localize objects from natural language descriptions. It systematically tests models across four critical dimensions: Discriminative (distinguishing similar objects), Spatial (complex relational reasoning), Limited (handling occlusions/tiny objects), and Rejection (recognizing ungroundable queries). The leaderboard below presents model performance on our 1,005 challenging examples.
| Rank | Model | Total | Dis. | Spa. | Lim. | Rej. |
|---|---|---|---|---|---|---|
| 🥇 | Qwen3-VL-A22B (Thinking) | 49.8 | 65.2 | 73.7 | 45.0 | 5.5 |
| 🥈 | Qwen3-VL-32B (Thinking) | 46.9 | 65.7 | 70.0 | 36.0 | 9.5 |
| 🥉 | Qwen3-VL-A22B | 45.1 | 69.6 | 49.7 | 54.0 | 0.0 |
| 4 | Qwen3-VL-32B | 39.5 | 75.0 | 47.3 | 34.0 | 0.0 |
| 5 | Qwen3-VL-A3B (Thinking) | 39.2 | 53.4 | 53.3 | 38.0 | 5.5 |
| 6 | Qwen3-VL-A3B | 35.7 | 63.2 | 30.0 | 46.7 | 0.0 |
| 7 | Qwen3-VL-8B (Thinking) | 34.3 | 52.5 | 43.0 | 33.3 | 4.5 |
| 8 | GLM-4.5V (Thinking) | 34.0 | 52.5 | 45.3 | 30.3 | 4.0 |
| 9 | Qwen3-VL-4B | 33.9 | 56.4 | 28.3 | 47.0 | 0.0 |
| 10 | GLM-4.5V | 32.1 | 52.9 | 42.0 | 29.3 | 0.5 |
| 11 | Qwen3-VL-8B | 31.0 | 61.3 | 26.3 | 36.0 | 0.0 |
| 12 | Qwen2.5-VL-72B | 29.6 | 48.5 | 40.3 | 23.7 | 3.0 |
| 13 | Qwen2.5-VL-32B | 26.9 | 47.5 | 40.0 | 17.7 | 0.0 |
| 14 | MiMo-VL-7B-RL-2508 (Thinking) | 24.1 | 46.6 | 28.7 | 17.0 | 5.0 |
| 15 | Qwen3-VL-2B | 21.1 | 44.6 | 11.7 | 28.7 | 0.0 |
| 16 | MiMo-VL-7B-RL-2508 | 18.6 | 44.1 | 19.3 | 13.0 | 0.0 |
| 17 | InternVL3.5-A28B | 17.1 | 28.4 | 25.0 | 13.0 | 0.0 |
| 18 | Qwen2.5-VL-7B | 15.1 | 31.9 | 14.3 | 14.3 | 0.5 |
| 19 | Llama-4-Maverick | 13.0 | 18.1 | 22.3 | 5.0 | 6.0 |
| 20 | Llama-Nemotron-8B | 10.4 | 25.0 | 6.0 | 8.3 | 5.5 |
| 21 | Llama-4-Scout | 8.9 | 17.6 | 12.3 | 3.7 | 2.5 |
| 22 | Keye-VL-1.5-8B | 8.5 | 21.6 | 8.0 | 5.7 | 0.0 |
| 23 | LLaVA-OneVision-1.5-8B | 4.4 | 9.8 | 4.7 | 3.3 | 0.0 |
| 24 | MiniCPM-V-4.5 | 4.0 | 7.8 | 4.0 | 4.0 | 0.0 |
| 25 | InternVL3.5-8B | 3.3 | 6.4 | 4.0 | 1.7 | 1.5 |
| 26 | Mistral-3.2-24B | 1.7 | 4.4 | 2.7 | 0.0 | 0.0 |
| 27 | Phi-4-Multimodal | 0.4 | 1.0 | 0.7 | 0.0 | 0.0 |
| 28 | Gemma-3-27B | 0.4 | 1.5 | 0.3 | 0.0 | 0.0 |
| Rank | Model | Total | Dis. | Spa. | Lim. | Rej. |
|---|---|---|---|---|---|---|
| 🥇 | Seed-1.6-Vision-250815 (Thinking) | 46.5 | 59.3 | 72.7 | 41.7 | 1.5 |
| 🥈 | Seed-1.6-Vision-250815 | 42.6 | 59.8 | 58.7 | 42.7 | 1.0 |
| 3 | Gemini-2.5-Pro | 20.7 | 34.8 | 34.0 | 7.0 | 7.0 |
| 4 | Gemini-2.5-Flash | 18.7 | 36.3 | 25.0 | 13.0 | 0.0 |
All metrics reported are Accuracy@0.5. Dis. = Discriminative, Spa. = Spatial, Lim. = Limited, Rej. = Rejection
Visual grounding—localizing objects from natural language descriptions—represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets?
Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative—distinguishing highly similar objects, (2) Spatial—understanding complex relational descriptions, (3) Limited—handling occlusions or tiny objects, and (4) Rejection—recognizing ungroundable queries.
Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks—reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment.
We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
Subtask Distribution. Our benchmark comprises of 1,005 samples, distributed across four L-1 categories and twelve L-2 subcategories.
Examples of different visual grounding benchmarks. ■ Green bounding box indicates the correct ground-truth object, while ■ red bounding box shows the answer of Qwen3-VL-30B-A3B-Instruct.
Three-stage human-in-the-loop annotation pipeline: (1) Bounding Box Annotation using automated tools, (2) Description Generation with MLLMs, and (3) Manual Selection and Refinement to ensure quality and challenge.
Best model (Qwen3-VL-A22B) achieves only 45.1% accuracy, with most models scoring 10-40%
Most models score 0% on rejection tasks, reflexively hallucinating non-existent objects
Consistent performance improvement with increased model size across model families
Main evaluation results on GroundingME. All metrics reported are Accuracy@0.5. All models in this table are evaluated under the no-thinking mode setting if supported.
Thinking mode universally improves performance across all tested models, with gains ranging from 1.9% to 7.4%. Models show notable improvements on reasoning-intensive tasks (Spatial and Rejection) and can learn basic rejection behavior through thinking.
By generating multiple thinking trajectories and selecting the best one using an LLM judge, we achieve significant performance improvements:
Text-only LLMs evaluating thinking trajectory quality prove effective for reasoning-intensive categories.
Fine-tuning on RefCOCOg augmented with negative samples teaches models to reject ungroundable queries:
@article{li2025groundingme,
title={GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation},
author={Rang Li and Lei Li and Shuhuai Ren and Hao Tian and Shuhao Gu and Shicheng Li and Zihao Yue and Yudong Wang and Wenhan Ma and Zhe Yang and Jingyuan Ma and Zhifang Sui and Fuli Luo},
journal={arXiv preprint arXiv:2512.17495},
year={2025}
}