MMIE

Massive Multimodal Interleaved Comprehension Benchmark For Large Vision-Language Models

UNC-Chapel Hill University of Chicago Microsoft Research NUS
*Equal Contribution

πŸ”₯[NEW!] We introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)

Abstract

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed for Large Vision-Language Models (LVLMs). MMIE offers a robust framework for evaluating the interleaved comprehension and generation capabilities of LVLMs across diverse fields, supported by reliable automated metrics.

🌟 Key Features


πŸ—‚ Dataset

  • Comprehensive: 20K+ examples in interleaved multimodal format, consolidated into one JSON file for easy access.
  • Diverse: Spanning 12 fields and 102 subfields, offering broad and deep evaluation across domains.
  • Ground Truth Reference: Each question comes paired with a reference, ensuring accurate evaluations of model performance.

βš™οΈ Metric

  • Automated Scoring: Evaluate your model’s results with our scoring model, MMIE-Score, powered by InternVL-2-4B.
  • Bias Mitigation: Fine-tuned to reduce bias and ensure objective evaluations.
  • Multimodal Capability: Tailored for interleaved inputs and outputs, evaluating both text and image comprehension.
  • High Correlation with Human Scores: Outperforms alternative metrics such as GPT-4o in multimodal tasks, ensuring reliable benchmarking results.

MMIE Datasets

MMIE is curated from four multimodal datasets, encompassing:

  • 3 categories: Situational analysis, project-based learning, and multi-step reasoning.
  • 12 fields: Mathematics, physics, coding, statistics, literature, philosophy, education, finance, health, sports, art, and Electrical Engineering and Computer Science (EECS).
  • 102 subfields: Offering in-depth coverage across multiple domains.

The dataset contains 20,103 multimodal questions that support both interleaved inputs and outputs. It includes a mix of multiple-choice and open-ended questions, evaluating a wide range of competencies and reasoning skills. Each query is paired with a ground truth reference, enabling effective evaluation.

In addition, we propose an automated evaluation metric powered by a scoring model, which is available for use at MMIE-Score. This evaluation tool provides a streamlined way to assess your model's performance using the benchmark dataset.

Statistic Number Percentage
Questions 20103 -
- Situational analysis 5005 24.89%
- Project-based learning 11482 57.12%
- Multi-step reasoning 3616 17.99%
Total Categories/Fields/Subfields 3/12/102 -
Formats
- Multiple-Choice Questions 663 3.40%
- Open-Ended Questions 19340 96.60%
Questions with Images 20103 100%
Questions with answer label 20103 100%
Average question length 76.0 -
Average images per question 1.32 -

πŸ”§ Benchmark Details


Distribution of categories and fields in MMIE.

πŸ—‚ Dataset

    MMIE evaluates LVLMs across interleaved multimodal comprehension and generation tasks. The dataset is carefully curated to ensure a wide range of examples across various fields, providing balanced coverage for comprehensive evaluations. These examples test reasoning, cognitive tasks, and multimodal alignment, ensuring detailed insights into model performance.

βš™οΈ Metric

Pipeline of the scoring model.

The MMIE evaluation metric is built on InternVL-2-4B, a high-performing vision-language model fine-tuned for multimodal reasoning. This pipeline evaluates models using:

  • Text Quality: Clarity, coherence, and grammar.
  • Image Quality: Vividness and accuracy of image descriptions.
  • Text-Image Coherence: How well visual descriptions support the narrative.
  • Stylistic Consistency: Consistent style and structure throughout text and images.
  • Results

    Note: Higher values indicate better performance for Pearson and Cosine Similarity, while lower values are better for MSE and MAE.

    The MMIE evaluation metric achieves high correlations with human annotations in all aspects of multimodal comprehension and generation. It consistently outperforms other metrics, like GPT-4o, making it ideal for large-scale model benchmarking and comparison.

πŸ† Leaderboard

MMIE provides a systematic evaluation of existing open-source LVLMs supporting interleaved multimodal input and output interleaved LVLMs, along with the integration of state-of-the-art LVLMs and text-to-image generative models integrated LVLMs. To view detailed results, please see the paper. Leaderboard is also available on huggingface.

Scores on MMIE benchmark.

Model Model Type Situational analysis Project-based learning Multi-step reasoning AVG
MiniGPT-5 Interleaved LVLM 47.63 55.12 42.17 50.92
EMU-2 Interleaved LVLM 39.65 46.12 50.75 45.33
GILL Interleaved LVLM 46.72 57.57 39.33 51.58
Anole Interleaved LVLM 48.95 59.05 51.72 55.22
GPT-4o | Openjourney Integrated LVLM 53.05 71.4 53.67 63.65
GPT-4o | SD-3 Integrated LVLM 53 71.2 53.67 63.52
GPT-4o | SD-XL Integrated LVLM 56.12 73.25 53.67 65.47
GPT-4o | Flux Integrated LVLM 54.97 68.8 53.67 62.63
Gemini-1.5 | Openjourney Integrated LVLM 48.08 67.93 60.05 61.57
Gemini-1.5 | SD-3 Integrated LVLM 47.48 68.7 60.05 61.87
Gemini-1.5 | SD-XL Integrated LVLM 49.43 71.85 60.05 64.15
Gemini-1.5 | Flux Integrated LVLM 47.07 68.33 60.05 61.55
LLAVA-34b | Openjourney Integrated LVLM 54.12 73.47 47.28 63.93
LLAVA-34b | SD-3 Integrated LVLM 54.72 72.55 47.28 63.57
LLAVA-34b | SD-XL Integrated LVLM 55.97 74.6 47.28 65.05
LLAVA-34b | Flux Integrated LVLM 54.23 71.32 47.28 62.73
Qwen-VL-70b | Openjourney Integrated LVLM 52.73 71.63 55.63 64.05
Qwen-VL-70b | SD-3 Integrated LVLM 54.98 71.87 55.63 64.75
Qwen-VL-70b | SD-XL Integrated LVLM 52.58 73.57 55.63 65.12
Qwen-VL-70b | Flux Integrated LVLM 54.23 69.47 55.63 63.18

BibTeX


@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

We would like to express our sincere gratitude to the teams behind InternVL, MiniGPT, EMU, GILL, Anole, LLaVA, Qwen2-VL, Openjourney, Stable Diffusion and Flux for providing open-source models.

License: MIT