M³KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

CVPR 2026
Hyeongcheol Park1, Jiyoung Seo1, Jaewon Mun1, Hogun Park2,
Wonmin Byeon3, Sung June Kim1, Hyeonsoo Im4, JeungSub Lee4, Sangpil Kim1

1Korea University 2Sungkyunkwan University 3NVIDIA Research 4Hanwha Systems

Overview

Teaser
An overview of the M$^3$KG construction pipeline. The pipeline consists of three steps: (i) Context-Enriched Triplet Extraction, which rewrites multimodal captions into knowledge-intensive text and extracts entity–relation triplets; (ii) Knowledge Grounding, linking normalized entities to open knowledge bases to obtain candidate descriptions; (iii) Context-Aware Description Refinement, selecting and rewriting the most context-relevant descriptions for each entity; and Self-Reflection Loop, where an inspector agent validates or re-runs uncertain outputs to ensure graph quality.
Teaser
Overview of the Multimodal RAG framework. The framework consists of two components: (a) Modality-Wise Retrieval, which retrieves multi-hop triplets aligned with the query from the M$^3$KG; and (b) GRASP (Grounded Retrieval And Selective Pruning), which uses visual and/or audio grounding models to check entity presence and prunes triplets that are off-topic or non-informative. The resulting subgraph is then provided to an MLLM for query-relevant, evidence-grounded audio-visual reasoning.

Abstract

Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.



Question-Answering Examples



Overall Performance & Win-rate

MLLM Method Audio QA Video QA Audio-Visual QA
AudioCaps-QA VCGPT VALOR
VideoLLaMA2 None 43.13 39.09 25.66
Wikidata 43.58 38.58 26.43
VTKG 43.02 38.88 25.92
M2ConceptBase 42.19 39.31 25.93
VAT-KG 44.60 39.42 28.30
M3KG-RAG (Ours) 53.23 39.92 29.25
Qwen2.5-Omni None 49.00 42.21 32.42
Wikidata 49.78 40.82 30.28
VTKG 48.95 42.96 32.70
M2ConceptBase 49.78 42.78 32.31
VAT-KG 51.30 43.50 35.44
M3KG-RAG (Ours) 60.77 44.35 44.67
Table 1. Overall performance. We report Model-as-Judge (M.J.) scores (higher is better).
Criterion AudioCaps-QA VCGPT VALOR
BaselineOurs BaselineOurs BaselineOurs
Baseline: Base
Comprehensiveness 15.9% 84.1% 47.6% 52.4% 39.8% 60.2%
Diversity 20.3% 79.7% 37.8% 62.2% 45.5% 54.5%
Empowerment 14.0% 86.0% 42.1% 57.9% 40.1% 59.9%
Overall 15.2% 84.8% 47.0% 53.0% 39.8% 60.2%
Baseline: Wikidata
Comprehensiveness 14.9% 85.1% 48.3% 51.7% 40.3% 59.7%
Diversity 22.4% 77.6% 47.4% 52.6% 55.5% 44.5%
Empowerment 12.0% 88.0% 39.6% 60.4% 40.8% 59.2%
Overall 13.7% 86.3% 44.5% 55.5% 40.8% 59.2%
Baseline: VTKG
Comprehensiveness 20.8% 79.2% 49.1% 50.9% 39.1% 60.9%
Diversity 33.8% 66.2% 45.9% 54.1% 45.2% 54.8%
Empowerment 21.2% 78.8% 46.6% 53.4% 39.2% 60.8%
Overall 21.2% 78.8% 49.1% 50.9% 39.4% 60.6%
Baseline: M2ConceptBase
Comprehensiveness 21.2% 78.8% 41.8% 58.2% 38.2% 61.8%
Diversity 28.3% 71.7% 43.9% 56.1% 45.3% 54.7%
Empowerment 19.7% 80.3% 44.6% 55.4% 38.6% 61.4%
Overall 21.0% 79.0% 44.3% 55.7% 38.3% 61.7%
Baseline: VAT-KG
Comprehensiveness 26.1% 73.9% 48.4% 51.6% 41.4% 58.6%
Diversity 34.8% 65.2% 46.6% 53.4% 48.3% 51.7%
Empowerment 24.3% 75.7% 43.5% 56.5% 42.1% 57.9%
Overall 25.6% 74.4% 47.6% 52.4% 41.8% 58.2%
Table 2. Win-rate comparison. Pairwise win rates (%) of each baseline versus M3KG-RAG across three benchmarks and four criteria.

Table 1 / 2: Overall performance