M³KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.

Question-Answering Examples

Overall Performance & Win-rate

MLLM	Method	Audio QA	Video QA	Audio-Visual QA
MLLM	Method	AudioCaps-QA	VCGPT	VALOR
VideoLLaMA2	None	43.13	39.09	25.66
	Wikidata	43.58	38.58	26.43
	VTKG	43.02	38.88	25.92
	M²ConceptBase	42.19	39.31	25.93
	VAT-KG	44.60	39.42	28.30
	M³KG-RAG (Ours)	53.23	39.92	29.25

Qwen2.5-Omni	None	49.00	42.21	32.42
	Wikidata	49.78	40.82	30.28
	VTKG	48.95	42.96	32.70
	M²ConceptBase	49.78	42.78	32.31
	VAT-KG	51.30	43.50	35.44
	M³KG-RAG (Ours)	60.77	44.35	44.67

Table 1. Overall performance. We report Model-as-Judge (M.J.) scores (higher is better).

Criterion	AudioCaps-QA		VCGPT		VALOR
Criterion	Baseline	Ours	Baseline	Ours	Baseline	Ours
Baseline: Base
Comprehensiveness	15.9%	84.1%	47.6%	52.4%	39.8%	60.2%
Diversity	20.3%	79.7%	37.8%	62.2%	45.5%	54.5%
Empowerment	14.0%	86.0%	42.1%	57.9%	40.1%	59.9%
Overall	15.2%	84.8%	47.0%	53.0%	39.8%	60.2%
Baseline: Wikidata
Comprehensiveness	14.9%	85.1%	48.3%	51.7%	40.3%	59.7%
Diversity	22.4%	77.6%	47.4%	52.6%	55.5%	44.5%
Empowerment	12.0%	88.0%	39.6%	60.4%	40.8%	59.2%
Overall	13.7%	86.3%	44.5%	55.5%	40.8%	59.2%
Baseline: VTKG
Comprehensiveness	20.8%	79.2%	49.1%	50.9%	39.1%	60.9%
Diversity	33.8%	66.2%	45.9%	54.1%	45.2%	54.8%
Empowerment	21.2%	78.8%	46.6%	53.4%	39.2%	60.8%
Overall	21.2%	78.8%	49.1%	50.9%	39.4%	60.6%
Baseline: M²ConceptBase
Comprehensiveness	21.2%	78.8%	41.8%	58.2%	38.2%	61.8%
Diversity	28.3%	71.7%	43.9%	56.1%	45.3%	54.7%
Empowerment	19.7%	80.3%	44.6%	55.4%	38.6%	61.4%
Overall	21.0%	79.0%	44.3%	55.7%	38.3%	61.7%
Baseline: VAT-KG
Comprehensiveness	26.1%	73.9%	48.4%	51.6%	41.4%	58.6%
Diversity	34.8%	65.2%	46.6%	53.4%	48.3%	51.7%
Empowerment	24.3%	75.7%	43.5%	56.5%	42.1%	57.9%
Overall	25.6%	74.4%	47.6%	52.4%	41.8%	58.2%

Table 2. Win-rate comparison. Pairwise win rates (%) of each baseline versus M³KG-RAG across three benchmarks and four criteria.

Table 1 / 2: Overall performance

M³KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Overview

Abstract

Question-Answering Examples

Overall Performance & Win-rate