ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

Abstract

Multimodal large language models (MLLMs) rely heavily on vision encoders to understand diverse image content. While recent approaches have explored combining multiple vision experts to address the limitations of single encoders, they typically perform image-level expert selection and fusion, ignoring the spatial heterogeneity within images where different regions may benefit from different experts.

In this paper, we propose ViMoE (Vision Mixture of Experts with Multimodal Context Awareness), a novel MLLM that introduces three key innovations:

Token-Level Sparse Expert Activation (TLSEA) — enables different spatial tokens to utilize different expert combinations for fine-grained, content-aware feature extraction
Hierarchical Context Aggregation (HCA) — captures multi-scale visual context to guide expert routing at different granularities
Expert Confidence Calibration (ECC) — learns to estimate and calibrate expert contribution confidence to reduce noise from unreliable features

Extensive experiments demonstrate that ViMoE achieves significant improvements over state-of-the-art methods across challenging multimodal benchmarks including MME, MMBench, and various VQA tasks, while maintaining computational efficiency through sparse activation patterns.

Key Innovations

Token-Level Sparse Expert Activation

Unlike image-level expert selection, TLSEA enables each spatial token to independently select and weight its expert contributions. For each token, we compute routing logits combining local token features and global context, then select top-k experts per token for computational efficiency. This allows different image regions to utilize different expert combinations based on their local content without processing unnecessary experts.

\(\mathbf{r}_i = \mathbf{r}_i^{local} + \mathbf{r}^{global}\)

Hierarchical Context Aggregation

Expert routing decisions should consider both local details and global semantics. HCA aggregates visual context at multiple scales ({1, 2, 4}) to capture information at different granularities—from global image context to fine-grained region patterns. It further fuses multi-scale visual context with textual context from the user instruction through a learned gating mechanism, providing rich, multi-granular information for routing decisions.

\(\mathbf{H} = \mathbf{g} \odot \bar{\mathbf{C}} + (1 - \mathbf{g}) \odot \mathbf{T}'\)

Expert Confidence Calibration

Not all expert contributions are equally reliable—some experts may produce noisy or inconsistent features for certain inputs. ECC learns to estimate the confidence of each expert's features based on two factors: (1) intrinsic feature quality measured by a learned estimator, and (2) consistency with the base encoder's features. The combined confidence score is then used to calibrate routing weights, adaptively reducing the influence of low-confidence expert features while preserving high-confidence contributions.

\(c_j = \frac{c_j^{feat} + c_j^{cons}}{2}\)

Framework

The overall architecture of ViMoE integrates three novel components for precise, content-aware expert utilization in multimodal understanding. Given an input image and user instruction, ViMoE extracts features from a base vision encoder (CLIP ViT-L-336px) and task-specific vision experts including DINOv2 (self-supervised local features), Pix2Struct (document understanding), Deplot (chart understanding), SAM (segmentation), Co-DETR (object detection), Vary (OCR), and BiomedCLIP (medical images). The ViMoE-Adapter then applies hierarchical context aggregation to capture multi-scale visual and textual information, enables token-level expert routing for spatially-adaptive feature fusion, and calibrates expert contributions based on confidence scores before projecting to the LLM embedding space.

Figure 2. The ViMoE framework comprises: (1) a base vision encoder (CLIP ViT-L), (2) task-specific vision experts (DINOv2, Pix2Struct, Deplot, SAM, etc.), (3) the ViMoE-Adapter with our three novel modules, and (4) a large language model.

Token-Level Sparse Expert Activation

The core innovation of ViMoE enables each spatial token to independently route to different vision experts based on local content characteristics. Unlike image-level approaches that apply uniform expert weights across all spatial tokens, TLSEA recognizes that different image regions contain diverse content requiring different expertise. For each token, local routing logits capture content-specific preferences (e.g., text regions prefer document experts), while global routing logits provide consistent bias based on overall image-instruction context. We apply load balancing to encourage uniform expert utilization and prevent routing collapse. With top-k selection (k=3), we maintain sparsity for computational efficiency while enabling fine-grained, spatially-adaptive expert utilization.

Figure 3. Each token computes routing scores based on its local features and global context. Top-k experts are selected per token, enabling spatially-adaptive expert utilization while maintaining computational efficiency through sparsity.

7

Vision Experts

Top-3

Sparse Selection

576

Visual Tokens

<2%

Latency Overhead

Why Token-Level Routing Matters

The motivation for token-level expert activation stems from a critical observation: single vision encoders have inherent limitations. CLIP excels at general image understanding but struggles with fine-grained tasks like document parsing and chart comprehension. Recent approaches like MoVA address this by combining multiple task-specific experts, but they make a limiting assumption: all spatial tokens in an image should use the same set of experts.

In reality, documents contain both text paragraphs and embedded charts. Natural images have both foreground objects and background scenes. Medical images combine anatomical structures with text annotations. ViMoE recognizes this spatial heterogeneity within images and routes different regions to different experts. Text regions route to document experts (Pix2Struct, Vary), chart regions route to visualization experts (Deplot), and scene regions route to general-purpose encoders (DINOv2). This fine-grained routing enables precise knowledge extraction without "averaging out" expert strengths and weaknesses.

Empirical results validate this approach: ViMoE achieves +1.6% on ChartQA, +1.4% on DocVQA, and +1.3% on GQA compared to MoVA, with the largest improvements on tasks involving diverse content types. The overhead is minimal—token-level routing adds less than 2% latency while enabling significantly better expert utilization.

Results

ViMoE achieves state-of-the-art performance across diverse multimodal benchmarks.

MLLM Benchmarks

Method	LLM	MME^P	MME^C	MMBench	QBench	MathVista	POPE
GPT-4V	-	1409	517	75.1	73.5	47.8	-
LLaVA-1.5	Vicuna-7B	1510	316	64.3	58.7	25.5	85.9
InternVL-1.5	InternLM-7B	1563	345	72.5	68.4	36.7	88.5
MoVA	Llama3-8B	1595.8	347.5	75.3	70.8	37.7	89.3
ViMoE (Ours)	Llama3-8B	1612.3	358.2	76.8	72.3	39.2	90.1
↑ +16.5 MME^P \| ↑ +1.5% MMBench \| ↑ +1.5% MathVista

Visual Question Answering

Method	LLM	VQAv2	GQA	TextVQA	ChartQA	DocVQA	AI2D
LLaVA-NeXT	Vicuna-7B	81.8	64.2	64.9	54.2	74.4	66.9
CogAgent	Vicuna-7B	-	-	76.1	68.4	81.6	-
InternVL-1.5	InternLM-7B	82.1	64.5	72.5	68.2	82.1	74.5
MoVA	Llama3-8B	83.5	65.2	77.1	70.5	83.8	77.0
ViMoE (Ours)	Llama3-8B	84.1	66.5	78.3	72.1	85.2	78.4
↑ +1.6% ChartQA \| ↑ +1.4% DocVQA \| ↑ +1.3% GQA

Expert Routing Visualization

Token-level routing decisions demonstrate how ViMoE adaptively routes different image regions to appropriate experts based on content type.

Figure 4. Visualization of token-level expert routing. Different image regions are routed to different experts based on content: text regions prefer Pix2Struct, chart regions prefer Deplot, and natural scenes prefer DINOv2.

Pix2Struct

Deplot

DINOv2

SAM

Co-DETR

Vary

ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

Unlike prior methods that perform image-level expert selection, ViMoE enables token-level sparse expert activation. Different image regions are routed to appropriate vision experts based on local content, achieving more precise knowledge extraction.

Abstract

Key Innovations

Token-Level Sparse Expert Activation

Hierarchical Context Aggregation

Expert Confidence Calibration

Framework

Token-Level Sparse Expert Activation

Why Token-Level Routing Matters

Results

MLLM Benchmarks

Visual Question Answering

Expert Routing Visualization