Multimodal large language models (MLLMs) rely heavily on vision encoders to understand diverse image content. While recent approaches have explored combining multiple vision experts to address the limitations of single encoders, they typically perform image-level expert selection and fusion, ignoring the spatial heterogeneity within images where different regions may benefit from different experts.
In this paper, we propose ViMoE (Vision Mixture of Experts with Multimodal Context Awareness), a novel MLLM that introduces three key innovations:
Extensive experiments demonstrate that ViMoE achieves significant improvements over state-of-the-art methods across challenging multimodal benchmarks including MME, MMBench, and various VQA tasks, while maintaining computational efficiency through sparse activation patterns.
Unlike image-level expert selection, TLSEA enables each spatial token to independently select and weight its expert contributions. For each token, we compute routing logits combining local token features and global context, then select top-k experts per token for computational efficiency. This allows different image regions to utilize different expert combinations based on their local content without processing unnecessary experts.
Expert routing decisions should consider both local details and global semantics. HCA aggregates visual context at multiple scales ({1, 2, 4}) to capture information at different granularities—from global image context to fine-grained region patterns. It further fuses multi-scale visual context with textual context from the user instruction through a learned gating mechanism, providing rich, multi-granular information for routing decisions.
Not all expert contributions are equally reliable—some experts may produce noisy or inconsistent features for certain inputs. ECC learns to estimate the confidence of each expert's features based on two factors: (1) intrinsic feature quality measured by a learned estimator, and (2) consistency with the base encoder's features. The combined confidence score is then used to calibrate routing weights, adaptively reducing the influence of low-confidence expert features while preserving high-confidence contributions.
The overall architecture of ViMoE integrates three novel components for precise, content-aware expert utilization in multimodal understanding. Given an input image and user instruction, ViMoE extracts features from a base vision encoder (CLIP ViT-L-336px) and task-specific vision experts including DINOv2 (self-supervised local features), Pix2Struct (document understanding), Deplot (chart understanding), SAM (segmentation), Co-DETR (object detection), Vary (OCR), and BiomedCLIP (medical images). The ViMoE-Adapter then applies hierarchical context aggregation to capture multi-scale visual and textual information, enables token-level expert routing for spatially-adaptive feature fusion, and calibrates expert contributions based on confidence scores before projecting to the LLM embedding space.
Figure 2. The ViMoE framework comprises: (1) a base vision encoder (CLIP ViT-L), (2) task-specific vision experts (DINOv2, Pix2Struct, Deplot, SAM, etc.), (3) the ViMoE-Adapter with our three novel modules, and (4) a large language model.
The core innovation of ViMoE enables each spatial token to independently route to different vision experts based on local content characteristics. Unlike image-level approaches that apply uniform expert weights across all spatial tokens, TLSEA recognizes that different image regions contain diverse content requiring different expertise. For each token, local routing logits capture content-specific preferences (e.g., text regions prefer document experts), while global routing logits provide consistent bias based on overall image-instruction context. We apply load balancing to encourage uniform expert utilization and prevent routing collapse. With top-k selection (k=3), we maintain sparsity for computational efficiency while enabling fine-grained, spatially-adaptive expert utilization.
Figure 3. Each token computes routing scores based on its local features and global context. Top-k experts are selected per token, enabling spatially-adaptive expert utilization while maintaining computational efficiency through sparsity.
The motivation for token-level expert activation stems from a critical observation: single vision encoders have inherent limitations. CLIP excels at general image understanding but struggles with fine-grained tasks like document parsing and chart comprehension. Recent approaches like MoVA address this by combining multiple task-specific experts, but they make a limiting assumption: all spatial tokens in an image should use the same set of experts.
In reality, documents contain both text paragraphs and embedded charts. Natural images have both foreground objects and background scenes. Medical images combine anatomical structures with text annotations. ViMoE recognizes this spatial heterogeneity within images and routes different regions to different experts. Text regions route to document experts (Pix2Struct, Vary), chart regions route to visualization experts (Deplot), and scene regions route to general-purpose encoders (DINOv2). This fine-grained routing enables precise knowledge extraction without "averaging out" expert strengths and weaknesses.
Empirical results validate this approach: ViMoE achieves +1.6% on ChartQA, +1.4% on DocVQA, and +1.3% on GQA compared to MoVA, with the largest improvements on tasks involving diverse content types. The overhead is minimal—token-level routing adds less than 2% latency while enabling significantly better expert utilization.
ViMoE achieves state-of-the-art performance across diverse multimodal benchmarks.
| Method | LLM | MMEP | MMEC | MMBench | QBench | MathVista | POPE |
|---|---|---|---|---|---|---|---|
| GPT-4V | - | 1409 | 517 | 75.1 | 73.5 | 47.8 | - |
| LLaVA-1.5 | Vicuna-7B | 1510 | 316 | 64.3 | 58.7 | 25.5 | 85.9 |
| InternVL-1.5 | InternLM-7B | 1563 | 345 | 72.5 | 68.4 | 36.7 | 88.5 |
| MoVA | Llama3-8B | 1595.8 | 347.5 | 75.3 | 70.8 | 37.7 | 89.3 |
| ViMoE (Ours) | Llama3-8B | 1612.3 | 358.2 | 76.8 | 72.3 | 39.2 | 90.1 |
| ↑ +16.5 MMEP | ↑ +1.5% MMBench | ↑ +1.5% MathVista | |||||||
| Method | LLM | VQAv2 | GQA | TextVQA | ChartQA | DocVQA | AI2D |
|---|---|---|---|---|---|---|---|
| LLaVA-NeXT | Vicuna-7B | 81.8 | 64.2 | 64.9 | 54.2 | 74.4 | 66.9 |
| CogAgent | Vicuna-7B | - | - | 76.1 | 68.4 | 81.6 | - |
| InternVL-1.5 | InternLM-7B | 82.1 | 64.5 | 72.5 | 68.2 | 82.1 | 74.5 |
| MoVA | Llama3-8B | 83.5 | 65.2 | 77.1 | 70.5 | 83.8 | 77.0 |
| ViMoE (Ours) | Llama3-8B | 84.1 | 66.5 | 78.3 | 72.1 | 85.2 | 78.4 |
| ↑ +1.6% ChartQA | ↑ +1.4% DocVQA | ↑ +1.3% GQA | |||||||
Token-level routing decisions demonstrate how ViMoE adaptively routes different image regions to appropriate experts based on content type.
Figure 4. Visualization of token-level expert routing. Different image regions are routed to different experts based on content: text regions prefer Pix2Struct, chart regions prefer Deplot, and natural scenes prefer DINOv2.