AMD × vLLM Semantic Router: Building Trustworthy, Evolvable Mixture-of-Models AI Infrastructure

December 10, 2025 · 8 min read

AI Networking @ Tencent

AMD is a long-term technology partner for the vLLM community: from accelerating the vLLM engine on AMD GPUs and ROCm™ Software to now exploring the next layer of the stack with vLLM Semantic Router (VSR).

As AI moves from single models to Mixture-of-Models (MoM) architectures, the challenge shifts from "how big is your model" to how intelligently you orchestrate many models together.

VSR is the intelligent brain for multi-model AI. Together with AMD, we are building this layer to be:

System Intelligence – semantic understanding, intent classification, and dynamic decision-making.
Trustworthy – hallucination-aware, risk-sensitive, and observable.
Evolvable – continuously improving via data, feedback, and experimentation.
Production-ready on AMD – from CI/CD → online playgrounds → large-scale production.

Semantic Tool Selection: Building Smarter AI Agents with Context-Aware Routing

November 7, 2025 · 11 min read

Xunzhuo Liu

AI Networking @ Tencent

Huamin Chen

Distinguished Engineer @ Red Hat

Anthropic recently published an insightful blog post on code execution with MCP, highlighting a critical challenge in modern AI systems: as agents connect to more tools, loading all tool definitions upfront becomes increasingly inefficient. Their solution—using code execution to load tools on-demand—demonstrates how established software engineering patterns can dramatically improve agent efficiency.

This resonates deeply with our experience building the vLLM Semantic Router. We've observed the same problem from a different angle: when AI agents have access to hundreds or thousands of tools, how do they know which tools are relevant for a given task?

Our solution: semantic tool selection—using semantic similarity to automatically select the most relevant tools for each user query before the request even reaches the LLM.

tools

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

October 25, 2025 · 9 min read

Ivar Flakstad

Machine Learning @ Hugging Face

OneZero-Y

LLM Inference

Huamin Chen

Distinguished Engineer @ Red Hat

Xunzhuo Liu

AI Networking @ Tencent

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization.

Sync from vLLM Official Blog.

Background: From BERT to a Modular System

The previous implementation relied primarily on BERT and ModernBERT for intent and jailbreak classification. While ModernBERT performs well for English text classification tasks, it has the following limitations:

Language Coverage: The original ModernBERT's multilingual support is limited compared to models trained on more diverse datasets. (Note: mmBERT, a massively multilingual variant of ModernBERT supporting 1800+ languages, was released after this refactoring began and represents an alternative approach to the multilingual challenge)
Context Length: While ModernBERT extends context to 8,192 tokens using RoPE (source), models like Qwen3-Embedding support up to 32,768 tokens, which is beneficial for very long document processing
Model Coupling: Classification logic was tightly coupled to specific model architectures, making it difficult to add new models

These constraints motivated a broader refactoring that would enable the system to support multiple model types while maintaining performance. The modular architecture means that newer models like mmBERT can be integrated alongside Qwen3-Embedding and EmbeddingGemma, allowing the router to select the most appropriate model for each task.

Architectural Restructuring

modular

Semantic Router Q4 2025 Roadmap: Journey to Iris

October 20, 2025 · 15 min read

Xunzhuo Liu

AI Networking @ Tencent

Huamin Chen

Distinguished Engineer @ Red Hat

Chen Wang

Senior Staff Research Scientist @ IBM

Yue Zhu

Staff Research Scientist @ IBM

As we approach the end of 2025, we're excited to share our Q4 2025 roadmap for vLLM Semantic Router. This quarter marks a significant milestone in our project's evolution as we prepare for our first major release: v0.1, codename "Iris", expected in late 2025 to early 2026.

iris

vLLM Semantic Router: Next Phase in LLM inference

September 6, 2025 · 5 min read

Huamin Chen

Distinguished Engineer @ Red Hat

Chen Wang

Senior Staff Research Scientist @ IBM

Yue Zhu

Staff Research Scientist @ IBM

Xunzhuo Liu

AI Networking @ Tencent

code

Synced from official vLLM Blog: vLLM Semantic Router: Next Phase in LLM inference

Background: From BERT to a Modular System​

Architectural Restructuring​

Background: From BERT to a Modular System

Architectural Restructuring