Semantic Router Q4 2025 Roadmap: Journey to Iris
As we approach the end of 2025, we're excited to share our Q4 2025 roadmap for vLLM Semantic Router. This quarter marks a significant milestone in our project's evolution as we prepare for our first major release: v0.1, codename "Iris", expected in late 2025 to early 2026.

About Our Release Naming Convention
Starting with v0.1, each major release of vLLM Semantic Router will carry a codename inspired by figures from Greek mythology. These names reflect the essence and purpose of each release, connecting ancient wisdom with modern AI infrastructure.
Our inaugural release is named Iris (Ἶρις), after the Greek goddess of the rainbow and divine messenger of the Olympian gods. In mythology, Iris served as the swift-footed messenger who bridged the gap between gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. She personified the connection between heaven and earth, ensuring that communication flowed seamlessly across different realms.
This symbolism perfectly captures the essence of vLLM Semantic Router: a system that bridges the gap between users and diverse AI models, intelligently routing requests across different LLM providers and architectures. Just as Iris connected different worlds through her rainbow bridge, our router connects applications to the right models through intelligent semantic understanding. The rainbow itself—a spectrum of colors working in harmony—mirrors our vision of orchestrating multiple models in a unified, efficient system.
With the Iris release, we're establishing the foundation for reliable, intelligent, and secure AI model routing that will serve as the bridge for modern AI applications.
Q4 2025 Focus Areas
Our Q4 roadmap centers on seven critical pillars that will transform vLLM Semantic Router from an experimental project into a production-ready platform. These initiatives address the most pressing needs identified by our community and represent the essential groundwork for v0.1.
1. Semantic Chain for Fusion Intelligent Routing
The Challenge
Current routing relies exclusively on ModernBERT classification for semantic understanding. While powerful, this approach has limitations: it cannot perform deterministic routing based on specific keywords, lacks pattern-based detection for safety and compliance, and misses opportunities for specialized domain classification that could enhance routing accuracy and flexibility.
The Innovation
We're introducing a unified content scanning and routing framework that extends semantic routing with four complementary signal sources, all integrated through a Signal Fusion Layer:
1. Keyword-Based Routing
- Deterministic, fast Boolean logic for exact term matching
- Route queries containing "kubernetes" or "CVE-" patterns directly to specialized models
- Eliminate unnecessary ML inference for technology-specific queries
2. Regex Content Scanning
- Pattern-based detection for safety, compliance, and structured data
- Guaranteed blocking of PII patterns (SSN, credit cards) with no ML false negatives
- RE2 engine with ReDoS protection for security-critical applications
3. Embedding Similarity Scanning
- Semantic concept detection robust to paraphrasing
- Detect "multi-step reasoning" intent even when phrased as "explain thoroughly"
- Reuses existing BERT embedder for zero additional model overhead
4. Domain Classification
- In-Tree BERT Classification: Lightweight BERT-based domain classifiers running directly in the router process for low-latency intent detection
- Out-of-Tree MCP Classification: Advanced domain-specific classifiers deployed as MCP servers for specialized routing scenarios (legal, medical, financial domains)
- Hierarchical classification with confidence scoring for multi-domain queries
Dual Execution Paths
- In-Tree Path: Low-latency signal providers running directly in the router process
- Out-of-Tree Path: MCP (Model Context Protocol) servers for massive rule sets, custom matching engines (Aho-Corasick, Hyperscan), and domain-specific algorithms
Signal Fusion Layer
The decision-making engine that combines all signals into actionable routing decisions:
- Priority-based policy evaluation: Safety blocks (200) → Routing overrides (150) → Category boosting (100) → Consensus (50) → Default (0)
- Boolean expressions: Combine multiple signals with AND, OR, NOT operators
- Flexible actions: Block, route to specific models, boost category weights, or fallthrough to BERT
Impact
This framework enables:
- Fast deterministic routing for technology-specific queries
- Guaranteed compliance with safety and regulatory requirements
- Semantic intent detection that complements BERT classification
- Specialized domain classification for vertical-specific routing (legal, medical, financial)
- Flexible deployment options with both in-tree and out-of-tree execution paths
- Graceful degradation and backward compatibility with existing routing
The Semantic Chain for Fusion Intelligent Routing represents a fundamental shift from pure ML-based routing to a hybrid approach that leverages the best of deterministic, pattern-based, semantic, and domain-specific classification methods.
2. Extensible Serving Architecture: Modular Candle-Binding for MoM Family
The Challenge
Our Rust-based candle-binding codebase has grown organically into a 2,600+ line monolithic structure. This architecture was designed for a handful of models, but now faces a critical challenge: supporting the entire MoM (Mixture of Models) Family with its diverse model architectures, specialized classifiers, and LoRA-adapted variants. The current monolithic design makes it nearly impossible to efficiently serve multiple model types simultaneously.
The Vision
We're restructuring the candle-binding into an extensible serving architecture specifically designed to support the MoM Family's diverse model ecosystem. This modular design enables seamless addition of new MoM models without code changes, efficient multi-model serving, and clear separation between model architectures and serving logic.
Layered Architecture for MoM Models
- Core Layer: Unified error handling, configuration management, device initialization, and weight loading shared across all MoM models
- Model Architectures Layer: Modular implementations of BERT (mom-similarity-flash, mom-pii-flash, mom-jailbreak-flash), ModernBERT, and Qwen3 (mom-brain-pro/max, mom-expert-* series) with extensible traits for future MoM additions
- Classifiers Layer: Specialized implementations for sequence classification (intent routing), token classification (PII/jailbreak detection), and LoRA support (fine-tuned MoM experts)
- FFI Layer: Centralized memory safety checks and C-compatible interfaces for Go integration
Impact
This extensible architecture enables:
- Rapid MoM Model Deployment: Add new MoM models (mom-expert-math-flash, mom-brain-max) by implementing standard traits
- Efficient Multi-Model Serving: Serve multiple MoM models simultaneously with shared infrastructure
- LoRA Support: Native support for LoRA-adapted MoM experts with high-confidence routing
- Backward Compatibility: Existing Go bindings continue to work without changes
This transformation positions the serving layer as a scalable foundation for the entire MoM Family ecosystem, enabling us to rapidly expand our model offerings while maintaining performance and reliability.
3. Model Unification: The MoM (Mixture of Models) Family
The Challenge
Despite developing a comprehensive family of specialized routing models, our codebase still references legacy models scattered across configuration files. This fragmentation creates confusion, inconsistent performance, and a steep learning curve for new users.
The Solution
We're migrating the entire system to use the MoM Family as the primary built-in models:
- 🧠 Intelligent Routing: mom-brain-flash/pro/max for intent classification with clear latency-accuracy trade-offs



