WeChat Mini Program
Old Version Features

MixLLM: Dynamic Routing in Mixed Large Language Models

North American Chapter of the Association for Computational Linguistics(2025)

Cited 0|Views2
Abstract
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25 time constraint).
More
Translated text
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:论文提出了一种名为MixLLM的动态路由系统,通过查询-语言模型分配优化大型混合语言模型的响应质量、成本和延迟的权衡。

方法】:论文采用上下文老虎机(contextual-bandit)方法,结合查询标签增强查询嵌入,设计轻量级预测模型估计响应质量和成本,并利用元决策者进行最优查询-语言模型分配。

实验】:实验使用MixLLM系统进行了广泛测试,结果表明该系统在响应质量、成本和延迟方面实现了最佳权衡(97.25时间约束),并使用特定数据集进行验证(数据集名称未在论文摘要中提及)。