Chrome Extension
WeChat Mini Program
Use on ChatGLM

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

CoRR(2025)

Cited 0|Views3
Abstract
We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1 typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.
More
Translated text
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文介绍了Llamba系列,一种从Llama-3.x模型蒸馏到Mamba架构的高效循环语言模型,实现了在保持与Transformer基模型相当性能的同时,提高了推理吞吐量和处理更大批量数据的能力,并通过跨架构蒸馏优化了资源使用。

方法】:作者通过使用MOHAWK蒸馏技术将Llama-3.x模型压缩至Mamba架构,创建了Llamba模型系列。

实验】:在实验中,作者评估了Llamba-1B、Llamba-3B和Llamba-8B的性能,使用的数据集未具体提及,但结果显示这些模型在资源受限的设备上提供了实用的、内存高效的处理能力,且整体性能与Transformer基模型相当。