Chrome Extension
WeChat Mini Program
Use on ChatGLM

Smart-Infinity: Fast Large Language Model Training Using Near-Storage Processing on a Real System

2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024(2024)

Seoul Natl Univ | Univ Texas Austin | Yonsei Univ

Cited 1|Views136
Abstract
The recent huge advance of Large Language Models (LLMs) is mainly driven by the increase in the number of parameters. This has led to substantial memory capacity requirements, necessitating the use of dozens of GPUs just to meet the capacity. One popular solution to this is storage-offloaded training, which uses host memory and storage as an extended memory hierarchy. However, this obviously comes at the cost of storage bandwidth bottleneck because storage devices have orders of magnitude lower bandwidth compared to that of GPU device memories. Our work, Smart-Infinity, addresses the storage bandwidth bottleneck of storage-offloaded LLM training using near-storage processing devices on a real system. The main component of Smart-Infinity is SmartUpdate, which performs parameter updates on custom near-storage accelerators. We identify that moving parameter updates to the storage side removes most of the storage traffic. In addition, we propose an efficient data transfer handler structure to address the system integration issues for Smart-Infinity. The handler allows overlapping data transfers with fixed memory consumption by reusing the device buffer. Lastly, we propose accelerator-assisted gradient compression/decompression to enhance the scalability of Smart-Infinity. When scaling to multiple near-storage processing devices, the write traffic on the shared channel becomes the bottleneck. To alleviate this, we compress the gradients on the GPU and decompress them on the accelerators. It provides further acceleration from reduced traffic. As a result, Smart-Infinity achieves a significant speedup compared to the baseline. Notably, SmartInfinity is a ready-to-use approach that is fully integrated into PyTorch on a real system. The implementation of Smart-Infinity is available at https://github.com/AIS-SNU/smart-infinity.
More
Translated text
Key words
Processing in-memory/near-memory/in-cache,FPGA: Architectures and accelerators,Large Language Models (LLMs)
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers

Check-QZP: A Lightweight Checkpoint Mechanism for Deep Learning Frameworks

Sangheon Lee, Gyupin Moon,Chanyong Lee, Hyunwoo Kim,Donghyeok An,Donghyun Kang
APPLIED SCIENCES-BASEL 2024

被引用0

Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:论文提出Smart-Infinity方法,通过在真实系统中使用近存储处理设备进行大规模语言模型的快速训练,有效解决了存储带宽瓶颈问题。

方法】:Smart-Infinity采用SmartUpdate技术,在定制近存储加速器上执行参数更新,并使用数据传输处理器结构优化系统集成,同时引入加速器辅助的梯度压缩/解压缩技术以提升扩展性。

实验】:实验在真实系统上完成,使用未明确指出的数据集,结果显示Smart-Infinity相比基线实现了显著的速度提升,且该方法已完全集成至PyTorch。