OpenGeMM: A Highly-Efficient GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

Xiaoling Yi, Ryan Antonio, Joren Dumoulin,Jiacong Sun, Josse Van Delm, Guilherme Pereira Paim,Marian Verhelst

Asia and South Pacific Design Automation Conference（2025）

Cited 0|Views2

Abstract

Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system-level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58× to 16.40× speedup on normalized throughput across a wide variety of GeMM workloads, while achieving 4.68 TOPS/W system efficiency.

Translated text

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Try using models to generate summary,it takes about 60s

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：本文提出了一种高效的GeMM加速器生成器OpenGeMM，结合轻量级RISC-V控制和紧密内存耦合，以提高边缘设备上DNN部署的效率和利用率，并具备配置灵活性和可编程性。

【方法】：OpenGeMM通过参数化的Chisel编码GeMM加速器、轻量级RISC-V处理器和紧密耦合的多银行暂存内存设计，利用配置预加载、输入预取与输出缓冲、可编程的跨距内存访问三种机制提升核心利用率和系统效率。

【实验】：实验使用了多种卷积神经网络和Transformer工作负载，在多个GeMM工作负载上，OpenGeMM实现了81.89%至99.34%的硬件利用率，相比于最新的开源Gemmini加速器，在广泛GeMM工作负载上展示了3.58倍至16.40倍的速度提升，同时达到4.68 TOPS/W的系统效率。

去 AI 文献库对话