StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition

Xin Ding, Hao Wu,Yifan Yang,Shiqi Jiang, Donglin Bai, Zhibo Chen,Ting Cao

CoRR（2025）

Cited 0|Views1

Abstract

With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce , a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ”event-gated LLM invocation”, in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI Copilot and interactive media.

Translated text

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Try using models to generate summary,it takes about 60s

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：本文提出了一种名为StreamMind的视频LLM框架，通过事件驱动的认知机制实现全帧率视频流处理和实时无干预响应，解决了传统视频处理速度与计算成本之间的矛盾。

【方法】：文章采用了一种新颖的感知-认知交织范式“事件门控LLM调用”，通过在视频编码器与LLM之间引入认知门控网络，仅当相关事件发生时调用LLM，并提出了基于状态空间方法的Event-Preserving Feature Extractor (EPFE)以恒定成本提取事件特征。

【实验】：实验在Ego4D和SoccerNet流任务以及标准离线基准上进行了测试，证明了StreamMind在模型能力和实时效率方面均达到了最先进的性能水平。

去 AI 文献库对话