Structured Video-Language Modeling with Temporal Grouping and Spatial GroundingYuanhao Xiong,Long Zhao,Boqing Gong,Ming-Hsuan Yang,Florian Schroff,Ting Liu,Cho-Jui Hsieh,Liangzhe YuanICLR 2024(2024)引用 1|浏览115关键词multi-modal learning,video and languageAI 理解论文溯源树样例生成溯源树,研究论文发展脉络Chat Paper正在生成论文摘要