TD²-Net: Toward Denoising and Debiasing for Video Scene Graph Generation

Xin Lin,Chong Shi,Yibing Zhan,Zuopeng Yang,Yaqi Wu,Dacheng Tao

AAAI（2024）

Guangzhou University | JD Explore Academy | The University of Sydney

Cited 3|Views24

Abstract

Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD2-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD2-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD2-Net outperforms the second-best competitors by 12.7% on mean-Recall@10 for predicate classification.

Translated text

Key words

Scene Graph Generation,Event Detection,Key Frame Extraction,Spatiotemporal Features,Video Summarization

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Try using models to generate summary,it takes about 60s

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Related Papers

Reference papers

Video Visual Relation Detection

Xindi Shang,Tongwei Ren,Jingfan Guo,Hanwang Zhang,Tat-Seng Chua

2017

被引用182 | 浏览

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

Xu Yang,Chongyang Gao,Hanwang Zhang,Jianfei Cai

2020

被引用30 | 浏览

Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection.

Kaifeng Gao,Long Chen,Hanwang Zhang,Jun Xiao,Qianru Sun

2023

被引用32 | 浏览

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

要点：本文介绍了一种名为TD²-Net的网络，旨在针对动态场景图生成进行去噪和去偏。具体而言，作者首先提出了一个去噪时空变换器模块，通过设计一个可微的Top-K对象选择器，利用gumbel-softmax采样策略选择每个对象的相关邻域，增强了对象表示的鲁棒环境信息。其次，引入了一个不对称重新加权损失来缓解标签偏倚问题。此损失函数整合了不对称聚焦因子和样本数量来调整分配给各个样本的权重。系统的实验结果表明，在Action Genome数据库上，我们提出的TD²-Net相比已有的最先进方法在谓词分类的平均召回率@10上超过了第二名竞争对手12.7%。

方法：引入了一个去噪时空变换器模块和一个不对称重新加权损失。

实验：在Action Genome数据库上，TD²-Net相比已有的最先进方法在谓词分类的平均召回率@10上超过了第二名竞争对手12.7%。

去 AI 文献库对话