Chrome Extension
WeChat Mini Program
Use on ChatGLM

TD²-Net: Toward Denoising and Debiasing for Video Scene Graph Generation

AAAI(2024)

Guangzhou University | JD Explore Academy | The University of Sydney

Cited 3|Views24
Abstract
Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD2-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD2-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD2-Net outperforms the second-best competitors by 12.7% on mean-Recall@10 for predicate classification.
More
Translated text
Key words
Scene Graph Generation,Event Detection,Key Frame Extraction,Spatiotemporal Features,Video Summarization
PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Try using models to generate summary,it takes about 60s
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点:本文介绍了一种名为TD²-Net的网络,旨在针对动态场景图生成进行去噪和去偏。具体而言,作者首先提出了一个去噪时空变换器模块,通过设计一个可微的Top-K对象选择器,利用gumbel-softmax采样策略选择每个对象的相关邻域,增强了对象表示的鲁棒环境信息。其次,引入了一个不对称重新加权损失来缓解标签偏倚问题。此损失函数整合了不对称聚焦因子和样本数量来调整分配给各个样本的权重。系统的实验结果表明,在Action Genome数据库上,我们提出的TD²-Net相比已有的最先进方法在谓词分类的平均召回率@10上超过了第二名竞争对手12.7%。

方法:引入了一个去噪时空变换器模块和一个不对称重新加权损失。

实验:在Action Genome数据库上,TD²-Net相比已有的最先进方法在谓词分类的平均召回率@10上超过了第二名竞争对手12.7%。