Multiple Scales Fusion and Query Matching Stabilization for Detection with Transformer
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE(2025)
China Univ Petr East China | Lenovo China
Abstract
Recent advances in object detection with Transformer-based models like Detection with Transformer (DETR) have improved performance, but challenges remain with multi-scale fused features. These features introduce redundant tokens and bias toward larger objects, slowing down training. To overcome these issues, we propose two novel encoders: the Similarity-based Deduplication Encoder (SDE) and the Hybrid Multi-object Encoder (HMoE). HMoE employs an offset-based attention window to enhance local attention for objects of varying sizes across feature maps, while SDE reduces redundancy by calculating attention scores across multiple scales. Additionally, we introduce a One-to-many Positive Matching (OmPM) strategy to improve query stability. OmPM generates query vectors from multiple positive samples, resulting in more diverse and semantically meaningful queries. Our model demonstrates substantial performance improvements. On the Visual Object Classes Challenge 2007 dataset, it achieves a +5.04 mean Average Precision (mAP) and +5.1 Average Precision for small objects (APs) for small objects in just 24 epochs. On the Microsoft Common Objects in Context (COCO) dataset, the model reaches 50.1 mAP and 34.2 APs in only 8 epochs, and 52.4 mAP and 35.6 APs in 24 epochs. This significantly accelerates convergence, reducing training time by 66% compared to benchmarks while maintaining or exceeding detection accuracy. Furthermore, our model achieves 27 Frames Per Second (FPS) on the COCO dataset, setting a new record among DETR-like methods with high detection accuracies.
MoreTranslated text
Key words
Deep learning,Object detection,Multiple scales fusion,Vision transformer
求助PDF
上传PDF
View via Publisher
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
- Pretraining has recently greatly promoted the development of natural language processing (NLP)
- We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
- We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
- The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
- Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined