Chrome Extension
WeChat Mini Program
Use on ChatGLM

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

AAAI 2025(2025)

School of Computer Science and Technology | SGIT AI Lab | China Telecom Corporation Ltd. Data&AI Technology Company | Baidu Inc

Cited 0|Views14
Abstract
Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned optimizing stage and a consistent sampling stage. In the optimizing stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the sampling stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.
More
Translated text
PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Try using models to generate summary,it takes about 60s
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:论文提出了一种无需训练的图像生成方法SpotActor,实现了根据给定布局条件生成一致性和构图性图像,解决了传统文本到图像生成模型在布局控制和角色一致性方面的不足。

方法】:作者创新性地使用了双能量引导和优化策略,在双语义-潜在空间中实现了一个训练免费的生成流程,包括一个布局条件化的反向更新阶段和一个一致性的正向采样阶段。

实验】:实验中,作者设计了一个特定的基准ActorBench,包含数百个合理的提示-框对,来源于目标检测数据集。实验结果表明SpotActor在布局对齐、角色一致性、提示符符合性和背景多样性方面表现出色。