Overview
The key problem addressed was the accurate of images that accurately represent complex spatial relationships and interactions between multiple objects, which demonstrates the lack of understanding of spatial concepts. I pinpointed compositionality as the key impedance for generalization, and developed a novel two-stage compositional diffusion model to address this (Spatially Composable Diffusion). Our model outperforms existing baselines in key metrics such as object accuracy, relational accuracy, and FID score. Leveraging compositionality, it successful generalizes to scenes with 4+ objects when only trained on scenes with 2 objects. As a co-first author, I played the primary role in the overall success of the project, from implementing models and building data pipelines to paper writing and poster design.
Timeline
Jun 2023 - Mar 2024
Context
Deep generative models have revolutionized text-to-image generation, yielding highly realistic results. However, they often struggle with generating accurate images, missing objects, or incorrect spatial relationships. While some solutions use the spatial reasoning abilities of large language models (LLMs) or human input, these issues remain unaddressed within diffusion models. The concept of compositionality and concept learning is powerful—humans who understand "left" can picture any objects in that spatial arrangement. We aimed to embed this compositional structure directly into the generative process, ensuring that our model accurately generalizes spatial relationships and interactions as specified by input scene graphs.
How can we enhance diffusion models to accurately represent complex spatial relationships?
How can we incorporate compositionality and concept learning into the diffusion process?
Outcome
(Read the paper on the cover for a detailed explanation)
Ideation