Consistent Scene Diffusion for Zero-Shot
Social Story Generation

Improving state-of-the-art zero-shot machine-generated text detection algorithm

The goal of this project is to utilize LLMs and text-to-image models to generate social stories, a useful tool for helping children with autism spectrum disorder learn. This is needed because despite their usefulness, they are not utilized due to time burden imposed on therapists to make them.

On the technical side, this focuses on enhancing zero-shot image generation to ensure consistent scene and character generation. This approach combines Stable Diffusion, DreamBooth, and textual inversion with cross-attention control and ChatGPT prompting. The results exhibit superior performance compared to the existing state-of-the-art StoryDALL-E model, both quantitatively and qualitatively in terms of consistency and interpretability, while staying lightweight, expressive, and personalizable for therapists.

I contributed to all aspects of the project, whether it was ideation, implementation, and developing evaluation metrics.
  • Timeline
  • Mar - Jun 2023
  • Skills
  • Research
    Applied AI
  • Team
  • Ryan Lian
    Will Li
    Claire Shao
  • Tools
  • PyTorch
    ChatGPT API

Context

Social stories have demonstrated to be an amazing tool for children with ASD to learn. They're not utilized, however, as the therapists would have to manually create stories, which is not only time consuming but also not of the highest quality.

LLMs and text-to-image generation models like Stable Diffusion are showing a potential to automate this process, but there are several severe challenges still. Due to their black box nature, using these models directly result in a series of images you can barely call a story: the same character is completely different in every scene, and the scene itself also dramatically changes despite fixing the random seed. This is not only hard to comprehend, but even more so distracting for children with ASD. There is also no control over the subject, and in an ideal situation the subject would be the child, and the story in a style that intrigues the child.

How can we leverage existing LLMs and text-to-image models to generate personalized, expressive social stories with children with Autism Spectrum Disorder (ASD)?

How can we improve scene and character consistency while keeping it light weight, personalizable, and expressive for therapists?

Process

Presentation

I presented at the poster session with 500+ students and and faculty members

Outcome

(Read the paper on the cover for a detailed explanation)

Therapist Survey Results

Overall, 0 therapists preferred the social story generated by StoryDALL-E, 2 preferred human generated stories, and 9 preferred our results. The two therapists who preferred the human-made social story reported it was because they were more accustomed to working with that story style and not due to a failure in our story generation.

Award

We were awarded with "Best Project" in Stanford's hallmark graduate level deep learning class CS 231N ~ 500+ BS, MS, PhD, and professional students