Why Sora Struggles with Real-World Physics

By Ben Jones, co-founder and CEO of Data Literacy, and author of AI Literacy Fundamentals

Before reading this article, watch this 5 second AI-generated video clip:

What happens when AI tries to generate videos that mimic our physical world? In this article, I’d like to explore this topic through the lens of Sora, OpenAI’s new video generation model.

Tl;dr: As stunningly realistic as the outputs of this model seem to be at first blush, the early version of Sora really struggles to follow the basic laws of physics that govern our world. This is not an exposé piece, as OpenAI themselves have admitted as much in writing. But I do think we can learn a lot about these models and how they work by examining their outputs.

What is Sora?

First, what is Sora? This month, OpenAI made their video generation model, Sora, available as a standalone product for ChatGPT Plus and ChatGPT Pro subscribers. The Pro plan allows users to create more, longer, and higher resolution videos than users of the Plus plan. The version of the model that subscribers are able to use is called Sora Turbo, and it’s a diffusion model designed to turn text, image, and video prompts into video outputs. More on that to come.

The Good: Impressive Capabilities

So what is it capable of creating? The video above is a 5 second video clip in 720p resolution created based on my prompt, “A seagull flies past Seattle’s Space Needle, with a view of the downtown skyline as seen from a ferry on the Puget Sound.” If you didn’t play the video before starting to read, go back and watch it now.

In this video, the movement of the birds might be somewhat abrupt, but that isn’t so abnormal, really. To my eye, the video doesn’t seem to defy the laws of physics in any drastic way. The ferry bobs in the water and the camera pans side to side as if held in the hand of a person riding the ferry as the birds fly around the boat. The city and clouds in the background remain fixed in place appropriately. I’ve personally ridden many ferries in the Seattle area, and in my opinion this video feels quite realistic.

So far so good! I went ahead and added this example, along with a description of Sora, to our course, Harnessing Generative AI.

What Are the Claims Being Made?

There’s a lot to say about this new platform! It’s quite impressive, of course. This one video alone might nudge us toward accepting two specific claims that OpenAI has made about Sora recently:

  1. In their December 9, 2024 product launch announcement, “Sora is here,” OpenAI described Sora as “a foundation for AI that understands and simulates reality—an important step towards developing models that can interact with the physical world.”
  2. On another page on their site, OpenAI claims that Sora “understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”

As powerful as it can be, I don’t believe that these claims are defensible, quite yet. In actuality, this early version of the Sora platform still has many flaws. Specifically, in its current version, it often struggles with real-world physics. I’ll demonstrate that now.

The Bad: Defying Laws of Physics

Watch this AI-generated video of a bouncing basketball. Sora created this short video clip for me on December 14, 2024 in response to my prompt, “Create a video of a basketball bouncing on a court in a city.” 

Notice how the ball suddenly bounces without cause, stops bouncing on its own, and then rolls around in different directions, all without any external force being applied to it. The ball doesn’t exactly obey the law of gravity, nor does it follow Isaac Newton’s First Law of Motion: “An object at rest stays at rest and an object in motion stays in motion with the same speed and in the same direction unless acted upon by an unbalanced force.”

The Ugly: Anatomy Gone Awry

And it isn’t just the way objects can move, sometimes the problem is with the objects themselves. Look again at how the seams of the basketball morph and shift over the course of the video. The ball keeps its general spherical shape, but the seams don’t maintain consistency over time. This can be even more extreme – and somewhat disturbing – when Sora creates videos of the human body in motion.

Watch this AI-generated video of a gymnast. Sora created this short video clip for me on December 14, 2024 in response to my prompt, “A sports telecast scene of a gymnast performing an aerial flip during a floor routine.” The transformer portion of the model created the following title for the video: “Flawless Aerial Flip.”

Not quite “flawless,” I’d say. Not only does the gymnast move through the air in a way that’s physically impossible, the body contorts in unnatural ways, and limbs twist and move and appear in bizarre places.

What’s Going On Here?

So why did Sora create videos with such glaring and disturbing glitches? The issues with these two videos go well beyond the so-called “uncanny valley” – it’s not that they’re slightly off in a way that’s unsettling but hard to pin down; they’re obviously wrong. And my prompts were in no way designed to elicit such effects. I provided Sora with simple and brief descriptions of real-world situations: no trickery involved.

The Nature of the Unnatural Outputs

To be clear, OpenAI themselves have acknowledged that Sora, in its current state, is susceptible to these kinds of quirks. In their launch post itself, they admit that, “The version of Sora we are deploying has many limitations. It often generates unrealistic physics and struggles with complex actions over long durations.”

Elsewhere on their site, they make it clear, that:

“The current model still has room for improvement. It may struggle to simulate the physics of a complex scene, and may not comprehend specific instances of cause and effect (for example: a cookie might not show a mark after a character bites it). The model may also confuse spatial details included in a prompt, such as discerning left from right, or struggle with precise descriptions of events that unfold over time, like specific camera trajectories.”

But why does it tend to do this?

What is a Diffusion Transformer?

Image generators like Dall•E and Midjourney are diffusion models that gradually transform random random noise into a complex distribution like an image or music. Large language models like GPT-4 are transformers that are able to pay attention to all of the words (actually “tokens,” which can be words or fragments of words and punctuation) in a sequence in order to generate new text.

Then what is a diffusion transformer? In simple terms, it combines both approaches: it uses gradual refinement like a diffusion model, starting with random noise and progressively turning it into meaningful data over many steps, and it uses self-attention like a transformer to focus on key patterns and relationships in the data.

In their research paper, Scalable Diffusion Models with Transformers,” Peebles and Xie describe diffusion transformers as “a simple transformer-based backbone for diffusion models that…inherits the excellent scaling properties of the transformer model class.” Essentially, the diffusion model part creates structure step by step, while the transformer part keeps track of relationships and ensures consistency, even across complex sequences like images or videos.

Why Sora Struggles With Real-World Physics | Data Literacy | Data Literacy  

Image Source: OpenAI

OpenAI explains that Sora uses “spacetime patches” to process videos. These patches have a space dimension and a time dimension – the different areas within a frame are the “space” part, and the different frames in a sequence are the “time” part. This approach has many merits, and basically it’s what allows Sora to keep track of patterns and movements within individual frames and across many frames, giving its outputs some level of coherence and continuity.

It’s important to understand, though, that at present, diffusion transformers like Sora will generate scenes with twisted limbs or gravity-defying movements and such from time to time because they don’t have internal representations about the way the world actually works. This may be what OpenAI is working towards, but their current model isn’t there yet.

In a nutshell, working with “spacetime patches” of video data in this way hasn’t yet amounted to a reliable ability to recreate consistent and natural objects or movements. Here’s how reporter Kyle Wiggers put it in his recent TechCrunch article, “What are AI ‘world models,’ and why do they matter?“:

“While a generative model trained on years of video might accurately predict that a basketball bounces, it doesn’t actually have any idea why — just like language models don’t really understand the concepts behind words and phrases. But a world model with even a basic grasp of why the basketball bounces like it does will be better at showing it do that thing.”

What’s Next? Large World Models (LWMs)

An inherent limitation of diffusion models is that they’re trained on hours and hours of video, but they lack any physical embodiment, so they cannot directly experience nor come to understand the physical world like we can, or at all, for that matter. And while their training may result in high level abstractions of objects and how they move in the training videos, they don’t include, at present, any sensory inputs or programming that allow them to accurately simulate the laws of physics.

Update: And, as Nick Desbarats pointed out in a thread about this post on LinkedIn, it would seem that some notion of object affordances is relevant, both conventional (what is a basketball for? why is throwing it in a basketball hoop a good idea?), and unconventional (what else you could do with a basketball? what else could you do with a hoop?). There’s so much to learn about the world – things we, as humans, learn over time by living in it.

Large world models (LWMs), such as those being built by Fei-Fei Li’s company World Labs, promise different results because they aim to incorporate various sensory inputs to develop rules about the environment and its dynamics. It will be interesting to watch both the evolution of video generating diffusion models like Sora, as well as the research, development, and potential emergence of LWMs, and their impacts.

I believe we can expect Sora’s performance to improve over time, and it’s entirely possible that the videos in this article will come to seem as dated as movies like Polar Express. But for the time being, it’s important to recognize the limitations of these tools, and it’s critical to educate ourselves about their capabilities and drawbacks, as well as the underlying reasons that they behave the way that they do.

How to Learn More

If you’re interested in learning more about AI – what it is, its history, and how it is being used – consider enrolling in our training course, AI Literacy Fundamentals. If you’d like to focus on Generative AI, you can take the follow-up course, Harnessing Generative AI. If you think your organization or group could benefit from training about these critical technologies, just fill out our team training inquiry form, and we’ll put a meeting on the calendar to discuss more. My wife and co-founder, Becky Jones, has spoken with hundreds of organizations of various types and sizes all around the world, and she’ll listen to your situation and provide some options that are tailored to your needs. Hope to talk to you soon!