AI video: more than just “slop”

The Economist
Updated on: Oct 07, 2025 12:18 pm IST

Video models work by taking randomly generated visual static and progressively “de-noising” it by adding order to the chaos

SCROLLING THROUGH the feed on Sora, a new video app from chatbot developer OpenAI , is a hallucinatory experience. A woman in a judo jacket bows to an elephant before flipping it over her shoulder. A young figure-skater races across the rings of Saturn. Grainy security-camera footage captures Sam Altman, OpenAI’s founder and boss, attempting to shoplift a graphics card.

Representative photo.(REUTERS) PREMIUM
Representative photo.(REUTERS)

The TikTok-like service would be an odd project for the AI lab, were it not for the fact that the videos on Sora are all AI-generated. There is no option to upload your own footage, nor even to turn your camera on (save for activating a feature which inserts your own likeness into the AI video-generator). The Sora feed is all slop—AI-generated pabulum—all of the time. Video models, like the Sora AI on which the app is built, are what is exciting the AI industry now that the star is fading for text, and not only because of their impact on mass media.

Not that that impact is small. Despite being invitation-only, the app is perched at the top of the American and Canadian app-store charts, its initial launch sites. “Invite” codes themselves have become valuable commodities, selling on eBay for $5 to $35. At launch it was followed in the charts by Google’s Gemini app, itself seeing a slop-fuelled uplift thanks to the company’s “Nano Banana” image generator. Users ask the system for their photo in the style of the lead character of a ‘90s slasher flick, or hugging themselves as a child, or something equally improbable, and it dutifully complies.

Success comes at a cost. For those lucky enough to cop an invite, Sora is free to use. But it certainly is not free to run. Each video generated on its site is estimated to cost OpenAI around $1 in computing power, based on pricing for the first version of Sora, and users can generate 100 a day. The genius of social media was that users would post content without needing to be paid and advertisers would pay for space alongside them. The economics of a video app are somewhat less promising if the company loses money with every post.

But the true value of Sora, and similar video models like Google’s Veo 3, is unlikely to lie in the slop they can generate—even if it captures users’ attention. Instead, a new paper from researchers at Google DeepMind argues, such systems are able to solve an array of visual and spatial problems without any specific training at all.

Video models work by taking randomly generated visual static and progressively “de-noising” it by adding order to the chaos. At each step it asks itself “what would make this look more like the prompt I have been given?” If that prompt is a description of shareable content, then this is what the model will spit out. If it is a description of a visual task, like manipulating images or solving problems in the real world, then it turns out the latest generation of video models can solve them too.

Give it an image of a parrot on a tree and a prompt demanding the model produce a video showing all colour and detail fading away, leaving only the edges visible, and the model will gamely comply—performing a competent job of edge detection, a task that previously required specialised systems. Give it an unfinished sudoku puzzle and a prompt describing a video of the puzzle being finished, and the model can do so. A photo of robot hands holding a jar can be extended into a full video of the motions the hands would take to open that jar.

The broad range of tasks such models can perform makes them, the paper argues, “zero-shot reasoners”. Zero-shot because the video systems can solve tasks they have never seen before, and were not explicitly trained to do. Reasoners because, at least sometimes, they seem to benefit from what the researchers call “chain-of-frames visual reasoning”, solving tasks like finding the exit to a maze one step at a time.

Promisingly, the paper notes, new systems are significantly better than previous-generation video models at this generalised problem-solving. This, the authors suggest, means video models “will become general-purpose foundation models for vision” in the near future, ultimately able to solve any visual challenge put to them without special training. It is a bold claim, but has a historical echo. In 2022 a team of researchers from Google and the University of Tokyo published a paper noting that “large language models are zero-shot reasoners”, arguing that the then-nascent field of LLMs had “untapped and understudied fundamental zero-shot capabilities”.

Six months later, ChatGPT arrived and the AI boom began. The hope is that video models will mature with a similar wave of excitement—and that the slop phase of Sora will thus turn out to be an interesting footnote in their development, rather than the real McCoy.

Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.

All Access.
One Subscription.

Get 360° coverage—from daily headlines
to 100 year archives.

E-Paper
Full Archives
Full Access to
HT App & Website
Games
SHARE THIS ARTICLE ON
SHARE
close
Story Saved
Live Score
Saved Articles
Following
My Reads
Sign out
Get App
crown-icon
Subscribe Now!