Insights

Why Generative Video Was Almost-There for Two Years Before Sora

From Runway Gen-1 in early 2023 to Sora's release in February 2024, generative video was a category that kept looking like it was about to break and never quite did. A look at why, and what changed.

Published May 20, 2024 · By CampaignsLive · Insights

The two-year stretch between early 2023 and the public release of OpenAI’s Sora in early 2024 produced an unusual pattern in generative AI: a category that kept looking like it was about to be production-ready, and kept not quite being.

The pattern was consistent enough to be a running joke in the working community. Every few months a new video model would release with a striking demo reel. Runway Gen-1, Gen-2, Pika Labs, Stable Video Diffusion, Google’s Lumiere, Meta’s Emu Video, several smaller players. The demos always showed a credibility curve that suggested production-grade work was just around the corner. The actual production use, in each case, stayed thin.

This piece is about why that gap held for two years, and what changed when it broke.

What the gap actually was

The demos showed short clips — usually two to four seconds — at moderate resolution, with motion that was coherent, environments that were plausible, and lighting that mostly held. A reasonable viewer watching the demo could imagine the technology being usable for brand creative within months. The hero clip in any given demo reel looked like the kind of footage a brand might use.

The production gap was in three places the demos did not show.

The first was duration. Most generative video models in 2023 could produce three to four seconds of coherent motion before the model lost track of what was supposed to be happening. Faces drifted. Bodies morphed. Environmental detail rearranged itself between frames. Beyond four seconds, the failure modes compounded. A thirty-second commercial requires either a continuous take of thirty seconds (which the models could not produce) or a sequence of multiple shots that maintain identity, environment, and brand consistency across cuts (which the models could not handle either).

The second was control. The 2023 generation of video models accepted text prompts and, in some cases, image inputs. They did not accept the kind of fine-grained control that production work requires: this specific camera move, this specific composition, this specific motion arc for the talent. Working video production runs on storyboards and shot lists. The models worked on text and hope.

The third was identity. Across cuts, across shots, across formats, brand creative requires that talent looks like talent, product looks like product, environment is recognizably the same environment. The 2023 video models could not maintain identity reliably even within a single clip, let alone across a sequence.

The aggregate effect was that the demos were demos. A brand looking to produce a thirty-second spot could not get one out of the available tools, even though the available tools could produce a three-second clip that looked almost spot-grade.

What the brands tried in the meantime

The pattern that emerged was that brands used the available tools for what they could do, not for what the demos suggested. Three use cases stuck.

Concept clips and animatics. A model could produce four seconds of moving reference for a stakeholder meeting, in place of a static storyboard, at a cost that made the production economic. The clip was not the final asset; it was a working tool that let stakeholders react to motion earlier in the workflow than animatics traditionally allowed.

Environmental and atmospheric inserts. Sky replacements, weather effects, water motion, foliage, ambient environmental movement. These are short, low-identity, low-narrative uses where the model’s failures on extended motion did not surface.

Image-to-motion on existing campaign material. Taking a static hero shot and applying small motion — a wisp of hair, a gaze shift, a small environmental movement — to extend the asset’s life on social platforms that favored motion. This was the most commercially useful application of the 2023 generation of tools, and the one that quietly drove the most adoption.

What did not stick was full spot production. Several brands tried; the production processes were public; the outputs ranged from passable to embarrassing; none of them set a precedent that the industry followed.

What Sora actually changed

Sora’s public unveiling in February 2024 was a step-change demonstration of three things the prior generation could not do.

The first was duration. Sora produced sixty-second clips with coherent identity, environment, and motion across the entire length. The technical accomplishment was the most-discussed part of the announcement, and rightly so. Sixty seconds of coherent video is a different category of asset than three seconds.

The second was prompt fidelity. Sora followed detailed prompts with a fidelity the prior tools did not approach. A prompt that specified camera direction, talent direction, environmental detail, and narrative arc produced output that addressed each of the specified elements rather than collapsing into the model’s defaults.

The third was the perceived ceiling. The Sora demos suggested that the path from “this tool can produce coherent video” to “this tool can produce production-grade commercial video” was substantially shorter than the path the prior tools had implied. Whether or not this turned out to be true in 2024 — and the reality was more nuanced than the demos suggested — the announcement reset the industry’s expectations about how close the production-grade moment was.

What Sora did not solve, at announcement, was the production-workflow integration. The release was a closed preview. Brand teams could not, in the normal commercial sense, use the tool. The first widely-distributed brand commercial produced with Sora — Toys R Us, in June 2024 — was a one-off collaboration, not a standard production-stack capability.

The next eighteen months would be about closing the gap between the demo-grade capability and the production-grade workflow. That part is still being worked through.

What this teaches about the broader pattern

The two-year video gap is a useful study in a pattern that applies to most generative AI categories. The demo capability runs ahead of the production capability by twelve to twenty-four months. The production capability runs ahead of the workflow integration by another six to twelve. Between any new technical capability and the point at which a brand can routinely produce work with it, there is a multi-year integration period that the demos do not foreshadow.

This pattern is visible in image generation (the 2022-2024 trajectory from striking single images to production-stable brand assets), in audio (the 2023-2025 trajectory from talking-head voice synthesis to production-grade voiceover replacement), and now in video (the 2023-2026 trajectory still in progress).

For brand-side teams, the practical implication is to discount demo timelines. The category becomes useful when the workflow integration catches up, not when the technical capability is announced. The teams that have done well in each category have been the teams that waited for the workflow gap to close before committing production volume, while staying close enough to the technical work to start that commit on the right cycle when the gap closed.

For the next chapter in this story, see Toys R Us and OpenAI Sora.

Start building campaigns that matter.

Register