Someone said that a long long time ago.
This video and the dogs and the snow was created without a single photon bouncing off a puppy or a snowflake and into a sensor. Instead someone entered the following oddly incomplete sounding prompt into Sora, the new text-to-video tool from openAI: “A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.”
Seeing is believing was never fully true and magicians and optical illusions inhabited the cracks in our perceptions. It was called trick photography when I was growing up. Photoshop made that available to the general public in the 1990s and it was what was used to airbrush ads and photos of celebrities. CGI became a thing at the movies and we marveled at Gollum in the first Lord of the Rings at the dawn of this century. This same technology allowed Marvel to become an entertainment powerhouse for the next two decades and for three hours at a time we inhabited make-believe worlds filled with make-believe characters.
But it wasn’t cheap, easy, or fast. You know this if you’ve waited till the end of a Marvel movie to see that post-credit scene. Before that you have to sit through many minutes of screens-full of people’s names at FX or Weta or a special effects studio. It literally took a studio full of people and computers to trick you into believing what you saw.
About a year ago you could create almost photo realistic images of a monkey in an astronaut suit riding a horse (many of us did). Dalle-E, MidJourney and Stable Diffusion invaded our imagination and our screens.
Recently Google, Runway, Pika, and others allowed us to make a few seconds of videos of anything you could put into words. The motion was jerky, the pixelation crude, but it was fascinating. Today OpenAI announced Sora. The era of seeing is believing is over.
Close your eyes and imagine three fluffy goldens playing in the soft snow. Puppies wrestle, snow flies, ears flop, deep brown eyes stare. Snow sticks to black wet muzzles. If I could now pull that visual image from your brain and put it in a 20 sec video it would probably not be very different from Sora’s video. Neither is the process inside Sora. It has been trained on billions on visuals of dogs and puppies and snow and perhaps even puppies playing in the snow, and sentences with those words. Based on that composite image in its artificial mind, it creates this video. Training Sora is very expensive. Only a handful of companies worldwide have the deep pockets, technology, access to needed silicon, and the legal muscle (the ethics are bing questioned in courts) to pull this off today. But the rendering – the “inference” as it is called – is almost effortless, compared to the cost of producing the same video using traditional FX techniques used to make the Marvel movies. You prefer Dalmatian puppies in the mud? Just retype the prompt and hit return.
Those of us that lived on both sides of this day will remember when seeing was believing. Hereafter there won’t be any expectation that reality and video are or should be related. That will seem like a quaint idea in time. We will go to the cinema, in the words of my friend Professor Ghosh, to watch Oscar-winning movies fully generated by AIs. With nary a human actor or a cinematographer in sight. In the words of Benj Edwards from ars technica, “Even when the kid jumped over the lava, there was at least a kid and a room.” Tomorrow, there will be no kid or room and certainly no lava.
Check out sora at https://openai.com/sora