This horse-riding astronaut is a milestone in AI’s journey to make sense of the world
When OpenAI revealed its picture-making neural network DALL-E in early 2021, the program’s human-like ability to combine different concepts in new ways was striking. DALL-E produced a series of images on demand that were cartoonish and surreal, but they revealed that the AI had learned key lessons about how the universe works together. DALL-E’s avocado chair had all the essential features of both chairs and avocados; the dog-walking daikons wore their tutus around their waists while holding the leashes.
Today the San Francisco-based lab announced DALL-E’s successor, DALL-E 2. It produces better images, is simpler to use, and will be released to public (eventually). DALL-E 2 could even exceed current definitions of artificial Intelligence, forcing us to examine this concept and decide what it actually means.
Image-generation models like DALL-E have come a long way in just a few years. In 2020, AI2 showed off a neural network that could generate images from prompts such as “Three people play video games on a couch.” The results were distorted and blurry, but just about recognizable. Baidu, a Chinese tech giant, improved the image quality of the original DALL-E by creating ERNIE-ViLG.
DALLE 2 extends this approach. It can create amazing images: ask it to create images of astronauts on horses, sea otters, or teddy bear scientists. It does this with near photorealism. OpenAI’s examples (see below) and the ones I saw last week in a demo were cherry-picked. The quality of the results is often amazing, even though it is not easy to find them.
One way to think of this neural network is transcendent beautify as a service,” says Ilya Supskever, cofounder and chief scientist of OpenAI. “Every now, it generates something that just makes my gasp. ”
DALLE 2’s improved performance is due to a complete redesign. GPT-3 was the original version. GPT-3 can be described as a supercharged autocomplete. It starts with a few sentences or words and then it predicts the next several hundred words. DALL-E performed in a similar manner, but it switched words for pixels. It “completed” a text prompt by anticipating the string of pixels it thought would be next. This produced an image.
DALLE 2 is not based upon GPT-3. It works in two stages. It first uses OpenAI’s language model CLIP to translate the prompt into an intermediate form. This captures key characteristics that an image should possess to match the prompt (according CLIP). DALL-E 2 uses a type of neural network called a diffusion model to create an image that meets CLIP.
To support MIT Technology Review’s journalism, please consider becoming a subscriber.
Diffusion models are trained on images that have been completely distorted with random pixels. These models learn how to transform these images back into their original form. There are no images in DALL-E 2. The diffusion model takes random pixels and converts them into a new image from scratch using CLIP. This corresponds to the prompt.
The diffusion model allows DALLE 2 to produce high-resolution images faster than DALLE. In the demo, Ramesh showed me photos of a hedgehog playing chess with a calculator, a corgi, a panda and a corgi, as well as a cat dressed up like Napoleon and holding a piece cheese. I am struck by the strangeness of the subjects. He says, “It’s easy for me to spend a whole day thinking up prompts.”
DALL-E 2 still slips up. It can struggle with prompts that ask it to combine objects with multiple attributes, such “A red cube on top a blue cube” OpenAI believes this is because CLIP doesn’t always correctly connect objects and attributes.
DALL-E 2 is able to create variations of images from existing ones, and can also riff off text prompts. Ramesh uploads a photo of street art outside his apartment. The AI instantly generates alternate scenes with different wall art. Each of these images can be used to start their own sequence of variations. Ramesh says that this feedback loop could prove to be very useful for designers.
One early user, an artist called Holly Herndon, says she is using DALL-E 2 to create wall-sized compositions. She says, “I can put together huge artworks piece by piece, almost like a patchwork tapestry or narrative journey.” “It feels like working in a new medium.”
DALL-E 2 looks much more like a polished product than the previous version. Ramesh says that this wasn’t the goal. OpenAI plans to release DALL-E 2 after a limited rollout to trusted users, much as it did with GPT-3. (You can sign up for access here. )
GPT-3 can produce toxic text. But OpenAI says it has used the feedback it got from users of GPT-3 to train a safer version, called InstructGPT. The company plans to follow a similar route with DALL-E 2. It will also be shaped through user feedback. OpenAI will encourage its first users to break the AI by tricking it into creating offensive or harmful images. OpenAI will make DALL-E 2 more accessible to people as it solves these problems.
OpenAI has also released a policy that prohibits the AI from creating offensive images. There is no violence or pornography, and no political images. Users will not be allowed ask DALL-E to create images of real people in order to prevent deep fakes.
As well as the user policy, OpenAI has removed certain types of image from DALL-E 2’s training data, including those showing graphic violence. OpenAI claims it will also pay human moderators for reviewing every image uploaded to its platform.
“Our main goal here is to get a lot more feedback before we start sharing the system more widely,” Prafulla Dhariwal, OpenAI. “I hope eventually it will be available, so that developers can build apps on top of it.”
Multiskilled AIs that can view the world and work with concepts across multiple modalities–like language and vision–are a step towards more general-purpose intelligence. DALL-E 2 is one the most impressive examples.
Etzioni is pleased with the DALL-E 2 images, but he is cautious about the implications for AI’s overall progress. He says that this kind of improvement doesn’t bring us closer to AGI. “We know deep learning is capable of solving very narrow tasks. This is something we already know.” These tasks are still formulated by humans, who then give deep learning its marching orders .”
Mark Riedl, an AI researcher from Georgia Tech in Atlanta, believes creativity is a good indicator of intelligence. Unlike the Turing test, which requires a machine to fool a human through conversation, Riedl’s Lovelace 2.0 test judges a machine’s intelligence according to how well it responds to requests to create something, such as “A picture of a penguin in a spacesuit on Mars.”
DALL-E scores well on this test. Intelligence is a sliding scale. Our intelligence tests must adapt as we create better machines. Chatbots can mimic human conversation and pass the Turing test in a narrow way. They are still mindless, however.
But ideas about what “create” and “understand”, change, according to Riedl. These terms are ill-defined, and subject to debate.” Riedl says that a bee understands the meaning of yellow because it acts upon that information. Riedl says that AI systems are far from being able to understand human understanding if understanding is defined as such. But I would also argue these art-generation system have some basic understanding that overlaps human understanding,” he said. “They can put on a tutu on the radish in exactly the same place as a human would.”
DALL-E 2 works on information and produces images that meet human expectations. AIs such as DALL-E force us to ask these questions and consider what these terms mean.
OpenAI is clear about its position. Dhariwal says that the goal is to create general intelligence. “Building models like DALL-E 2 that connect vision and language is a crucial step in our larger goal of teaching machines to perceive the world the way humans do, and eventually developing AGI.”
I’m a journalist who specializes in investigative reporting and writing. I have written for the New York Times and other publications.