How DALL-E 2 Could Solve Key Computer Vision Challenges

We’re excited to bring back Transform 2022 in person on July 19 and virtually from July 20-28. Join leaders in AI and data for in-depth discussions and exciting networking opportunities. Register today!


OpenAI recently released DALL-E 2, a more advanced version of DALL-E, an ingenious multimodal AI capable of generating images based solely on textual descriptions. DALL-E 2 achieves this by using advanced deep learning techniques that improve the quality and resolution of generated images and provide other functionality such as editing an existing image or creating new versions of it. -this.

Many AI enthusiasts and researchers have tweeted how amazing DALL-E 2 is at generating art and images from a thin word, but in this article, I’d like to explore a different application for this powerful text-to-image model – generate datasets to solve the toughest challenges in computer vision.

Caption: An image generated by DALL-E 2. “A detective rabbit sitting on a park bench and reading a newspaper in a Victorian setting.” Source: Twitter

The shortcomings of computer vision

Computer vision AI applications can range from detecting benign tumors in CT scans to enabling self-driving cars. Yet what is common to all is the need for abundant data. One of the most important performance predictors of a deep learning algorithm is the size of the underlying data set it was trained on. For example, the JFT dataset, which is an internal Google dataset used for training image classification models, consists of 300 million images and over 375 million labels.

Consider how an image classification model works: a neural network transforms the colors of pixels into a set of numbers that represent its characteristics, also known as “embedding” an input. These features are then mapped to the output layer, which contains a probability score for each class of images the model is expected to detect. During training, the neural network tries to learn the best representations of features that discriminate between classes, for example a pointy ear feature for a Dobermann versus a Poodle.

Ideally, the machine learning model would learn to generalize across different lighting conditions, angles, and background environments. Yet more often than not, deep learning models learn the wrong representations. For example, a neural network might infer that blue pixels are a feature of the “frisbee” class because all of the images of a frisbee it saw during practice were on the beach.

A promising way to solve these problems is to increase the size of the practice set, for example by adding more images of frisbees with different backgrounds. However, this exercise can be long and costly.

First, you will need to collect all the required samples, for example by searching online or capturing new images. Next, you need to make sure that each class has enough labels to prevent the model from overfitting or underfitting some. Finally, you will need to label each image, indicating which image corresponds to which class. In a world where more data translates to a better performing model, these three steps act as a bottleneck to achieving peak performance.

But even then, computer vision models are easily fooled, especially if attacked with conflicting examples. Guess what is another way to mitigate enemy attacks? You guessed it right – more labeled, well-organized, and diverse data.

Caption: OpenAI’s CLIP misclassified an apple as an iPod due to a text label. Source: Open AI

Enter DALL-E 2

Let’s take the example of a dog breed classifier and a class for which it is a little more difficult to find images: Dalmatian dogs. Can we use DALL-E to solve our lack of data problem?

Consider applying the following techniques, all powered by DALL-E 2:

  • Vanilla use. Send the class name as part of a text prompt to DALL-E and add the generated images to the labels for that class. For example, “A Dalmatian dog in the park chasing a bird.”
  • Different environments and styles. To improve the generalizability of the pattern, use prompts with different environments while maintaining the same class. For example, “A Dalmatian dog on the beach chasing a bird.” The same applies to the style of the generated image, for example “A Dalmatian dog in the park chasing a bird in a cartoon style”.
  • Conflicting samples. Use the class name to create a conflicting example dataset. For example, “A Dalmatian type car”.
  • Variants. One of the new features of DALL-E is the ability to generate multiple variations of an input image. He can also take a second image and merge the two by combining the most important aspects of each. We can then write a script that feeds all the existing images of the dataset to generate dozens of variants per class.
  • Inpainting. DALL-E 2 can also make realistic changes to existing images, adding and removing elements while accounting for shadows, reflections and textures. It can be a powerful data augmentation technique to train and further improve the underlying model.

Except for generating more training data, the huge advantage of all the above techniques is that the newly generated images are already labeled, which eliminates the need for human labor to labeling.

While image generation techniques such as Generative Adversarial Networks (GANs) have been around for some time, DALL-E 2 stands out for its high resolution 1024×1024 generations, its multimodal nature of transforming text into images and its strong semantic consistency, i.e. understanding the relationship between different objects in a given image.

Automating dataset creation using GPT-3 + DALL-E

The input to DALL-E is a textual prompt of the image we want to generate. We can leverage GPT-3, a text generation model, to generate dozens of text prompts per class which will then be fed into DALL-E, which in turn will create dozens of images which will be stored per class .

For example, we could generate prompts that include different environments for which we would like DALL-E to create images of dogs.

Caption: A GPT-3 generated prompt to use as input for DALL-E . Source: author

Using this example and a pattern-like sentence such as “A [class_name] [gpt3_generated_actions]we could feed DALL-E with the following prompt: “A Dalmatian lying on the ground”. This can be further optimized by refining GPT-3 to produce dataset legends such as the one in the OpenAI Playground example above.

To further increase the confidence in the newly added samples, one can define a certainty threshold to select only the generations that have passed a specific classification, since each generated image is classified by an image-text model called CLIP.

Limitations and mitigations

If not used with care, DALL-E may generate inaccurate or narrow-scope images, excluding specific ethnic groups or ignoring traits that could lead to bias. A simple example would be a face detector that has only been trained on images of men. Moreover, the use of images generated by DALL-E could present a significant risk in specific fields such as pathology or self-driving cars, where the cost of a false negative is extreme.

DALL-E 2 still has some limitations, compositionality being one of them. Relying on prompts that, for example, assume the correct positioning of objects can be risky.

Caption: DALL-E is still having trouble with some prompts. Source: Twitter

Ways to mitigate this include human sampling, where a human expert will randomly select samples to check their validity. To optimize such a process, one can follow an active learning approach where the images that obtained the lowest CLIP ranking for a given caption are prioritized for review.

Last words

DALL-E 2 is another exciting research result from OpenAI that opens the door to new types of applications. Generating huge datasets to solve one of the biggest bottlenecks in computer vision – data is just one example.

OpenAI reports that it will release DALL-E sometime next summer, most likely in a rolling release with a shortlist for interested users. Those who can’t wait, or can’t pay for this service, can tinker with open source alternatives such as DALL-E Mini (Interface, Playground repository).

While the business case for many DALL-E-based applications will depend on the pricing and policy that OpenAI sets for its API users, they are all certain to take the generation of software a big leap forward. pictures.

Sahar Mor has 13 years of engineering and product management experience focused on AI products. He is currently a Product Manager at Stripe, leading strategic data initiatives. Previously, he founded AirPaper, a document intelligence API powered by GPT-3 and was a founding product manager at Zeitgold (Acq. By Deel), a B2B AI accounting software company where he built and released scale its human-in-the-loop product. , and Levity.ai, a no-code AutoML platform. He has also worked as an engineering director at early-stage startups and the elite Israeli intelligence unit, 8200.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including data technicians, can share data insights and innovations.

If you want to learn more about cutting-edge insights and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You might even consider writing your own article!

Learn more about DataDecisionMakers

Comments are closed.