The world of AI is still figuring out how to handle the amazing feat of DALL-E 2’s ability to draw/paint/imagine just about anything…but OpenAI isn’t the only one working on something. something like that. Google Research was quick to release a similar model it’s working on, which it says is even better.
Imagen (get it?) is a text-to-image flow-based generator built on great converter language models… well, let’s slow down and decompress real fast.
Text-to-image templates take text inputs like “dog on a bike” and output a corresponding image, something that’s been done for years but has recently seen huge leaps in quality and ease of use. Accessibility.
Some of this uses propagation techniques, which basically start with a pure image with noise and slowly refine it little by little until the model thinks it can’t make it look more like a dog on a bike than it really is. This was an improvement over the top-down builders who could go wrong funny at first, and others who could easily be misled.
The other part is improving language comprehension through large language models using a transformer approach, which I won’t (and can’t) get into here, but that along with some other recent advances has led to compelling language models like GPT-3 and others.
Imagen starts by generating a small image (64 x 64 pixels) and then passes a “ultra-resolution” over it to reach 1024 x 1024. It’s not like normal scaling, because the AI’s super-resolution creates new details that blend in with the smaller image, using the original as a base.
Let’s say, for example, that you have a dog on a bike, and the dog’s eye is 3 pixels wide in the first frame. There is not much room for expression! But in the second image, it’s 12 pixels wide. Where do the necessary details for this come from? Well, the AI knows what a dog’s eye looks like, so it generates more details as you draw it. Then it happens again when the eye is reconstructed, but 48 pixels in diameter. But the AI has never had to remove 48 dog-eye pixels from…let’s say the magic bag. Like many artists, it began with the equivalent of a rough sketch, completed it in a study, and then actually went to the city on the finished canvas.
It’s not unprecedented, and in fact, artists working with AI models are already using this technology to create much larger pieces than the AI can handle in one go. If you break a canvas into several pieces and solve them all separately, you will end up with something much larger and more detailed; You can do this repeatedly. Interesting example From an artist I know:
The advances Google researchers claim with Imagen are manifold. They say that existing text templates can be used for the text markup part and that their quality is more important than just increasing visual fidelity. This makes sense intuitively, because the detailed picture of the bullshit is definitely worse than a slightly less detailed picture of exactly what you asked for.
For example, in the article describing Imagen, they compare their results and DALL-E 2 for making “Panda Latte Art”. In all of the latter’s pictures, it’s panda latte art; In most of the pictures, the panda does the art. (None of them have managed to get an astronaut off a horse, which indicates otherwise on all attempts. It’s a work in progress.)
In Google tests, Imagen ranked first in human evaluation tests for accuracy and fidelity. It’s obviously quite subjective, but to match the perceived quality of the DALL-E 2, which until today was considered a huge leap forward compared to everything else, is very impressive. I’ll just add that while they’re very good, none of these images (from any generator) will hold up more than a quick scrutiny before people notice they’re born or have any serious doubts.
However, OpenAI is one or two steps ahead of Google in many ways. DALL-E 2 is more than just a research paper, it’s a private beta that people use just as they used its predecessor, GPT-2 and 3. Ironically, the company with the word “open” in its name has focused on producing its own text-looking images, while That wonderfully profitable Internet giant has yet to operate it.
This is more evident than the choice made by the DALL-E 2 researchers, to keep the training data set up front and to remove any content that might violate their own guidelines. The model can’t do anything NSFW if it tries. However, the Google team used large known data sets to include inappropriate material. In an insightful section of Imagen describing “Limitations and Societal Impact,” the researchers wrote:
The downstream applications of text image models are diverse and can have a complex impact on society. The potential risks of misuse raise concerns about open source code and demos. At this time, we have decided not to release any generic code or demo.
The data requirements for text-image models have led researchers to rely heavily on large data sets, most of which are de-saturated and retrieved from the web. While this approach has enabled rapid advances in algorithms in recent years, data sets of this type often reflect social stereotypes, oppressive viewpoints, and degrading or harmful associations with marginalized identity groups. While a subset of our training data was filtered to remove noise and unwanted content, such as pornographic images and toxic language, we also used the LAION-400M data set known to contain a wide range of inappropriate content, including pornographic images, racial slurs, and stereotypes. harmful social. Imagen relies on text codecs trained on unsecured web-scale data, thus inheriting the social biases and limitations of large language models. As such, there is a risk that Imagen will encrypt stereotypes and harmful images, guiding our decision not to release Imagen for public use without further safeguards.
While some may criticize this, saying that Google fears its AI is not politically correct enough, it is an intolerable and short-sighted view. The AI model is as good as the trained data, and not every team can put in the time and effort to remove the really awful things that these scrapers pick up when compiling several million or several billion photos. Word data sets.
These biases are thought to emerge during the research process, which reveals how systems work and provides an unfettered testing ground for identifying these and other limitations. Other than that, how do we know that AI can’t draw popular black hairstyles – hairstyles that any kid can draw? Or when you are asked to write stories about work environments, artificial intelligence always makes the boss a man? In these cases, the AI model works perfectly and as designed – it has successfully learned the biases that permeate the media being trained. No different from people!
But while eliminating systemic bias is a lifelong project for many humans, AI is easier, and its creators can remove the content that made it misbehave in the first place. Perhaps one day it will be necessary for AI to write in the racist and sexist expert style of the 1950s, but for now, the benefits of including this data are small and the risks great.
Either way, it’s clear that Imagen, like the others, is still in the beta phase, ready for use only under strict human supervision. As Google makes it easier to access its capabilities, I’m sure we’ll learn more about how it works and why.