OmniGen

5mos agoupdate 288 0 0

Unified image generation diffusion model, which naturally supports multiple image generation tasks with high flexibility and scalability.

Language:
zh,en
Collection time:
2024-12-04
OmniGenOmniGen

The advent of large-scale language models (LLMs) has unified language generation tasks and revolutionized human-computer interaction. However, inImage Generationdomain, a unified model capable of handling a variety of tasks within a single framework remains largely unexplored. Recently, SmartSource launched theNew Diffusion Modeling Architecture OmniGen, a new multimodal model for unified image generation.

OmniGen has the following features:

  1. Uniformity:OmniGen naturally supports a wide variety of image generation tasks, such as text-generated maps, image editing, topic-driven generation and visual condition generation. In addition, OmniGen can handle classical computer vision tasks by converting them into image generation tasks.
  2. Simplicity:OmniGen's architecture is highly simplified. In addition, it is more user-friendly than existing models, allowing complex tasks to be accomplished via commands without the need for lengthy processing steps and additional modules (e.g. ControlNet or IP-Adapter), thus greatly simplifying the workflow.
  3. Knowledge transfer:Benefiting from learning in a unified format, OmniGen effectively migrates knowledge across different tasks, tackles unseen tasks and domains, and demonstrates novel functionality. We also explore potential applications of the model's reasoning capabilities and thought chain mechanisms in the area of image generation.
OmniGen
  • Paper: https://arxiv.org/pdf/2409.11340
  • Code: https://github.com/VectorSpaceLab/OmniGen
  • Demo: https://huggingface.co/spaces/Shitao/OmniGen
OmniGen
Based on OmniGen's generalized capabilities, a more flexible image generation can be implemented. A simple pipeline is shown above: text generates an image, some elements of the generated image are edited, a redrawn image is generated based on the body pose of the generated image, and the desired objects are extracted from another image and fused to the new image.

I. Introduction

In recent years, many text-based graph models have stood out in the wave of generative AI. However, these excellent proprietary models can only generate images based on text. Additional plug-ins and operations are often required when users have more flexible, complex, fine-grained, etc. image generation needs.

For example, if we want to generate an image with reference to any pose, the conventional approach is to use a pose detector to estimate the pose from a reference image as a conditional input, load the corresponding Controlnet plug-in, and finally extract the features of the conditional input to feed into the diffusion model to generate an image.

In addition, to generate a new image based on a specific person in a group photo, the process is more cumbersome and requires cropping the image to ensure that the resulting image contains only the target person.

Methods such as InstandID, on the other hand, require the use of an additional face detector to extract facial information and a facial encoder to extract features for input into the model.

It is worth noting that even more different plug-ins and operations are required for various different generation tasks, and such a complex, trivial and lengthy workflow greatly increases the cost of training and application. However, even with such tediousness, it is still sometimes difficult to fulfill the needs of general image generation, such as generating a new image based on specifying entities in multiple photographs.

On the contrary, in the text generation domain, the model represented by ChatGPT can directly process various text tasks by human instructions. So, in the image generation domain, can a single model supporting multiple inputs and coupling multiple capabilities accomplish various generation tasks based on user commands without various complicated processes?

To address this challenging issue, SmartSource has released theUnified Image Generation Model OmniGenThe OmniGen model has good simplicity and ease of use, integrating a wide range of basic image generation tasks, including but not limited to: text-generated maps, image editing, character consistency generation, visual condition-based generation, etc. OmniGen supports the completion of tasks based on any multimodal text-map commands, without the need for any other additional plug-ins and operations.

II. Capacity

OmniGen combines a number of capabilities, including but not limited to:
  1. Text to Image Generation
  2. Referring Expression Generation
  3. General Image Conditional Generation
  4. Image Edit
  5. Classical computer vision tasks: image denoising, edge detection, pose estimation, etc.
  6. Certain In-context Learning Skills
Some of the capability effects are briefly shown below:
2.1 Text to Image Generation
OmniGen
2.2 Fingerprint generation

OmniGen has capabilities such as generating role-consistent images similar to models such as InstandID, Pulid, etc., i.e., inputting an image with a single object, understanding and following instructions, and outputting a new image based on that object.

At the same time, OmniGen has a higher-order capability: referent representation generation, which we define as the ability to recognize the object referred to by a command and generate a new image from an image containing multiple objects.

For example, OmniGen can locate a target object directly from a multiplayer image based on a command and generate a new image that follows the command without the need for any additional modules and operations:

OmniGen
More diverse examples:
OmniGen
2.3 Generic Image Condition Generation

OmniGen not only supports ControlNet-like ability to generate images based on specific explicit conditions, but also has the ability to handle classical computer vision tasks (e.g., human pose estimation, depth estimation, etc.) at the same time.

As a result, OmniGen can perform the entire ControlNet process with a single model: OmniGen is used directly to extract the visual conditions from the original image and to generate an image based on the extracted conditions, without the need for an additional processor.

At the same time, OmniGen can further simplify the intermediate process by producing an image in one step: directly input the original image and enter the command "Following the human pose (or depth mapping) of this image, generate a new image:...", then a new image can be generated based on the human pose or depth mapping relationship of the input image. The new image can be generated according to the human pose or depth mapping relationship of the input image.

OmniGen
2.4 Image Editing
OmniGen has good image editing capabilities and can execute multiple editing commands simultaneously in a single run, for example:
OmniGen
2.5 Additional capacities

OmniGen has the potential inference capability to handle non-explicit query commands that have certain requirements for model understanding and inference capabilities.

For example, if the model is asked to delete an item in a diagram that can hold water, the model is able to understand and infer the object in the diagram to which the instruction relates and delete it:

OmniGen

On the other hand, OmniGen has a certain degree of contextual learning capability to process images based on reference examples. For example, given an input-output pairing example (Example) for segmenting the Queen's Chess, the model can recognize and segment the corresponding objects in the new input image:

OmniGen

The Chain-of-Thought (CoT) approach significantly improves the performance of llm by decomposing the task into multiple steps and solving each step sequentially to obtain an accurate final answer. We consider whether a similar alternative can be applied to image generation. Inspired by the fundamental way humans paint, we wanted to mimic the step-by-step painting process by iteratively generating images from a blank canvas. We conducted a preliminary exploration and fine-tuned the model to mimic human behavior in generating images step by step, leaving further optimization for future research.

OmniGen

OmniGen's capabilities include, but are not limited to, the above, but also include basic image denoising, edge extraction, and other capabilities. The model weights and code have been open sourced so that users can explore more of OmniGen's capabilities on their own.

III. Models

The core design principles of OmniGen are simplicity and efficiency. The basic architecture of OmniGen is a Transformer model and a VAE module with 3.8B parameters. Among them, the Transformer is inherited from the Phi3-mini model, and Bidirectional Attention is used inside the image to fit the characteristics of the image data. The overall architecture is shown below:

OmniGen

To achieve strong generalization and generalization capabilities, researchers need to train models based on large and diverse datasets. However, there is no available generalized dataset in the field of image generation. To this end, we constructed the first large-scale and diverse unified image generation dataset X2I, which means "Anything to Image". The X2I dataset contains about 100 million images, and will be open-sourced after review and other processes to further advance the field of general-purpose image generation. The figure below briefly shows some examples of the X2I dataset:

OmniGen

IV. Summary and outlook

In summary, OmniGen's unified image generation paradigm not only facilitates the execution of various downstream tasks, but also facilitates the combination of various capabilities to meet more generalized needs. Currently, OmniGen's reports, weights, and code are open source, and the community is welcome to participate in the exploration of OmniGen's potential capabilities, basic performance enhancements, and broad applications.

The OmniGen model is an initial attempt at unified image generation, and there is still a lot of room for improvement. In the future, OmniGen will further improve the basic capabilities of the model and expand more interesting functions. Meanwhile, the fine-tuning code has been released and users can simply fine-tune it. Since the input forms of OmniGen are very diverse, users can define all kinds of fine-tuning tasks by themselves, which will give the model more interesting capabilities.

data statistics

Relevant Navigation

No comments

none
No comments...