torch>=2.6.0 torchvision>=0.21.0 timm einops accelerate transformers>=4.51.3 diffusers opencv-python-headless scipy wandb matplotlib Pillow tqdm omegaconf python-dotenv ninja ipykernel numpy
ComfyUI-OmniGen2 is now available in ComfyUI, OmniGen2 is a powerful and efficient unified multimodal model. Its architecture is composed of two key components: a 3B Vision-Language Model (VLM) and a 4B diffusion model.
cd ComfyUI/custom_nodes
git clone https://github.com/Yuan-ManX/ComfyUI-OmniGen2.git
# 1. Environment
cd ComfyUI-OmniGen2
# 2. (Optional) Create a clean Python environment
conda create -n omnigen2 python=3.11
conda activate omnigen2
# 3. Install dependencies
# 3.1 Install PyTorch (choose correct CUDA version)
pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
# 3.2 Install other required packages
pip install -r requirements.txt
# Note: Version 2.7.4.post1 is specified for compatibility with CUDA 12.4.
# Feel free to use a newer version if you use CUDA 12.6 or they fixed this compatibility issue.
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Install PyTorch from a domestic mirror
pip install torch==2.6.0 torchvision --index-url https://mirror.sjtu.edu.cn/pytorch-wheels/cu124
# Install other dependencies from Tsinghua mirror
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
# Note: Version 2.7.4.post1 is specified for compatibility with CUDA 12.4.
# Feel free to use a newer version if you use CUDA 12.6 or they fixed this compatibility issue.
pip install flash-attn==2.7.4.post1 --no-build-isolation -i https://pypi.tuna.tsinghua.edu.cn/simple
OmniGen2, a multimodal generation model, model weights can be accessed in huggingface and modelscope.
To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.
text_guidance_scale: Controls how strictly the output adheres to the text prompt (Classifier-Free Guidance).image_guidance_scale: This controls how much the final image should resemble the input reference image.max_pixels: Automatically resizes images when their total pixel count (width × height) exceeds this limit, while maintaining its aspect ratio. This helps manage performance and memory usage.max_input_image_side_length: Maximum side length for input images.negative_prompt: Tell the model what you don’t want to see in the image.enable_model_cpu_offload: Reduces VRAM usage by nearly 50% with a negligible impact on speed.enable_sequential_cpu_offload: Minimizes VRAM usage to less than 3GB, but at the cost of significantly slower performance.Some suggestions for improving generation quality: