ComfyUI-QwenVL

★ 710

多模态AI图像与视频理解文本生成自动模型下载

集成Qwen-VL系列（含Qwen2.5-VL与Qwen3-VL），在ComfyUI中提供图像/视频理解与文本生成的多模态节点，支持预设提示、进阶控制、自动下载与量化。

💡 在ComfyUI中用Qwen-VL对图像或视频做多模态分析并生成文本描述。

🍴 93 Forks💻 Python🔄 2026-02-10

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/e98a62d17551

📦 requirements.txt

#
GGUF
backend
requirements
(vision-capable
llama-cpp-python).
#
Follow
the
platform-specific
install
guide:
#
docs/LLAMA_CPP_PYTHON_VISION_INSTALL.md
#
#
Linux
(CUDA)
example:
#
pip
install
--upgrade
--force-reinstall
--no-cache-dir
\
#
llama-cpp-python==<version>
--extra-index-url
<vision-wheel-index>
#
#
Windows
example:
#
1)
Install
a
vision-capable
wheel
from
the
guide.
#
2)
Then
verify
with:
#
python
-c
"import
llama_cpp;
print(llama_cpp.__version__)"
#
#
Placeholder
only;
do
not
rely
on
this
line
alone:
#
llama-cpp-python

600346260_122188475918461193_3763807942053883496_n

📄 README

QwenVL for ComfyUI

The ComfyUI-QwenVL custom node integrates the powerful Qwen-VL series of vision-language models (LVLMs) from Alibaba Cloud, including the latest Qwen3-VL and Qwen2.5-VL, plus GGUF backends and text-only Qwen3 support. This advanced node enables seamless multimodal AI capabilities within your ComfyUI workflows, allowing for efficient text generation, image understanding, and video analysis.

📰 News & Updates

2026/02/08: v2.1.1 Fixed compatibility for Transformers 4.x and 5.x [Update]

2026/02/05: v2.1.0 Added SageAttention support with per-GPU architecture optimization, improved FP8 model handling, and automatic attention mode selection. [Update]

SageAttention Support: New attention mode with per-GPU optimized kernels (SM80, SM89, SM90, SM120)

Improved FP8 Handling: Better support for pre-quantized FP8 models with automatic SDPA fallback

Smart Attention Selection: Auto mode now tries Sage → Flash → SDPA for optimal performance

Progress Bar: Added ComfyUI progress bar for model loading and generation stages

Better Memory Management: Improved cache clearing when changing attention modes or quantization

2025/12/22: v2.0.0 Added GGUF supported nodes and Prompt Enhancer nodes. [Update]

[!IMPORTANT]

Install llama-cpp-python before running GGUF nodes instruction

2025/11/10: v1.1.0 Runtime overhaul with attention-mode selector, flash-attn auto detection, smarter caching, and quantization/torch.compile controls in both nodes. [Update]

2025/10/31: v1.0.4 Custom Models Supported [Update]

2025/10/22: v1.0.3 Models list updated [Update]

2025/10/17: v1.0.0 Initial Release

Support for Qwen3-VL and Qwen2.5-VL series models.

Automatic model downloading from Hugging Face.

On-the-fly quantization (4-bit, 8-bit, FP16).

Preset and Custom Prompt system for flexible and easy use.

Includes both a standard and an advanced node for users of all levels.

Hardware-aware safeguards for FP8 model compatibility.

Image and Video (frame sequence) input support.

“Keep Model Loaded” option for improved performance on sequential runs.

Seed parameter for reproducible generation.

[](https://github.com/1038lab/ComfyUI-QwenVL/blob/main/example_workflows/QWenVL.json)

✨ Features

Standard & Advanced Nodes: Includes a simple QwenVL node for quick use and a QwenVL (Advanced) node with fine-grained control over generation.

Prompt Enhancers: Dedicated text-only prompt enhancers for both HF and GGUF backends.

Preset & Custom Prompts: Choose from a list of convenient preset prompts or write your own for full control.

Multi-Model Support: Easily switch between various official Qwen-VL models.

Automatic Model Download: Models are downloaded automatically on first use.

Smart Quantization: Balance VRAM and performance with 4-bit, 8-bit, and FP16 options.

Hardware-Aware: Automatically detects GPU capabilities and prevents errors with incompatible models (e.g., FP8).

Reproducible Generation: Use the seed parameter to get consistent outputs.

Memory Management: “Keep Model Loaded” option to retain the model in VRAM for faster processing.

Image & Video Support: Accepts both single images and video frame sequences as input.

Robust Error Handling: Provides clear error messages for hardware or memory issues.

Clean Console Output: Minimal and informative console logs during operation.

SageAttention Support: GPU-optimized attention mechanism with per-architecture kernels (Ampere, Ada, Hopper, Blackwell).

Progress Bar: Visual feedback during model loading and generation stages.

Intelligent Cache Management: Automatically clears VRAM when changing attention modes or quantization settings.

🚀 Installation

Clone this repository to your ComfyUI/custom\_nodes directory:

“`

cd ComfyUI/custom\_nodes

git clone https://github.com/1038lab/ComfyUI-QwenVL.git

“`

Install the required dependencies:

“`

cd ComfyUI/custom\_nodes/ComfyUI-QwenVL

pip install \-r requirements.txt

“`

Restart ComfyUI.

Optional: SageAttention Support

For optimal performance on supported GPUs, install SageAttention:

pip install sageattention

🧭 Node Overview

Transformers (HF) Nodes

QwenVL: Quick vision-language inference (image/video + preset/custom prompts).

QwenVL (Advanced): Full control over sampling, device, and performance settings.

QwenVL Prompt Enhancer: Text-only prompt enhancement (supports both Qwen3 text models and QwenVL models in text mode).

GGUF (llama.cpp) Nodes

QwenVL (GGUF): GGUF vision-language inference.

QwenVL (GGUF Advanced): Extended GGUF controls (context, GPU layers, etc.).

QwenVL Prompt Enhancer (GGUF): GGUF text-only prompt enhancement.

🧩 GGUF Nodes (llama.cpp backend)

This repo includes GGUF nodes powered by llama-cpp-python (separate from the Transformers-based nodes).

Nodes: QwenVL (GGUF), QwenVL (GGUF Advanced), QwenVL Prompt Enhancer (GGUF)

Model folder (default): ComfyUI/models/llm/GGUF/ (configurable via gguf_models.json)

Vision requirement: install a vision-capable llama-cpp-python wheel that provides Qwen3VLChatHandler / Qwen25VLChatHandler

See docs/LLAMA_CPP_PYTHON_VISION_INSTALL.md

🗂️ Config Files

HF models: hf_models.json

hf_vl_models: vision-language models (used by QwenVL nodes).

hf_text_models: text-only models (used by Prompt Enhancer).

GGUF models: gguf_models.json

System prompts: AILab_System_Prompts.json (includes both VL prompts and prompt-enhancer styles).

📥 Download Models

The models will be automatically downloaded on first use. If you prefer to download them manually, place them in the ComfyUI/models/LLM/Qwen-VL/ directory.

HF Vision Models (Qwen-VL)

| Model | Link |

| :—- | :—- |

| Qwen3-VL-2B-Instruct | Download |

| Qwen3-VL-2B-Thinking | Download |

| Qwen3-VL-2B-Instruct-FP8 | Download |

| Qwen3-VL-2B-Thinking-FP8 | Download |

| Qwen3-VL-4B-Instruct | Download |

| Qwen3-VL-4B-Thinking | Download |

| Qwen3-VL-4B-Instruct-FP8 | Download |

| Qwen3-VL-4B-Thinking-FP8 | Download |

| Qwen3-VL-8B-Instruct | Download |

| Qwen3-VL-8B-Thinking | Download |

| Qwen3-VL-8B-Instruct-FP8 | Download |

| Qwen3-VL-8B-Thinking-FP8 | Download |

| Qwen3-VL-32B-Instruct | Download |

| Qwen3-VL-32B-Thinking | Download |

| Qwen3-VL-32B-Instruct-FP8 | Download |

| Qwen3-VL-32B-Thinking-FP8 | Download |

| Qwen2.5-VL-3B-Instruct | Download |

| Qwen2.5-VL-7B-Instruct | Download |

HF Text Models (Qwen3)

| Model | Link |

| :—- | :—- |

| Qwen3-0.6B | Download |

| Qwen3-4B-Instruct-2507 | Download |

| qwen3-4b-Z-Image-Engineer | Download |

GGUF Models (Manual Download)

| :– | :– | :– | :– | :– | :– |

📖 Usage

Basic Usage

Add the “QwenVL” node from the 🧪AILab/QwenVL category.

Select the model\_name you wish to use.

Connect an image or video (image sequence) source to the node.

Write your prompt using the preset or custom field.

Run the workflow.

Advanced Usage

For more control, use the “QwenVL (Advanced)” node. This gives you access to detailed generation parameters like temperature, top\_p, beam search, and device selection.

⚙️ Parameters

| :—- | :—- | :—- | :—- | :—- |

💡 Quantization Options

| :—- | :—- | :—- | :—- | :—- | :—- |

\* Note on 4-bit Speed: 4-bit quantization significantly reduces VRAM usage but may result in slower performance on some systems due to the computational overhead of real-time dequantization.

🎯 Attention Mode Guide

| Mode | Description | Best For |

| :—- | :—- | :—- |

| auto | Automatically selects best available: Sage → Flash → SDPA | Most users (recommended) |

| sage | SageAttention with GPU-optimized kernels | Speed on modern GPUs (RTX 40 series, Hopper, Blackwell) |

| flash\_attention\_2 | Flash Attention 2 | Speed when Sage unavailable |

| sdpa | PyTorch SDPA (default) | Compatibility, FP8/BitsAndBytes models |

Note: FP8 models and BitsAndBytes quantization automatically use SDPA regardless of selection.

🤔 Setting Tips

| Setting | Recommendation |

| :—- | :—- |

| Model Choice | For most users, Qwen3-VL-4B-Instruct is a great starting point. If you have a 40-series GPU, try the \-FP8 version for better performance. |

| Memory Mode | Keep keep\_model\_loaded enabled (True) for the best performance if you plan to run the node multiple times. Disable it only if you are running out of VRAM for other nodes. |

| Quantization | Start with the default 8-bit. If you have plenty of VRAM (>16GB), switch to None (FP16) for the best speed and quality. If you are low on VRAM, use 4-bit. |

| Attention Mode | Use “auto” for best performance. SageAttention provides fastest inference on supported GPUs. |

| Performance | The first time a model is loaded with a specific quantization, it may be slow. Subsequent runs (with keep\_model\_loaded enabled) will be much faster. |

🧠 About Model

This node utilizes the Qwen-VL series of models, developed by the Qwen Team at Alibaba Cloud. These are powerful, open-source large vision-language models (LVLMs) designed to understand and process both visual and textual information, making them ideal for tasks like detailed image and video description.

🗺️ Roadmap

✅ Completed (v2.1.0)

✅ SageAttention support with per-GPU architecture optimization

✅ Improved FP8 model handling with automatic SDPA fallback

✅ Smart attention selection (auto: Sage → Flash → SDPA)

✅ Progress bar for model loading and generation

✅ Better memory management and cache clearing

✅ Completed (v2.0.0)

✅ GGUF model support via llama.cpp backend

✅ Prompt Enhancer nodes for text-only optimization

✅ Completed (v1.0.0)

✅ Support for Qwen3-VL and Qwen2.5-VL models.

✅ Automatic model downloading and management.

✅ On-the-fly 4-bit, 8-bit, and FP16 quantization.

✅ Hardware compatibility checks for FP8 models.

✅ Image and Video (frame sequence) input support.

🙏 Credits

Qwen Team: Alibaba Cloud \- For developing and open-sourcing the powerful Qwen-VL models.

ComfyUI: comfyanonymous \- For the incredible and extensible ComfyUI platform.

llama-cpp-python: JamePeng/llama-cpp-python \- GGUF backend with vision support used by the GGUF nodes.

SageAttention: SageAttention \- Efficient attention implementation with GPU-optimized kernels.

ComfyUI Integration: 1038lab \- Developer of this custom node.

📜 License

This repository’s code is released under the GPL-3.0 License.