torch torchvision torchaudio numpy pillow huggingface_hub accelerate optimum av transformers>=4.57.1 qwen-vl-utils opencv-python bitsandbytes triton-windows; sys_platform == 'win32' triton; sys_platform == 'linux'




This is an implementation of Qwen3-VL-Instruct by ComfyUI, which includes, but is not limited to, support for text-based queries, video queries, single-image queries, and multi-image queries to generate captions or responses.
[!IMPORTANT]
Important Notes for Using the Workflow
– Please ensure that you have the “Display Text node” available in your ComfyUI setup. If you encounter any issues with this node being missing, you can find it in the ComfyUI_MiniCPM-V-4_5 repository. Installing this additional addon will make the “Display Text node” available for use.
Qwen3)ComfyUI\custom_nodes\ directory and run:pip install -r requirements.txt
All the models will be downloaded automatically when running the workflow if they are not found in the ComfyUI\models\prompt_generator\ directory.