numpy loguru tqdm accelerate transformers>=4.37.0 safetensors requests opencv-python diffusers pyyaml einops omegaconf packaging pytorch-lightning descript-audio-codec scipy soxr ffmpy audiocraft descript-audio-codec

This is a ComfyUI custom node wrapper for the HunyuanVideo-Foley model, which generates realistic audio from video and text descriptions.
ComfyUI/models/foley/ when neededComfyUI/models/foley/ when needed“`bash
cd ComfyUI/custom_nodes
git clone https://github.com/if-ai/ComfyUI_HunyuanVideoFoley.git
“`
“`bash
cd ComfyUI_HunyuanVideoFoley
pip install -r requirements.txt
“`
“`bash
python install.py
“`
The models can be obtained in two ways:
ComfyUI/models/foley/ when you first run the nodeComfyUI/models/foley/ (recommended) or ./pretrained_models/ directoryconfigs/hunyuanvideo-foley-xxl.yamlThis custom node package is designed in a modular way for maximum flexibility and efficiency. Here is the recommended workflow and an explanation of what each node does.
The most powerful and efficient way to use these nodes is to chain them together in the following order:
Model Loader → Dependencies Loader → Torch Compile → Generator (Advanced)
This setup allows you to load the models only once, apply performance optimizations, and then run the generator multiple times without reloading, saving significant time and VRAM.
This is the starting point. It loads the main (and very large) audio generation model into memory.
none: Loads the model in its original format (highest VRAM usage).fp8_e5m2 / fp8_e4m3fn: These options use FP8 quantization, a technique that stores the model’s weights in a much smaller format. This can save several gigabytes of VRAM with a minimal impact on audio quality, making it possible to run on GPUs with less memory.True, the model will be kept in your regular RAM instead of VRAM. This is not the same as the generator’s offload setting; use this if you are loading multiple different models in your workflow and need to conserve VRAM.This node takes the main model from the loader and then loads all the smaller, auxiliary models required for the process (the VAE, text encoder, and visual feature extractors).
This is an optional but highly recommended performance-enhancing node. It uses torch.compile to optimize the model’s code for your specific hardware.
compile_mode: This controls the trade-off between compilation time and the amount of performance gain.default: The best balance. It provides a good speedup with a reasonable initial compile time.reduce-overhead: Compiles more slowly but can reduce the overhead of running the model, which might be faster for very small audio generations.max-autotune: Takes the longest to compile initially, but it tries many different optimizations to find the absolute fastest option for your specific hardware.backend: This is an advanced setting that changes the underlying compiler used by PyTorch. For most users, the default inductor is the best choice.This is the main workhorse node where the audio generation happens.
False, the node does nothing and passes through an empty/silent output. This is useful for disabling parts of a complex workflow without having to disconnect them.True, it outputs a valid, silent audio clip, which prevents downstream nodes (like video combiners) from failing. If False, it outputs None.The two memory-related checkboxes on the Generator node are crucial for managing your GPU’s resources. Here is exactly what they do:
cpu_offload:True, the node will always move the models to your regular RAM (CPU) after the generation is complete. This is the best option for freeing up VRAM for other nodes in your workflow while still keeping the models ready for the next run without having to reload them from disk.memory_efficient:True, the node will completely unload the models from memory (both VRAM and RAM) after the generation is finished.HunyuanVideoFoleyModelLoader (the advanced workflow), it will not unload it, respecting the fact that you may want to reuse the pre-loaded model for another generation.The most memory-intensive part of the process is visual feature extraction. We’ve implemented batched processing to prevent out-of-memory errors with longer videos or on GPUs with less VRAM. You can control this with two settings on the Generator (Advanced) node:
feature_extraction_batch_size: This determines how many video frames are processed by the feature extractor models at once.enable_profiling: If you check this box, the node will print detailed performance timings and peak VRAM usage for the feature extraction step to the console. This is highly recommended for finding the optimal batch size for your specific hardware.These are general starting points. The optimal value can vary based on your exact GPU, driver version, and other running processes.
| VRAM Tier | Video Resolution | Recommended Batch Size | Notes |
| :— | :— | :— | :— |
| ≤ 8 GB | 480p | 4 – 8 | Start with 4. If successful, you can try increasing it. |
| | 720p | 2 – 4 | Start with 2. 720p videos are demanding on low VRAM cards. |
| 12-16 GB | 480p | 16 – 32 | The default of 16 should work well. Can be increased for more speed. |
| | 720p | 8 – 16 | Start with 8 or 16. |
| ≥ 24 GB| 480p | 32 – 64 | You can safely increase the batch size for maximum performance. |
| | 720p | 16 – 32 | A batch size of 32 should be easily achievable. |
Main node for generating audio from video and text.
Inputs:
None. (BOOLEAN, default: True)Outputs:
For videos longer than 15 seconds:
The node expects the following model structure:
ComfyUI\models\foley\hunyuanvideo-foley-xxl
├── hunyuanvideo_foley.pth # Main Foley model
├── vae_128d_48k.pth # DAC VAE model
└── synchformer_state_dict.pth # Synchformer model
configs/
└── hunyuanvideo-foley-xxl.yaml # Configuration file
If you find this tool useful, please consider supporting my work by:
You can also support by reporting issues or suggesting features. Your contributions help me bring updates and improvements to the project.
This custom node is based on the HunyuanVideo-Foley project. Please check the original project’s license terms.
Based on the HunyuanVideo-Foley project by Tencent. Original paper and code available at: