ComfyUI_HunyuanVideoFoley

★ 168

视频转音频文本引导多样样本自动下载缓存

基于 HunyuanVideo-Foley 的 ComfyUI 节点：从视频和可选文本生成与画面匹配的真实音效，支持多样样本、参数控制与模型缓存。

💡 为视频快速生成匹配的 Foley 音效并导出多个变体。

🍴 12 Forks💻 Python🔄 2025-09-08

📦 网盘链接待填入

📦 requirements.txt

numpy
loguru
tqdm
accelerate
transformers>=4.37.0
safetensors
requests
opencv-python
diffusers
pyyaml
einops
omegaconf
packaging
pytorch-lightning
descript-audio-codec
scipy
soxr
ffmpy
audiocraft
descript-audio-codec

📄 README

ComfyUI HunyuanVideo-Foley Custom Node

This is a ComfyUI custom node wrapper for the HunyuanVideo-Foley model, which generates realistic audio from video and text descriptions.

Features

Text-Video-to-Audio Synthesis: Generate realistic audio that matches your video content

Flexible Text Prompts: Use optional text descriptions to guide audio generation

Multiple Samples: Generate up to 6 different audio variations per inference

Configurable Parameters: Control guidance scale, inference steps, and sampling

Seed Control: Reproducible results with seed parameter

Model Caching: Efficient model loading and reuse across generations

Automatic Model Downloads: Models are automatically downloaded to ComfyUI/models/foley/ when needed

Features

Text-Video-to-Audio Synthesis: Generate realistic audio that matches your video content

Flexible Text Prompts: Use optional text descriptions to guide audio generation

Multiple Samples: Generate up to 6 different audio variations per inference

Configurable Parameters: Control guidance scale, inference steps, and sampling

Seed Control: Reproducible results with seed parameter

Model Caching: Efficient model loading and reuse across generations

Automatic Model Downloads: Models are automatically downloaded to ComfyUI/models/foley/ when needed

Installation

Clone this repository into your ComfyUI custom_nodes directory:

“`bash

cd ComfyUI/custom_nodes

git clone https://github.com/if-ai/ComfyUI_HunyuanVideoFoley.git

“`

Install dependencies:

“`bash

cd ComfyUI_HunyuanVideoFoley

pip install -r requirements.txt

“`

Run the installation script (recommended):

“`bash

python install.py

“`

Restart ComfyUI to load the new nodes.

Model Setup

The models can be obtained in two ways:

Option 1: Automatic Download (Recommended)

Models will be automatically downloaded to ComfyUI/models/foley/ when you first run the node

No manual setup required

Progress will be shown in the ComfyUI console

Option 2: Manual Download

Download models from HuggingFace

Place models in ComfyUI/models/foley/ (recommended) or ./pretrained_models/ directory

Ensure the config file is at configs/hunyuanvideo-foley-xxl.yaml

Operation Guide: How to Use the Nodes

This custom node package is designed in a modular way for maximum flexibility and efficiency. Here is the recommended workflow and an explanation of what each node does.

Recommended Workflow

The most powerful and efficient way to use these nodes is to chain them together in the following order:

Model Loader → Dependencies Loader → Torch Compile → Generator (Advanced)

This setup allows you to load the models only once, apply performance optimizations, and then run the generator multiple times without reloading, saving significant time and VRAM.

Node Details

1. HunyuanVideo-Foley Model Loader (FP8)

This is the starting point. It loads the main (and very large) audio generation model into memory.

quantization: This is the most important setting for saving VRAM.

none: Loads the model in its original format (highest VRAM usage).

fp8_e5m2 / fp8_e4m3fn: These options use FP8 quantization, a technique that stores the model’s weights in a much smaller format. This can save several gigabytes of VRAM with a minimal impact on audio quality, making it possible to run on GPUs with less memory.

cpu_offload: If True, the model will be kept in your regular RAM instead of VRAM. This is not the same as the generator’s offload setting; use this if you are loading multiple different models in your workflow and need to conserve VRAM.

2. HunyuanVideo-Foley Dependencies

This node takes the main model from the loader and then loads all the smaller, auxiliary models required for the process (the VAE, text encoder, and visual feature extractors).

3. HunyuanVideo-Foley Torch Compile

This is an optional but highly recommended performance-enhancing node. It uses torch.compile to optimize the model’s code for your specific hardware.

Note: The very first time you run a workflow with this node, it will take a minute or two to perform the compilation. However, every subsequent run will be significantly faster (often 20-30%).

compile_mode: This controls the trade-off between compilation time and the amount of performance gain.

default: The best balance. It provides a good speedup with a reasonable initial compile time.

reduce-overhead: Compiles more slowly but can reduce the overhead of running the model, which might be faster for very small audio generations.

max-autotune: Takes the longest to compile initially, but it tries many different optimizations to find the absolute fastest option for your specific hardware.

backend: This is an advanced setting that changes the underlying compiler used by PyTorch. For most users, the default inductor is the best choice.

4. HunyuanVideo-Foley Generator (Advanced)

This is the main workhorse node where the audio generation happens.

video / images: Your visual input. You can provide either a video file or a batch of images from another node.

compiled_model: The input for the model prepared by the upstream nodes.

text_prompt / negative_prompt: Your descriptions of the sound you want (and don’t want).

guidance_scale / num_inference_steps / seed: Standard diffusion model controls for creativity vs. prompt adherence, quality vs. speed, and reproducibility.

enabled: A simple switch. If False, the node does nothing and passes through an empty/silent output. This is useful for disabling parts of a complex workflow without having to disconnect them.

silent_audio: Controls what happens when the node is disabled or fails. If True, it outputs a valid, silent audio clip, which prevents downstream nodes (like video combiners) from failing. If False, it outputs None.

Understanding the Memory Options

The two memory-related checkboxes on the Generator node are crucial for managing your GPU’s resources. Here is exactly what they do:

cpu_offload:

What it does: If this is True, the node will always move the models to your regular RAM (CPU) after the generation is complete. This is the best option for freeing up VRAM for other nodes in your workflow while still keeping the models ready for the next run without having to reload them from disk.

Use this when: You want to run other VRAM-intensive nodes after this one and plan to come back to the Foley generator later.

memory_efficient:

What it does: This is a more aggressive option. If True, the node will completely unload the models from memory (both VRAM and RAM) after the generation is finished.

Important Distinction: This process is smart. It will only unload the model if it was loaded by the generator node itself (the simple workflow). If the model was passed in from the HunyuanVideoFoleyModelLoader (the advanced workflow), it will not unload it, respecting the fact that you may want to reuse the pre-loaded model for another generation.

Use this when: You are finished with audio generation and want to free up as much memory as possible for completely different tasks.

Performance Tuning & VRAM Usage

The most memory-intensive part of the process is visual feature extraction. We’ve implemented batched processing to prevent out-of-memory errors with longer videos or on GPUs with less VRAM. You can control this with two settings on the Generator (Advanced) node:

feature_extraction_batch_size: This determines how many video frames are processed by the feature extractor models at once.

Lower values significantly reduce peak VRAM usage at the cost of slightly slower processing.

Higher values speed up processing but require more VRAM.

enable_profiling: If you check this box, the node will print detailed performance timings and peak VRAM usage for the feature extraction step to the console. This is highly recommended for finding the optimal batch size for your specific hardware.

Recommended Batch Sizes

These are general starting points. The optimal value can vary based on your exact GPU, driver version, and other running processes.

| :— | :— | :— | :— |

| ≤ 8 GB | 480p | 4 – 8 | Start with 4. If successful, you can try increasing it. |

| | 720p | 2 – 4 | Start with 2. 720p videos are demanding on low VRAM cards. |

| 12-16 GB | 480p | 16 – 32 | The default of 16 should work well. Can be increased for more speed. |

| | 720p | 8 – 16 | Start with 8 or 16. |

| ≥ 24 GB| 480p | 32 – 64 | You can safely increase the batch size for maximum performance. |

| | 720p | 16 – 32 | A batch size of 32 should be easily achievable. |

Usage

Node Types

1. HunyuanVideo-Foley Generator

Main node for generating audio from video and text.

Inputs:

video: Video input (VIDEO type)

text_prompt: Text description of desired audio (STRING)

guidance_scale: CFG scale for generation control (1.0-10.0, default: 4.5)

num_inference_steps: Number of denoising steps (10-100, default: 50)

sample_nums: Number of audio samples to generate (1-6, default: 1)

seed: Random seed for reproducibility (INT)

model_path: Path to pretrained models (optional, leave empty for auto-download)

enabled: Enable or disable the entire node. If disabled, it will pass through a silent or null audio output without processing. (BOOLEAN, default: True)

silent_audio: Controls the output when the node is disabled or fails. If true, it outputs a silent audio clip. If false, it outputs None. (BOOLEAN, default: True)

Outputs:

video_with_audio: Video with generated audio merged (VIDEO)

audio_only: Generated audio file (AUDIO)

status_message: Generation status and info (STRING)

⚠ Important Limitations

Frame Count & Duration Limits

Maximum Frames: 450 frames (hard limit)

Maximum Duration: 15 seconds at 30fps

Recommended: Keep videos ≤15 seconds for best results

FPS Recommendations

30fps: Max 15 seconds (450 frames)

24fps: Max 18.75 seconds (450 frames)

15fps: Max 30 seconds (450 frames)

Long Video Solutions

For videos longer than 15 seconds:

Reduce FPS: Lower FPS allows longer duration within frame limit

Segment Processing: Split long videos into 15s segments

Audio Merging: Combine generated audio segments in post-processing

Example Workflow

Load Video: Use a “Load Video” node to input your video file

Add Generator: Add the “HunyuanVideo-Foley Generator” node

Connect Video: Connect the video output to the generator’s video input

Set Prompt: Enter a text description (e.g., “A person walks on frozen ice”)

Adjust Settings: Configure guidance scale, steps, and sample count as needed

Generate: Run the workflow to generate audio

Model Requirements

The node expects the following model structure:

ComfyUI\models\foley\hunyuanvideo-foley-xxl

├── hunyuanvideo_foley.pth          # Main Foley model

├── vae_128d_48k.pth                # DAC VAE model  

└── synchformer_state_dict.pth      # Synchformer model



configs/

└── hunyuanvideo-foley-xxl.yaml     # Configuration file

TODO

[x] ADD VHS INPUT/OUTPUTS (Thanks to YC)

[x] NEGATIVE PROMPT (Thanks to YC)

[x] MODEL OFFLOADING OPS

[x] TORCH COMPILE

[ ] QUANTISE MODEL

Support

If you find this tool useful, please consider supporting my work by:

Starring this repository on GitHub

Subscribing to my YouTube channel: Impact Frames

Following on X: @ImpactFrames

You can also support by reporting issues or suggesting features. Your contributions help me bring updates and improvements to the project.

License

This custom node is based on the HunyuanVideo-Foley project. Please check the original project’s license terms.

Credits

Based on the HunyuanVideo-Foley project by Tencent. Original paper and code available at:

Paper: [HunyuanVideo-Foley: Text-Video-to-Audio Synthesis]

Code: [https://github.com/tencent/HunyuanVideo-Foley]