ComfyUI-ThinkSound_Wrapper

★ 21

文本转音频视频转音频Chain-of-ThoughtComfyUI集成

将 ThinkSound 接入 ComfyUI 的封装节点，利用 Chain-of-Thought 从文本或视频生成高质量、可与视频同步的音频，便于在可视化流程中精细控制生成效果。

💡 在 ComfyUI 流程中根据文本或视频生成同步高质量音频。

🍴 1 Forks💻 Python🔄 2025-07-24

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8f9eee5e2cdb

📦 requirements.txt

#
ThinkSound
ComfyUI
Requirements
#
Core
dependencies
needed
for
ThinkSound
functionality
#
Critical
ThinkSound
dependencies
(these
were
missing
and
caused
errors)
alias-free-torch==0.0.6
descript-audio-codec==1.0.0
vector-quantize-pytorch==1.9.14
#
Essential
for
functionality
einops==0.7.0
open-clip-torch>=2.20.0
huggingface_hub
safetensors
sentencepiece>=0.1.99
tqdm
#
Optional
but
recommended
for
full
compatibility
auraloss==0.4.0
encodec==0.1.1
lightning>=2.0.0
einops-exts==0.0.4
ema-pytorch==0.2.3
k-diffusion==0.1.1
PyWavelets==1.4.1
pandas>=2.0.0
importlib-resources>=5.0.0

📄 README

ComfyUI-ThinkSound_Wrapper

A ComfyUI wrapper implementation of ThinkSound – an advanced AI model for generating high-quality audio from text descriptions and video content using Chain-of-Thought (CoT) reasoning.

https://github.com/user-attachments/assets/b3f090a7-fe58-4bb0-8e21-cb19377aa9cf

14.02.25 – Add the ability to use the big model: thinksound.ckpt

you can download it from here: https://huggingface.co/FunAudioLLM/ThinkSound/resolve/main/thinksound.ckpt?download=true

🎵 Features

Text-to-Audio Generation: Create audio from detailed text descriptions

Video-to-Audio Generation: Generate synchronized audio that matches video content

Chain-of-Thought Reasoning: Use detailed CoT prompts for precise audio control

Multimodal Understanding: Combines visual and textual information for better results

ComfyUI Integration: Easy-to-use nodes that integrate seamlessly with ComfyUI workflows

🎬 What Makes ThinkSound Special

ThinkSound uses multimodal AI to understand both text and video:

MetaCLIP for visual scene understanding

Synchformer for temporal motion analysis

T5 for detailed language understanding

Advanced diffusion model for high-quality audio synthesis

📋 Requirements

System Requirements

NVIDIA GPU with at least 12GB VRAM (24GB+ recommended)

Python 3.8+

ComfyUI installed and working

Windows/Linux (tested on Windows)

Dependencies

The following Python packages will be installed automatically:

torch>=2.0.1
torchaudio>=2.0.2
torchvision>=0.15.0
transformers>=4.20.0
accelerate>=0.20.0
alias-free-torch==0.0.6
descript-audio-codec==1.0.0
vector-quantize-pytorch==1.9.14
einops==0.7.0
open-clip-torch>=2.20.0
huggingface_hub
safetensors
sentencepiece>=0.1.99

🚀 Installation

Step 1: Install ComfyUI Custom Node

Navigate to your ComfyUI custom nodes folder:

“`bash

cd ComfyUI/custom_nodes/

“`

Clone this repository:

“`bash

git clone https://github.com/ShmuelRonen/ComfyUI-ThinkSound_Wrapper.git

cd ComfyUI-ThinkSound_Wrapper

“`

Your folder structure should look like:

“`

ComfyUI-ThinkSound_Wrapper/

├── __init__.py

├── nodes.py

├── requirements.txt

├── thinksound/

│ ├── data/

│ ├── models/

│ ├── inference/

│ └── …

└── README.md

“`

Step 3: Install Dependencies

Option A: Install all dependencies (recommended)

pip install -r requirements.txt

Option B: Install minimal dependencies

pip install torch torchaudio torchvision transformers accelerate
pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14
pip install einops open-clip-torch huggingface_hub safetensors sentencepiece

Step 4: Download Models

Download the models pack from Google Drive:

🔗 Download Models (Google Drive)

Extract the downloaded file and place models in:

“`

ComfyUI/models/thinksound/

├── thinksound_light.ckpt

├── vae.ckpt

├── synchformer_state_dict.pth

└── (other model files)

“`

Create the thinksound models folder if it doesn’t exist:

“`bash

mkdir -p ComfyUI/models/thinksound

“`

Step 5: Restart ComfyUI

Restart ComfyUI completely

Check the console for successful loading messages:

“`

🎉 ThinkSound modules imported successfully!

✅ SUCCESS: Found FeaturesUtils in thinksound.data.v2a_utils.feature_utils_224

“`

🎛️ Usage

Available Nodes

After installation, you’ll find these nodes in ComfyUI:

ThinkSound Model Loader

Loads the main ThinkSound diffusion model

Input: thinksound_model (select your .ckpt file)

Output: thinksound_model

ThinkSound Feature Utils Loader

Loads VAE and Synchformer models

Inputs: vae_model, synchformer_model

Output: feature_utils

ThinkSound Sampler

Generates audio from text and/or video

Main generation node

Basic Workflow

ThinkSound Model Loader ──┐
                         ├── ThinkSound Sampler ── Audio Output
ThinkSound Feature Utils ─┘
Loader

Sampler Node Parameters

Duration: Audio length in seconds (1.0 – 30.0)

Steps: Denoising steps (30 recommended)

CFG Scale: Guidance strength (5.0 recommended)

Seed: Random seed for reproducibility

Caption: Short audio description

CoT Description: Detailed Chain-of-Thought prompt

Video: Optional video input for video-to-audio generation

🎵 Examples

Text-to-Audio Examples

Example 1: Simple Audio

Caption: "Dog barking"
CoT Description: "Generate the sound of a medium-sized dog barking outdoors. The barking should be natural and energetic, with slight echo to suggest an open space. Include 3-4 distinct barks with realistic timing between them."

Example 2: Complex Scene

Caption: "Ocean waves at beach"
CoT Description: "Create gentle ocean waves lapping against the shore. Add subtle sounds of water receding over sand and pebbles. Include distant seagull calls and a light ocean breeze for natural ambiance."

Example 3: Musical Content

Caption: "Jazz piano"
CoT Description: "Generate a smooth jazz piano melody in a minor key. Include syncopated rhythms, bluesy chord progressions, and subtle improvisation. The tempo should be moderate and relaxing, perfect for a late-night cafe atmosphere."

Video-to-Audio Generation

Load a video using ComfyUI’s video loader nodes

Connect the video to the ThinkSound Sampler’s video input

Add descriptive text to guide the audio generation

Generate audio that syncs with the video content

⚠️ Important Notes

Model Precision

ThinkSound requires fp32 precision for stable operation

The nodes automatically use fp32 (no precision selection needed)

Do not force fp16 as it may cause tensor dimension errors

Memory Requirements

8GB VRAM minimum for basic operation

12GB+ VRAM recommended for longer audio generation

Enable “force_offload” to save VRAM (enabled by default)

Video Input Format

Supported: MP4, AVI, MOV (any format ComfyUI can load)

Recommended: 8-30 seconds duration

Processing: Automatically handled by the node

🐛 Troubleshooting

Common Issues

Issue: “ThinkSound source code not installed”

Solution: Ensure you've downloaded the ThinkSound repository to the 'thinksound' folder

Issue: “ImportError: No module named ‘alias_free_torch'”

Solution: Install missing dependencies:
pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14

Issue: “Input type (float) and bias type (struct c10::Half) should be the same”

Solution: This is resolved automatically with fp32 precision. Restart ComfyUI if you see this error.

Issue: “Tensors must have same number of dimensions”

Solution: Update to the latest version of the nodes. This was fixed in recent updates.

Issue: Models not loading

Solution: 
1. Check that models are in ComfyUI/models/thinksound/
2. Verify model file names match the dropdown options
3. Check ComfyUI console for specific error messages

Performance Tips

Start with shorter durations (8-10 seconds) for testing

Use lower step counts (12-16) for faster generation during testing

Enable force_offload to manage VRAM usage

Close other GPU-intensive applications while generating

📊 Expected Performance

Generation Times (approximate)

8 seconds audio: 30-60 seconds on RTX 3080

15 seconds audio: 60-120 seconds on RTX 3080

Video analysis: Additional 10-20 seconds

Quality Settings

Steps 12-16: Fast, good quality

Steps 24: Recommended balance

Steps 32+: High quality, slower

🔄 Updates

To update the project:

Pull latest changes: git pull origin main

Update ThinkSound source: cd thinksound && git pull

Restart ComfyUI

📄 License

This project is a wrapper implementation based on ThinkSound by FunAudioLLM. Please refer to the original ThinkSound repository for licensing information.

🤝 Contributing

Contributions are welcome! Please:

Fork the repository

Create a feature branch

Submit a pull request

📞 Support

If you encounter issues:

Check the troubleshooting section above

Review ComfyUI console output for error messages

Open an issue on GitHub with detailed error information

🎉 Acknowledgments

ThinkSound Team for the original model and research

ComfyUI Community for the excellent framework

Contributors who helped test and improve this wrapper implementation

Enjoy creating amazing audio with ThinkSound! 🎵✨