ComfyUI-ThinkSound_Wrapper

ComfyUI-ThinkSound_Wrapper
★ 21

文本转音频视频转音频Chain-of-ThoughtComfyUI集成
将 ThinkSound 接入 ComfyUI 的封装节点,利用 Chain-of-Thought 从文本或视频生成高质量、可与视频同步的音频,便于在可视化流程中精细控制生成效果。
💡 在 ComfyUI 流程中根据文本或视频生成同步高质量音频。
🍴 1 Forks💻 Python🔄 2025-07-24
📦
网盘下载
复制链接后前往夸克网盘下载
https://pan.quark.cn/s/8f9eee5e2cdb
📦 requirements.txt
#
ThinkSound
ComfyUI
Requirements
#
Core
dependencies
needed
for
ThinkSound
functionality
#
Critical
ThinkSound
dependencies
(these
were
missing
and
caused
errors)
alias-free-torch==0.0.6
descript-audio-codec==1.0.0
vector-quantize-pytorch==1.9.14
#
Essential
for
functionality
einops==0.7.0
open-clip-torch>=2.20.0
huggingface_hub
safetensors
sentencepiece>=0.1.99
tqdm
#
Optional
but
recommended
for
full
compatibility
auraloss==0.4.0
encodec==0.1.1
lightning>=2.0.0
einops-exts==0.0.4
ema-pytorch==0.2.3
k-diffusion==0.1.1
PyWavelets==1.4.1
pandas>=2.0.0
importlib-resources>=5.0.0
📄 README

ComfyUI-ThinkSound_Wrapper

A ComfyUI wrapper implementation of ThinkSound – an advanced AI model for generating high-quality audio from text descriptions and video content using Chain-of-Thought (CoT) reasoning.

https://github.com/user-attachments/assets/b3f090a7-fe58-4bb0-8e21-cb19377aa9cf

14.02.25 – Add the ability to use the big model: thinksound.ckpt

  • you can download it from here: https://huggingface.co/FunAudioLLM/ThinkSound/resolve/main/thinksound.ckpt?download=true
  • 🎵 Features

  • Text-to-Audio Generation: Create audio from detailed text descriptions
  • Video-to-Audio Generation: Generate synchronized audio that matches video content
  • Chain-of-Thought Reasoning: Use detailed CoT prompts for precise audio control
  • Multimodal Understanding: Combines visual and textual information for better results
  • ComfyUI Integration: Easy-to-use nodes that integrate seamlessly with ComfyUI workflows
  • 🎬 What Makes ThinkSound Special

    ThinkSound uses multimodal AI to understand both text and video:

  • MetaCLIP for visual scene understanding
  • Synchformer for temporal motion analysis
  • T5 for detailed language understanding
  • Advanced diffusion model for high-quality audio synthesis
  • 📋 Requirements

    System Requirements

  • NVIDIA GPU with at least 12GB VRAM (24GB+ recommended)
  • Python 3.8+
  • ComfyUI installed and working
  • Windows/Linux (tested on Windows)
  • Dependencies

    The following Python packages will be installed automatically:

    torch>=2.0.1
    torchaudio>=2.0.2
    torchvision>=0.15.0
    transformers>=4.20.0
    accelerate>=0.20.0
    alias-free-torch==0.0.6
    descript-audio-codec==1.0.0
    vector-quantize-pytorch==1.9.14
    einops==0.7.0
    open-clip-torch>=2.20.0
    huggingface_hub
    safetensors
    sentencepiece>=0.1.99

    🚀 Installation

    Step 1: Install ComfyUI Custom Node

  • Navigate to your ComfyUI custom nodes folder:
  • “`bash

    cd ComfyUI/custom_nodes/

    “`

  • Clone this repository:
  • “`bash

    git clone https://github.com/ShmuelRonen/ComfyUI-ThinkSound_Wrapper.git

    cd ComfyUI-ThinkSound_Wrapper

    “`

  • Your folder structure should look like:
  • “`

    ComfyUI-ThinkSound_Wrapper/

    ├── __init__.py

    ├── nodes.py

    ├── requirements.txt

    ├── thinksound/

    │ ├── data/

    │ ├── models/

    │ ├── inference/

    │ └── …

    └── README.md

    “`

    Step 3: Install Dependencies

    Option A: Install all dependencies (recommended)

    pip install -r requirements.txt

    Option B: Install minimal dependencies

    pip install torch torchaudio torchvision transformers accelerate
    pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14
    pip install einops open-clip-torch huggingface_hub safetensors sentencepiece

    Step 4: Download Models

  • Download the models pack from Google Drive:
  • 🔗 Download Models (Google Drive)

  • Extract the downloaded file and place models in:
  • “`

    ComfyUI/models/thinksound/

    ├── thinksound_light.ckpt

    ├── vae.ckpt

    ├── synchformer_state_dict.pth

    └── (other model files)

    “`

  • Create the thinksound models folder if it doesn’t exist:
  • “`bash

    mkdir -p ComfyUI/models/thinksound

    “`

    Step 5: Restart ComfyUI

  • Restart ComfyUI completely
  • Check the console for successful loading messages:
  • “`

    🎉 ThinkSound modules imported successfully!

    ✅ SUCCESS: Found FeaturesUtils in thinksound.data.v2a_utils.feature_utils_224

    “`

    🎛️ Usage

    Available Nodes

    After installation, you’ll find these nodes in ComfyUI:

  • ThinkSound Model Loader
  • Loads the main ThinkSound diffusion model
  • Input: thinksound_model (select your .ckpt file)
  • Output: thinksound_model
  • ThinkSound Feature Utils Loader
  • Loads VAE and Synchformer models
  • Inputs: vae_model, synchformer_model
  • Output: feature_utils
  • ThinkSound Sampler
  • Generates audio from text and/or video
  • Main generation node
  • Basic Workflow

    ThinkSound Model Loader ──┐
                             ├── ThinkSound Sampler ── Audio Output
    ThinkSound Feature Utils ─┘
    Loader

    Sampler Node Parameters

  • Duration: Audio length in seconds (1.0 – 30.0)
  • Steps: Denoising steps (30 recommended)
  • CFG Scale: Guidance strength (5.0 recommended)
  • Seed: Random seed for reproducibility
  • Caption: Short audio description
  • CoT Description: Detailed Chain-of-Thought prompt
  • Video: Optional video input for video-to-audio generation
  • 🎵 Examples

    Text-to-Audio Examples

    Example 1: Simple Audio

    Caption: "Dog barking"
    CoT Description: "Generate the sound of a medium-sized dog barking outdoors. The barking should be natural and energetic, with slight echo to suggest an open space. Include 3-4 distinct barks with realistic timing between them."

    Example 2: Complex Scene

    Caption: "Ocean waves at beach"
    CoT Description: "Create gentle ocean waves lapping against the shore. Add subtle sounds of water receding over sand and pebbles. Include distant seagull calls and a light ocean breeze for natural ambiance."

    Example 3: Musical Content

    Caption: "Jazz piano"
    CoT Description: "Generate a smooth jazz piano melody in a minor key. Include syncopated rhythms, bluesy chord progressions, and subtle improvisation. The tempo should be moderate and relaxing, perfect for a late-night cafe atmosphere."

    Video-to-Audio Generation

  • Load a video using ComfyUI’s video loader nodes
  • Connect the video to the ThinkSound Sampler’s video input
  • Add descriptive text to guide the audio generation
  • Generate audio that syncs with the video content
  • ⚠️ Important Notes

    Model Precision

  • ThinkSound requires fp32 precision for stable operation
  • The nodes automatically use fp32 (no precision selection needed)
  • Do not force fp16 as it may cause tensor dimension errors
  • Memory Requirements

  • 8GB VRAM minimum for basic operation
  • 12GB+ VRAM recommended for longer audio generation
  • Enable “force_offload” to save VRAM (enabled by default)
  • Video Input Format

  • Supported: MP4, AVI, MOV (any format ComfyUI can load)
  • Recommended: 8-30 seconds duration
  • Processing: Automatically handled by the node
  • 🐛 Troubleshooting

    Common Issues

    Issue: “ThinkSound source code not installed”

    Solution: Ensure you've downloaded the ThinkSound repository to the 'thinksound' folder

    Issue: “ImportError: No module named ‘alias_free_torch'”

    Solution: Install missing dependencies:
    pip install alias-free-torch==0.0.6 descript-audio-codec==1.0.0 vector-quantize-pytorch==1.9.14

    Issue: “Input type (float) and bias type (struct c10::Half) should be the same”

    Solution: This is resolved automatically with fp32 precision. Restart ComfyUI if you see this error.

    Issue: “Tensors must have same number of dimensions”

    Solution: Update to the latest version of the nodes. This was fixed in recent updates.

    Issue: Models not loading

    Solution: 
    1. Check that models are in ComfyUI/models/thinksound/
    2. Verify model file names match the dropdown options
    3. Check ComfyUI console for specific error messages

    Performance Tips

  • Start with shorter durations (8-10 seconds) for testing
  • Use lower step counts (12-16) for faster generation during testing
  • Enable force_offload to manage VRAM usage
  • Close other GPU-intensive applications while generating
  • 📊 Expected Performance

    Generation Times (approximate)

  • 8 seconds audio: 30-60 seconds on RTX 3080
  • 15 seconds audio: 60-120 seconds on RTX 3080
  • Video analysis: Additional 10-20 seconds
  • Quality Settings

  • Steps 12-16: Fast, good quality
  • Steps 24: Recommended balance
  • Steps 32+: High quality, slower
  • 🔄 Updates

    To update the project:

  • Pull latest changes: git pull origin main
  • Update ThinkSound source: cd thinksound && git pull
  • Restart ComfyUI
  • 📄 License

    This project is a wrapper implementation based on ThinkSound by FunAudioLLM. Please refer to the original ThinkSound repository for licensing information.

    🤝 Contributing

    Contributions are welcome! Please:

  • Fork the repository
  • Create a feature branch
  • Submit a pull request
  • 📞 Support

    If you encounter issues:

  • Check the troubleshooting section above
  • Review ComfyUI console output for error messages
  • Open an issue on GitHub with detailed error information
  • 🎉 Acknowledgments

  • ThinkSound Team for the original model and research
  • ComfyUI Community for the excellent framework
  • Contributors who helped test and improve this wrapper implementation

  • Enjoy creating amazing audio with ThinkSound! 🎵✨