ComfyUI-HiggsAudio_Wrapper

ComfyUI-HiggsAudio_Wrapper
★ 27

文本转语音语音克隆ComfyUI集成GPU加速
为ComfyUI提供HiggsAudio v2的高质量文本转语音封装,支持多预设与自定义参考音的语音克隆并可GPU加速。
💡 在ComfyUI中用参考音或预设快速生成高质量语音。
🍴 4 Forks💻 Python🔄 2025-07-26
📦
网盘下载
复制链接后前往夸克网盘下载
https://pan.quark.cn/s/8f9eee5e2cdb
📦 requirements.txt
descript-audio-codec
torch
transformers==4.45.2
librosa
dacite
boto3==1.35.36
s3fs
soundfile
torchvision
torchaudio
json_repair
pandas
pydantic
vector_quantize_pytorch
loguru
pydub
ruff==0.12.2
omegaconf
click
langid
jieba
accelerate>=0.26.0
📄 README

ComfyUI-HiggsAudio_Wrapper

A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.

Features

  • High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model
  • Voice Cloning: Clone voices using reference audio or built-in voice presets
  • Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)
  • Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio
  • Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control
  • GPU Acceleration: Supports CUDA for faster generation
  • ComfyUI Integration: Seamless integration with ComfyUI workflows
  • Installation

    Prerequisites

  • Python 3.8+
  • ComfyUI
  • CUDA-compatible GPU (recommended)
  • ComfyUI Installation

  • Clone this repository into your ComfyUI custom_nodes directory:
  • cd ComfyUI/custom_nodes
    git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git

    Install Dependencies

    pip install -r requirements.txt

  • Restart ComfyUI
  • The nodes will appear under the “Higgs Audio” category
  • Usage

    Basic Workflow

    The wrapper provides several nodes that can be chained together:

  • Load Higgs Audio Model – Loads the generation model
  • Load Higgs Audio Tokenizer – Loads the audio tokenizer
  • Load Higgs Audio System Prompt – Configures generation style
  • Load Higgs Audio Prompt – Sets the text to convert to speech
  • Higgs Audio Generator – Performs the actual audio generation
  • Voice Cloning Options

    Using Voice Presets

    The wrapper includes several built-in voice presets:

  • belinda – Female voice
  • en_woman – English female voice
  • en_man – English male voice
  • mabel – Alternative female voice
  • vex – Character voice
  • chadwick – Male voice
  • broom_salesman – Character voice
  • zh_man_sichuan – Chinese male voice (Sichuan dialect)
  • voice_clone – Use custom reference 30 sec audio
  • Using Custom Reference Audio

  • Set voice preset to voice_clone
  • Connect reference audio to the reference_audio input
  • Optionally provide reference text that describes the audio
  • Audio Priority Settings

    Control which audio source takes precedence:

  • auto (default) – Uses voice preset if selected, otherwise reference audio
  • preset_dropdown – Always prioritizes dropdown selection over reference audio
  • reference_input – Always prioritizes reference audio over dropdown
  • force_preset – Forces use of preset, ignoring reference audio completely
  • Configuration

    What Actually Affects Audio Quality

    Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:

    Voice Quality Control

  • Reference Audio: High-quality voice samples (24kHz+) with clear articulation
  • Voice Presets: Different presets have distinct characteristics – test to find the best fit
  • Reference Text: Clear, well-punctuated text that matches the reference audio
  • System Prompt (Minimal Impact)

    Keep system prompts simple since complex scene descriptions are largely ignored:

    Generate audio following instruction.

    Generation Parameters

  • max_new_tokens (128-4096): Controls audio length and pacing
  • temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)
  • top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)
  • top_k (-1-100): Fine-tunes voice characteristics (50 = default)
  • device: auto/cuda/cpu (auto = recommended)
  • File Structure

    ComfyUI-HiggsAudio_Wrapper/
    ├── __init__.py                 # Node registration
    ├── nodes.py                    # Main node implementations
    ├── requirements.txt            # Python dependencies
    ├── voice_examples/             # Voice preset files
    │   ├── config.json            # Voice preset configuration
    │   ├── en_woman.wav           # Female English voice
    │   ├── en_man.wav             # Male English voice
    │   └── ...                    # Other voice presets
    └── boson_multimodal/          # HiggsAudio engine
        └── ...

    Realistic Expectations

    What HiggsAudio Does Well

  • Voice Cloning: Excellent at replicating voice characteristics from reference audio
  • Speech Quality: Generates natural-sounding speech with good pronunciation
  • Multiple Voices: Built-in voice presets for different character types
  • Consistency: Maintains voice characteristics across longer text
  • Current Limitations

  • Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect
  • Emotional Control: Limited ability to control emotional expression through text prompts
  • Background Audio: Cannot generate environmental sounds or music
  • Real-time: Requires processing time, not suitable for real-time applications
  • Best Use Cases

  • Voice-over generation with consistent character voices
  • Audiobook narration with cloned voices
  • Character voices for games or animations
  • Text-to-speech with specific voice characteristics
  • For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.

    Troubleshooting

    Common Issues

    Poor Audio Quality

  • Use higher quality reference audio (24kHz+ recommended)
  • Try different voice presets to find the best match
  • Adjust temperature (0.8 for stability, 1.2 for variation)
  • Ensure reference text matches the reference audio content
  • “audio_base64 is None” Error

  • Ensure reference audio is properly formatted
  • Check that voice preset files exist in voice_examples/
  • Verify audio file is not corrupted
  • Inconsistent Voice Output

  • Lower the temperature parameter (try 0.8)
  • Use higher quality reference audio
  • Ensure reference audio has consistent background noise levels
  • CUDA Out of Memory

  • Reduce max_new_tokens
  • Use device: cpu instead of auto/cuda
  • Close other GPU-intensive applications
  • Model Loading Issues

  • Ensure stable internet connection for model download
  • Check available disk space (models are several GB)
  • Verify transformers version compatibility
  • Performance Tips

  • First Run: Model downloading may take time
  • GPU Memory: 8GB+ VRAM recommended for optimal performance
  • Caching: Models are cached after first load for faster subsequent runs
  • Voice Quality: Use high-quality reference audio for best results
  • Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation
  • Text Formatting: Use proper punctuation for natural speech rhythm
  • API Reference

    HiggsAudio Node Inputs

    Required

  • MODEL_PATH: Path to HiggsAudio model
  • AUDIO_TOKENIZER_PATH: Path to audio tokenizer
  • system_prompt: System prompt for generation control
  • prompt: Text to convert to speech
  • max_new_tokens: Maximum tokens to generate
  • temperature: Sampling temperature
  • top_p: Nucleus sampling parameter
  • top_k: Top-k sampling parameter
  • device: Computation device
  • Optional

  • voice_preset: Voice preset selection
  • reference_audio: Custom reference audio
  • reference_text: Text corresponding to reference audio
  • audio_priority: Audio source prioritization
  • Output

  • output: Generated audio in ComfyUI format
  • used_voice_info: Information about which voice source was used
  • Requirements

    See requirements.txt for complete list:

  • torch==2.5.1
  • torchaudio==2.5.1
  • transformers>=4.45.1,<4.47.0
  • librosa
  • And others…
  • Third-Party Licenses

    The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE in that directory for complete attribution and licensing information.

    Contributing

  • Fork the repository
  • Create a feature branch
  • Make your changes
  • Add tests if applicable
  • Submit a pull request
  • Support

    For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Provide detailed error messages and system information
  • Acknowledgments

  • HiggsAudio team for the underlying model
  • ComfyUI community for the framework
  • Contributors and testers

  • Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.