ComfyUI-HiggsAudio_Wrapper
A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.
Features
High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model
Voice Cloning: Clone voices using reference audio or built-in voice presets
Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)
Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio
Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control
GPU Acceleration: Supports CUDA for faster generation
ComfyUI Integration: Seamless integration with ComfyUI workflows
Installation
Prerequisites
Python 3.8+
ComfyUI
CUDA-compatible GPU (recommended)
ComfyUI Installation
Clone this repository into your ComfyUI custom_nodes directory:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git
Install Dependencies
pip install -r requirements.txt
Restart ComfyUI
The nodes will appear under the “Higgs Audio” category
Usage
Basic Workflow
The wrapper provides several nodes that can be chained together:
Load Higgs Audio Model – Loads the generation model
Load Higgs Audio Tokenizer – Loads the audio tokenizer
Load Higgs Audio System Prompt – Configures generation style
Load Higgs Audio Prompt – Sets the text to convert to speech
Higgs Audio Generator – Performs the actual audio generation
Voice Cloning Options
Using Voice Presets
The wrapper includes several built-in voice presets:
belinda – Female voice
en_woman – English female voice
en_man – English male voice
mabel – Alternative female voice
vex – Character voice
chadwick – Male voice
broom_salesman – Character voice
zh_man_sichuan – Chinese male voice (Sichuan dialect)
voice_clone – Use custom reference 30 sec audio
Using Custom Reference Audio
Set voice preset to voice_clone
Connect reference audio to the reference_audio input
Optionally provide reference text that describes the audio
Audio Priority Settings
Control which audio source takes precedence:
auto (default) – Uses voice preset if selected, otherwise reference audio
preset_dropdown – Always prioritizes dropdown selection over reference audio
reference_input – Always prioritizes reference audio over dropdown
force_preset – Forces use of preset, ignoring reference audio completely
Configuration
What Actually Affects Audio Quality
Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:
Voice Quality Control
Reference Audio: High-quality voice samples (24kHz+) with clear articulation
Voice Presets: Different presets have distinct characteristics – test to find the best fit
Reference Text: Clear, well-punctuated text that matches the reference audio
System Prompt (Minimal Impact)
Keep system prompts simple since complex scene descriptions are largely ignored:
Generate audio following instruction.
Generation Parameters
max_new_tokens (128-4096): Controls audio length and pacing
temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)
top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)
top_k (-1-100): Fine-tunes voice characteristics (50 = default)
device: auto/cuda/cpu (auto = recommended)
File Structure
ComfyUI-HiggsAudio_Wrapper/
├── __init__.py # Node registration
├── nodes.py # Main node implementations
├── requirements.txt # Python dependencies
├── voice_examples/ # Voice preset files
│ ├── config.json # Voice preset configuration
│ ├── en_woman.wav # Female English voice
│ ├── en_man.wav # Male English voice
│ └── ... # Other voice presets
└── boson_multimodal/ # HiggsAudio engine
└── ...
Realistic Expectations
What HiggsAudio Does Well
Voice Cloning: Excellent at replicating voice characteristics from reference audio
Speech Quality: Generates natural-sounding speech with good pronunciation
Multiple Voices: Built-in voice presets for different character types
Consistency: Maintains voice characteristics across longer text
Current Limitations
Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect
Emotional Control: Limited ability to control emotional expression through text prompts
Background Audio: Cannot generate environmental sounds or music
Real-time: Requires processing time, not suitable for real-time applications
Best Use Cases
Voice-over generation with consistent character voices
Audiobook narration with cloned voices
Character voices for games or animations
Text-to-speech with specific voice characteristics
For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.
Troubleshooting
Common Issues
Poor Audio Quality
Use higher quality reference audio (24kHz+ recommended)
Try different voice presets to find the best match
Adjust temperature (0.8 for stability, 1.2 for variation)
Ensure reference text matches the reference audio content
“audio_base64 is None” Error
Ensure reference audio is properly formatted
Check that voice preset files exist in voice_examples/
Verify audio file is not corrupted
Inconsistent Voice Output
Lower the temperature parameter (try 0.8)
Use higher quality reference audio
Ensure reference audio has consistent background noise levels
CUDA Out of Memory
Reduce max_new_tokens
Use device: cpu instead of auto/cuda
Close other GPU-intensive applications
Model Loading Issues
Ensure stable internet connection for model download
Check available disk space (models are several GB)
Verify transformers version compatibility
Performance Tips
First Run: Model downloading may take time
GPU Memory: 8GB+ VRAM recommended for optimal performance
Caching: Models are cached after first load for faster subsequent runs
Voice Quality: Use high-quality reference audio for best results
Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation
Text Formatting: Use proper punctuation for natural speech rhythm
API Reference
HiggsAudio Node Inputs
Required
MODEL_PATH: Path to HiggsAudio model
AUDIO_TOKENIZER_PATH: Path to audio tokenizer
system_prompt: System prompt for generation control
prompt: Text to convert to speech
max_new_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
device: Computation device
Optional
voice_preset: Voice preset selection
reference_audio: Custom reference audio
reference_text: Text corresponding to reference audio
audio_priority: Audio source prioritization
Output
output: Generated audio in ComfyUI format
used_voice_info: Information about which voice source was used
Requirements
See requirements.txt for complete list:
torch==2.5.1
torchaudio==2.5.1
transformers>=4.45.1,<4.47.0
librosa
And others…
Third-Party Licenses
The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE in that directory for complete attribution and licensing information.
Contributing
Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request
Support
For issues and questions:
Open an issue on GitHub
Check existing issues for solutions
Provide detailed error messages and system information
Acknowledgments
HiggsAudio team for the underlying model
ComfyUI community for the framework
Contributors and testers
Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.