ComfyUI-HiggsAudio_Wrapper

★ 27

文本转语音语音克隆ComfyUI集成GPU加速

为ComfyUI提供HiggsAudio v2的高质量文本转语音封装，支持多预设与自定义参考音的语音克隆并可GPU加速。

💡 在ComfyUI中用参考音或预设快速生成高质量语音。

🍴 4 Forks💻 Python🔄 2025-07-26

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8f9eee5e2cdb

📦 requirements.txt

descript-audio-codec
torch
transformers==4.45.2
librosa
dacite
boto3==1.35.36
s3fs
soundfile
torchvision
torchaudio
json_repair
pandas
pydantic
vector_quantize_pytorch
loguru
pydub
ruff==0.12.2
omegaconf
click
langid
jieba
accelerate>=0.26.0

📄 README

ComfyUI-HiggsAudio_Wrapper

A comprehensive ComfyUI wrapper for HiggsAudio v2, enabling high-quality text-to-speech generation with advanced voice cloning capabilities.

Features

High-Quality Audio Generation: Leverages the powerful HiggsAudio v2 3B parameter model

Voice Cloning: Clone voices using reference audio or built-in voice presets

Multiple Voice Presets: Includes pre-configured voices (belinda, en_woman, en_man, etc.)

Flexible Audio Prioritization: Control whether to use voice presets or custom reference audio

Customizable System Prompts: Fine-tune audio generation with scene descriptions and style control

GPU Acceleration: Supports CUDA for faster generation

ComfyUI Integration: Seamless integration with ComfyUI workflows

Installation

Prerequisites

Python 3.8+

ComfyUI

CUDA-compatible GPU (recommended)

ComfyUI Installation

Clone this repository into your ComfyUI custom_nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-HiggsAudio_Wrapper.git

Install Dependencies

pip install -r requirements.txt

Restart ComfyUI

The nodes will appear under the “Higgs Audio” category

Usage

Basic Workflow

The wrapper provides several nodes that can be chained together:

Load Higgs Audio Model – Loads the generation model

Load Higgs Audio Tokenizer – Loads the audio tokenizer

Load Higgs Audio System Prompt – Configures generation style

Load Higgs Audio Prompt – Sets the text to convert to speech

Higgs Audio Generator – Performs the actual audio generation

Voice Cloning Options

Using Voice Presets

The wrapper includes several built-in voice presets:

belinda – Female voice

en_woman – English female voice

en_man – English male voice

mabel – Alternative female voice

vex – Character voice

chadwick – Male voice

broom_salesman – Character voice

zh_man_sichuan – Chinese male voice (Sichuan dialect)

voice_clone – Use custom reference 30 sec audio

Using Custom Reference Audio

Set voice preset to voice_clone

Connect reference audio to the reference_audio input

Optionally provide reference text that describes the audio

Audio Priority Settings

Control which audio source takes precedence:

auto (default) – Uses voice preset if selected, otherwise reference audio

preset_dropdown – Always prioritizes dropdown selection over reference audio

reference_input – Always prioritizes reference audio over dropdown

force_preset – Forces use of preset, ignoring reference audio completely

Configuration

What Actually Affects Audio Quality

Important: System prompts and scene descriptions have minimal effect on HiggsAudio output. Focus on these factors that actually work:

Voice Quality Control

Reference Audio: High-quality voice samples (24kHz+) with clear articulation

Voice Presets: Different presets have distinct characteristics – test to find the best fit

Reference Text: Clear, well-punctuated text that matches the reference audio

System Prompt (Minimal Impact)

Keep system prompts simple since complex scene descriptions are largely ignored:

Generate audio following instruction.

Generation Parameters

max_new_tokens (128-4096): Controls audio length and pacing

temperature (0.0-2.0): Controls voice consistency (0.8 = more stable, 1.2 = more varied)

top_p (0.1-1.0): Affects pronunciation variation (0.9-0.95 recommended)

top_k (-1-100): Fine-tunes voice characteristics (50 = default)

device: auto/cuda/cpu (auto = recommended)

File Structure

ComfyUI-HiggsAudio_Wrapper/
├── __init__.py                 # Node registration
├── nodes.py                    # Main node implementations
├── requirements.txt            # Python dependencies
├── voice_examples/             # Voice preset files
│   ├── config.json            # Voice preset configuration
│   ├── en_woman.wav           # Female English voice
│   ├── en_man.wav             # Male English voice
│   └── ...                    # Other voice presets
└── boson_multimodal/          # HiggsAudio engine
    └── ...

Realistic Expectations

What HiggsAudio Does Well

Voice Cloning: Excellent at replicating voice characteristics from reference audio

Speech Quality: Generates natural-sounding speech with good pronunciation

Multiple Voices: Built-in voice presets for different character types

Consistency: Maintains voice characteristics across longer text

Current Limitations

Scene Control: System prompts for acoustic environments (reverb, background sounds) have minimal effect

Emotional Control: Limited ability to control emotional expression through text prompts

Background Audio: Cannot generate environmental sounds or music

Real-time: Requires processing time, not suitable for real-time applications

Best Use Cases

Voice-over generation with consistent character voices

Audiobook narration with cloned voices

Character voices for games or animations

Text-to-speech with specific voice characteristics

For acoustic effects like reverb or background sounds, consider post-processing with audio editing software.

Troubleshooting

Common Issues

Poor Audio Quality

Use higher quality reference audio (24kHz+ recommended)

Try different voice presets to find the best match

Adjust temperature (0.8 for stability, 1.2 for variation)

Ensure reference text matches the reference audio content

“audio_base64 is None” Error

Ensure reference audio is properly formatted

Check that voice preset files exist in voice_examples/

Verify audio file is not corrupted

Inconsistent Voice Output

Lower the temperature parameter (try 0.8)

Use higher quality reference audio

Ensure reference audio has consistent background noise levels

CUDA Out of Memory

Reduce max_new_tokens

Use device: cpu instead of auto/cuda

Close other GPU-intensive applications

Model Loading Issues

Ensure stable internet connection for model download

Check available disk space (models are several GB)

Verify transformers version compatibility

Performance Tips

First Run: Model downloading may take time

GPU Memory: 8GB+ VRAM recommended for optimal performance

Caching: Models are cached after first load for faster subsequent runs

Voice Quality: Use high-quality reference audio for best results

Parameter Tuning: Lower temperature (0.8) for consistent voice, higher (1.2) for variation

Text Formatting: Use proper punctuation for natural speech rhythm

API Reference

HiggsAudio Node Inputs

Required

MODEL_PATH: Path to HiggsAudio model

AUDIO_TOKENIZER_PATH: Path to audio tokenizer

system_prompt: System prompt for generation control

prompt: Text to convert to speech

max_new_tokens: Maximum tokens to generate

temperature: Sampling temperature

top_p: Nucleus sampling parameter

top_k: Top-k sampling parameter

device: Computation device

Optional

voice_preset: Voice preset selection

reference_audio: Custom reference audio

reference_text: Text corresponding to reference audio

audio_priority: Audio source prioritization

Output

output: Generated audio in ComfyUI format

used_voice_info: Information about which voice source was used

Requirements

See requirements.txt for complete list:

torch==2.5.1

torchaudio==2.5.1

transformers>=4.45.1,<4.47.0

librosa

And others…

Third-Party Licenses

The boson_multimodal/audio_processing/ directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE in that directory for complete attribution and licensing information.

Contributing

Fork the repository

Create a feature branch

Make your changes

Add tests if applicable

Submit a pull request

Support

For issues and questions:

Open an issue on GitHub

Check existing issues for solutions

Provide detailed error messages and system information

Acknowledgments

HiggsAudio team for the underlying model

ComfyUI community for the framework

Contributors and testers

Note: This wrapper requires significant computational resources. A CUDA-compatible GPU with 8GB+ VRAM is recommended for optimal performance.