ComfyUI-VoxCPMTTS

★ 36

文本转语音语音克隆自动转录多设备支持

ComfyUI 节点，基于 VoxCPMTTS 提供高质量文本转语音与语音克隆，支持参考音频自动转录、多设备加速和音频后处理，便于可控生成与调试。

💡 在ComfyUI中用文本或参考音频生成高质量语音并克隆声线

🍴 3 Forks💻 Python🔄 2025-12-11

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8f9eee5e2cdb

📦 requirements.txt

huggingface_hub>=0.20.0
einops>=0.6.0
pydantic>=2.0.0
wetext>=0.1.0
faster-whisper
soundfile

📄 README

ComfyUI VoxCPMTTS Node

A clean, efficient ComfyUI custom node for VoxCPMTTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM 0.5B and 1.5 model.

Features

🎯 High-Quality TTS: Generate natural-sounding speech from text

🎭 Voice Cloning: Clone any voice using a reference audio sample

🔄 Auto-Transcription: Automatic speech recognition for reference audio

⚡ Multi-Device Support: CUDA, MPS, and CPU compatibility

🎛️ Fine-Tuned Control: Adjustable guidance scale, inference steps, and more

🔊 Audio Post-Processing: Built-in fade-in to reduce artifacts

Installation

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager

Search for “VoxCPMTTS”

Install the node

Method 2: Manual Installation

Navigate to your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes/

Clone this repository:

git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git

Install dependencies:

cd ComfyUI-VoxCPMTTS
pip install -r requirements.txt

Restart ComfyUI

Dependencies

The node will automatically install required dependencies on first use:

huggingface_hub>=0.20.0

einops>=0.6.0

pydantic>=2.0.0

wetext>=0.1.0

faster-whisper

Model Download

VoxCPM1.5 (default) will be automatically downloaded to ComfyUI/models/TTS/VoxCPM1.5/ on first use. VoxCPM-0.5B is no longer used.

https://huggingface.co/openbmb/VoxCPM1.5

Usage

Text-to-Speech

Add the VoxCPMTTS node to your workflow

Input your text in the text field

Adjust parameters as needed:

cfg_value: Controls adherence to prompt (1.0-10.0, default: 2.0)

inference_steps: Quality vs speed tradeoff (1-100, default: 10)

max_length: Maximum token length (256-8192, default: 4096)

Connect the audio output to your desired destination

Voice Cloning

Connect a reference audio to the reference_audio input

Optionally provide reference_text (transcript of the reference audio)

If left empty, the node will automatically transcribe the audio

The generated speech will mimic the reference voice characteristics

Parameters

|———–|——|———|————-|

| inference_steps | INT | 10 | Diffusion steps (higher = better quality) |

| max_length | INT | 4096 | Maximum token length |

| seed | INT | -1 | Random seed (-1 for random) |

| fade_in_ms | INT | 20 | Fade-in duration (0-1000ms) |

Outputs

REFERENCE_TEXT: Transcribed or provided reference text

AUDIO: Generated speech audio with 16kHz sample rate

Environment Variables

Set these environment variables to customize behavior:

# ASR model size (tiny, small, medium, large)
export VOXCPM_ASR_MODEL=small

# Maximum retry attempts for bad cases
export VOXCPM_RETRY_MAX=2

Device Selection

auto: Automatically selects the best available device

cuda: Force CUDA if available

mps: Force MPS (Apple Silicon) if available

cpu: Force CPU processing

Example Workflows

Basic TTS Workflow

[Text Input] → [VoxCPMTTS] → [Audio Output]

Voice Cloning Workflow

[Reference Audio] → [VoxCPMTTS] ← [Target Text]
                         ↓
                   [Cloned Audio]

Batch Processing Workflow

[Text List] → [VoxCPMTTS] → [Audio Batch] → [Save Audio]

Performance Tips

Memory Optimization

Use lower inference_steps for faster generation

Choose appropriate max_length for your text

Use CPU device if GPU memory is limited

Quality Settings

Fast: cfg_value=1.5, inference_steps=5

Balanced: cfg_value=2.0, inference_steps=10 (default)

High Quality: cfg_value=3.0, inference_steps=20

Voice Cloning Tips

Use high-quality reference audio (16kHz+)

Reference audio should be 3-30 seconds long

Clear speech with minimal background noise works best

Provide accurate reference text when possible

Troubleshooting

Common Issues

Model download fails

Check internet connection

Ensure sufficient disk space (~1.2GB)

Try clearing the download cache

Out of memory errors

Reduce max_length

Lower inference_steps

Switch to CPU device

Close other GPU-intensive applications

Poor voice cloning quality

Ensure reference audio is clear and high-quality

Verify reference text accuracy

Try different cfg_value settings

Use reference audio from the same speaker

ASR transcription errors

Install faster-whisper for better performance

Provide manual reference_text instead

Use clearer reference audio

Debug Mode

Enable verbose logging by setting:

export COMFYUI_LOG_LEVEL=DEBUG

Model Information

This node uses the VoxCPM1.5 model developed by OpenBMB:

Model Size: ~500M parameters

Audio Quality: 16kHz sample rate

Languages: Primarily optimized for English and Chinese

License: Apache 2.0

Paper: VoxCPM: A High-Quality Chinese Text-to-Speech Model

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 – see the LICENSE file for details.

Acknowledgments

OpenBMB for the VoxCPM model

ComfyUI community

All contributors and users

Support

If you encounter any issues or have questions:

Open an issue on GitHub

Check the troubleshooting section above

Join the ComfyUI community discussions

Star this repository if you find it useful! ⭐