huggingface_hub>=0.20.0 einops>=0.6.0 pydantic>=2.0.0 wetext>=0.1.0 faster-whisper soundfile

A clean, efficient ComfyUI custom node for VoxCPMTTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM 0.5B and 1.5 model.
cd ComfyUI/custom_nodes/
git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git
cd ComfyUI-VoxCPMTTS
pip install -r requirements.txt
The node will automatically install required dependencies on first use:
huggingface_hub>=0.20.0einops>=0.6.0pydantic>=2.0.0wetext>=0.1.0faster-whisperVoxCPM1.5 (default) will be automatically downloaded to ComfyUI/models/TTS/VoxCPM1.5/ on first use. VoxCPM-0.5B is no longer used.
https://huggingface.co/openbmb/VoxCPM1.5
text fieldcfg_value: Controls adherence to prompt (1.0-10.0, default: 2.0)inference_steps: Quality vs speed tradeoff (1-100, default: 10)max_length: Maximum token length (256-8192, default: 4096)reference_audio inputreference_text (transcript of the reference audio)| Parameter | Type | Default | Description |
|———–|——|———|————-|
| text | STRING | “Hello, this is VoxCPMTTS.” | Text to synthesize |
| cfg_value | FLOAT | 2.0 | Guidance scale (higher = more prompt adherence) |
| inference_steps | INT | 10 | Diffusion steps (higher = better quality) |
| max_length | INT | 4096 | Maximum token length |
| normalize | BOOLEAN | True | Enable text normalization |
| seed | INT | -1 | Random seed (-1 for random) |
| device | COMBO | auto | Device selection (auto/cuda/mps/cpu) |
| reference_audio | AUDIO | – | Reference audio for voice cloning |
| reference_text | STRING | “” | Reference audio transcript |
| fade_in_ms | INT | 20 | Fade-in duration (0-1000ms) |
Set these environment variables to customize behavior:
# ASR model size (tiny, small, medium, large)
export VOXCPM_ASR_MODEL=small
# Maximum retry attempts for bad cases
export VOXCPM_RETRY_MAX=2
auto: Automatically selects the best available devicecuda: Force CUDA if availablemps: Force MPS (Apple Silicon) if availablecpu: Force CPU processing[Text Input] → [VoxCPMTTS] → [Audio Output]
[Reference Audio] → [VoxCPMTTS] ← [Target Text]
↓
[Cloned Audio]
[Text List] → [VoxCPMTTS] → [Audio Batch] → [Save Audio]
inference_steps for faster generationmax_length for your textcfg_value=1.5, inference_steps=5cfg_value=2.0, inference_steps=10 (default)cfg_value=3.0, inference_steps=20Model download fails
Out of memory errors
max_lengthinference_stepsPoor voice cloning quality
cfg_value settingsASR transcription errors
faster-whisper for better performancereference_text insteadEnable verbose logging by setting:
export COMFYUI_LOG_LEVEL=DEBUG
This node uses the VoxCPM1.5 model developed by OpenBMB:
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 – see the LICENSE file for details.
If you encounter any issues or have questions:
Star this repository if you find it useful! ⭐