ComfyUI_ChatterBox_Voice

★ 23

文本转语音语音转换语音采集情感控制

在ComfyUI中集成ResembleAI Chatterbox，提供高质量TTS、语音转换和录音采集，支持长文本分块与情感调控。

💡 在ComfyUI内将长文本生成高质量语音并支持语音克隆

🍴 19 Forks💻 Python🔄 2025-06-04

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8f9eee5e2cdb

📦 requirements.txt

#
ChatterboxTTS
Dependencies
for
ComfyUI
#
Install
with:
pip
install
-r
requirements.txt
s3tokenizer>=0.1.7
resemble-perth
librosa
omegaconf
accelerate
transformers==4.46.3

📄 README

ComfyUI_ChatterBox_Voice

An unofficial ComfyUI custom node integration for High-quality Text-to-Speech and Voice Conversion nodes for ComfyUI using ResembleAI’s ChatterboxTTS with unlimited text length!!!.

NEW: Audio capture node

Features

🎤 ChatterBox TTS – Generate speech from text with optional voice cloning

🔄 ChatterBox VC – Convert voice from one speaker to another

🎙️ ChatterBox Voice Capture – Record voice input with smart silence detection

⚡ Fast & Quality – Production-grade TTS that outperforms ElevenLabs

🎭 Emotion Control – Unique exaggeration parameter for expressive speech

📝 Enhanced Chunking – Intelligent text splitting for long content with multiple combination methods

📦 Self-Contained – Bundled ChatterBox for zero-installation-hassle experience

Note: There are multiple ChatterBox extensions available. This implementation focuses on simplicity, ComfyUI standards, and enhanced text processing capabilities.

Installation

cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI_ChatterBox.git

Expected folder structure for bundled approach:

ComfyUI_ChatterBox_Voice/
├── __init__.py
├── nodes.py
├── nodes_audio_recorder.py
├── chatterbox/
├── web/        

├── models/                  # ← Models bundled here (optional)
│   └── chatterbox/
│       ├── conds.pt
│       ├── s3gen.pt
│       ├── t3_cfg.pt
│       ├── tokenizer.json
│       └── ve.pt
└── README.md

2.3. Install Additional Dependencies

pip install -r requirements.txt

2.4. Download Models

Download the ChatterboxTTS models and place them in:

ComfyUI/models/TTS/chatterbox/

Required files:

conds.pt (105 KB)

s3gen.pt (~1 GB)

t3_cfg.pt (~1 GB)

tokenizer.json (25 KB)

ve.pt (5.5 MB)

Download from: https://huggingface.co/ResembleAI/chatterbox/tree/main

3. Install Voice Recording Dependencies (Optional)

pip install sounddevice

4. Restart ComfyUI

Enhanced Features

📝 Intelligent Text Chunking (NEW!)

Long text support with smart processing:

Character-based limits (100-1000 chars per chunk)

Sentence boundary preservation – won’t cut mid-sentence

Multiple combination methods:

auto – Smart selection based on text length

concatenate – Simple joining

silence_padding – Add configurable silence between chunks

crossfade – Smooth audio blending

Comma-based splitting for very long sentences

Backward compatible – works with existing workflows

Chunking Controls (all optional):

enable_chunking – Enable/disable smart chunking (default: True)

max_chars_per_chunk – Chunk size limit (default: 400)

chunk_combination_method – How to join audio (default: auto)

silence_between_chunks_ms – Silence duration (default: 100ms)

Auto-selection logic:

Text > 1000 chars → silence_padding (natural pauses)

Text > 500 chars → crossfade (smooth blending)

Text < 500 chars → concatenate (simple joining)

📦 Smart Model Loading

Priority-based model detection:

Bundled models in node folder (self-contained)

ComfyUI models in standard location

HuggingFace download with authentication

Console output shows source:

📦 Using BUNDLED ChatterBox (self-contained)
📦 Loading from bundled models: ./models/chatterbox
✅ ChatterboxTTS model loaded from bundled!

Usage

Voice Recording

Add “🎤 ChatterBox Voice Capture” node

Select your microphone from the dropdown

Adjust recording settings:

Silence Threshold: How quiet to consider “silence” (0.001-0.1)

Silence Duration: How long to wait before stopping (0.5-5.0 seconds)

Sample Rate: Audio quality (8000-96000 Hz, default 44100)

Change the Trigger value to start a new recording

Connect output to TTS (for voice cloning) or VC nodes

Enhanced Text-to-Speech

Add “🎤 ChatterBox Voice TTS” node

Enter your text (any length – automatic chunking)

Optionally connect reference audio for voice cloning

Adjust TTS settings:

Exaggeration: Emotion intensity (0.25-2.0)

Temperature: Randomness (0.05-5.0)

CFG Weight: Guidance strength (0.0-1.0)

Configure chunking (optional):

Enable Chunking: For long texts

Max Chars Per Chunk: Chunk size (100-1000)

Combination Method: How to join chunks

Silence Between Chunks: Pause duration

Voice Conversion

Add “🔄 ChatterBox Voice Conversion” node

Connect source audio (voice to convert)

Connect target audio (voice style to copy)

Workflow Examples

Long Text with Smart Chunking:

Text Input (2000+ chars) → ChatterBox TTS (chunking enabled) → PreviewAudio

Voice Cloning with Recording:

🎤 Voice Capture → ChatterBox TTS (reference_audio) → PreviewAudio

Voice Conversion Pipeline:

🎤 Voice Capture (source) → ChatterBox VC ← 🎤 Voice Capture (target)

Complete Advanced Pipeline:

Long Text Input → ChatterBox TTS (with voice reference) → PreviewAudio
                ↘ ChatterBox VC ← 🎤 Target Voice Recording

Settings Guide

Enhanced Chunking Settings

For Long Articles/Books:

max_chars_per_chunk=600, combination_method=silence_padding, silence_between_chunks_ms=200

For Natural Speech:

max_chars_per_chunk=400, combination_method=auto (default – works well)

For Fast Processing:

max_chars_per_chunk=800, combination_method=concatenate

For Smooth Audio:

max_chars_per_chunk=300, combination_method=crossfade

Voice Recording Settings

General Recording:

silence_threshold=0.01, silence_duration=2.0 (default settings)

Noisy Environment:

Higher silence_threshold (~0.05) to ignore background noise

Longer silence_duration (~3.0) to avoid cutting off speech

Quiet Environment:

Lower silence_threshold (~0.005) for sensitive detection

Shorter silence_duration (~1.0) for quick stopping

TTS Settings

General Use:

exaggeration=0.5, cfg_weight=0.5 (default settings work well)

Expressive Speech:

Lower cfg_weight (~0.3) + higher exaggeration (~0.7)

Higher exaggeration speeds up speech; lower CFG slows it down

Text Processing Capabilities

📚 No Hard Text Limits!

Unlike many TTS systems:

OpenAI TTS: 4096 character limit

ElevenLabs: 2500 character limit

ChatterBox: No documented limits + intelligent chunking

🧠 Smart Text Splitting

Sentence Boundary Detection:

Splits on .!? with proper spacing

Preserves sentence integrity

Handles abbreviations and edge cases

Long Sentence Handling:

Splits on commas when sentences are too long

Maintains natural speech patterns

Falls back to character limits only when necessary

Examples:

Input: "This is a very long article about artificial intelligence and machine learning. It contains multiple sentences and complex punctuation, including lists, quotes, and technical terms. The enhanced chunking system will split this intelligently."

Output: 3 well-formed chunks with natural boundaries

License

MIT License – Same as ChatterboxTTS

Credits

ResembleAI for ChatterboxTTS

ComfyUI team for the amazing framework

sounddevice library for audio recording functionality

🔗 Links

Resemble AI ChatterBox

Model Downloads (Hugging Face) ⬅️ Download models here

ChatterBox Demo

ComfyUI

Resemble AI Official Site

Note: The original ChatterBox model includes Resemble AI’s Perth watermarking system for responsible AI usage. This ComfyUI integration includes the Perth dependency but has watermarking disabled by default to ensure maximum compatibility. Users can re-enable watermarking by modifying the code if needed, while maintaining the full quality and capabilities of the underlying TTS model.