ComfyUI-Geeky-Kokoro-TTS

★ 49

文本转语音语音变声音色混合GPU加速

在 ComfyUI 中集成 Kokoro TTS 与高级语音变换，支持 54+ 音色、语音混合、引导变声与音高校正

💡 在 ComfyUI 流程中生成并实时调整自然语音

🍴 10 Forks💻 Python🔄 2026-03-22

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/e58c8376a81b

📦 requirements.txt

#
Geeky
Kokoro
TTS
Requirements
-
2025
Edition
#
Python
3.12
and
3.13
compatible
#
Updated
for
Kokoro
v0.19+
and
ComfyUI
v3.49+
#
Core
TTS
dependencies
kokoro>=0.9.4
soundfile>=0.12.1
torch>=2.0.0
tqdm>=4.64.0
einops>=0.6.0
#
Numeric
computing
(Python
3.12/3.13
compatible)
numpy>=1.24.0,<2.0.0
#
Audio
processing
dependencies
(for
Voice
Mod
features)
librosa>=0.10.0
scipy>=1.9.0,<2.0.0
resampy>=0.4.3
#
Language-specific
dependencies
pyopenjtalk>=0.3.0
#
For
Japanese
language
support
ordered-set>=4.1.0
#
For
Chinese
language
support
#
Optional
but
recommended
#
espeak-ng
(system
package,
install
via:
apt-get
install
espeak-ng
or
brew
install
espeak-ng)

📄 README

🔊 Geeky Kokoro TTS for ComfyUI – 2025 Edition (Does not work with python 3.13)

The most comprehensive Kokoro TTS implementation for ComfyUI with ALL 54+ voices across 9 languages, voice blending, and advanced voice modification effects.

[](https://www.python.org/downloads/)

[](https://github.com/comfyanonymous/ComfyUI)

[](LICENSE)

🌟 What’s New in 2025 Edition

Ignore the Advanced Voice Mod node for now, it’s an experimental thing currently.

Special Note: Advanced Voice Mod node currently under construction. Will not function as intended at the moment. Japanese voices do not work at the moment either, will require custom wheel.

Complete Rewrite from Ground Up

🌍 54+ Voices: Complete support for all Kokoro-82M voices across 9 languages

🎯 9 Languages: US English, UK English, Japanese (not working yet, needs a custom wheel build), Mandarin Chinese, Spanish, French, Hindi, Italian, Brazilian Portuguese

🔀 Advanced Voice Blending: Mix any two voices with adjustable blend ratios

**🐍 Python 3.12: Fully tested and optimized for the latest Python versions

📦 Modern Architecture: Completely rewritten following 2025 ComfyUI best practices

⚡ Improved Performance: Better memory management and processing speed

🛡️ Enhanced Reliability: Robust error handling and fallback mechanisms

Key Features

✅ ALL 54+ Kokoro-82M voices (nothing left out!)

✅ Voice blending with linear interpolation

✅ NEW: Guided Voice Morphing – Use any audio file to guide voice transformation

✅ NEW: Autotune-style Pitch Correction – Match pitch to reference audio

✅ NEW: Advanced Spectral Morphing – Match tone, timbre, and character

✅ NEW: 18 Voice Profiles – Professional presets for instant transformations

✅ Advanced voice modification effects (pitch, formant, reverb, etc.)

✅ Intelligent text chunking that preserves sentence order

✅ GPU acceleration with automatic CPU fallback

✅ Multi-language support with proper phoneme handling

✅ Professional audio processing pipeline with Dynamic Time Warping

✅ ComfyUI v3.49+ compatibility

📋 Table of Contents

Guided Voice Morphing (NEW!)

Voice Modification Effects

Technical Details

Troubleshooting

Credits

🔧 Installation

Prerequisites

ComfyUI v3.49+ (or compatible version)

**Python 3.9, 3.10, 3.11, 3.12, (3.13+ not supported)

PyTorch 2.0+ (usually included with ComfyUI)

4GB+ RAM (8GB recommended for longer texts)

Optional: CUDA-capable GPU for faster processing

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI and navigate to “Manager”

Click “Install Custom Nodes”

Search for “Geeky Kokoro TTS”

Click “Install” and restart ComfyUI

Done! Nodes will appear in the “audio” category

Method 2: Manual Installation (Git Clone)

# Navigate to your ComfyUI custom nodes directory

cd ComfyUI/custom_nodes



# Clone this repository

git clone https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS.git



# Navigate into the directory

cd ComfyUI-Geeky-Kokoro-TTS



# Install Python dependencies

pip install -r requirements.txt



# Optional: Run installation verification script

python install.py

Method 3: ComfyUI Portable (Windows)

REM Navigate to custom nodes directory

cd ComfyUI_windows_portable\ComfyUI\custom_nodes



REM Clone repository

git clone https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS.git



REM Navigate into directory

cd ComfyUI-Geeky-Kokoro-TTS



REM Install with portable Python

..\..\..\python_embeded\python.exe -m pip install -r requirements.txt

System Dependencies (Optional but Recommended)

For best phoneme processing, install espeak-ng:

Ubuntu/Debian:

sudo apt-get update

sudo apt-get install espeak-ng

macOS:

brew install espeak-ng

Windows:

Download and install from: https://github.com/espeak-ng/espeak-ng/releases

🎭 Complete Voice List (54+ Voices)

🇺🇸 US English Voices (20 voices)

Female Voices (11)

|————|——|———–|———-|

Male Voices (9)

|————|——|———–|———-|

🇬🇧 UK English Voices (8 voices)

Female Voices (4)

|————|——|———–|———-|

Male Voices (4)

|————|——|———–|———-|

🇯🇵 Japanese Voices (5 voices) – Not Working, needs custom wheel

|————|——|——–|———–|———-|

🇨🇳 Mandarin Chinese Voices (8 voices)

Female Voices (4)

|————|——|———–|———-|

Male Voices (4)

|————|——|———–|———-|

🇪🇸 Spanish Voices (3 voices)

|————|——|——–|———–|———-|

🇫🇷 French Voice (1 voice)

|————|——|——–|———–|———-|

🇮🇳 Hindi Voices (4 voices)

|————|——|——–|———–|———-|

🇮🇹 Italian Voices (2 voices)

|————|——|——–|———–|———-|

🇧🇷 Brazilian Portuguese Voices (3 voices)

|————|——|——–|———–|———-|

🚀 Usage Guide

Basic Text-to-Speech

Add the Node: In ComfyUI, add “🔊 Geeky Kokoro TTS (2025)” node to your workflow

Enter Text: Type or paste your text in the multiline text field

Select Voice: Choose from 54+ voices in the dropdown

Adjust Speed: Set speed from 0.5x (slower) to 2.0x (faster)

GPU Option: Enable “use_gpu” if you have a CUDA-capable GPU

Generate: Connect to audio output or preview node

Voice Blending (Creating Unique Voices)

Voice blending allows you to create unique vocal characteristics by mixing two voices:

Enable Blending: Check the “enable_blending” checkbox

Select Second Voice: Choose a second voice from the dropdown

Adjust Blend Ratio:

1.0 = 100% primary voice (no blending)

0.7 = 70% primary, 30% secondary (subtle blend)

0.5 = 50/50 mix (balanced blend)

0.3 = 30% primary, 70% secondary (secondary dominant)

0.0 = 100% secondary voice

Blending Tips:

Mix voices from the same language for best results

Blend male + female voices for androgynous effects

Try Heart + Bella at 0.6 for energetic yet warm narration

Try Michael + Adam at 0.5 for rich, authoritative voice

Experiment with ratios to find your perfect voice!

🎵 Guided Voice Morphing (NEW!)

The game-changing feature that makes voices sing, match, and transform!

The Advanced Voice node now supports guided voice morphing – using a secondary audio file (like a song or reference voice) to guide the transformation of your TTS output. Perfect for:

Making TTS voices “sing” along to music

Matching tone and style of reference speakers

Creating autotune-style effects

Professional voice-over matching

How to Use Guided Morphing

Connect Guide Audio:

Load your guide audio (song, reference voice, etc.)

Connect it to the guide_audio input on the Advanced Voice node

Enable Morphing:

Check the enable_guided_morph checkbox

Adjust Morph Parameters (0.0 to 1.0):

Pitch Morph: Match pitch contour to guide audio (autotune effect)

Formant Morph: Match vocal character and tone

Spectral Morph: Match overall timbre and frequency balance

Amplitude Morph: Match dynamics and volume envelope

Morphing Parameters Explained

Pitch Morph (0.0 – 1.0)

0.0: No pitch change (original TTS pitch)

0.3-0.5: Subtle pitch guidance (natural autotune)

0.7-0.9: Strong pitch matching (follows melody closely)

1.0: Complete pitch matching (perfect autotune)

Use Cases:

Music: 0.7-1.0 to make voice follow melody

Speech matching: 0.3-0.5 for natural intonation

Character voice: 0.0 (use manual pitch shift instead)

Formant Morph (0.0 – 1.0)

Matches the vocal tract characteristics

Affects perceived age, gender, and character

0.0: Original voice character

0.5: Blend of both voices

1.0: Fully matched character

Use Cases:

Voice cloning: 0.6-0.9

Gender transformation: 0.5-0.7

Age adjustment: 0.4-0.6

Spectral Morph (0.0 – 1.0)

Matches overall frequency spectrum and timbre

Affects “brightness”, “warmth”, and tonal quality

Most subtle but powerful for natural matching

Use Cases:

Microphone matching: 0.5-0.7

Tone matching: 0.6-0.8

Style transfer: 0.4-0.6

Amplitude Morph (0.0 – 1.0)

Matches volume dynamics and expression

Follows the energy and intensity patterns

Great for emotional expression

Use Cases:

Dynamic speech: 0.5-0.7

Singing expression: 0.6-0.8

Whisper/shout: 0.4-0.6

Guided Morphing Examples

Example 1: Make TTS Voice Sing

Setup:

1. Generate TTS with lyrics text

2. Load instrumental or vocal track as guide_audio

3. Enable guided morph

4. Set: pitch_morph=0.8, formant_morph=0.3, spectral_morph=0.4



Result: Voice follows melody while maintaining TTS character

Example 2: Clone Speaking Style

Setup:

1. Generate TTS with script

2. Load reference speaker audio as guide_audio

3. Enable guided morph

4. Set: pitch_morph=0.4, formant_morph=0.7, spectral_morph=0.6



Result: TTS matches speaking style and voice character

Example 3: Autotune Effect

Setup:

1. Generate TTS with any text

2. Load musical scale or melody as guide_audio

3. Enable guided morph

4. Set: pitch_morph=1.0, formant_morph=0.0, spectral_morph=0.2



Result: Perfect pitch-corrected robotic singing effect

Advanced Voice Effects

Connect the TTS output to “🎛️ Geeky Kokoro Advanced Voice (2025)” node for effects:

Preset Profiles (18 Total):

Original Profiles:

Cinematic: Deep, movie-trailer style (-3 semitones, reverb, compression)

Monster: Growling creature voice (-6 semitones, formant shift, distortion)

Robot: Mechanical, synthesized voice (band-pass filter, modulation)

Child: Young character voice (+3 semitones, formant shift)

Darth Vader: Deep, breathing villain voice (-4 semitones, echo, modulation)

Singer: Optimized for vocal content (compression, EQ, reverb)

NEW Profiles:

Alien: Otherworldly voice (-8 semitones, extreme formant shift, modulation)

Deep Voice: Professional bass voice (-5 semitones, bass boost)

Chipmunk: High-pitched cartoon voice (+6 semitones, formant shift up)

Telephone: Classic phone quality (300-3400Hz bandpass, compression)

Radio: Broadcast radio sound (100-5000Hz, compression, EQ)

Cathedral: Large reverberant space (heavy reverb, echo)

Cave: Echo chamber effect (reverb, echo with feedback)

Metallic: Robotic metallic sound (ring modulation, bandpass)

Whisper: Quiet breathy voice (noise, reduced bass)

Shout: Loud emphasized voice (compression, distortion, mid boost)

Custom: Full manual control of all parameters

Manual Controls:

Pitch Shift: ±12 semitones (0.1 step precision)

Formant Shift: Vocal tract size adjustment (-5 to +5)

Time Stretch: Speed without pitch change (0.5x to 2.0x)

Reverb: Room ambiance with room size control

Echo: Discrete repeats with adjustable feedback

Distortion: Harmonic saturation (0.0 to 1.0)

Compression: Dynamic range control

3-Band EQ: Bass, Mid, Treble (-1.0 to +1.0)

Brightness: High-frequency emphasis (-1.0 to +1.0)

Warmth: Low-frequency emphasis (-1.0 to +1.0)

Effect Blend: Mix with original audio (0.0 to 1.0)

Output Volume: -60dB to +60dB

⚙️ Technical Details

Model Information

Model: Kokoro-82M v0.19

Parameters: 82 million

Architecture: Decoder-only based on StyleTTS 2 + ISTFTNet

Sample Rate: 24kHz

License: Apache 2.0

Repository: hexgrad/Kokoro-82M

Performance Benchmarks

Processing Speed (Python 3.12, CUDA GPU):

Short text (< 200 chars): ~2-3 seconds

Medium text (200-800 chars): ~5-10 seconds

Long text (800+ chars): ~15-30 seconds

Voice blending: +20% processing time

Voice effects: +5-15% processing time

Guided morphing: +30-50% processing time (feature extraction + morphing)

Memory Usage:

Base model: ~2GB VRAM/RAM

With GPU acceleration: ~3GB VRAM

Voice effects processing: +500MB

Voice blending: +200MB temporary

Guided morphing: +800MB-1.5GB (feature extraction + DTW alignment)

Guided Morphing Technology

Feature Extraction:

Pitch Tracking: PYIN algorithm with autocorrelation fallback

Formant Analysis: LPC (Linear Predictive Coding) with Levinson-Durbin recursion

Spectral Envelope: Cepstral smoothing with liftering

Amplitude Envelope: RMS energy tracking

MFCC: 13-coefficient mel-frequency cepstral analysis

Morphing Algorithms:

Dynamic Time Warping (DTW): Aligns feature sequences between source and guide

Phase Vocoder: Time-varying pitch shifting with STFT

Spectral Transfer: Magnitude envelope morphing with phase preservation

Volume Matching: RMS-based amplitude envelope transfer

Supported Guide Audio:

Any sample rate (auto-resampling to 24kHz)

Mono or stereo (auto-converted to mono)

WAV, MP3, FLAC, OGG formats

Recommended: 16kHz+ sample rate for best results

Text Processing

Intelligent Chunking: Automatically splits long texts while preserving sentence order

Chunk Size: 350 characters (configurable)

Gap Insertion: 150ms natural pauses between chunks

Paragraph Awareness: Respects paragraph breaks and structure

Punctuation Handling: Proper sentence boundary detection

Supported Languages & Codes

a – American English

b – British English

j – Japanese

z – Mandarin Chinese

e – Spanish

f – French

h – Hindi

i – Italian

p – Brazilian Portuguese

🔧 Troubleshooting

Common Issues

“Kokoro import error”

Solution:

pip install --upgrade kokoro>=0.9.4

Voice not loading

Solution:

Restart ComfyUI completely

Check console for specific error messages

Ensure all dependencies are installed

Try reinstalling: pip install --force-reinstall kokoro

GPU out of memory

Solutions:

Disable “use_gpu” option

Reduce text length

Close other GPU-intensive applications

Use CPU mode for very long texts

Audio sounds distorted

Solutions:

Reduce “output_volume” in Voice Mod node

Lower “effect_blend” ratio (start at 0.3-0.5)

Reduce distortion and compression amounts

Check that input audio isn’t already clipping

Python version issues

Solution:

python --version  # Check your version

# Must be 3.9, 3.10, 3.11, 3.12, or 3.13

pip install --upgrade pip

pip install -r requirements.txt --force-reinstall

espeak-ng not found

Solution:

Ubuntu/Debian: sudo apt-get install espeak-ng

macOS: brew install espeak-ng

Windows: Download from espeak-ng GitHub releases

Performance Tips

For long texts: Enable GPU acceleration

For short texts: CPU mode is often faster

Memory management: Process texts in batches if needed

Effect intensity: Start low (30-50%) and increase gradually

Voice blending: Keep both voices in the same language family

🤝 Contributing

Contributions are welcome! Areas where help is appreciated:

Additional voice profile presets

Performance optimizations

Bug reports and fixes

Documentation improvements

Testing on different platforms

📄 License & Credits

This Project

License: MIT License

Author: GeekyGhost

Repository: https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS

Kokoro TTS Model

License: Apache 2.0

Author: hexgrad

Model: https://huggingface.co/hexgrad/Kokoro-82M

Dependencies

librosa: Audio processing (ISC License)

scipy: Scientific computing (BSD License)

PyTorch: Deep learning framework (BSD License)

soundfile: Audio I/O (BSD License)

Special Thanks

hexgrad for the incredible Kokoro-82M model

ComfyUI Team for the amazing framework

Community testers and contributors

Audio processing library developers

📚 Research & Resources

Useful Links

Kokoro Model Page: https://huggingface.co/hexgrad/Kokoro-82M

ComfyUI Documentation: https://docs.comfy.org

Issue Tracker: https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS/issues

Discussions: https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS/discussions

Research Papers & References

StyleTTS 2 architecture

ISTFTNet vocoder

Phase vocoder techniques

Voice morphing and blending

🌟 Quick Start Examples

Example 1: Basic Narration

Node: Geeky Kokoro TTS (2025)

Text: "Welcome to my tutorial on advanced AI techniques."

Voice: 🇺🇸 🚺 Nicole 🎧

Speed: 1.0

GPU: true

Example 2: Character Voice with Effects

Node 1: Geeky Kokoro TTS (2025)

Voice: 🇺🇸 🚹 Puck 🎭

Text: "The villain laughed menacingly."



Node 2: Geeky Kokoro Advanced Voice

Profile: Monster

Intensity: 0.7

Example 3: Blended Voice for Unique Sound

Node: Geeky Kokoro TTS (2025)

Voice: 🇺🇸 🚺 Heart ❤️

Enable Blending: true

Second Voice: 🇺🇸 🚺 Bella 🔥

Blend Ratio: 0.6

Text: "This creates a warm yet energetic voice perfect for marketing."

Made with ❤️ for the ComfyUI community

Enjoy natural, high-quality text-to-speech with 54+ voices and unlimited creative possibilities! 🎉