ComfyUI-MegaTTS

★ 50

文本转语音语音克隆中英双语显存优化

基于字节跳动MegaTTS3的ComfyUI节点，支持中英高质量文本转语音、短样本语音克隆、精细参数调控与显存优化，自动下载模型。

💡 将文本生成自然语音并用短样本快速克隆特定声音。

🍴 6 Forks💻 Python🔄 2025-06-19

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8f9eee5e2cdb

📦 requirements.txt

setproctitle>=1.2.0
attrdict>=2.0.1
librosa>=0.9.2
pydub>=0.25.1
pyloudnorm>=0.1.0
x-transformers>=0.25.0
torchdiffeq>=0.2.2
openai-whisper>=20231117
tqdm>=4.65.0
torch>=2.0.0
numpy>=1.20.0
requests>=2.28.0
pyyaml>=6.0

📄 README

ComfyUI-MegaTTS

(English / 中文)

A ComfyUI custom node based on ByteDance MegaTTS3, enabling high-quality text-to-speech synthesis with voice cloning capabilities for both Chinese and English.

Update Logs

Version 1.0.2

Reconstructed the code and custom node for optimized performance and better GPU resource management.

Added enhanced memory management features to prevent low VRAM users from running out of memory.

i18n supported in English and Chinese

Version 1.0.1

Bug Fix

Features

High-Quality Voice Synthesis: Generate natural-sounding speech from text input

Voice Cloning: Clone any voice with just a short sample (requires both WAV and NPY files)

Bilingual Support: Works with both Chinese and English text, with code-switching capabilities

Advanced Parameter Control: Fine-tune generation quality, pronunciation accuracy, and voice similarity

Memory Management: Built-in functionality to optimize GPU resource usage

Automatic Model Download: Models are downloaded automatically when required

Installation

Prerequisites

ComfyUI installed and working

Python 3.10+ recommended

CUDA-compatible GPU with at least 4GB VRAM (8GB+ recommended for higher quality)

Steps

Clone this repository to ComfyUI’s custom_nodes directory:

“`bash

cd ComfyUI/custom_nodes

git clone https://github.com/1038lab/ComfyUI-MegaTTS.git

“`

Install required dependencies:

“`bash

cd ComfyUI-MegaTTS

pip install -r requirements.txt

“`

The node will automatically download required models on first use, or you can manually download them:

Models and Manual Download

This extension uses modified versions of ByteDance’s MegaTTS3 models. While the models are automatically downloaded during first use, you can manually download them from Hugging Face:

Model Structure

The models are organized in the following structure:

model_path/TTS/MegaTTS3/
  ├── diffusion_transformer/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  ├── wavvae/
  │   ├── config.yaml
  │   └── decoder.ckpt
  ├── duration_lm/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  ├── aligner_lm/
  │   ├── config.yaml
  │   └── model_only_last.ckpt
  └── g2p/
      ├── config.json
      ├── model.safetensors
      ├── generation_config.json
      ├── tokenizer_config.json
      ├── special_tokens_map.json
      ├── tokenizer.json
      ├── vocab.json
      └── merges.txt

Manual Download Options

Direct Download from Hugging Face:

Visit the ByteDance/MegaTTS3 repository

Download each subfolder from the repository:

aligner_lm

diffusion_transformer

duration_lm

g2p

wavvae

Place the downloaded files in the corresponding directories under comfyui/models/TTS/MegaTTS3/

Using Hugging Face CLI:

“`bash

# Install huggingface_hub if you don’t have it

pip install huggingface_hub

# Download all models

python -c “from huggingface_hub import snapshot_download; snapshot_download(repo_id=’ByteDance/MegaTTS3′, local_dir=’comfyui/models/TTS/MegaTTS3/’)”

“`

Voice Folder and Voice Maker

[!IMPORTANT]

The WaveVAE encoder is currently not available.

For security reasons, Bytedance has not uploaded the WaveVAE encoder.

You can only use pre-extracted latents (.npy files) for inference.

To synthesize speech for a specific speaker, ensure both the corresponding WAV and NPY files are in the same directory.

Refer to the Bytedance MegaTTS3 repository for details on obtaining necessary files or submitting your voice samples.

Voice Folder Structure

End of Selection

Voice Folder Structure

The extension requires a Voices folder to store reference voice samples and their extracted features:

Voices/
├── sample1.wav     # Reference audio file
├── sample1.npy     # Extracted features from the audio file
├── sample2.wav     # Another reference audio
└── sample2.npy     # Corresponding features

Getting Voice Samples and NPY Files

Download pre-extracted samples:

Sample voice WAV and NPY files can be found in this Google Drive folder: Voice Samples and NPY Files

This folder contains pre-extracted NPY files and their corresponding WAV samples organized in subfolders

Submit your own voice samples:

If you want to use your own voice, you can submit samples to this Google Drive folder: Voice Submission Queue

Your samples should be clear audio with minimal background noise and within 24 seconds

After verification for safety, the ByteDance team will extract and provide NPY files for your samples

Generate NPY files with Voice Maker:

Use the Voice Maker node to automatically process your audio and generate NPY files

While this method is convenient, the quality may not match officially extracted NPY files

Best for quick testing and experimentation with your own voice samples

Voice Maker Node

This extension includes a Voice Maker custom node that helps you prepare voice samples:

Voice Maker Node Features:

Convert any audio file to the required 24kHz WAV format

Extract NPY feature files from WAV samples

Process and optimize voice samples for better quality

Save processed files to the Voices folder automatically

How to use the Voice Maker:

Add the “Voice Maker” node from the 🧪AILab/🔊Audio category

Connect an audio input or select a file from your computer

Configure processing options (normalization, trimming, etc.)

Run the node to generate a ready-to-use voice sample with its NPY file

About WAV and NPY Files

WAV files: These are the actual voice samples you want to clone (24kHz recommended)

NPY files: These contain extracted features necessary for voice cloning

Voice Format Requirements

For best results:

Sample rate: 24kHz (will be automatically converted if different)

Audio format: WAV recommended, but MP3, M4A, and other formats are supported

Duration: 5-24 seconds of clear speech

Quality: Clean recording with minimal background noise

Parameter Tuning

Controlling Voice Accent

This model offers excellent control over accents and pronunciation:

For preserving the speaker’s accent:

Set pronunciation_strength (p_w) to a lower value (1.0-1.5)

This is useful for cross-lingual TTS where you want to preserve the accent

For standard pronunciation:

Set pronunciation_strength (p_w) to a higher value (2.5-4.0)

This helps produce more standard pronunciation regardless of the source accent

For emotional or expressive speech:

Increase the voice_similarity (t_w) parameter (2.0-5.0)

Keep pronunciation_strength (p_w) at a moderate level (1.5-2.5)

Recommended Parameter Combinations

| Use Case | p_w (pronunciation_strength) | t_w (voice_similarity) |

|———-|——————————|————————|

| Standard TTS | 2.0 | 3.0 |

| Preserve Accent | 1.0-1.5 | 3.0-5.0 |

| Cross-lingual (standard) | 3.0-4.0 | 3.0-5.0 |

| Emotional Speech | 1.5-2.5 | 3.0-5.0 |

| Noisy Reference Audio | 3.0-5.0 | 3.0-5.0 |

Nodes

This extension provides three main nodes:

1. Mega TTS (Advanced)

Full-featured TTS node with complete parameter control.

Inputs:

input_text – Text to convert to speech

language – Language selection (en: English, zh: Chinese)

generation_quality – Controls the number of diffusion steps (higher = better quality but slower)

pronunciation_strength (p_w) – Controls pronunciation accuracy (higher values produce more standard pronunciation)

voice_similarity (t_w) – Controls similarity to reference voice (higher values produce speech more similar to reference)

reference_voice – Reference voice file from Voices folder

Outputs:

AUDIO – Generated audio in WAV format

LATENT – Audio latent representation for further processing

2. Mega TTS (Simple)

Simplified TTS node with default parameters for quick usage.

Inputs:

input_text – Text to convert to speech

language – Language selection (en: English, zh: Chinese)

reference_voice – Reference voice file from Voices folder

Outputs:

AUDIO – Generated audio in WAV format

3. Mega TTS (Clean Memory)

Utility node to free GPU memory after TTS processing.

Parameter Descriptions

| Parameter | Description | Recommended Values |

|———–|————-|——————-|

| generation_quality | Controls the number of diffusion steps. Higher values produce better quality but increase generation time. | Default: 10. Range: 1-50. For quick tests: 1-5, for final output: 15-30. |

| pronunciation_strength (p_w) | Controls how closely the output follows standard pronunciation. | Default: 2.0. Range: 1.0-5.0. For accent preservation: 1.0-1.5, for standard pronunciation: 2.5-4.0. |

| voice_similarity (t_w) | Controls how similar the output is to the reference voice. | Default: 3.0. Range: 1.0-5.0. For more expressive output with preserved voice characteristics: 3.0-5.0. |

Voice Cloning

Adding Reference Voices

Place your voice WAV files in the Voices folder

Each voice requires two files:

voice_name.wav – Voice sample file (24kHz sample rate recommended, 5-10 seconds of clear speech)

voice_name.npy – Corresponding voice feature file (generated automatically if voice extraction is enabled)

How to Clone a Voice

Add your sample WAV file to the Voices folder

The first time you select the voice, the system will extract feature files and save them

Select your voice in the node’s “reference_voice” dropdown

Adjust the “voice_similarity” parameter to control the intensity of voice cloning:

Lower values (1.0-2.0): More natural but less similar to reference

Higher values (3.0-5.0): More similar to reference but potentially less natural

Advanced Usage

Cross-Language Voice Cloning

For cloning a voice across languages (e.g., making an English speaker speak Chinese):

Use a clean voice sample in the original language

Set language to the target language (e.g., “zh” for Chinese)

Increase the pronunciation_strength (p_w) parameter (3.0-4.0) for more standard pronunciation

Set voice_similarity (t_w) parameter higher (3.0-5.0) to maintain voice characteristics

Handling Accents

For preserving accents: Lower pronunciation_strength (p_w) value (1.0-1.5)

For standard pronunciation: Higher pronunciation_strength (p_w) value (2.5-4.0)

Credits

Original MegaTTS3 model by ByteDance

MegaTTS3 Hugging Face model: ByteDance/MegaTTS3

License

GPL-3.0 License

References

ByteDance MegaTTS3 GitHub Repository

ByteDance MegaTTS3 Hugging Face Model

Original papers:

“Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis”

“Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”