ComfyUI-XTTS

ComfyUI-XTTS
★ 68

语音合成声音克隆多语言支持srt字幕
ComfyUI-XTTS在ComfyUI中接入Coqui TTS的xtts模块,支持17种语言的语音合成与声音克隆,并用srt实现多说话人和字幕支持,便于微调与推理。
💡 在ComfyUI流程中进行多语言TTS与说话人克隆。
🍴 19 Forks💻 Python🔄 2024-06-24
📦
网盘下载
复制链接后前往夸克网盘下载
https://pan.quark.cn/s/b6135d9bd930
📦 requirements.txt
umap-learn
numpy>=1.17.0
📄 README

ComfyUI-XTTS

a custom comfyui node for coqui-ai/TTS‘s xtts module! support 17 languages voice cloning and tts

English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko) Hindi (hi)

Disclaimer / 免责声明

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.

Features

  • srt file for subtitle was supported
  • mutiple speaker was supported in finetune and inference by srt
  • huge comfyui custom nodes can merge in xtts
  • How to use

    make sure ffmpeg is worked in your commandline

    for Linux

    apt update
    apt install ffmpeg

    for Windows,you can install ffmpeg by WingetUI automatically

    then!

    git clone https://github.com/AIFSH/ComfyUI-XTTS.git
    cd ComfyUI-XTTS
    pip install -r requirements.txt

    weights will be downloaded from huggingface automatically! if you in china,make sure your internet attach the huggingface

    or if you still struggle with huggingface, you may try follow hf-mirror to config your env.

    或者下载权重文件解压后把pretrained_models整个文件夹放进ComfyUI-XTTS目录

    Tutorial

    Demo

    Params

  • temperature: The softmax temperature of the autoregressive model. Defaults to 0.65.
  • length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. Defaults to 1.0.
  • repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or “uhhhhhhs”, etc. Defaults to 2.0.
  • top_k: Lower values mean the decoder produces more “likely” (aka boring) outputs. Defaults to 50.
  • top_p: Lower values mean the decoder produces more “likely” (aka boring) outputs. Defaults to 0.8.
  • speed: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
  • WeChat Group && Donate

    Thanks

    coqui-ai/TTS