ComfyUI_Step_Audio_EditX_SM

★ 28

音频编辑语音合成情感与风格零样本TTS

基于LLM的开源音频编辑与零-shot TTS节点，支持情感、说话风格与旁语的迭代式精细编辑，便于在ComfyUI中快速试验与生成语音

💡 在ComfyUI内迭代编辑语音情感与说话风格，或进行零样本TTS生成

🍴 2 Forks💻 Python🔄 2025-11-15

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/c1eafc754fbb

📦 requirements.txt

torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
transformers==4.48.3
accelerate==1.3.0
openai-whisper==20240930
onnxruntime-gpu==1.17.0
onnxruntime
omegaconf==2.3.0
librosa==0.10.2.post1
sox==1.5.0
modelscope
numpy==1.26.4
six==1.16.0
hyperpyyaml
conformer==0.3.2
diffusers
pillow
sentencepiece
funasr>=1.1.3
protobuf==5.29.3
gradio>=5.16.0
nvidia-cuda-nvrtc-cu12==12.1.105
spaces==0.42.1

📄 README

ComfyUI_Step_Audio_EditX_SM

Step_Audio_EditX：the first open-source LLM-based audio model excelling at expressive and iterative audio editing—encompassing emotion, speaking style, and paralinguistics—alongside robust zero-shot text-to-speech (TTS) capabilities，try it in comfyUI

Update

插件音频归一化峰值（max_amplitude）外置，LLM的temperature外置，输入音频大于峰值（max_amplitude）时则执行归一化（避免削波，该值为归一化处理后声音的峰值），temperature默认是7,（直白点理解，越大越有创造性，越低越保守）

paralinguistic模式需要在第二行的prompt输入助词：假设音频内容是：早上好 ,吃了没，第二行输入，早上好,[Suprise-ah] ,吃了没

情绪及style的音频内容要跟info 的情绪或表达方向一致，比如你的文字内容是开心，才用happy，模型没那么强力，它最多帮你强化情绪，不能创造又哭有笑的情绪

None in audio normalization peak value（max_amplitude） is external, LLM’s temperature is external, and normalization is performed when it is greater than the peak value (to avoid clipping, change the value to the normalized peak value of the sound). The default temperature is 7, (to be understood in plain white, the larger the value, the more creative it is, and the lower the value, the more conservative it is)

sampler菜单选择clone时为zero shot 语音克隆,上面的prompt文字内容跟输入音频一致,下面的是文生音频的目标prompt;

不选择clone时为edit模式,下方的prompt失效,按照工作流的note,在edit info输入tag来编辑你想要的style或者情绪;

offload在Vram小于16时使用;

n_edit_iter 为编辑的轮次,一般2或者3就有好的效果;

When selecting ‘clone’ from the sampler menu, it is a’zero shot’ voice clone. The prompt text ‘above’ is consistent with the input audio, and the prompt ‘below’ is for the text generated audio;

When ‘clone’ is not selected, it is in edit mode, and the prompt below becomes invalid. Follow the note in the workflow and enter the tag in ‘edit_info’ to edit the ‘style’ or ’emotion’ you want;

‘Offload’ is used when Vram is less than 16G;

‘n-edit-iter’ is the round of editing, usually 2 or 3 has good results;

1.Installation

In the ./ComfyUI/custom_nodes directory, run the following:

git clone https://github.com/smthemex/ComfyUI_Step_Audio_EditX_SM

2.requirements

精简掉 sox 和hyperpyyaml,transformer版本因为tokens的问题,用transformer==4.53.3版本,或者低于,否则无声;

funasr库即便安装完成,注意控制台的信息,可能还需要装一个库(忘记是哪个了)

pip install -r requirements.txt

3.checkpoints

if offload （离线模式需要预下载模型）

1.main Step-Audio-EditX or 魔搭 Step-Audio-EditX

2.tokens Step-Audio-Tokenizer or 魔搭 Step-Audio-Tokenizer

├── ComfyUI/models/SAEditX
|     ├── Step-Audio-EditX
|          ├──all files #全部文件，包括子目录文件
|     ├── Step-Audio-Tokenizer
|          ├──all files #全部文件，包括子目录文件

4 Example

5 Usage Disclaimer

请勿用此方法或插件做任何违法行为,法网恢恢,疏而不漏,切勿以身试法!!

此插件只为开源开发者测试演示方法制作,未收取任何报酬,只是为爱发电;

Do not use this model for any unauthorized activities, including but not limited to:

Voice cloning without permission

Identity impersonation

Fraud

Deepfakes or any other illegal purposes

Ensure compliance with local laws and regulations, and adhere to ethical guidelines when using this model.

The model developers are not responsible for any misuse or abuse of this technology.

6 Citation

@misc{yan2025stepaudioeditxtechnicalreport,
      title={Step-Audio-EditX Technical Report}, 
      author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
      year={2025},
      eprint={2511.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.03601}, 
}

ComfyUI_Step_Audio_EditX_SM

ComfyUI_Step_Audio_EditX_SM

Update

Previous

4 Example

5 Usage Disclaimer

6 Citation