torch==2.4.1 torchaudio==2.4.1 torchvision==0.19.1 transformers==4.48.3 accelerate==1.3.0 openai-whisper==20240930 onnxruntime-gpu==1.17.0 onnxruntime omegaconf==2.3.0 librosa==0.10.2.post1 sox==1.5.0 modelscope numpy==1.26.4 six==1.16.0 hyperpyyaml conformer==0.3.2 diffusers pillow sentencepiece funasr>=1.1.3 protobuf==5.29.3 gradio>=5.16.0 nvidia-cuda-nvrtc-cu12==12.1.105 spaces==0.42.1
Step_Audio_EditX:the first open-source LLM-based audio model excelling at expressive and iterative audio editing—encompassing emotion, speaking style, and paralinguistics—alongside robust zero-shot text-to-speech (TTS) capabilities,try it in comfyUI
1.Installation
In the ./ComfyUI/custom_nodes directory, run the following:
git clone https://github.com/smthemex/ComfyUI_Step_Audio_EditX_SM
2.requirements
pip install -r requirements.txt
3.checkpoints
1.main Step-Audio-EditX or 魔搭 Step-Audio-EditX
2.tokens Step-Audio-Tokenizer or 魔搭 Step-Audio-Tokenizer
├── ComfyUI/models/SAEditX
| ├── Step-Audio-EditX
| ├──all files #全部文件,包括子目录文件
| ├── Step-Audio-Tokenizer
| ├──all files #全部文件,包括子目录文件
Deepfakes or any other illegal purposes
@misc{yan2025stepaudioeditxtechnicalreport,
title={Step-Audio-EditX Technical Report},
author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
year={2025},
eprint={2511.03601},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.03601},
}