ComfyUI-speech-dataset-toolkit

ComfyUI-speech-dataset-toolkit
★ 23

音频处理数据集制作可视化ASR/TTS 工具
基于 torchaudio 的 ComfyUI 音频工具,辅助 ASR/TTS 等语音数据集制作,提供加载/保存、编辑、可视化与 AI 处理能力
💡 在 ComfyUI 中快速加载、编辑并可视化语音以构建 ASR/TTS 数据集
🍴 5 Forks💻 Python🔄 2025-06-17
📦
网盘下载
复制链接后前往夸克网盘下载
https://pan.quark.cn/s/b6135d9bd930
📦 requirements.txt
faster-whisper==1.0.1
git+https://github.com/kale4eat/demucs@no-torch-install#egg=demucs
#
nue
asr
git+https://github.com/rinnakk/nue-asr
#
nemo-asr
Cython
git+https://github.com/reazon-research/ReazonSpeech#subdirectory=pkg/nemo-asr
#
kotoba-whisper
transformers
accelerate
#
BS-RoFormer
git+https://github.com/kale4eat/BS-RoFormer@mss-training#egg=BS-RoFormer
ml_collections
Load Audio node
Apply demucs node
Save Audio node
📄 README

ComfyUI-speech-dataset-toolkit

Overview

Basic audio tools using torchaudio for ComfyUI. It is assumed to assist in the speech dataset creation for ASR, TTS, etc.

[!NOTE]

The AUDIO type in this repository is compatible with the official implementation. (as of February 7, 2025).

Features

  • Basic
  • Load & Save audio
  • Edit
  • Cut and Trim
  • Split and Join
  • Silence
  • Resample
  • Visualization
  • WaveForm
  • Specgram
  • Spectrogram
  • MelFilterBank
  • Pitch
  • AI
  • Demucs (facebook)
  • faster-whisper (OpenAI’s Whisper model using CTranslate2)
  • silero-vad (Silero Team)
  • nue-asr (rinna Co., Ltd.)
  • ReazonSpeech nemo-asr (Reazon Human Interaction Lab)
  • SpeechMOS (UTokyo-SaruLab’s MOS prediction system)
  • kotoba-whisper (Kotoba Technologies)
  • BS-RoFormer
  • Requirement

    Install torchaudio according to your environment.

    cd custom_nodes
    git clone https://github.com/kale4eat/ComfyUI-speech-dataset-toolkit.git
    cd ComfyUI-speech-dataset-toolkit
    pip3 install torchaudio --index-url https://download.pytorch.org/whl/cu121
    pip3 install -r requirements.txt

    If you use silero-vad, install onnxruntime according to your environment.

    pip install onnxruntime-gpu

    Usage

    At first startup, audio_input and audio_output folder is created.

    ComfyUI
    ├── input
    │   └── audio_input
    ├── output
    │   └── audio_input
    ├── custom_nodes
    │   └── ComfyUI-speech-dataset-toolkit
    ...

    Fisrt of all, use a Load Audio node to load audio.

    Please put the audio files you wish to process in a audio_input folder in advance.

    If you’ve added files while the app is running, please reload the page (press F5).

    You can use LoadAudio, which is official implementation of ComfyUI.

    audio, the data type of ComfyUI flow, consists of waveform and sample rate.

    Many nodes of this extension handle this data.

    Note that waveform is torch.Tensor and has batch dim.

    For example, Demucs separate drums, bass, vocals and other stems. Each of them is audio data.

    Finally, use a Save Audio node to save audio. The audio is saved to audio_output folder. You can also use SaveAudio implemented by ComfyUI.

    Note

    There are some unsettled policies, destructive changes may be made.

    This repository does not contain the nodes such as numerical operations and string processing.

    Inspiration

  • ComfyUI-audio
  • ComfyUI-AudioScheduler