ComfyUI-Mana-Nodes

★ 246

字体动画语音识别字幕生成文本转语音

为ComfyUI添加字体动画、语音识别、字幕生成和TTS节点，简化音频与文本的制作与同步。

💡 在ComfyUI中创建带语音、字幕和字体动画的多媒体片段

🍴 18 Forks💻 Python🔄 2024-05-29

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/9671236b7e59

📦 requirements.txt

numpy
torch
torchvision
Pillow
moviepy
librosa
transformers
opencv-python-headless
requests
matplotlib
scipy
pyspellchecker

Screenshot 2024-04-27 at 17-07-10 ComfyUI

Screenshot 2024-04-28 at 19-18-01 ComfyUI

📄 README

[](https://buymeacoffee.com/foreigngods)

Welcome to the ComfyUI-Mana-Nodes project!

This collection of custom nodes is designed to supercharge text-based content creation within the ComfyUI environment.

Whether you’re working on dynamic captions, transcribing audio, or crafting engaging visual content, Mana Nodes has got you covered.

If you like Mana Nodes, give our repo a ⭐ Star and 👀 Watch our repository to stay updated.

Installation

You can install Mana Nodes via the ComfyUI-Manager

Or simply clone the repo into the custom_nodes directory with this command:

git clone https://github.com/ForeignGods/ComfyUI-Mana-Nodes.git

and install the requirements using:

.\python_embed\python.exe -s -m pip install -r requirements.txt --user

If you are using a venv, make sure you have it activated before installation and use:

pip install -r requirements.txt

Nodes

✒️ Text to Image Generator

Required Inputs

`font`

To set the font and its styling you need to input 🆗 Font Properties node here.

`canvas`

To configure the canvas input the 🖼️ Canvas Properties

`text`

Specifies the text to be rendered on the images. Supports multiline text input for rendering on separate lines.

For simple text: Input the text directly as a string.

For frame-specific text: Use a JSON-like format where each line specifies a frame number and the corresponding text. Example:

“`

“1”: “Hello”,

“10”: “World”,

“20”: “End”

“`

`frame_count`

Sets the amount of frames this node will output.

Optional Inputs

`transcription`

Input the transcription output from the 🎤 Speech Recognition node here.

Based on this transcription data, 🖼️ Canvas Properties and 🆗 Font Properties the text should be formatted in a way that builds up lines of words until there is no space on the canvas left (transcription_mode: fill, line).

`highlight_font`

Input a secondary font 🆗 Font Properties, that is used to highlight the active caption (transcription_mode: fill, line). When manually setting the text the following syntax can be used to defined which word/character:

Hello <tag>World</tag>

Outputs

`images`

The generated images with the specified text and configurations, in common ComfyUI format (compatible with other nodes).

`transcription_framestamps`

Framestamps formatted based on canvas, font and transcription settings.

Can be useful to manually correct errors by 🎤 Speech Recognition node.

Example: Save this output with 📝 Save/Preview Text -> manually correct mistakes -> remove transcription input from ✒️ Text to Image Generator node -> paste corrected framestamps into text input field of ✒️ Text to Image Generator node.

🆗 Font Properties

Required Inputs

`font_file`

Fonts located in the custom_nodes\ComfyUI-Mana-Nodes\font_files\example_font.ttf or system font directories (supports .ttf, .otf, .woff, .woff2).

`font_size`

Either set single value font_size or input animation definition via the ⏰ Scheduled Values node. (Convert font_size to input)

`font_color`

Either set single color value (CSS3/Color/Extended color keywords) or input animation definition via the 🌈 Preset Color Animations node. (Convert font_color to input)

`x_offset`, `y_offset`

Either set single horiontal and vertical offset value or input animation definition via the ⏰ Scheduled Values node. (Convert x_offset/y_offset to input)

`rotation`

Either set single rotation value or input animation definition via the ⏰ Scheduled Values node. (Convert rotation to input)

`rotation_anchor_x`, `rotation_anchor_y`

Horizontal and vertical offsets of the rotation anchor point, relative to the texts initial position.

`kerning`

Spacing between characters of font.

`border_width`

Width of the text border.

`border_color`

Either set single color value (CSS3/Color/Extended color keywords) or input animation definition via the 🌈 Preset Color Animations node. (Convert border_color to input)

`shadow_color`

Either set single color value (CSS3/Color/Extended color keywords) or input animation definition via the 🌈 Preset Color Animations node. (Convert shadow_color to input)

`shadow_offset_x`, `shadow_offset_y`

Horizontal and vertical offset of the text shadow.

Outputs

`font`

Used as input on ✒️ Text to Image Generator node for the font and highlight_font.

🖼️ Canvas Properties

Required Inputs

`height`, `width`

Dimensions of the canvas.

`background_color`

Background color of the canvas. (CSS3/Color/Extended color keywords)

`padding`

Padding between image border and font.

`line_spacing`

Spacing between lines of text on the canvas.

Optional Inputs

`images`

Can be used to input images instead of using background_color.

Outputs

`canvas`

Used as input on ✒️ Text to Image Generator node to define the canvas settings.

⏰ Scheduled Values

Required Inputs

`frame_count`

Sets the range of the x axis of the chart. (always starts at 1)

`value_range`

Sets the range of the y axis of the chart. (Example: 25, will would be ranging from -25 to 25)

This can be changed by zooming via the mousewheel and will reset to the specified value if changed.

`easing_type`

Is used to generate values in between of the manually added values by the user by clicking the Generate Values button.

The available easing functions are:

linear

easeInQuad

easeOutQuad

easeInOutQuad

easeInCubic

easeOutCubic

easeInOutCubic

easeInQuart

easeOutQuart

easeInOutQuart

easeInQuint

easeOutQuint

easeInOutQuint

exponential

`step_mode`

The option single will force the chart to display every single tick/step on the chart.

The option auto will automatically remove ticks/step to prevent overlapping.

`animation_reset`

Used to specify the reset behaviour of the animation.

word: animation will be reset when a new word is displayed, stays on last value when animation finished before word change.

line: animation will be reset when a new line is displayed, stays on last value when animation finished before line change.

never: animation will just run once and stop on last value. (Not affected by word or line change)

looped: animation will endlessly loop. (Not affected by word or line change)

pingpong: animation will first play forward then back and so on. (Not affected by word or line change)

`scheduled_values`

Adding Values: Click on the chart to add keyframes at specific points.

Editing Values: Double-click on a keyframe to edit its frame and value.

Deleting Values: Click on the delete button associated with each keyframe to remove it.

Generating Values: Click on the “Generate Values” button to interpolate values between existing keyframes.

Deleting Generated Values: Click on the “Delete Generated” button to remove all interpolated values.

Outputs

`scheduled_values`

Outputs a list of frame and value pairs and the animation_reset option.

At the moment this output can be used to animate the following widgets (Convert property to input) of the 🆗 Font Properties node:

font_size (font, higlight_font)

x_offset (font)

y_offset (font)

rotation (font)

🌈 Preset Color Animations

Required Inputs

`color_preset`

Currently the following color animation presets are available:

rainbow

sunset

grey

ocean

forest

fire

sky

earth

`animation_duration`

Sets the length of the animation measured as frames.

`animation_reset`

Used to specify the reset behaviour of the animation.

word: animation will be reset when a new word is displayed, stays on last value when animation finished before word change.

line: animation will be reset when a new line is displayed, stays on last value when animation finished before line change.

never: animation will just run once and stop on last value. (Not affected by word or line change)

looped: animation will endlessly loop. (Not affected by word or line change)

pingpong: animation will first play forward then back and so on. (Not affected by word or line change)

Outputs

`scheduled_colors`

Outputs a list of frame and color definitions and the animation_reset option.

At the moment this output can be used to animate the following widgets (Convert property to input) of the 🆗 Font Properties node:

font_color (font, higlight_font)

border_color (font, higlight_font)

shadow_color (font, higlight_font)

🎤 Speech Recognition

Converts spoken words in an audio file to text using a deep learning model.

Required Inputs

`audio`

Audio file path or URL.

`wav2vec2_model`

The Wav2Vec2 model used for speech recognition. (https://huggingface.co/models?search=wav2vec2)

`spell_check_language`

Language for the spell checker.

`framestamps_max_chars`

Maximum characters allowed until new framestamp line is created.

Optional Inputs

`fps`

Frames per second, used for synchronizing with video. (Default set to 30)

Outputs

`transcription`

Text transcription of the audio. (Should only be used as font2img transcription input)

`raw_string`

Raw string of the transcription without timestamps.

`framestamps_string`

Frame-stamped transcription.

`timestamps_string`

Transcription with timestamps.

Example Outputs

`raw_string`

Returns the transcribed text as one line.

THE GREATEST TRICK THE DEVIL EVER PULLED WAS CONVINCING THE WORLD HE DIDN'T EXIST

`framestamps_string`

Depending on the framestamps_max_chars parameter the sentece will be cleared and starts to build up again until max_chars is reached again.

In this example framestamps_max_chars is set to 25.

"27": "THE",
"31": "THE GREATEST",
"43": "THE GREATEST TRICK",
"73": "THE GREATEST TRICK THE",
"77": "DEVIL",
"88": "DEVIL EVER",
"94": "DEVIL EVER PULLED",
"127": "DEVIL EVER PULLED WAS",
"133": "CONVINCING",
"150": "CONVINCING THE",
"154": "CONVINCING THE WORLD",
"167": "CONVINCING THE WORLD HE",
"171": "DIDN'T",
"178": "DIDN'T EXIST",

`timestamps_string`

Returns all transcribed words, their start_time and end_time in json format as a string.

[
  {
    "word": "THE",
    "start_time": 0.9,
    "end_time": 0.98
  },
  {
    "word": "GREATEST",
    "start_time": 1.04,
    "end_time": 1.36
  },
  {
    "word": "TRICK",
    "start_time": 1.44,
    "end_time": 1.68
  },
...
]

🎞️ Split Video

Required Inputs

`video`

Path the video file.

`frame_limit`

Maximum number of frames to extract from the video.

`frame_start`

Starting frame number for extraction.

`filename_prefix`

Prefix for naming the extracted audio file. (relative to .\ComfyUI\output)

Outputs

`frames`

Extracted frames as image tensors.

`frame_count`

Total number of frames extracted.

`audio_file`

Path of the extracted audio file.

`fps`

Frames per second of the video.

`height`, `width:`

Dimensions of the extracted frames.

🎥 Combine Video

Required Inputs

`frames`

Sequence of images to be used as video frames.

`filename_prefix`

Prefix for naming the video file. (relative to .\ComfyUI\output)

`fps`

Frames per second for the video.

Optional Inputs

`audio_file`

Audio file path or URL.

Outputs

`video_file`

Path to the created video file.

📣 Generate Audio (experimental)

Converts text to speech and saves the output as an audio file.

Required Inputs

`text`

The text to be converted into speech.

`filename_prefix`

Prefix for naming the audio file. (relative to .\ComfyUI\output)

This node uses a text-to-speech pipeline to convert input text into spoken words, saving the result as a WAV file. The generated audio file is named using the provided filename prefix and is stored relative to the .\ComfyUI-Mana-Nodes directory.

Model: https://huggingface.co/spaces/suno/bark

Foreign Language

Bark supports various languages out-of-the-box and automatically determines language from input text. When prompted with code-switched text, Bark will even attempt to employ the native accent for the respective languages in the same voice.

Example:

Buenos días Miguel. Tu colega piensa que tu alemán es extremadamente malo. But I suppose your english isn't terrible.

Non-Speech Sounds

Below is a list of some known non-speech sounds, but we are finding more every day.


[laughter]
[laughs]
[sighs]
[music]
[gasps]
[clears throat]
— or … for hesitations
♪ for song lyrics
capitalization for emphasis of a word
MAN/WOMAN: for bias towards speaker

Example:

" [clears throat] Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as... ♪ singing ♪."

Music

Bark can generate all types of audio, and, in principle, doesn’t see a difference between speech and music. Sometimes Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.

Example:

♪ In the jungle, the mighty jungle, the lion barks tonight ♪

Speaker Prompts

You can provide certain speaker prompts such as NARRATOR, MAN, WOMAN, etc. Please note that these are not always respected, especially if a conflicting audio history prompt is given.

Example:

WOMAN: I would like an oatmilk latte please.

MAN: Wow, that's expensive!

📝 Save/Preview Text

Required Inputs

`string`

The string to be written to the file.

`filename_prefix`

Prefix for naming the text file. (relative to .\output)

Example Workflows

LCM AnimateDiff Text Animation

Demo

| Demo 1 | Demo 2 | Demo 3 |

| —— | —— | —— |

||||

Workflow

example_workflow_1.json

The values for the ⏰ Scheduled Values node cannot be imported yet (you have to add them yourself).

Speech Recognition Caption Generator

Demo

Turn on audio.

https://github.com/ForeignGods/ComfyUI-Mana-Nodes/assets/78089013/e5a39327-db61-46ad-abea-10e27e4551c1

Workflow

example_workflow_2.json

To-Do

[ ] Improve Speech Recognition

[ ] Improve Text to Speech

[ ] Node to download fonts from DaFont.com

[ ] SVG Loader/Animator

[ ] Text to Image Generator Alpha Channel

[ ] Add Font Support for non Latin Characters

[ ] 3D Effects, Bevel/Emboss, Inner Shading, Fade in/out

[ ] Find a better way to define color animations

[ ] Make more Font Properties animatable

Contributing

Your contributions to improve Mana Nodes are welcome!

If you have suggestions or enhancements, feel free to fork this repository, apply your changes, and create a pull request. For significant modifications or feature requests, please open an issue first to discuss what you’d like to change.

ComfyUI-Mana-Nodes

Installation

Nodes

Required Inputs

font

canvas

text

frame_count

Optional Inputs

transcription

highlight_font

Outputs

images

transcription_framestamps

Required Inputs

font_file

font_size

font_color

x_offset, y_offset

rotation

rotation_anchor_x, rotation_anchor_y

kerning

border_width

border_color

shadow_color

shadow_offset_x, shadow_offset_y

Outputs

font

Required Inputs

height, width

background_color

padding

line_spacing

Optional Inputs

images

Outputs

canvas

Required Inputs

frame_count

value_range

easing_type

step_mode

animation_reset

scheduled_values

Outputs

scheduled_values

Required Inputs

color_preset

animation_duration

animation_reset

Outputs

scheduled_colors

Required Inputs

audio

wav2vec2_model

spell_check_language

framestamps_max_chars

Optional Inputs

fps

Outputs

transcription

raw_string

framestamps_string

timestamps_string

Example Outputs

raw_string

framestamps_string

timestamps_string

Required Inputs

video

frame_limit

frame_start

filename_prefix

Outputs

frames

frame_count

audio_file

fps

height, width:

Required Inputs

`font`

`canvas`

`text`

`frame_count`

`transcription`

`highlight_font`

`images`

`transcription_framestamps`

`font_file`

`font_size`

`font_color`

`x_offset`, `y_offset`

`rotation`

`rotation_anchor_x`, `rotation_anchor_y`

`kerning`

`border_width`

`border_color`

`shadow_color`

`shadow_offset_x`, `shadow_offset_y`

`font`

`height`, `width`

`background_color`

`padding`

`line_spacing`

`images`

`canvas`

`frame_count`

`value_range`

`easing_type`

`step_mode`

`animation_reset`

`scheduled_values`

`scheduled_values`

`color_preset`

`animation_duration`

`animation_reset`

`scheduled_colors`

`audio`

`wav2vec2_model`

`spell_check_language`

`framestamps_max_chars`

`fps`

`transcription`

`raw_string`

`framestamps_string`

`timestamps_string`

`raw_string`

`framestamps_string`

`timestamps_string`

`video`

`frame_limit`

`frame_start`

`filename_prefix`

`frames`

`frame_count`

`audio_file`

`fps`

`height`, `width:`

`frames`

`filename_prefix`

`fps`

`audio_file`

`video_file`

`text`

`filename_prefix`

`string`

`filename_prefix`