exllamav2>=0.1.5; platform_system == "Linux" flash-attn>=2.5.7; platform_system == "Linux"

A simple local text generator for ComfyUI using ExLlamaV2.
Clone the repository to custom_nodes and install the requirements:
cd custom_nodes
git clone https://github.com/Zuellni/ComfyUI-ExLlama-Nodes
pip install -r ComfyUI-ExLlama-Nodes/requirements.txt
Use wheels for ExLlamaV2 and FlashAttention on Windows:
pip install exllamav2-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl
pip install flash_attn-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl
Only EXL2, 4-bit GPTQ and FP16 models are supported. You can find them on Hugging Face.
To use a model with the nodes, you should clone its repository with git or manually download all the files and place them in a folder in models/llm.
For example, if you’d like to download the 4-bit Llama-3.1-8B-Instruct:
cd models
mkdir llm
git install lfs
git clone https://huggingface.co/turboderp/Llama-3.1-8B-Instruct-exl2 -b 4.0bpw
[!TIP]
You can add your own
llmpath to the extra_model_paths.yaml file and put the models there instead.
| ExLlama Nodes | ||
| Loader | Loads models from the llm directory. |
|
| cache_bits | A lower value reduces VRAM usage, but also affects generation speed and quality. | |
| flash_attention | Enabling reduces VRAM usage, not supported on cards with compute capability lower than 8.0. |
|
| max_seq_len | Max context, higher value equals higher VRAM usage. 0 will default to model config. |
|
| Formatter | Formats messages using the model’s chat template. | |
| add_assistant_role | Appends assistant role to the formatted output. | |
| Tokenizer | Tokenizes input text using the model’s tokenizer. | |
| add_bos_token | Prepends the input with a bos token if enabled. |
|
| encode_special_tokens | Encodes special tokens such as bos and eos if enabled, otherwise treats them as normal strings. |
|
| Settings | Optional sampler settings node. Refer to SillyTavern for parameters. | |
| Generator | Generates text based on the given input. | |
| unload | Unloads the model after each generation to reduce VRAM usage. | |
| stop_conditions | A list of strings to stop generation on, e.g. "\n" to stop on newline. Leave empty to only stop on eos. |
|
| max_tokens | Max new tokens to generate. 0 will use available context. |
|
| Text Nodes | ||
| Clean | Strips punctuation, fixes whitespace, and changes case for input text. | |
| Message | A message for the Formatter node. Can be chained to create a conversation. |
|
| Preview | Displays generated text in the UI. | |
| Replace | Replaces variable names in curly brackets, e.g. {a}, with their values. |
|
| String | A string constant. | |
An example workflow is embedded in the image below and can be opened in ComfyUI.