ComfyUI-LLaVA-Captioner

★ 143

多模态离线运行图像聊天批量处理

ComfyUI-LLaVA-Captioner 节点：在本地用 a/LLaVA 多模态 LLM 与图片对话，支持离线运行、无外部服务、批处理和标签输出。

💡 本地对图片进行自然语言问答与自动标注。

🍴 17 Forks💻 Python🔄 2024-08-03

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/2df45d172dc1

📄 README

ComfyUI LLaVA Captioner

A ComfyUI extension for chatting with your images. Runs on your own system, no external services used, no filter.

Uses the LLaVA multimodal LLM so you can give instructions or ask questions in natural language.

It’s maybe as smart as GPT3.5, and it can see.

Try asking for:

captions or long descriptions

whether a person or object is in the image, and how many

lists of keywords or tags

a description of the opposite of the image

NSFWness (FAQ #1 apparently)

The model is quite capable of analysing NSFW images and returning NSFW replies.

It is unlikely to return an NSFW response to a SFW image, in my experience.

It seems like this is because (1) the model’s output is strongly conditioned on

the contents of the image so it’s hard

to activate concepts that aren’t pictured and

(2) the LLM has had a hefty dose of safety-training.

This is probably for the best in general. But you will not have much success asking NSFW questions about SFW images.

Installation

git clone https://github.com/ceruleandeep/ComfyUI-LLaVA-Captioner into your custom_nodes folder

e.g. custom_nodes\ComfyUI-LLaVA-Captioner

Open a console/Command Prompt/Terminal etc

Change to the custom_nodes/ComfyUI-LLaVA-Captioner folder you just created

e.g. cd C:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-LLaVA-Captioner or wherever you have it installed

Run python install.py

Download models from 🤗 into models\llama:

llava-v1.5-7b-Q4_K.gguf

llava-v1.5-7b-mmproj-Q4_0.gguf

Usage

Add the node via image -> LlavaCaptioner

Supports tagging and outputting multiple batched inputs.

model: The multimodal LLM model to use. People are most familiar with LLaVA but there’s also Obsidian or BakLLaVA or ShareGPT4

mmproj: The multimodal projection that goes with the model

prompt: Question to ask the LLM

max_tokens Maximum length of response, in tokens. A token is approximately half a word.

temperature How much randomness to allow in the result. While a lot of people are using the text-only Llama series models with temperatures up around 0.7 and enjoying the creativity, LLaVA’s accuracy seems to benefit greatly from temperatures less than 0.2.

Requirements

llama-cpp-python

This is easy to install but getting it to use the GPU can be a saga.

GPU inference time is 4 secs per image on a RTX 4090 with 4GB of VRAM to spare, and 8 secs per image on a Macbook Pro M1.

CPU inference time is 25 secs per image. If your inference times are closer to 25 than to 5, you’re probably doing CPU inference.

Unfortunately the multimodal models in the Llama family need about a 4x larger context size than the text-only ones,

so the llama.cpp promise of doing fast LLM inference on their CPUs hasn’t quite