img2txt-comfyui-nodes

★ 99

自动描述视觉问答支持中文img2img自动化

为图像自动生成通用描述或指定问题答案，支持中文问答，可用BLIP、Llava和MiniCPM模型辅助 img2img 自动化流程。

💡 自动生成图片描述或指定问答，用于 img2img 流程自动化。

🍴 13 Forks💻 Python🔄 2025-03-14

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/8b0992d318c3

📦 requirements.txt

transformers<=4.41.2
bitsandbytes>=0.43.0
timm>=1.0.7
sentencepiece
accelerate>=0.3.0
TensorImgUtils

📄 README

Auto-generate caption (BLIP):

Using to automate img2img process (BLIP and Llava)

Requirements/Dependencies

For Llava

bitsandbytes>=0.43.0
accelerate>=0.3.0

For MiniCPM

transformers<=4.41.2
timm>=1.0.7
sentencepiece

Installation

cd into ComfyUI/custom_nodes directory

git clone this repo

cd img2txt-comfyui-nodes

pip install -r requirements.txt

Models will be automatically downloaded per-use. If you never toggle a model on in the UI, it will never be downloaded.

To ask a list of specific questions about the image, use the Llava or MiniPCM models. The questions are separated by line in the multiline text input box.

Support for Chinese

The MiniCPM model works with Chinese text input without any additional configuration. The output will also be in Chinese.

“MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM”

Please support the creators of MiniCPM here

Tips

The multi-line input can be used to ask any type of questions. You can even ask very specific or complex questions about images.

To get best results for a prompt that will be fed back into a txt2img or img2img prompt, usually it’s best to only ask one or two questions, asking for a general description of the image and the most salient features and styles.

Model Locations/Paths

Models are downloaded automatically using the Huggingface cache system and the transformers from_pretrained method so no manual installation of models is necessary.

If you really want to manually download the models, please refer to Huggingface’s documentation concerning the cache system. Here is the relevant except:

> Pretrained models are downloaded and locally cached at ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below – in order of priority – to specify a different cache directory:

> – Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.

> – Shell environment variable: HF_HOME.

> – Shell environment variable: XDG_CACHE_HOME + /huggingface.

Models

MiniCPM (Chinese & English)

Title: MiniCPM-V-2 – Strong multimodal large language model for efficient end-side deployment

Datasets: HuggingFaceM4VQAv2, RLHF-V-Dataset, LLaVA-Instruct-150K

Size: ~ 6.8GB

Salesforce – blip-image-captioning-base

Title: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Size: ~ 2GB

Dataset: COCO (The MS COCO dataset is a large-scale object detection, image segmentation, and captioning dataset published by Microsoft)

llava – llava-1.5-7b-hf

Title: LLava: Large Language Models for Vision and Language Tasks

Size: ~ 15GB

Dataset: 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP, 158K GPT-generated multimodal instruction-following data, 450K academic-task-oriented VQA data mixture, 40K ShareGPT data.

Prompts

This is the guide for the format of an “ideal” txt2img prompt (using BLIP). Use as the basis for the questions to ask the img2txt models.

Subject – you can specify region, write the most about the subject

Medium – material used to make artwork. Some examples are illustration, oil painting, 3D rendering, and photography. Medium has a strong effect because one keyword alone can dramatically change the style.

Style – artistic style of the image. Examples include impressionist, surrealist, pop art, etc.

Artists – Artist names are strong modifiers. They allow you to dial in the exact style using a particular artist as a reference. It is also common to use multiple artist names to blend their styles. Now let’s add Stanley Artgerm Lau, a superhero comic artist, and Alphonse Mucha, a portrait painter in the 19th century.

Website – Niche graphic websites such as Artstation and Deviant Art aggregate many images of distinct genres. Using them in a prompt is a sure way to steer the image toward these styles.

Resolution – Resolution represents how sharp and detailed the image is. Let’s add keywords highly detailed and sharp focus

Enviornment

Additional Details and objects – Additional details are sweeteners added to modify an image. We will add sci-fi, stunningly beautiful and dystopian to add some vibe to the image.

Composition – camera type, detail, cinematography, blur, depth-of-field

Color/Warmth – You can control the overall color of the image by adding color keywords. The colors you specified may appear as a tone or in objects.

Lighting – Any photographer would tell you lighting is a key factor in creating successful images. Lighting keywords can have a huge effect on how the image looks. Let’s add cinematic lighting and dark to the prompt.