ComfyUI-Gemini_Flash_2.0_Exp
Support My Work
If you find this project helpful, consider buying me a coffee:
[](https://buymeacoffee.com/shmuelronen)
A ComfyUI custom node that integrates Google’s Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows. Now with image generation capabilities!
Audio option
Features
Multimodal input support:
Text analysis
Image analysis
Video frame analysis
Audio analysis
NEW! Image Generation using gemini-2.0-flash-exp-image-generation model
Chat mode with conversation history
Voice chat with smart Audio recorder node
Structured output option
Temperature and token limit controls
Proxy support
Configurable API settings via config.json
Installation
Install via ComfyUI manager
or
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-Gemini_Flash_2.0_Exp.git
Install required dependencies:
# Install BOTH packages (both are required)
pip install google-genai
pip install google-generativeai
# OR
python -m pip install google-genai
python -m pip install google-generativeai
# Other dependencies
pip install pillow
pip install torchaudio
Important:
For Ubuntu/Debian-based systems:
“`
sudo apt-get install libportaudio2
“`
Get your free API key from Google AI Studio:
Visit Google AI Studio
Log in with your Google account
Click on “Get API key” or go to settings
Create a new API key
Copy the API key for use in config.json
Set up your API key in the config.json file (will be created automatically on first run)
Configuration
API Key Setup
Make config.json file in the node main folder:
{
"GEMINI_API_KEY": "your_api_key_here"
}
WSL 2 Ubuntu Users
Note: always insert the API-KEY into Gemini Flash 2 node.
Node Inputs
Required Inputs:
prompt: Main text prompt for analysis or generation
input_type: Select from [“text”, “image”, “video”, “audio”]
model_version: Select model including the new image generation model
operation_mode: Select between “analysis” or “generate_images” mode
chat_mode: Boolean to enable/disable chat functionality
clear_history: Boolean to reset chat history
Optional Inputs:
Additional_Context: Additional text input for context
images: Multiple image inputs (IMAGE type with list=True)
video: Video frame sequence input (IMAGE type)
audio: Audio input (AUDIO type)
api_key: Directly enter your API key (recommended for WSL/Ubuntu)
max_output_tokens: Set maximum output length (1-8192)
temperature: Control response randomness (0.0-1.0)
structured_output: Enable structured response format
max_images: Maximum number of images to process (1-16)
batch_count: Number of images to generate (for image generation mode)
seed: Random seed for reproducible image generation
Usage Examples
Basic Text Analysis:
Text Input Node -> Gemini Flash Node [input_type: "text", operation_mode: "analysis"]
Image Analysis:
Load Image Node -> Gemini Flash Node [input_type: "image", operation_mode: "analysis"]
Video Analysis:
Load Video Node -> Gemini Flash Node [input_type: "video", operation_mode: "analysis"]
Audio Analysis:
Load Audio Node -> Gemini Flash Node [input_type: "audio", operation_mode: "analysis"]
Image Generation:
Text Input Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Image Generation with Reference:
Load Image Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Chat Mode
Chat mode maintains conversation history and provides a more interactive experience:
Enable chat mode by setting chat_mode: true
Chat history format:
=== Chat History ===
USER: your message
ASSISTANT: Gemini's response
=== End History ===
Use clear_history: true to start a new conversation
Chat history persists between calls until cleared
Chat Mode Tips:
Works with all input types (text, image, video, audio)
History is displayed in the output
Maintains context across multiple interactions
Clear history when switching topics
Video Frame Handling
When processing videos:
Automatically samples frames evenly throughout the video
Resizes frames for efficient processing
Works with both chat and non-chat modes
Image Generation
The new image generation capabilities allow you to:
Generate images from text descriptions
Generate variations based on reference images
Control the generation with seed and temperature parameters
Generate multiple images with batch_count
Image Generation Tips:
For best results, use the “gemini-2.0-flash-exp-image-generation” model
Use “generate_images” operation mode
Provide clear, detailed prompts for better results
Connect reference images for style guidance
Use seed parameter for reproducible results
Troubleshooting Cross-Platform Issues
Windows vs. Ubuntu/WSL Differences
On Windows, both config file and GUI methods work well
On Ubuntu/WSL, entering the API key directly in the GUI is more reliable
If using lowercase filenames on Ubuntu (e.g., gemini_flash_node.py instead of Gemini_Flash_Node.py), the node will still work properly
Common Issues on Ubuntu/WSL:
If you get “400 Bad Request” errors, try entering your API key directly in the GUI
Make sure binary data (images, audio) is properly base64 encoded
Check network connectivity and proxy settings
Ensure proper file permissions for config files
Error Handling
The node provides clear error messages for common issues:
Invalid API key
Rate limit exceeded
Invalid input formats
Network/proxy issues
Rate Limits
Default rate limits (from config.json):
10 requests per minute (RPM_LIMIT)
4 million tokens per minute (TPM_LIMIT)
1,500 requests per day (RPD_LIMIT)
Audio Analysis with Smart Recording:
The package includes two nodes for audio handling:
Audio Recorder Node: Smart audio recording with silence detection
Gemini Flash Node: Audio content analysis
Audio Recorder Node Features:
Live microphone recording with automatic silence detection
Smart recording termination after detecting silence
Configurable silence threshold and duration
Compatible with most input devices
Visual recording status indicator (10-second auto-reset)
Seamless integration with Gemini Flash analysis
Audio Recording Setup:
Audio Recorder Node -> Gemini Flash Node [input_type: "audio"]
Audio Recorder Controls:
device: Select input device (microphone)
sample_rate: Audio quality setting (default: 44100 Hz)
silence_threshold: Sensitivity for silence detection (0.001-0.1)
silence_duration: Required silence duration to stop recording (0.5-5.0 seconds)
Record Button:
Click to start recording
Records until silence is detected
Button resets after 10 seconds automatically
Visual feedback during recording (red indicator)
Using Voice Commands/Audio Analysis:
Add Audio Recorder node to your workflow
Connect it to Gemini Flash node
Configure recording settings:
Choose input device
Adjust silence detection parameters
Set sample rate if needed
Click “Start Recording” to begin
Speak your message
Recording automatically stops after detecting silence
The recorded audio is processed and sent to Gemini for analysis
Recording button resets after 10 seconds, ready for next recording
Example Audio Analysis Workflow:
Audio Recorder Node [silence_duration: 2.0, silence_threshold: 0.01] ->
Gemini Flash Node [input_type: "audio", prompt: "Transcribe and analyze this audio"]
Contributing
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
License
MIT License
Acknowledgments
Google’s Gemini API
ComfyUI Community
All contributors
Note: This node is experimental and based on Gemini 2.0 Flash Experimental model. Features and capabilities may change as the model evolves.