ComfyUI-TF32-Enabler

ComfyUI-TF32-Enabler
★ 9

性能加速TF32NVIDIA RTX零配置
在ComfyUI上为RTX 30/40/50系列自动启用TF32加速,开箱即用、零配置,提升扩散模型推理速度1.5-2倍,精度影响极小。
💡 在ComfyUI中自动开启TF32以显著加速扩散模型推理。
🍴 2 Forks💻 Python🔄 2026-03-10
📦
网盘下载
复制链接后前往夸克网盘下载
https://pan.quark.cn/s/79aaff81621b
📄 README

ComfyUI TF32 Enabler

Automatically enables TensorFloat-32 (TF32) acceleration for NVIDIA RTX 30/40/50 series GPUs in ComfyUI.

Note: to use torch compile you have to disable cudamallochasync

🚀 Performance Benefits

  • 1.5-2x speedup for diffusion models on Ampere/Ada/Blackwell GPUs
  • Minimal precision impact (maintains quality)
  • Automatic activation on ComfyUI startup
  • Zero configuration required
  • 📋 Requirements

  • NVIDIA GPU with compute capability >= 8.0:
  • RTX 30 series (Ampere)
  • RTX 40 series (Ada Lovelace)
  • RTX 50 series (Blackwell)
  • A100, A6000, etc.
  • PyTorch with CUDA support
  • ComfyUI
  • 📦 Installation

    cd ComfyUI/custom_nodes
    git clone https://github.com/marduk191/ComfyUI-TF32-Enabler.git
    # Or download and extract the zip file

    ✅ Verification

    When ComfyUI starts, you should see:

    ============================================================
    🚀 ComfyUI TF32 Acceleration Enabled
    ============================================================
       Matmul TF32: True
       cuDNN TF32:  True
       CUDA Allocator: expandable_segments:True
       GPU: NVIDIA GeForce RTX 5090
       Compute Capability: 10.0
       ✅ torch.compile CUDA allocator fix applied
    ============================================================

    🧪 Testing

    Run the included test script to verify torch.compile works:

    cd ComfyUI/custom_nodes/ComfyUI-TF32-Enabler
    python test_torch_compile.py

    🔧 Technical Details

    This custom node enables:

  • torch.backends.cuda.matmul.allow_tf32 = True
  • torch.backends.cudnn.allow_tf32 = True
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • TF32 uses 10-bit mantissa (vs FP32’s 23-bit) while maintaining the same 8-bit exponent range, providing:

  • Faster computation on tensor cores
  • Same dynamic range as FP32
  • Negligible quality loss for AI inference
  • The expandable segments allocator configuration resolves memory allocation issues when using torch.compile with CUDA operations.

    📊 Benchmarks

    Typical speedups on RTX 5090:

  • SDXL: ~1.8x faster
  • Flux: ~1.9x faster
  • SD3: ~1.7x faster
  • 🛠️ Compatibility

    Works with all ComfyUI workflows and custom nodes. No conflicts expected.

    📝 License

    MIT License – See LICENSE file for details

    🤝 Contributing

    Issues and pull requests welcome!

    🔗 Links

  • GitHub Repository
  • ComfyUI

  • Note: If your GPU doesn’t support TF32 (older than RTX 30 series), this node will safely do nothing and won’t cause errors.