ComfyUI-sudo-latent-upscale

★ 40

潜空间放大手绘插画优化SD1.5/SDXL高质量超分

在潜空间直接放大图像，针对 SD1.5/SDXL 手绘内容训练，保留细节、降低伪影、提升放大质量。

💡 在潜空间对 SD 手绘图进行高质量放大。

🍴 6 Forks💻 Python🔄 2024-05-22

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/2df45d172dc1

📦 requirements.txt

wget
timm

📄 README

ComfyUI-sudo-latent-upscale

This took heavy inspriration from city96/SD-Latent-Upscaler and Ttl/ComfyUi_NNLatentUpscale. Directly upscaling inside the latent space.

Some models are for 1.5 and some models are for SDXL. All models are trained for drawn content. Might add new architectures or update models at some point. I recommend the SwinFIR or DRCT models.

1.5 comparison:

SDXL comparison:

First row is upscaled rgb image from rgb models before being used in vae encode or vae decoded image for latent models, second row final output after second KSampler.

Training Details

I tried to take promising networks from already existing papers and apply more exotic loss functions.

DAT12x6_l1_eV2-b0_contextual_315k_1.5 / DAT6x6_l1_eV2-b0_265k_1.5

4 channel EfficientnetV2-b0 as a discriminator

Prodigy with 0.1

bf16

batch size 32 for the normal model and 16 for the large model

L1 with 0.08 weight

~22-24gb vram

CRAFT7x6_l1_eV2-b0_150k_1.5

Similar settings

batch size 16

DAT12x6_l1_eV2-b0_contextual_315k_1.5

Same as previous DAT, but with contextual loss which used

a self-made 4-channel latent classification network as a feature extractor. Training with contextual loss from scratch takes too long to converge, so I only used it at the very end. I can’t really recommend the usage of contextual loss though.

SwinFIR4x6_mse_200k_1.5

lamb with 3e-4

bf16

batch size 150

MSE with 0.08 weight

model was trained on 2×4090 with ddp and gloo, 100k steps each gpu

SwinFIR4x6_fft_l1_94k_sdxl / SwinFIR4x6_mse_64k_sdxl

Prodigy with 0.1

bf16

batch size 140

One model was trained with MSE and the other was trained with FFT and L1 with weight 1 everywhere

DRCT-l_12x6_325k_l1_sdxl / DRCTFIR-l_12x6_215k_l1_sdxl

AdamW 1e-4

bf16

batch size 40

l1 with weight 0.08

~35gb vram

DRCT-l_12x6_160k_l1_vaeDecode_l1_hfen_sdxl

used DRCT-l_12x6_325k_l1_sdxl as pretrain

AdamW 1e-4

bf16

batch size 3, because training with vae gradients requires a lot of vram

l1 with weight 0.1

vae decode loss similar to nnlatent (HFEN with weight 0.1 and l1 with weight 1 on decoded image)

~22gb vram

DRCT-l_12x6_170k_l1_vaeDecode_l1_fft_sdxl

similar to prior, but with fft loss with weight 1

Further Ideas

Ideas I might test in the future:

Huber

Different Conv2D (for example MBConv)

Dropout prior to final conv

Failure cases

Any kind of SSIM introduces instability. I tried to do 4 channel SSIM and MS-SSIM, also SSIM on vae decoded image and nothing works. nonnegative_ssim=True does not seem to help as well. Avoid SSIM to retain stability.

Using vae.config.scaling_factor = 0.13025 (do not set a scaling factor, nnlatent used it and city96 didn’t, I do not recommend to use it), image range 0 to 1 (image tensor is supposed to be -1 to 1 prior to encoding with vae) and not using torch.inference_mode() while creating the dataset. A combination of these can make training a lot less stable, even if loss goes down during training and does seemingly converge, the final model won’t be able to generate properly. Here is a correct example:

vae = AutoencoderKL.from_single_file("vae.pt").to(device)
vae.eval()

with torch.inference_mode():
  image = cv2.imread(f)
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  image = (
      torch.from_numpy(image.transpose(2, 0, 1))
      .float()
      .unsqueeze(0)
      .to(device)
      / 255.0
  )
  image = vae.encode(image*2.0-1.0).latent_dist.sample()

DITN and OmniSR looked like liquid with their official sizes. Not recommended to use small or efficient networks.

HAT looked promising, but seemingly always had some kind of blur effect. I didn’t manage to get a proper model yet.

I tried to use fourier as first and last conv in DAT, but I didn’t manage to properly train it yet. Making the loss converge seems hard.

GRL did not converge.

SwinFIR with Prodigy 1 and Prodigy 0.1 caused massive instability. Images from my Prodigy 1, l1 and EfficientnetV2-b0 attempt.