ComfyUI_InteractAvatar

★ 22

文本驱动人-物交互可控头像口型同步

将文本指令转为可控的会说话虚拟头像行为与表情，生成与物体的交互动作并同步口型

💡 通过文本描述控制头像与物体互动并同步说话表情

🍴 1 Forks💻 Python🔄 2026-03-26

🔗 GitHub 原文

📦

网盘下载

复制链接后前往夸克网盘下载

https://pan.quark.cn/s/c1eafc754fbb

📦 requirements.txt

toml
transformers
diffusers
datasets
pillow
sentencepiece
protobuf
peft
torch-optimi
tensorboard
tqdm
safetensors
bitsandbytes
imageio[ffmpeg]
av
einops
accelerate
loguru
easydict
ftfy
decord
pyloudnorm
#deepspeed

📄 README

Making Avatars Interact
Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

InteractAvatar is a novel dual-stream DiT framework that enables talking avatars to perform Grounded Human-Object Interaction (GHOI). Unlike previous methods restricted to simple gestures, our model can perceive the environment from a static reference image and generate complex, text-guided interactions with objects while maintaining high-fidelity lip synchronization.

ComfyUI_InteractAvatar

InteractAvatar is a novel dual-stream DiT framework that enables talking avatars to perform Grounded Human-Object Interaction (GHOI)

Update

Add Dwpose node to easy use

新增dwpose节点，模型为none时会自动下载，简化object物体输入流程，可以之间用mask获取

fix bug ,now output video short side muse be 512 or 704

If your Vram <24G,turn on 'offload', ActionAndSong mode use 'long model' and need chocie '2' mode;example img\video\ audio in "InterDemo" dir

test env 64G RAM, 12G VRAM,win11

The prompt words for the singing mode and the action prompt words must have the same number of lines；

小显存开启offload，唱歌用带long的dit，模式选’2’，否则用常规的,示例图片音频等在InterDemo文件内; 基本上40G加8G能跑普通模式，长时长唱歌可能有难度，唱歌模式的提示词和动作提示词必须要有相同的行数

1. Installation

In the ./ComfyUI /custom_nodes directory, run the following:

git clone https://github.com/smthemex/ComfyUI_InteractAvatar.git

2. Requirements

pip install -r requirements.txt

3. Models

wan 2.2 vae/clip Comfy-Org/Wan_2.2_ComfyUI_Repackaged

InteractAvatar dit youliang1233214/InteractAvatar

--  ComfyUI/models/vae
    |-- wan2.2_vae.safetensors # or Wan2.2_VAE.pth origin
--  ComfyUI/models/clip
    |-- umt5_xxl_fp8_e4m3fn_scaled.safetensors  # or fp16
--  ComfyUI/models/diffusion_models
     |--interact-avatar-long.safetensors  #  rename from diffusion_pytorch_model.safetensors  long or normal

4 Example

song long

object

ap2v audio and pose driver

5 Citation

@article{zhang2026making,
  title={Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars},
  author={Zhang, Youliang and Zhou, Zhengguang and Yu, Zhentao and Huang, Ziyao and Hu, Teng and Liang, Sen and Zhang, Guozhen and Peng, Ziqiao and Li, Shunkai and Chen, Yi and Zhou, Zixiang and Zhou, Yuan and Lu, Qinglin and Li, Xiu},
  journal={arXiv preprint arXiv:2602.01538},
  year={2026}
}

🙏 Acknowledgements

We sincerely thank the contributors to the following projects:

ComfyUI_InteractAvatar

Making Avatars Interact Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

ComfyUI_InteractAvatar

Update

Previous

1. Installation

2. Requirements

3. Models

4 Example

5 Citation

🙏 Acknowledgements

Making Avatars Interact
Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars