与大模型的层级视觉编码器预训练

Arxiv cs.CV2026-04-02🔗 查看原文
HIVE提出与大语言模型的层级交叉注意力,将视觉编码器多层特征结构化融合,替代扁平化图像嵌入。结合三阶段预训练,改善梯度流与表示学习,实现稳定优化。实验证明在图像分类及MME、GQA、OK‑VQA、ScienceQA等视觉语言任务上优于自注意力方法,显示层级融合可提升多模态对齐与表达力。
原文内容
arXiv:2604.00086v1 Announce Type: new
Abstract: The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.