OptiMer:分布向量合成胜过数据混合

Arxiv cs.CL2026-04-01🔗 查看原文
提出OptiMer:对每个数据集分别进行持续预训练,提取表示参数位移的分布向量,再用贝叶斯优化在向量池上后验搜索最优组合权重。Gemma 3 27B 在日语、中文及数学/代码领域实验表明,OptiMer较数据混合和模型平均效果更好且搜索成本低15–35倍。研究发现优化权重可被解释为数据混合比并用于重训,且同一向量池可反复无重训地按需生成目标定制模型,提供了一种更灵活高效的持续预训练范式。
原文内容
arXiv:2603.28858v1 Announce Type: new
Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2)