Falcon Perception：早期融合视觉语言模型

HuggingFace BlogWed, 01 Ap🔗 查看原文

介绍Falcon Perception，一款0.6B参数的早期融合Transformer，将图像patch与文本作为单序列输入并用混合注意力掩码处理，产生可变数量实例的轻量输出接口。在SA-Co上达68.0 Macro-F1（优于SAM 3的62.3），主要差距在存在性校准（MCC 0.64 vs 0.82）。同时提出诊断基准PBench并发布0.3B的Falcon OCR，在olmOCR与OmniDocBench上得分优异且吞吐量最高。文章总结设计方法与经验。

原文内容

Back to Articles
Falcon Perception
Team
Article
Published
April 1, 2026
Upvote
10
+4
FalconPerception
FalconPerception
Follow
tiiuae
TL;DR
—
Falcon Perception
is a
0.6B-parameter
early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes
image patches + text
in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On
SA-Co
, Falcon Perception reaches
68.0 Macro-F1
(vs.
62.3
for SAM 3) with the main remaining gap being presence calibration (MCC
0.64
vs.
0.82
). We also introduce
PBench
, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase
Falcon OCR
, a
0.3B-parameter
model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question:
can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this pos