Falcon Perception:早期融合模型
Falcon Perception 是一个0.6B参数的早期融合Transformer,将图像patch与文本作为同一序列并用混合注意力掩码处理,实现自然语言提示下的开放词汇定位与分割。于SA-Co上Macro-F1达68.0(优于SAM 3的62.3),主要短板为存在性校准(MCC 0.64 vs 0.82)。并提出诊断基准PBench,细分属性、OCR歧义消解、空间约束与关系等能力。另发布0.3B的Falcon OCR,在olmOCR/OmniDocBench上得分80.3/88.6,吞吐率领先开源OCR。文章总结了设计与经验教训。
原文内容
Back to Articles
Falcon Perception
Team
Article
Published
April 1, 2026
Upvote
19
+13
FalconPerception
FalconPerception
Follow
tiiuae
TL;DR
—
Falcon Perception
is a
0.6B-parameter
early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes
image patches + text
in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On
SA-Co
, Falcon Perception reaches
68.0 Macro-F1
(vs.
62.3
for SAM 3) with the main remaining gap being presence calibration (MCC
0.64
vs.
0.82
). We also introduce
PBench
, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase
Falcon OCR
, a
0.3B-parameter
model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question:
can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this po
Falcon Perception
Team
Article
Published
April 1, 2026
Upvote
19
+13
FalconPerception
FalconPerception
Follow
tiiuae
TL;DR
—
Falcon Perception
is a
0.6B-parameter
early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes
image patches + text
in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On
SA-Co
, Falcon Perception reaches
68.0 Macro-F1
(vs.
62.3
for SAM 3) with the main remaining gap being presence calibration (MCC
0.64
vs.
0.82
). We also introduce
PBench
, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase
Falcon OCR
, a
0.3B-parameter
model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question:
can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this po