Falcon Perception 早期融合分割
Falcon Perception是0.6B参数的早期融合Transformer,单序列处理图像patch与文本并用混合注意力掩码,实现开域目标定位与分割。SA-Co上获68.0 Macro-F1(SAM 3为62.3),主要差距在存在性校准(MCC 0.64 vs 0.82)。同时提出PBench诊断基准,细化属性、OCR消歧、空间与关系能力;并发布0.3B的Falcon OCR,在olmOCR/OmniDocBench得80.3/88.6,吞吐量领先。
原文内容
Back to Articles
Falcon Perception
Team
Article
Published
April 1, 2026
Upvote
20
+14
FalconPerception
FalconPerception
Follow
tiiuae
TL;DR
—
Falcon Perception
is a
0.6B-parameter
early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes
image patches + text
in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On
SA-Co
, Falcon Perception reaches
68.0 Macro-F1
(vs.
62.3
for SAM 3) with the main remaining gap being presence calibration (MCC
0.64
vs.
0.82
). We also introduce
PBench
, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase
Falcon OCR
, a
0.3B-parameter
model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question:
can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this po
Falcon Perception
Team
Article
Published
April 1, 2026
Upvote
20
+14
FalconPerception
FalconPerception
Follow
tiiuae
TL;DR
—
Falcon Perception
is a
0.6B-parameter
early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes
image patches + text
in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On
SA-Co
, Falcon Perception reaches
68.0 Macro-F1
(vs.
62.3
for SAM 3) with the main remaining gap being presence calibration (MCC
0.64
vs.
0.82
). We also introduce
PBench
, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase
Falcon OCR
, a
0.3B-parameter
model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.
This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.
The problem: why do perception systems end up as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.
We asked a simpler question:
can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?
In our experiments, the answer is largely yes. The rest of this po