Aperture Lab — Visual Agents Research

PERCEPTION · REASONING · ACTION · WORLD MODELS · EMBODIED AI · MULTIMODAL · AUTONOMOUS AGENTS · INTERPRETABILITY PERCEPTION · REASONING · ACTION · WORLD MODELS · EMBODIED AI · MULTIMODAL · AUTONOMOUS AGENTS · INTERPRETABILITY

感知 · 推理 · 行动 · 世界模型 · 具身智能 · 多模态 · 自主智能体 · 可解释性感知 · 推理 · 行动 · 世界模型 · 具身智能 · 多模态 · 自主智能体 · 可解释性

Our Belief我们相信，真正的智能始于观看， 成于行动。让智能体不止于回答，而能在真实世界里 看见、思考、并改变它。

Founded in 2024, Aperture Lab is a small research team devoted to visual agents — closing the loop between perception and decision. We're committed to open research: open code, transparent thinking. Aperture Lab 成立于 2024 年，是一支专注于视觉智能体的小型研究团队。我们关注感知与决策的闭环——如何让模型从像素中理解世界，并据此采取有意义的行动。我们坚持开放研究：代码开源、思考透明。

RESEARCH04 directions04 个方向

What We Research我们在研究什么

From low-level perception to high-level autonomy, we work along one thread: see → understand → act.从底层的视觉感知，到顶层的自主决策，我们沿着「看见 → 理解 → 行动」这条线索展开工作。

👁️

Visual Perception视觉感知

VISUAL PERCEPTION

Help models see scenes like we do: open-vocabulary detection, video understanding, and stable long-context visual memory.让模型像人一样看懂场景：开放词表检测、视频时序理解、以及在长上下文里稳定地"记住"看过的东西。

Open-Vocabulary
Video Understanding
3D Grounding

🧠

Autonomous Agents自主智能体

AUTONOMOUS AGENTS

Wire perception to decision: agents that plan multi-step tasks, call tools, and reflect when they fail.把感知接到决策上：能规划多步任务、调用工具、并在失败时自我反思的智能体框架与评测基准。

Planning
Tool Use
Self-Reflection

🌐

World Models世界模型

WORLD MODELS

Let agents rehearse the future in their heads — interactive video world models for imagination, planning, and risk-free trial.让智能体在脑中"预演"未来：可交互的视频世界模型，用于想象、规划与无风险的策略试错。

Interactive Video
Latent Dynamics
Imagination

🤖

Embodied Interaction具身交互

EMBODIED INTERACTION

Put agents into real bodies: vision-language-action models that drive robot arms and desktop GUIs alike.把智能体放进真实身体：视觉-语言-动作（VLA）模型，让机械臂与桌面 GUI 都能被自然地操控。

VLA Models
GUI Agents
Manipulation

RESEARCH MAP04 fields that meet04 个交汇的领域

Research Map领域图谱

Our work doesn't sit on a single point, but where several fields meet — vision, agents, inference engines and interpretability, pulling on each other.我们的工作不在某个孤立的点上，而在几个领域彼此交汇之处——视觉、智能体、推理引擎与可解释性，互相牵引。

Computer Vision计算机视觉

COMPUTER VISION

Detection, segmentation, video understanding and generation — making machines truly see. The starting point of everything.检测、分割、视频理解与生成——让机器从像素中真正"看见"，是一切的起点。

Agent智能体 Agent

AUTONOMOUS AGENTS

Planning, tool use, memory and self-reflection — wiring perception to action so models become agents that get things done.规划、工具调用、记忆与自我反思，把感知接到行动上，让模型成为会做事的智能体。

vLLM · ServingvLLM 推理引擎

EFFICIENT SERVING

High-throughput, low-latency LLM inference and serving, so research actually runs — fast and affordable.高吞吐、低延迟的大模型推理与部署，让研究成果真正能跑起来、用得起。

Interpretability可解释性学习

INTERPRETABILITY

Open the black box — understand why a model decides, so agents are not only effective but worthy of trust.打开黑盒——理解模型为何这样决策，让智能体不只是有效，也值得被信任。

3Global Branches全球分部

4Research Areas研究领域

2+Products Coming产品即将发布

2024Founded实验室成立

LOCATIONSThree cities, one timezone-agnostic lab三座城市，一个时区无关的实验室

Global Locations全球分部

A distributed team working in relay along the sun — headquartered in Los Angeles, reaching into Europe and Asia-Pacific.我们是一支分布式的研究团队，沿着太阳的轨迹接力工作——总部坐落在洛杉矶，触角延伸至欧洲与亚太。

Global HQ全球总部

🇺🇸

Los Angeles洛杉矶

LOS ANGELES · USA

Global HQ — the hub of research, engineering and coordination. Most of our core work starts here.全球总部 — 研究、工程与协调中枢。我们大部分的核心工作从这里出发。

34.05°N, 118.24°WPST

🇬🇧

London伦敦

LONDON · UK

Europe Branch — interpretability and foundational research, linked into Europe's academic network.欧洲分部 — 聚焦可解释性学习与基础研究，连接欧洲的学术网络。

51.51°N, 0.13°WGMT

🇸🇬

Singapore新加坡

SINGAPORE · SG

APAC Branch — embodied intelligence and efficient inference, connected to Asia-Pacific industry and compute.亚太分部 — 面向具身智能与高效推理，对接亚太的产业与算力。

1.35°N, 103.82°ESGT

WHAT WE'RE BUILDINGComing soon即将发布

What We're Building我们正在造的东西

Two works on the way. One gives agents real long-term memory; the other perfects a single vertical scene.两件正在路上的作品。一个让智能体拥有真正的长程记忆，一个把垂直场景做到极致。

SOTA · COMING SOONSOTA · 即将发布

🧠

Engram

AGENT MEMORY FRAMEWORK

A SOTA-level memory framework for agents. Retrievable, forgettable, evolving long-term memory — remembering you across sessions, never going blank mid-task, making "remembering" a first-class citizen.一套 SOTA 级别的智能体记忆框架。让 Agent 拥有可检索、可遗忘、可演化的长程记忆——跨会话记住你，在长任务里不再失忆，把"记得"做成一等公民。

Long-term Memory长程记忆 Retrieval-Augmented检索增强 Cross-session跨会话 SOTA

Progress研发进度88% · Internal beta88% · 内测中

NEW · IN DEVELOPMENTNEW · 研发中

🫥

Vanish

PASSERBY REMOVAL · VERTICAL

A vertical algorithm framework for passerby removal. Built for travel shots, street scenes and footage cleanup — erase passersby and clutter in one tap, auto-inpaint a clean background. Works on photos and video.一个垂直领域的「路人消除」专用算法框架。专为旅拍、街景与素材清理打造——一键抹除画面里的路人与杂物，自动补全干净背景，照片与视频都能用。

Video Inpainting视频修复 Instance Segmentation实例分割 Background Inpainting背景补全 Vertical垂域专用

Progress研发进度32% · Prototype32% · 原型阶段

VANISH · LIVE PREVIEWDrag to compare拖动对比

Drag the slider to see the same shot before and after passerby removal.拖动滑块，看同一张画面在路人消除前后的对比。

After · removed已消除 ✓

Before消除前

⟺

JOIN APERTURE

Come teach machines to see, with us. 来和我们一起，教机器学会观看。

We're not accepting applications at the moment — but the lab is growing fast. Follow along, and new openings will appear right here.我们暂时不接受投递。但实验室在快速成长，欢迎持续关注——新的职位很快会出现在这里。

Explore Research了解研究 Applications closed for now暂不接受投递