Om AI Lab Blogs

July 6, 2026
VLX-Seek 1.5: Enhanced Fine-grained Perception for Embodied Scenarios

VLX-Seek 1.5 strengthens fine-grained visual grounding for embodied scenarios with a multi-scale model family, faster region-reference inference, stronger detection, and more reliable absent-target rejection.
Read more →
2026年7月6日
VLX-Seek 1.5：面向具身场景的细粒度感知增强

VLX-Seek 1.5 面向端侧具身场景，进一步增强细粒度视觉定位能力，并通过多尺度模型系列、更快的区域指代推理、更强检测和更可靠的缺失目标拒识，提升真实部署表现。
阅读全文 →
June 28, 2026
VLX-Go: Vision-Language Short-Horizon Waypoint Prediction for Embodied Navigation

VLX-Go is a lightweight 0.6B vision-language waypoint planner for embodied navigation. It maps recent monocular frames, the current observation, and natural-language instructions into short-horizon local waypoints for closed-loop navigation.
Read more →
2026年6月28日
VLX-Go：面向具身导航的视觉-语言短时航点预测模型

VLX-Go 面向具身导航中的闭环规划问题，将近期视觉历史、当前观测和自然语言指令直接映射为短时间窗内的局部航点，用于目标跟随、局部导航、动态避障和真实机器人部署。
阅读全文 →
June 27, 2026
VLX-Seek: Improving VLM Fine-Grained Perception via Region Reference Instead of Coordinate Generation

VLX-Seek improves fine-grained VLM perception by turning fragile coordinate generation into region reference. It introduces addressable region tokens, a hybrid fine-grained region encoder, and compact object-centric reasoning for detection, counting, and open-vocabulary localization.
Read more →
2026年6月27日
VLX-Seek：VLM 细粒度感知增强：从“坐标生成”到“区域指代”

VLX-Seek 面向端侧具身视觉，将细粒度感知任务从坐标生成改写为区域指代。通过区域 token、混合细粒度区域编码器和更紧凑的对象级推理，它让 VLM 在检测、计数和开放词汇定位中看得更准。
阅读全文 →
June 26, 2026
VLX-Flow: Continuous Video Understanding for Real-Time Multimodal Interaction

VLX-Flow turns live video streams into reusable model memory. It processes continuous chunks, maintains visual cache and semantic memory, and supports low-latency interaction without reprocessing the full video history for every query.
Read more →
2026年6月26日
VLX-Flow：让多模态模型持续看见视频世界

VLX-Flow 面向一个越来越关键的问题：当视频从离线文件变成实时输入，多模态模型应该如何持续观察、持续记忆，并在任意时刻完成交互？VLX-Flow 将视频拆成连续片段，以增量方式更新视觉上下文和语义记忆，让模型不必在每次提问时重新处理完整历史。
阅读全文 →
August 15, 2025
VLM-FO1: From Coarse to Precise - Revolutionizing VLM Perception with Fine-Grained Objects

Existing Vision-Language Models (VLMs) excel at holistic scene understanding but fail at precise, object-centric tasks like detection, primarily due to their inability to generate accurate coordinates. We propose VLM-FO1, an approach that solves this by transforming object detection from a generation to a retrieval problem. We treat bounding boxes as visual prompts, extract their features into unique "object tokens", and feed them directly to the model. This method dramatically improves performance, with VLM-FO1-3B reaching 44.4 mAP on COCO, rivaling specialized detectors and demonstrating strong capabilities on other region-based perception tasks.
Read more →
July 17, 2025
OmAgent - A Reinforcement Learning-based Multimodal Agent Framework

With the rapid advancement of Large Language Models (LLMs) and Vision Language Models (VLMs), AI technology is shifting from exam-oriented task completion to practical scenario-based complex problem-solving. Using LLMs and VLMs to tackle more realistic and intricate problems—rather than simply passing exams—is not only an inevitable direction of technological evolution but also a key requirement for industrial applications. We launched the first embodied AI agent—OmAgent, a reinforcement learning-based multimodal agent framework. Its feasibility has been verified in practical applications.
Read more →
2025年7月17日
OmAgent - 基于强化学习的多模态智能体

随着大型语言模型（LLMs）与视觉语言模型（VLMs）的能力飞速发展，AI 技术正从「应试式」的任务达标转向「实战化」的复杂问题解决。用LLMs和VLMs去解决更实际更复杂的问题，而不是简单地通过“考试”，这既是技术演进的必然方向，也是产业落地的核心诉求。我们推出全球首个具身智能 AI Agent——OmAgent，一个基于强化学习的多模态智能体框架，并在实际应用中验证了该路径的可行性。
Read more →
March 24, 2025
Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning

Key insights from our extensive experimentation with Reinforcement Learning for object detection in Vision Language Models, focusing on training methodologies, reward functions, and prompt engineering.
Read more →
March 20, 2025
Improving Object Detection through Reinforcement Learning with VLM-R1

A detailed exploration of how reinforcement learning enhances object detection performance compared to supervised fine-tuning in vision-language models.
Read more →