Om AI Lab Blogs

August 15, 2025
VLM-FO1: From Coarse to Precise - Revolutionizing VLM Perception with Fine-Grained Objects

Existing Vision-Language Models (VLMs) excel at holistic scene understanding but fail at precise, object-centric tasks like detection, primarily due to their inability to generate accurate coordinates. We propose VLM-FO1, an approach that solves this by transforming object detection from a generation to a retrieval problem. We treat bounding boxes as visual prompts, extract their features into unique "object tokens", and feed them directly to the model. This method dramatically improves performance, with VLM-FO1-3B reaching 44.4 mAP on COCO, rivaling specialized detectors and demonstrating strong capabilities on other region-based perception tasks.
Read more →
July 17, 2025
OmAgent - A Reinforcement Learning-based Multimodal Agent Framework

With the rapid advancement of Large Language Models (LLMs) and Vision Language Models (VLMs), AI technology is shifting from exam-oriented task completion to practical scenario-based complex problem-solving. Using LLMs and VLMs to tackle more realistic and intricate problems—rather than simply passing exams—is not only an inevitable direction of technological evolution but also a key requirement for industrial applications. We launched the first embodied AI agent—OmAgent, a reinforcement learning-based multimodal agent framework. Its feasibility has been verified in practical applications.
Read more →
2025年7月17日
OmAgent - 基于强化学习的多模态智能体

随着大型语言模型（LLMs）与视觉语言模型（VLMs）的能力飞速发展，AI 技术正从「应试式」的任务达标转向「实战化」的复杂问题解决。用LLMs和VLMs去解决更实际更复杂的问题，而不是简单地通过“考试”，这既是技术演进的必然方向，也是产业落地的核心诉求。我们推出全球首个具身智能 AI Agent——OmAgent，一个基于强化学习的多模态智能体框架，并在实际应用中验证了该路径的可行性。
Read more →
March 24, 2025
Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning

Key insights from our extensive experimentation with Reinforcement Learning for object detection in Vision Language Models, focusing on training methodologies, reward functions, and prompt engineering.
Read more →
March 20, 2025
Improving Object Detection through Reinforcement Learning with VLM-R1

A detailed exploration of how reinforcement learning enhances object detection performance compared to supervised fine-tuning in vision-language models.
Read more →

VLM-FO1: From Coarse to Precise - Revolutionizing VLM Perception with Fine-Grained Objects

OmAgent - A Reinforcement Learning-based Multimodal Agent Framework

OmAgent - 基于强化学习的多模态智能体

Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning

Improving Object Detection through Reinforcement Learning with VLM-R1