OmAgent - A Reinforcement Learning-based Multimodal Agent Framework

Published on July 17, 2025

1. Introduction

With the rapid advancement of Large Language Models (LLMs) and Vision Language Models (VLMs), AI technology is shifting from exam-oriented task completion to practical scenario-based complex problem-solving. Using LLMs and VLMs to tackle more realistic and intricate problems—rather than simply passing exams—is not only an inevitable direction of technological evolution but also a key requirement for industrial applications. Meanwhile, the wave of AI agents driven by LLMs continues to expand AI’s application boundaries in the physical world: from GUI-based online shopping operations to physically embodied robots performing household chores. Enabling agents to perceive environments, plan, make decisions, and interact like humans has become a shared challenge for academia and industry.

What users truly need are general-purpose agents capable of delivering results and completing tangible tasks in the physical world. Guided by this goal, we focus on exploring a practical technical roadmap: building AI agents that can solve various problems in the physical world, deployable on devices as the core brain component. In the future, devices such as smartphones, cameras, robots, and drones are expected to become embodied AI agents, and be applied to diverse fields like industrial management, medical diagnosis, personal assistance, and media creation. To realize such embodied AI agents, four core capabilities must be achieved: visual perception, decision-making & execution, semantic interaction, and spatiotemporal memory. Among these, semantic interaction has been initially addressed by current LLMs, and the other three remain as challenges and opportunities for technological innovation. In February of this year, we released the reinforcement learning-driven VLM-R1 model and received widespread attention. By extending DeepSeek’s reasoning capabilities from natural language to vision-language scenarios, we validated the effectiveness of using reinforcement learning to enhance VLM’s visual perception and complex environmental reasoning abilities. Recently, we further expanded this to the field of decision-making & execution and, combined with a multimodal agent framework, launched the first embodied AI agent—OmAgent, a reinforcement learning-based multimodal agent framework. Its feasibility has been verified in practical applications.

OmAgent Roadmap

2. Method

OmAgent is a reinforcement learning-based multimodal agent framework. The core philosophy of this framework is simplifying complexity through abstraction. It encapsulates complex engineering implementations (such as spatiotemporal memory management, workflow orchestration, task queues, and node optimization) in the background, providing developers with a highly streamlined and user-friendly Python interface. It features the following characteristics:

OmAgent Architecture

Native Multimodal Support

Reusable Component Abstraction

Zero-Complexity Development Experience

OmAgent serves as a core component for smart devices. With multimodal support and zero-complexity development, it can be easily applied to various devices. Additionally, built-in basic algorithm modules and toolkits enable rapid resolution of challenges related to environmental perception, interaction, decision-making & execution, and memory storage. To meet the requirements for visual perception and decision-making capabilities, we integrate reinforcement learning models into OmAgent, enabling devices to maintain robust environmental perception and effective decision-making in dynamically complex environments.

2.1 Breaking Through Visual Perception Capabilities – Reinforcement Learning-Driven Advanced Environmental Perception

Our technical expertise in visual perception dates back to the release of the OmDet model series in 2021. Through multiple iterations, the model evolved from early attribute and relation-based general perception and detection capabilities to an efficient recognition mode driven by natural language instructions. Combined with lightweight training and deployment solutions, OmDet achieves open-domain detection and understanding of the surrounding environment. In 2023, we launched OmChat, continuously enhancing its perception and interaction capabilities in vision-language hybrid environments. In early 2025, leveraging technical breakthroughs from DeepSeek, we successfully introduced reinforcement learning into VLMs, and released the VLM-R1. VLM-R1 significantly outperforms traditional supervised learning methods in various visual perception tasks such as object detection, marking a similar aha moment of cognitive breakthrough in the visual domain. Notably, through training and validation across multiple tasks, our reinforcement learning-based OmR1 model demonstrates excellent generalization in cross-task scenarios, providing flexible technical support for visual perception and decision-making in complex environments.

VLM-R1 Technical Innovations

VLM-R1 core technical innovations include:

In our research, we observed the emergence of "OD aha moment" in VLMs—an intelligent behavior spontaneously developed during reinforcement learning training:

2.2 Embodied Intelligent Decision-Making & Execution – Simulating Human-Environment Interaction

Autonomous decision-making and execution based on environmental perception are the second major challenge in embodied intelligence. In the OmAgent framework, besides relying on VLM models for task decomposition, planning, and calling basic capabilities like MCP, we further simulate the logic of human interaction with the external environment and innovatively propose the ZoomEye algorithm—designed to enhance VLM’s interaction capabilities in high-resolution environments. Its core idea is to replicate humans’ zooming behavior when observing environments: just as human eyes first scan the overall scene and then focus on details, the model can progressively explore and deeply analyze key information in the environment through similar step-by-step exploration. Its core innovations include:

ZoomEye Tree Search Algorithm

3. Performance Evaluation of OmAgent in Industrial Applications

To verify the practical performance of OmAgent, we conducted comparative tests between OmAgent and mainstream VLMs on three industrial scenarios—open detection, visual cognition (complex event detection), and doc parsing (complex multimedia document understanding)—using a hardware environment equipped with 8 × 80G A100 GPUs.

Industry Evaluation Comparison

Open Detection Scenario

Model Vendor mAP Latency (s/frame) QPS Cost (RMB/1000 frames) Avg Output Tokens
OmDet (1B)Om AI30.800.018000.02-
OmR1 (3B)Om AI34.811.5145.730.42149.25
GPT-4oOpenAI1.262.734.8116.558.67
Qwen2.5VL-32BAlibaba32.303.319.681.99127.39

In the performance evaluation of open detection tasks, we used OVDEval as the evaluation dataset, which covers diverse general detection capabilities in open scenarios, including object attributes, small objects, and non-existent objects. First, as our ultra-lightweight solution, OmDet achieves an excellent 30.80 mAP with only 1B parameters, while achieving a latency of 0.01 seconds and QPS up to 800, providing an efficient solution for real-time scenarios. By introducing reinforcement learning into VLM models, OmR1 can recognize more complex objects and categories through reasoning, reaching 34.81 mAP—significantly outperforming other models and verifying the potential of reinforcement learning in VLMs. Another notable achievement is our breakthrough in cost control: OmDet’s processing cost is only 0.02 yuan per 1000 frames, 825 times lower than that of GPT-4o. OmR1 with a 3B model size, is 38 times lower in cost compared to GPT-4o.

Complex Event Judgment

Model Vendor Precision Latency (s/frame) QPS Cost (RMB/1000 frames) Avg Output Tokens
OmR1 (3B)Om AI80.74%3.026.562.94174.45
Qwen2.5VL-32BAlibaba74.01%3.772.088.4631.38
GPT-4oOpenAI67.29%4.684.0928.432.68

Visual cognition (complex event detection) is a general judgment model for surveillance scenarios, focusing on intelligent analysis tasks across different environments. In this task, users can customize complex management rules for different cases and flexibly define complex abnormal events through instructions. The built-in model agent should conduct environment understanding, anomaly analysis, and accurately mark abnormal areas in images based on these definitions. In this industrial application, the reinforcement learning-based OmR1 model also performs excellently. OmR1 achieves 80.74% precision, significantly outperforming other larger models. Through reasoning, OmR1 outputs an average of 174.45 tokens, enabling more detailed and in-depth analysis in complex reasoning processes. From a cost-effectiveness perspective, OmR1 reduces processing costs by nearly 90% compared to GPT-4o, demonstrating strong practical value in real-world applications.

Complex Multimedia Document Understanding

Model Vendor Accuracy Latency (s/page) QPS Cost (RMB/1000 pages)
OmDoc (1B)Om AI77.83%0.27299.20.06
Qwen2.5VL-32BAlibaba74.40%4.563.715.19
GPT-4oOpenAI76.46%8.160.83671.2

Doc parsing (complex multimedia document understanding) tasks focus on parsing, memory storage, and question answering for long documents with complex structural relationships, including tables, figures and charts. OmDoc, our document agent application, demonstrates significant technical advantages. In terms of performance, OmDoc achieves 77.83% accuracy, outperforming other larger models, while maintaining high precision and leading performance across the board. In terms of efficiency, OmDoc controls processing latency to 0.27 seconds, 17 times faster than Qwen2.5VL-32B and 30 times faster than GPT-4o. This millisecond-level response speed provides a solid technical foundation for real-time document analysis applications. In terms of throughput performance, OmDoc reaches a QPS of 299.2, offering strong technical support for large-scale batch processing scenarios. Most notably, OmDoc excels in cost control, with a processing cost of only 0.06 yuan per 1000 pages—1187 times lower than GPT-4o’s 71.2 yuan per 1000 pages.

4. Open-Source Contributions

We have fully opened our core technology system to the open-source community, receiving enthusiastic responses with over 9K stars accumulated on GitHub:

5. Future Directions

The evolution of smart devices is never straightforward and a single model cannot solve; it must balance environmental complexity, task diversity, and interaction relevance. Our vision is to inject a complete intelligent persona into every future smart device, with OmAgent as the technical core. We look forward to seeing all devices in the physical world break through their current functional boundaries, transforming into embodied agents capable of autonomous perception, proactive decision-making, and continuous evolution. Let agents play a vital role in various fields such as industrial safety management and medical diagnosis, enabling AI to step out of data centers, deeply integrate into the physical world, and become a core driver of industrial upgrading and life transformation.

For more technical details and open-source projects, please visit: Om AI Lab GitHub

For technical exchanges and cooperation, please contact us.