Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning

Published on March 24, 2025

TL;DR

This blog presents comprehensive experimental findings on optimizing Vision Language Models (VLMs) for object detection through reinforcement learning, focusing on Qwen2.5-VL-3B and 7B models. Key findings include:

The findings provide practical guidelines for VLM training optimization, particularly highlighting the importance of reward engineering and model-specific prompt strategies. Results demonstrate that careful tuning of these elements can significantly enhance object detection performance while maintaining model efficiency.

Introduction

This blog post summarizes our key findings and insights from extensive experimentation with Reinforcement Learning (RL) for object detection in Vision Language Models (VLMs), specifically focusing on Qwen2.5-VL-3B and 7B. Our experiments covered various aspects including training methodologies, data preparation strategies, reward functions, and prompt engineering.

1. Training Options Overview

1.1 Training Datasets

1.2 Evaluation Datasets

1.3 RL Reward Options

1.4 Training Methods

1.5 Prompt Formats

2. Evaluation Datasets Options

2.1 COCO Dataset Evaluation Modes

COCO_filtered (COCO_pos_1) Dataset
The COCO_filtered dataset is created from the COCO dataset's instances_val2017.json file. It filters out categories with more than 10 annotation boxes, ensuring that only categories with fewer boxes are included.

COCO_pos_2 Dataset
The COCO_pos_2 dataset focuses on images where each category has only one bounding box.

COCO_pos_3 Dataset
The COCO_pos_3 dataset is also derived from the COCO dataset's instances_val2017.json file. It ensures that the total number of boxes across all categories does not exceed 10.

In the evaluation of the COCO dataset, we have 3 modes:

2.2 OVD Dataset Evaluation Modes

3. Training Dataset Findings

3.1 Small Object Detection Trade-offs

When training with MAP reward on small objects (area < 0.5% of image) vs. regular single-box data:

Training Data COCO_filtered (mAP) Precision (IoU=0.5) Recall (IoU=0.5)
Small objects only 22.3 71.39 43.36
Regular single-box 21.1 55.02 49.46

Key findings:

3.2 Single Region Label Impact

The descriptions of Single Region Label and OD (Object Detection):

Single Region Label:
The Pure Single Region method takes a different approach. Instead of detecting the object and its boundaries, it relies on a given coordinate and asks the model to classify which object from a list of predefined options is located within the provided coordinates. Essentially, this is akin to a multiple-choice question, where the model is given a location (a coordinate box) in the image and must choose the correct object category from a set of possible options. This method emphasizes classification based on specific coordinates, and matches the region to the appropriate object class.

OD (Object Detection):
The OD approach focuses on detecting the position and boundaries of an object within an image. The model identifies the object and its location by drawing bounding boxes, and performance is measured using metrics like mAP and IoU. OD not only classifies the object but also determines its position in the image.

Comparison of different training approaches:

Training Approach COCO_filtered (mAP) Precision (IoU=0.5) Recall (IoU=0.5)
Pure Single Region 18.4 66.01 34.57
Mixed OD + Single Region 21.7 63.98 46.75
Pure OD (Single class) 23.8 65.58 48.92

Key findings:

4. RL Reward Function Optimization

4.1 Comparison of Different Reward Functions

Training data Reward method COCO_pos_3 (mAP) Precision (IoU=0.5) Recall (IoU=0.5)
OVDEval AP50 27.1 70.17 58.05
OVDEval AP 27.4 71.11 57.86
OVDEval weighted_sum 27.5 71.32 57.72
OVDEval weighted_sum, cosine reward 29.3 76.95 59.11
OVDEval weighted_sum, cosine reward, repetition reward 28.8 77.08 57.98

1. AP or AP50 For Bounding Box:

In our experiments, we utilized ovdeval as the training dataset and COCO_pos_3 for validation. The average precision (AP) score improved from 27.1 to 27.4. This single comparison highlighted that AP consistently outperforms AP50. The rationale behind this is that AP50, with its binary thresholding, offers limited feedback, whereas AP provides a more granular, continuous spectrum of feedback. This continuous feedback mechanism enables the model to refine its bounding box predictions, leading to enhanced precision and overall model performance.

2. Reward rule design

During our exploration, we considered implementing a customized rule for the object detection task. The focus is primarily on two aspects: the spatial overlap between predicted and ground truth boxes, and the overall detection completeness. Spatial overlap is typically measured using the Intersection over Union (IoU), which is the ratio of the area of overlap to the area of union between two boxes. A higher IoU indicates that the predicted box is closer in spatial position to the ground truth box. In terms of detection completeness, it is important to consider both missed detections and false alarms. Missed detections refer to real targets that were not detected, while false alarms refer to non-target areas that were incorrectly detected. Completeness is assessed by calculating the miss rate and false alarm rate, with the completeness score computed as

completeness = 1 - (miss rate + false alarm rate)/2

This reflects the overall thoroughness of the detection. To form a comprehensive score, the function combines position accuracy and detection completeness through a weighted average. When matching predicted and ground truth boxes, a greedy matching strategy is employed, which involves iteratively selecting the pair of boxes with the highest IoU until no more satisfactory matches can be made. It is important to note that if the predicted label does not match the ground truth label, the IoU is set to 0, even if the IoU itself is high, affecting the final score. This multi-dimensional reward method, which is named as weighted_sum, provides a comprehensive metric to reflect the performance of object detection models. We implemented this reward metric into training and compared it to the AP reward. The results show that the weighted_sum reward can further boost the mAP score to 27.5.

3. Cosine reward and repetition reward

Inspired by the paper "Demystifying Long Chain-of-Thought Reasoning in LLMs" (https://arxiv.org/abs/2502.03373), we implemented cosine and repetition rewards in our experiments. The cosine reward encourages a more efficient reasoning process by promoting shorter responses when a correct answer has already been reached. Conversely, for incorrect answers, the cosine reward incentivizes further exploration until the correct answer is found. This approach helps to converge the response length to a reasonable size while avoiding unnecessary reasoning.

In addition, we implemented a repetition reward to penalize n-gram patterns, as we observed instances of repetition when generating responses. Given that our focus is on the Object Vision Detection (OVD) task, it is likely that multiple objects are present within a single image. Therefore, we parse the OVD output JSON and calculate the n-grams based on the elements contained within it.

We found that the implementation of cosine and repetition rewards can converge to the same reasoning process during single training sessions. However, different training attempts may lead to the convergence of distinct reasoning patterns. Below, we present several interesting reasoning patterns observed during our experiments.

problem: 
 Please carefully observe the image and detect the following targets: plastic chair.Output each detected targets' bbox coordinates in JSON format. The format of the bbox coordinates is: ```json
[{"bbox_2d": [x1, y1, x2, y2], "label": "target name"}, {"bbox_2d": [x1, y1, x2, y2], "label": "target name"}]
```. If there are no such targets in the image, simply respond with None.
Content: <think>
4
</think>
<answer>
```json
[
    {"bbox_2d": [38, 95, 113, 164], "label": "plastic chair"},
    {"bbox_2d": [130, 90, 177, 150], "label": "plastic chair"},
    {"bbox_2d": [174, 89, 222, 140], "label": "plastic chair"},
    {"bbox_2d": [222, 86, 258, 135], "label": "plastic chair"}
]
```
</answer>
problem: 
 Please carefully observe the image and detect the following targets: iron door.Output each detected targets' bbox coordinates in JSON format. The format of the bbox coordinates is: ```json
[{"bbox_2d": [x1, y1, x2, y2], "label": "target name"}, {"bbox_2d": [x1, y1, x2, y2], "label": "target name"}]
```. If there are no such targets in the image, simply respond with None.
Content: <think>
image.
</think>
<answer>
```json
[
    {"bbox_2d": [97, 91, 406, 718], "label": "iron door"}
]
```
</answer>
problem: Please carefully check the image and detect the following objects: ['a fisherman standing on the shore with his body soaked in water', 'a street food truck without an umbrella', 'person covered with armor', 'woodpecker on the side of the trunk', 'unused microphone stand on stage']. Output the bbox coordinates of detected objects in <answer></answer>. The bbox coordinates in Markdown format should be: 
```json
[{"bbox_2d": [x1, y1, x2, y2], "label": "object name"}]
```
 If no targets are detected in the image, simply respond with "None".
Content: <think>
I carefully analyzed the image to locate specific objects.
</think>
<answer>
```json
[{"bbox_2d": [294, 221, 542, 442], "label": "woodpecker on the side of the trunk"}]
```
</answer>

When we introduce the cosine reward, we noticed a notable improvement in performance, with the mAP rising to 29.3. This increase is accompanied by a significant boost in precision, which reaches 76.95, and a slight improvement in recall to 59.11. Furthermore, when we add the repetition reward to the weighted sum and cosine reward combination, the mAP slightly decreases to 28.8. While the precision remains high at 77.08, the Recall drops to 57.98. This suggests that while the repetition reward may help in reducing redundancy in predictions, it could also lead to a trade-off where the model becomes overly conservative, potentially missing out on capturing some relevant instances. The overall results indicate that the addition of cosine rewards significantly enhances the model's performance, particularly in terms of precision.

5. Prompt Engineering Findings

We have made a series of experiments on the 3B model with OVDEval with the following discoveries.

3B Format Results

Format rewards on 3B model during training

3B Accuracy Results

Accuracy rewards on 3B model during training

5.1 Impact of Minor Grammatical Issues

Even small grammatical issues (like a missing space) significantly affected training convergence and on 3B model.
At the beginning of training on the 3B model using the OVDEval data (using data with grammatical problems and standard prompt). We observed that accuracy reward did not rise as the format reward finished converging (around 15 steps), but begin to up until around 125 steps. After checking the training parameter configurations, we found that the dataset had a problem of missing spaces, we fixed this problem and tried training again. It was found that the model did speed up convergence rising at around 65 steps. However, while making the accuracy reward of the training start to rise faster, eventually converges to the same value (around 0.7).

Example of problematic data:

Missing space data

<image>\n Please carefully observe the image and detect the following targets: person sit on motorcycle; motorcycle is sat on by person.Output each detected targets' bbox coordinates in JSON format. The format of the bbox coordinates is: \`\`\`json\n\[{\"bbox\_2d\": \[x1, y1, x2, y2\], \"label\": \"target name\"}, {\"bbox\_2d\": \[x1, y1, x2, y2\], \"label\": \"target name\"}\]\n\`\`\`. If there are no such targets in the image, simply respond with None.

Fixed data

<image>\n Please carefully observe the image and detect the following targets: person sit on motorcycle; motorcycle is sat on by person. Output each detected targets' bbox coordinates in JSON format. The format of the bbox coordinates is: \`\`\`json\n\[{\"bbox\_2d\": \[x1, y1, x2, y2\], \"label\": \"target name\"}, {\"bbox\_2d\": \[x1, y1, x2, y2\], \"label\": \"target name\"}\]\n\`\`\`. If there are no such targets in the image, simply respond with None.

5.2 Prompt Structure Optimization

Prompt modifications can improve the efficiency of training convergence on 3B model.
Although the accuracy reward can start to converge faster (around 65 steps) after correcting the data, it still did not match the trend of the format reward (around 15 steps). We then tried to modify the prompt without changing the other training settings, and finally the accuracy reward of the training could rise immediately after the format reward finished converging (around 15 steps). The modification of the prompt still did not improve the final accuracy reward value (around 0.7) although speed of convergence increase again during training

Enhanced prompt

"First thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"

5.3 Different Robustness of 7B and 3B Model to Prompt

7B Format Results

Format rewards on 7B model during training

7B Accuracy Results

Accuracy rewards on 7B model during training

The 7B model is more robust to the prompt which brings stronger initial format adherence
We tried the same experiment (fixed data and use enhanced prompt) on model 7B as we did on model 3B, and it turns out that the training rewards is almost the same. The accuracy reward going up along with the format reward right at the beginning. This reflects the fact that unlike the 3B model whose training performance may change significantly with minor adjustments to prompt, the 7B model exhibits relatively low sensitivity to cue modifications due to its stronger generalization ability and higher parameter capacity. This is also shown in that 7B models has stronger initial format adherence (0.75 initial format score) compared to 3B models (0.4)

5.3 System Prompt Variations

For instruct models, system prompts had minimal effect on format learning
We additionally tried training with the enhanced prompt passed in as a build-in system prompt, and found that the format rewards barely increased, which validates the point made in "MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning" (https://arxiv.org/abs/2503.07365): for the instruct model, which should have retained the model's built-in system prompt and included format-related information in the user prompt. In contrast, for the base model, we should provide format information in the system prompt.

Model COCO_pos_1 (mAP) Precision (IoU=0.5) Recall (IoU=0.5)
3B base (Qwen2.5-VL-3B-Instruct) 23.6 70.14 47.69
3B (fixed data & enhanced prompt) 24.5 68.2 50.27
7B base (Qwen2.5-VL-7B-Instruct) 24.6 71.53 50.56
7B (fixed data & enhanced prompt) 24.6 68.1 55.15

We finally evaluated our models on COCO pos1. The 7B model did not show any improvement on mAP, while 3B model gained some improvement, so perhaps using the same data (OCDEval), the 3B model with a smaller training parameters might be able to achieve more improvement compared to the 7B with a stronger base capability.

However, all of the above findings were obtained by training on the OVDEval dataset, and perhaps its data characteristics may have some impact. In the future, we will try to continue the experiments on other datasets (e.g., ) to verify these findings.

6. RL vs SFT Training Comparison

6.1 Basic Performance Comparison

We conducted an initial comparison between basic Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) approaches using identical training data (COCO) and base models (Qwen2.5-VL-3B). In this comparison, we utilized the MAP reward for reinforcement learning. Note that these results reflect basic RL implementation without advanced techniques like KL-divergence adjustment (KL=0) or output length control rewards that were explored later.

Training Method REFCOCO REFGTA COCO_filtered (mAP)
Base (Qwen2.5-VL-3B) 73.73 71.8 23.7
Basic RL Training 73.87 67.4 23.5
SFT Training 83.20 70.4 25.5

Note: These initial results showed SFT outperforming basic RL without advanced techniques. Later experiments with optimized RL rewards showed substantial improvements over these baseline RL results.

6.2 Token Generation Analysis

Model Min Tokens Max Tokens (excluding ≥3000) Average Tokens (excluding ≥3000) Records with Tokens ≥3000
Base Model 34 1892 139.752 281
RL Model 79 2999 240.563 534
SFT Model 48 2999 192.259 238

Key observation: SFT-trained models tend to be more concise in their outputs while maintaining better performance. RL with MAP rewards tended to generate more verbose outputs, potentially "hacking" the reward by generating more bounding boxes.

7. Other Implementation Findings

7.1 Trainer Consistency Verification

Multiple implementations were tested for consistency:

Code Version SuperCLEVR Test Score
Old Version [734e46] 85
New Version [a301eb] 87
VLLM Version [85d9f4] 87

All implementations showed consistent performance, validating the codebase's stability.

7.2 Model Size Impact (3B vs 7B)

Performance comparison on OVDEVAL:

Model OVDEval Avg (mAP) Proper Noun Avg Attribute Avg Position Relationship Negation
7B Best 45.27 53.33 25.10 65.4 26.1 56.4
3B Best 43.43 51.87 24.70 63.1 26.6 50.9

7B models generally performed slightly better, though the difference was not dramatic. The 7B model had particular advantages on negation tasks.

7.3 Completion Length and Batch Size

8. Key Takeaways and Best Practices

Training Methodology

Data Preparation

Reward Selection

Prompt Engineering

Implementation

Future Research Directions

We continue to explore new approaches and welcome community feedback and contributions.