VLX-Seek: Improving VLM Fine-Grained Perception via Region Reference Instead of Coordinate Generation
VLX-Seek improves fine-grained VLM perception by turning fragile coordinate generation into region reference. It introduces addressable region tokens, a hybrid fine-grained region encoder, and compact object-centric reasoning for detection, counting, and open-vocabulary localization.
Read more →