October 2025 | Honolulu, Hawaii | ICCV 2025
The era of large reasoning models (LRM) has begun, bringing new opportunities and challenges to the computer vision community. The strong semantic intelligence of LLM and the long-chain reasoning ability of LRM have opened new frontiers in visual understanding and interpretation.
This workshop aims to bridge the gap between computer vision and large language/reasoning models, focusing on complex tasks requiring advanced reasoning capabilities. We will explore how models can comprehend complex relationships through slow-thinking approaches like Neuro-Symbolic reasoning, Chain-of-Thought, and Multi-step Reasoning, pushing beyond traditional fixed tasks to understand object interactions within complex scenes.
The goal is to bring together perspectives from computer vision, multimodal learning, and large language models to address outstanding challenges in multimodal reasoning and slow thinking in the context of large reasoning models, fostering more flexible and robust understanding in AI systems.
UT Austin
Meta
Stanford University
Monash University
Alibaba Group
Wuhan University of Technology
ModelScope Community
Nanyang Technological University
University of Oxford
INSAIT Sofia University
Wuhan University of Technology
Tsinghua University
Chinese Academy of Sciences
| Time | Event | Presenter |
|---|---|---|
| 08:25-08:30 | Opening remarks | - |
| 08:30-09:15 | Invited talk and Q&A #1 | Kristen Grauman |
| 09:15-10:00 | Invited talk and Q&A #2 | Ishan Misra (Virtual) |
| 10:00-10:30 | Oral presentation and break | M. A. H. Khan |
| 10:30-11:15 | Invited talk and Q&A #3 | Jiajun Wu |
| 11:15-13:00 | Lunch | - |
| 11:15-13:15 | Poster presentation | Poster area |
| 13:15-14:00 | Invited talk and Q&A #4 | Hamid Rezatofighi |
| 14:30-15:15 | Invited talk and Q&A #5 | Junyang Lin (Virtual) |
| 15:15-16:00 | Invited talk and Q&A #6 | Yaxiong Chen (Virtual) |
Submissions must be in PDF format and conform to ICCV 2025 proceedings style (double-blind review). The maximum paper length is 8 pages (excluding references).
We welcome submissions of:
Awards will be distributed to top performers in each track.
Generously sponsored by:
Visual Grounding in Real-world Scenarios
Evaluating scene perception, object localization, and spatial reasoning.
Visual Question Answering with Spatial Awareness
Evaluating spatial, commonsense, and counterfactual reasoning.
Visual Reasoning in Creative Advertisement Videos
Evaluating cognitive reasoning abilities in advertisement videos.
We guarantee to respond to all competition registration and inquiry emails promptly. If you don't receive a reply within 12 hours, please resend your email as it may have been caught in spam filters or not delivered.
The competition will feature a custom-made dataset with 2K+ images, 1.5K+ videos, 17K+ question-answer pairs, and 15K+ bbox annotations.