Hook: Can AI Truly “Think in Pictures”?
Imagine asking an AI, “What’s wrong with this diagram?” Instead of a flat answer, the system walks you through the logic—pointing at components, evaluating relationships, weighing context. That’s the promise of multimodal chain-of-thought (CoT) reasoning: structured, transparent reasoning across modalities.
In 2025, two landmark models—Skywork R1V and GThinker—propelled this frontier. Skywork introduces chain-of-thought into visual domains with an efficient, open-source design; GThinker layers in dynamic “cue-guided rethinking” for deeper understanding. In this blog, we unpack their breakthroughs, plain-English implications, risks, and why they’re redefining AI reasoning.
Skywork R1V: Visual Chain-of-Thought Meets Open Access
What Happened
Skywork R1V was introduced in April 2025 as one of the first open-source models fusing visual and reasoning capabilities with chain-of-thought techniques. It extends the R1-series large language model into multimodal territory using a lightweight visual projector, avoiding complete retraining of either text or vision models (arXiv).
Training blends supervised fine-tuning with Group Relative Policy Optimization (GRPO), plus an adaptive-length CoT distillation that calibrates reasoning steps to avoid overthinking (arXiv).
Despite having only 38 billion parameters, it achieves impressive results: 69.0 on MMMU, 67.5 on MathVista, and strong performance in text-based reasoning (72.0 on AIME, 94.0 on MATH500) (Hugging Face). Public model weights are available for transparency and reproducibility.
Why It Matters (Plain English)
- Lightweight vision integration: Think of it as attaching a vision “lens” to a reasoning brain, without rebuilding either from scratch.
- Transparent thinking process: You can see how it reaches its conclusions—step by step, like a teacher explaining a math solution.
- Open-source advantage: Being freely available accelerates collective improvement and real-world adoption.
According to its GitHub page:
“Skywork R1V, the first industry open-sourced multimodal reasoning model with advanced visual chain-of-thought capabilities” (GitHub).
Risks & Limitations
- Benchmark gap: While robust, scores lag behind some proprietary systems.
- Overthinking hazard: Without careful length control, chain-of-thought could drift into hallucination.
- Compute demands remain high: Even at 38B parameters, it’s not trivial to run.
Skywork R1V2 & R1V3: Toward Smarter, Sturdier Reasoning
What Happened
Skywork R1V2, released shortly after, upgrades reasoning via hybrid reinforcement learning—balancing reward signals with rule-based guidance. A Selective Sample Buffer (SSB) filters training data, confronting issues like vanishing advantage in GRPO, while tuned reward thresholds reduce hallucinations (arXiv).
Benchmarks confirm improvements: 79.0 on AIME2024, 74.0 on MMMU, and 62.6 on OlympiadBench—narrowing the gap with closed-source models (arXiv). The R1V2-38B model weights are publicly available for review (arXiv).
Building on that, Skywork R1V3, introduced mid-2025, shows state-of-the-art open-source performance—76.0 on MMMU and 77.1 on MathVista—and includes quantized versions to enable single-GPU and CPU inference (Hugging Face).
Why It Matters
- Refined learning dynamics: Combines structured reasoning with adaptive reward signals—like teaching logic alongside trial-and-error.
- Stability & fidelity: SSB ensures learning focuses on quality examples.
- Accessibility: Quantized versions allow broader usage, even on resource-constrained hardware.
Risks
- Training complexity: Hybrid RL demands careful hyperparameter management.
- Visibility trade-off: More sophisticated training may obscure interpretability compared to simpler CoT.
GThinker: Cue-Guided Rethinking for General Vision Reasoning
What Happened
GThinker, unveiled June 2025, brings what the team calls Cue-Rethinking. When visual cues aren’t clear, it revisits and refines its reasoning—a “think, reflect, correct” loop. Training uses a two-stage pipeline: pattern-guided cold start, followed by incentive reinforcement learning. The team introduced the GThinker-11K dataset (7K reasoning paths + 4K RL samples) to support general scenario understanding (arXiv, GitHub).
Performance is strong—81.5% on the challenging M³CoT benchmark, surpassing O4-mini. It also shows consistent gains (around 2.1%) in general multimodal reasoning domains, while maintaining comparable math proficiency (arXiv, Hugging Face, GitHub).
Why It Matters
- Human-like correction: Think of it as GPT + second glance—catching inconsistencies after an initial pass.
- Generalist strength: Excels across varied domains—math, science, and everyday scenes.
- Open scaffolding: Dataset and model release support further innovation.
Risks & Limitations
- Compute cost: Iterative reasoning increases latency and resource use.
- Early-stage adoption: Needs wider benchmarks and real-world validation.
Quick Comparison
Model | Innovation | Strengths | Considerations |
---|---|---|---|
Skywork R1V | Lightweight visual CoT | Transparent, open, efficient | Lagging proprietary models’ scores, compute needs |
Skywork R1V2/R1V3 | Hybrid RL + improved optimization | Stronger scores, quantized deployment | Training complexity, visibility trade-off |
GThinker | Cue-Guided iterative reasoning | Generalist, accurate, adaptive | Production cost, early adoption phase |
Internal Link Opportunity
This exploration of AI reasoning aligns nicely with broader tech trends—such as those discussed in my piece on Beyond the Hype: The Most Important AI Breakthroughs in Mid-2025. It’s vital to see AI progress not just technically, but in its geopolitical and economic context.
Reflective Conclusion: Steering the Mind of Machines
Skywork R1V injects structured, explainable thinking into visual AI. R1V2 and R1V3 enhance robustness and accessibility. GThinker models the elusive “double-check” loop, spanning general contexts with clarity.
What’s next? Supporting these advances with interpretability, efficiency, and real-world testing—whether in education, robotics, or medical diagnostics—will determine if they remain laboratory curiosities or become trusted collaborators.