You are such a monster
You are such a monster
Congratulations to @tomerullman.bsky.social on official release! For everyone, my disagreements on this paper has also already been accepted by NeurIPS SpaVLE this year.
link: arxiv.org/abs/2510.20835
Congratulations on the official release! My disagreements on this paper is also already accepted by NeurIPS SpaVLE this year
link: arxiv.org/abs/2510.20835
developmental embodiedment π
#DevelopmentalEmbodiedment #GrowAI
congratulations
what type of pen are you using
VMEvalKit is 100% open source. We're building this in public with everyone. Plz join us βΌοΈ
π Slack: join.slack.com/t/growingail...
π Early Results: grow-ai-like-a-child.com/video-reason/
π Paper: github.com/hokindeng/VM...
π GitHub: github.com/hokindeng/VM...
The age of video reasoning is here π¬π§
VMEvalKit is 100% open source. We're building this in public with everyone. Plz join us βΌοΈ
π Slack: join.slack.com/t/growingail...
π GitHub: github.com/hokindeng/VM...
π Early Results: grow-ai-like-a-child.com/video-reason/
π Paper: github.com/hokindeng/VM...
The age of video reasoning is here π¬π§
While failure cases clearly show idiosyncratic patterns π§©π€, we currently lack a principled framework to systematically analyze or interpret them π. We invite everyone to explore these examples π§ͺ, as they may offer valuable clues for future research directions π‘π§ π.
Here is a generated video for solving the Raven's Matrices from video models. For more, checkout grow-ai-like-a-child.com/video-reason/
Raven's Matrices is the one of standard tasks in testing IQ in humans, which require subjects to find patterns and regularities. Intriguingly, video models are able to solve them quite well !
Here is an example of testing mental rotation in video models. For more, checkout grow-ai-like-a-child.com/video-reason/
For testing mental rotation, we give them an {n}-voxel structure with some tilted camera views (20-40Β° elevation) and ask them to horizontally rotate with exactly 180Β° azimuth change. The hard part is 1) don't deform 2) rotate the right degree. Interesting, some models are able to do it quite well.
Here is a video example. For more, checkout grow-ai-like-a-child.com/video-reason/
For the Sudoku problems, the video models need to fill the gap with the correct number in order to have each row and column all have 1, 2, 3. Surprisingly, this is the easiest task for video model.
Here is an example of generated video from the models solving the maze problem. Checkout more at grow-ai-like-a-child.com/video-reason/
In the maze problems, video models need to generate videos where navigate the green dot π’ to the red flags π© . And they are also able to do it quite well ~
Here is a generated video for solving the Chess problem. For more examples, checkout: grow-ai-like-a-child.com/video-reason/
Let's see some examples. Video models are able to figure out what are the checkmate moves in the following problems.
Idiosyncratic behavioral patterns exist.
For example, Sora-2 somehow figures out how to solve Chess problems. But all other models do not have such ability.
Veo 3 and 3.1 actually are able to do mental rotation quite well, but really fail on the maze problems.
Tasks also exhibit clear difficulty hierarchy, with Sudoku being the easiest and mental rotation being the hardest, across all models.
Models exhibit clear performance hierarchy with Sora-2 currently being the best model.
The basic of VMEvalkit is a Task Pair unit:
1οΈβ£ Initial image: unsolved puzzle
2οΈβ£ Text instruction: βSolve this ...β
3οΈβ£ Final image: correct solution (hidden during generation)
Models see (1)+(2), we compare their output to (3). Simple and straight-forward β
βΌοΈ Video models start to reason, let's build-in-public scaled eval together π
github.com/hokindeng/VM... (Apache 2.0) offers
1β£One-click inference across ALL available models
2β£Unified API & datasets & auto resume + error handling + eval
3β£Plug new models and tasks in <5 lines of code
a thread (1/n)
Our paper is now available at arxiv.org/abs/2510.20835. For anyone interested, weβd love to hang out and chat π¬π§
#EmbodiedAI #SpatialReasoning #NeuroAI #CognitiveScience #SpatialReasoning
Third, in embodied AIs, explicit simulators (MuJoCo/Isaac/Genesis) are vital but brittle alone. Implicit world models (VIP, R3M, visual pretraining) supply perceptual structure that boosts generalization, long-horizon planning, and sim-to-real.
However, it's necessary that visual and spatial mental content co-construct conscious experiences rather than run on isolated tracks.
Second, it makes to sound like the dorsal stream, where the "mujoco" software of our brain lies, almost becomes a "zombie" stream, namely with no participation of our conscious experience.
This first lies in the different interpretation of us in neuro-clinical literature of aphantasia. People with aphantasia can solve mental rotation yet report no visual imagery. We interpret this as a gating/decoding issueβnot absence of "rendering" in the brain.
We argue an alternative: robust spatial reasoning needs fine-grained perceptual content and higher-order relational indices. Thereβs no free lunch: coarse abstractions into "language-of-thought" like representations wonβt yield human-like spatial competence.