Thrilled to see this work accepted at NeurIPS!
Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.
#MLSky ๐ง ๐ค
Thrilled to see this work accepted at NeurIPS!
Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.
#MLSky ๐ง ๐ค
Excited to share that seq-JEPA has been accepted to NeurIPS 2025!
Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !
Check out the full paper here: ๐ arxiv.org/abs/2505.03176
๐ป Code coming soon!
๐ฌ DM me if youโd like to chat or discuss the paper!
(10/10)
Interestingly, seq-JEPA shows path integration capabilities โ an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.
(9/10)
Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.
8/10
On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.
(7/10)
Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.
(6/10)
Post-training, the model has learned two segregated representations:
An action-invariant aggregate representation
Action-equivariant individual-view representations
๐กNo explicit equivariance loss or dual predictor required!
(5/10)
Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).
โก๏ธ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.
(4/10)
๐ง Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.
(3/10)
Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.
2/10
Preprint Alert ๐
Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?
TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases โ without extra loss terms and predictors!
๐งต (1/10)