Hafez Ghaemi (@hafezghm)

Thrilled to see this work accepted at NeurIPS!

Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.

#MLSky 🧠🤖

19.09.2025 18:46 👍 19 🔁 4 💬 1 📌 1

Excited to share that seq-JEPA has been accepted to NeurIPS 2025!

19.09.2025 18:02 👍 16 🔁 2 💬 2 📌 2

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with re...

Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !

Check out the full paper here: 📄 arxiv.org/abs/2505.03176

💻 Code coming soon!
📬 DM me if you’d like to chat or discuss the paper!

(10/10)

14.05.2025 12:52 👍 8 🔁 0 💬 0 📌 0

Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)

14.05.2025 12:52 👍 5 🔁 0 💬 1 📌 0

Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10

14.05.2025 12:52 👍 2 🔁 0 💬 1 📌 0

On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)

14.05.2025 12:52 👍 2 🔁 0 💬 1 📌 0

Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)

14.05.2025 12:52 👍 3 🔁 0 💬 1 📌 0

Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

💡No explicit equivariance loss or dual predictor required!

(5/10)

14.05.2025 12:52 👍 4 🔁 0 💬 1 📌 0

Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➡️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)

14.05.2025 12:52 👍 3 🔁 1 💬 1 📌 0

🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)

14.05.2025 12:52 👍 2 🔁 0 💬 1 📌 0

Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10

14.05.2025 12:52 👍 4 🔁 0 💬 1 📌 0

Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)

14.05.2025 12:52 👍 51 🔁 16 💬 1 📌 5

Hafez Ghaemi

Latest posts by Hafez Ghaemi @hafezghm