Roei Herzig's Avatar

Roei Herzig

@roeiherz

Research Scientist @ IBM Research. Postdoc @ Berkeley AI. PhD @ Tel Aviv University. Working on Compositionality, Multimodal Foundation Models, and Structured Physical Intelligence. πŸ”— https://roeiherz.github.io/ πŸ“Bay area πŸ‡ΊπŸ‡²

416
Followers
183
Following
29
Posts
23.11.2024
Joined
Posts Following

Latest posts by Roei Herzig @roeiherz

MMFM 3rd Workshop - Program 08:30am - Welcome (5min) 08:35am - Keynote Talk by Ludwig Schmidt (Stanford, Anthropic) (25min + 5min QA) Title: LAION-5B & DataComp: In search of the next generation of multim...

CVPR panel at the What is Next in Multimodal Foundation Models? workshop kicks off soon!

11:30AM, R207 A–D (Level 2)

Don't miss an amazing discussion with: Ludwig Schmidt, @andrewowens.bsky.social , Arsha Nagrani, and Ani Kembhavi πŸ”₯

@cvprconference.bsky.social

sites.google.com/view/mmfm3rd...

12.06.2025 15:51 πŸ‘ 9 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0
Post image

We found that 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, and thus enabling efficient transfer learning from human video data to low-level robotic control.

24.02.2025 03:49 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

For example, VLAs use language decoders, which are pretrained on tasks like visual question answering and image captioning.

This presents a discrepancy between the models’ high-level pre-training objective and the need for robotic models to predict low-level actions.

24.02.2025 03:49 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Pretraining has significantly contributed to recent Foundational Model success. However, in robotics, progress has been limited due to a lack of robotic annotations and insufficient representations that accurately model the physical world.

24.02.2025 03:49 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our paper: arxiv.org/pdf/2502.13142.

Our project page and code will be released soon!

Team: \w Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, and Trevor Darrell.

24.02.2025 03:49 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

What happens when vision🀝 robotics meet? 🚨 Happy to share our new work on Pretraining Robotic Foundational Models!πŸ”₯

ARM4R is an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model.

BerkeleyAI 😊

24.02.2025 03:49 πŸ‘ 16 πŸ” 5 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

The best friend of Auto-regressive Robotic Models is 4D representations...πŸ€–πŸ˜»β€οΈ

20.02.2025 05:01 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

היא Χ”Χ™ΧͺΧ” ΧΧ•Χ˜Χ•Χ‘Χ•Χ‘?

26.01.2025 04:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Wow! This image so horrible and beautiful at the same time.

15.01.2025 21:49 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I wouldn't recommend deleting the old users on X and Facebook as this social network is still in a beta version.

08.01.2025 21:32 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The Star of David on the Christmas tree is quite hilarious :)

24.12.2024 17:32 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Our workshop "What is Next in Multimodal Foundation Models?" has been accepted to #CVPR for its 3rd time!

We are cooking amazing talks and an excellent panel for you, so stay tuned!

@cvprconference.bsky.social

21.12.2024 19:06 πŸ‘ 9 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

For all our @neuripsconf.bsky.social friendsπŸ€–πŸ¦‹, our work is presented NOW at POSTER #3701.

Come hear us talk our work on many-shot in-context learning and test-time scaling by leveraging the activations! You won't be disappointed😎

#Multimodal-InContextLearning #NeurIPS

12.12.2024 19:13 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Oh no, I have a NeurIPS @neuripsconf.bsky.social FOMOπŸ™ƒπŸ˜ƒπŸ€—

Or is it actually more of Taylor Swift?🫠

10.12.2024 23:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

This fantastic work was done by the outstanding students, Brandon Huang, Chancharik Mitra and Tianning Chai, as well as Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky.

I also want to special thanks the amazing Trevor Darrell and Deva Ramanan for their invaluable guidance.

04.12.2024 21:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Key-takeaways:

(1) Utilizing truly multimodal features (like those found in generative architectures)

(2) Demonstrating how generative LMMs can be used for discriminative VL tasks

(3) It is very convenient to have all the information in a small and different head for different VL tasks.

04.12.2024 21:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We tried several different tasks, such as Safety, Visual Question Answering (VQA), and Classification benchmarks.

The results suggest that SAVs are particularly useful even when compared to LoRA (where there are not a lot of samples to fine-tune the model).

04.12.2024 21:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

What we did? ->

We propose an algorithm for finding small sets of attention heads (~20!) as multimodal features in Generative LMMs that can be used for discriminative VL tasks, outperforming encoder-only architectures (CLIP, SigLIP) without training.

04.12.2024 21:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Motivation:

On the one hand, encoder-only architectures are great for discriminative VL tasks but lack multimodal features.

On the other hand, decoder-only architectures have a joint multimodal representation but are not suited for decoding tasks.

Can we enjoy both worlds? The answer is YES!

04.12.2024 21:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
In-Context Learning Enables Robot Action Prediction in LLMs Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict...

🚨Excited to share for the first time our work here in πŸ¦‹ "Sparse Attention Vector (SAVs)" πŸ₯³

We showed that when done properly, generative multimodal features can be discriminative vision-language classifiers.

A really fun & enjoyable collab w/ @CMU, @BAIR, and @MIT-IBM Lab

arxiv.org/abs/2410.12782

04.12.2024 21:24 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I think for two main reasons. Firstly, ICL is an emergent property of LLMs/VLMs, not something they were pre-trained to do originally. Second, those VLMs that suffer from poor ICL are usually those who were instruction-tuned, while most pretrained VLMs (i.e., generative models) should still have it.

30.11.2024 02:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I took my two kids there last April, and I was amazed at how much they could climb even at a young age!

Also, I highly recommend visiting north of California (Mendocino, Fort Bragg, etc.) during this time of year!

30.11.2024 01:38 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

My hot take it is essential to have many-shot capabilities. In our NeurIPS work arxiv.org/abs/2406.15334, we showed how to use Multimodal Task Vectors for many-shot.

But, I'm not sure it makes sense to pretrain for this. ICL is an emergent property, not a downstream task...

Anyway, nice work!

28.11.2024 20:07 πŸ‘ 2 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

πŸ€”

28.11.2024 01:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

You can access keynotes, papers, and spotlights on Robotic Learning for free! So cool!πŸ€–

youtube.com/watch?v=0joZ...

#Robotics #DeepLearning #CoRL2024

26.11.2024 05:07 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Assistant Professor (tenure-track) - Computer Science, College of Engineering and Applied Sciences Tenure-Track Assistant Professor in AI Department of Computer Science Stony Brook University Stony Brook University’s Department of Computer Science invites applications for a tenure-track assistant p...

πŸ“£ Stony Brook University’s Department of Computer Science invites applications for a tenure-track assistant professor position with an expected starting date of Fall 2025.

Link to the job post: careercenter.cra.org/job/assistan...

25.11.2024 21:31 πŸ‘ 9 πŸ” 2 πŸ’¬ 0 πŸ“Œ 1

Wow! This is fantastic! Well deserved.

25.11.2024 19:57 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

So far, my experience with this platform has shown that it is much better for research. I really love the research feed here!

24.11.2024 18:35 πŸ‘ 6 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
a penguin with a fish on its head is on a wooden raft with the words little help written above it ALT: a penguin with a fish on its head is on a wooden raft with the words little help written above it

For all the ML/AI researchers, are you still tweeting both at X and BK at the same time? Is there a convenient way to do this?

24.11.2024 17:57 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Very usefulπŸ˜ƒ

24.11.2024 03:40 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0