Juan Carlos Niebles (@jcniebles)

Our #NeurIPS2025 oral presentation is starting in a few minutes!

Join us:
⏰3:30 pm
📍 Ballroom 6AB

grafting.stanford.edu

arxiv.org/abs/2506.05340

04.12.2025 23:00 👍 0 🔁 0 💬 0 📌 0

Such a great two days of workshops! Fuelled by inspiring talks and excellent reconnections with friends and colleagues—definitely my favorite part of the conference.

Missed my talk on AI Agents: from Language to Multimodal Reasoning?

Summary and slides are here:
www.niebles.net/blog/2025/mm...

21.10.2025 17:27 👍 1 🔁 0 💬 0 📌 0

Talk is done!

Shared our work on Multimodal AI Agents at the #ICCV2025 Workshop on Multi-Modal Reasoning. 🤖

All the slides, key papers, and the research journey are consolidated in this new blog post:

📄https://www.niebles.net/blog/2025/mmagents/

@iccv.bsky.social

21.10.2025 00:49 👍 2 🔁 0 💬 0 📌 0

We will be presenting Strefer today at Poster 52 9:30-10:30am, join us to learn more about our work on Video-Languange at @salesforce.com AI Research @iccv.bsky.social #ICCV2025

strefer.github.io

arxiv.org/abs/2509.03501

20.10.2025 19:33 👍 4 🔁 1 💬 0 📌 0

Check out the latest on Strefer: model & data are now available!

arxiv.org/abs/2509.03501

We will see you at #ICCV2025 🏖️

17.10.2025 20:55 👍 2 🔁 0 💬 0 📌 0

Exploring Diffusion Transformer Designs via Grafting Exploring Diffusion Transformer Designs via Grafting

📢📢 Exciting news!

Our paper, "Exploring Diffusion Transformer Designs via Grafting," has been accepted as an Oral at #NeurIPS2025, with only 77 out of 21k submissions receiving this honor.

📄Paper: arxiv.org/abs/2506.05340
🌎Website: grafting.stanford.edu
🧑🏻‍💻Code: github.com/keshik6/graf...

22.09.2025 21:54 👍 9 🔁 0 💬 0 📌 1

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLM...

Strefer: our new work for auto-generating instruction data on space–time–focused video tasks: spatiotemporal reasoning, space-time reference understanding, etc. for Video LLMs

✅ Auto & scalable
✅ Fine-grained, space–time–grounded queries
✅ Effective

📄: arxiv.org/abs/2509.03501
🌐: strefer.github.io

04.09.2025 18:36 👍 3 🔁 1 💬 1 📌 1

What is Multimodal AI? | The AI Research Lab - Explained YouTube video by Salesforce

Check out a new episode of The AI Research Lab - Explained on Multimodal AI.

Had a blast creating this with the @salesforce.com team!

youtu.be/r98jGdLtO6Q

16.06.2025 23:18 👍 3 🔁 0 💬 0 📌 0

Congrats Chaitanya on winning the BEST PAPER AWARD 🥇 🏆

Check out details of our work:

arxiv.org/abs/2504.12513

12.06.2025 21:07 👍 3 🔁 0 💬 0 📌 0

Our first #cvpr2025 poster is up!

🕐Come check it out right now until 13:00

“AdaVid: Adaptive Video-Language Pretraining”

🪧ExHall D Poster # 203

📝 arxiv.org/abs/2504.12513

12.06.2025 17:01 👍 2 🔁 0 💬 0 📌 1

Just finished a day at the #CVPR2025 Area Chair workshop. Lots of interesting discussions and ideas, reconnection with colleagues and friends.

Had the chance to present our ViUnit poster to fellow ACs. If you missed it, come to our Sunday poster session.

See details in the 🧵⬇️

11.06.2025 02:17 👍 3 🔁 0 💬 0 📌 0

If you're at #CVPR2025, please stop by my posters and say hello! I'd love to chat about our work and all things computer vision. See you in Nashville! 👋

10.06.2025 05:37 👍 1 🔁 0 💬 0 📌 0

ViUniT: Visual Unit Tests for More Robust Visual Programming Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer cor...

Last but not least, presenting "ViUniT: Visual Unit Tests for More Robust Visual Programming" #CVPR2025

🗓️ Sun Jun 15, 10:30AM-12:30PM
📍 ExHall D Poster #346
🔗 Paper: arxiv.org/abs/2412.08859
📝 Blog: www.niebles.net/blog/2025/vi...

#VisualProgramming #RobustAI

10.06.2025 05:37 👍 0 🔁 0 💬 1 📌 0

Re-thinking Temporal Search for Long-Form Video Understanding Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundam...

Next, "Re-thinking Temporal Search for Long-Form Video Understanding" #CVPR2025

🗓️ Fri Jun 13, 4PM-6PM
📍 ExHall D Poster #306
🔗 Paper: arxiv.org/abs/2504.02259
🌐 Website: longvideohaystack.github.io
💻 Code: github.com/LongVideoHay...
📊 Data: huggingface.co/datasets/LVH...

#VideoUnderstanding

10.06.2025 05:37 👍 1 🔁 1 💬 1 📌 0

AdaVid: Adaptive Video-Language Pretraining Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices rema...

I'll also be presenting multiple papers at #CVPR2025! First up: "AdaVid: Adaptive Video-Language Pretraining".

🗓️ Thu Jun 12, 12:00-13:00PM
📍 ExHall D Poster #202
🔗 Paper: arxiv.org/abs/2504.12513
🌐 Website: chaitanya100100.github.io/AdaVid/
#VideoLanguage #Pretraining

10.06.2025 05:37 👍 1 🔁 1 💬 1 📌 1

Kicking things off on June 11th by participating in the #CVPR2025 Area Chair workshop! Eager to connect with fellow ACs and colleagues. Let's make this an impactful conference!

10.06.2025 05:37 👍 0 🔁 0 💬 1 📌 1

Excited to attend #CVPR2025 in Nashville! 🤠 Looking forward to a fantastic week of cutting-edge computer vision research and connecting with the community.
@cvprconference.bsky.social

10.06.2025 05:37 👍 3 🔁 0 💬 1 📌 0

Level up your Agents: Teaching Vision-Language Models to Play by the Rules | Juan Carlos Niebles We explore how Vision-Language Models can be improved for interactive decision-making by using our new reinforcement learning technique called Advantage-Filtered Supervised Fine-Tuning.

Read the full post for more details: "Level up your Agents: Teaching Vision-Language Models to Play by the Rules".

blog: www.niebles.net/blog/2025/vl...
arxiv: arxiv.org/abs/2505.03181

Work with Jake Grigsby, Michael Ryoo and Yuke Zhu

#AI #MachineLearning #DeepLearning

04.06.2025 19:44 👍 1 🔁 0 💬 0 📌 0

This RL approach effectively aligns VLMs with the demands of interactive decision-making. It's a powerful new pathway for developing more capable and adaptable visual agents using readily available VLM tech.

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

We tested our approach on PaliGemma, xGen-MM, and MoonDream2 across Gym Cards, BabyAI, and MiniWoB. Results? Substantial improvements in valid action syntax accuracy and task success rates, even starting from noisy data!

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

This approach works great for offline-to-online fine-tuning, learning from static datasets (even random actions!) and then smoothly transitioning to online learning where the agent gathers new data to refine its policy. Self-improvement is key!

04.06.2025 19:44 👍 1 🔁 0 💬 1 📌 0

AFSFT helps VLMs overcome challenges like strict action syntax and suboptimal data. It learns from demonstrations and filters out tokens that would lead to invalid syntax or poor choices, even penalizing invalid syntax.

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

Enter Reinforcement Learning (RL)! Our paper introduces an "offline-to-online" RL technique called Advantage-Filtered Supervised Fine-Tuning (AFSFT) that allows VLMs to learn through trial and error, improving even with imperfect initial data.

04.06.2025 19:44 👍 1 🔁 0 💬 1 📌 0

Traditional supervised fine-tuning (SFT) has limits – it can't go beyond its training data, and imperfect datasets mean replicating flaws. What if we don't have perfect examples or a good initial VLM?

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

The catch? VLMs can struggle with the precise rules and structured outputs many agent tasks require, unlike LLMs which excel at function calling and specific syntax. Think describing a button vs. knowing the exact command to click it.

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

Large Language Models (LLMs) are great for agents, but what happens when we give them "eyes"? VLMs extend this power to process visual info, opening up new possibilities like robotic control and automating tasks by "seeing" your screen.

04.06.2025 19:44 👍 0 🔁 0 💬 1 📌 0

Just dropped a new blog post: "Level up your Agents: Teaching Vision-Language Models to Play by the Rules"! We're exploring how to make Vision-Language Models (VLMs) even smarter at interactive tasks.

blog: www.niebles.net/blog/2025/vl...

arxiv: arxiv.org/abs/2505.03181
#multimodalAI #agents #VLM

04.06.2025 19:44 👍 2 🔁 1 💬 1 📌 0

What Are Large Action Models? | The AI Research Lab - Explained YouTube video by Salesforce

Check out this great intro to Large Action Models, the key engine powering the AI Agent revolution. 🤖

By @salesforce.com AI Research’s Shelby Heinecke.

See video here:
youtube.com/watch?v=vlvv...

12.05.2025 18:14 👍 1 🔁 0 💬 0 📌 0

https://bit.ly/4kfipp4

@salesforce.com #AI Research has a new series called "AI Explained."
🎬 "The AI Research Lab - Explained" debuts with our groundbreaking work on Large Action Models! Sr. Mgr Shelby Heinecke reveals how we're training these specialized models to generate precise, executable actions. t.co/XLhlN2EZyk

12.05.2025 18:02 👍 3 🔁 1 💬 0 📌 0

Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers!

cvpr.thecvf.com/Conferences/...

10.05.2025 13:59 👍 48 🔁 12 💬 0 📌 19

Juan Carlos Niebles

Latest posts by Juan Carlos Niebles @jcniebles