Mrinal Verghese's Avatar

Mrinal Verghese

@mrinal-verghese

PhD student at Carnegie Mellon Robotics Institute. I work on task learning for household robots. He/Him. http://mrinal.verghese.org

1,470
Followers
676
Following
10
Posts
11.11.2024
Joined
Posts Following

Latest posts by Mrinal Verghese @mrinal-verghese

Video thumbnail

We just released AnySense, an iPhone app for effortless data acquisition and streaming for robotics. We leverage Apple’s development frameworks to record and stream:

1. RGBD + Pose data
2. Audio from the mic or custom contact microphones
3. Seamless Bluetooth integration for external sensors

26.02.2025 15:14 πŸ‘ 34 πŸ” 10 πŸ’¬ 2 πŸ“Œ 0
Video thumbnail

How well do Multimodal LLMs consider visual information when creating plans to complete household activities? To answer this, we put a few multimodal LLMs on a pair of smart glasses and had participants try to solve cooking tasks while taking instructions from them.

23.02.2025 22:07 πŸ‘ 8 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

8/ A huge thank you goes out to my co-authors, Brian Chen, @heghbalz.bsky.social, Tushar Nagarajan, and Ruta Desai.

23.02.2025 22:07 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 1

7/ Finally, even though the overall success rate was low, in 50% of successful trials with our best model, the model guided a participant to complete an activity they had never done before. This highlights the potential of these systems to provide household assistance, particularly to elderly folks.

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

6/ 3) Metrics from related offline benchmarks, like action anticipation, can be misleading and are not indicative of real-world performance. Check out our paper to see some of the errors we found with these metrics and how to conduct your own study!

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
A diagram showing the flow of a latte-making activity. The majority of errors made by the system are classified as "grounding errors".

A diagram showing the flow of a latte-making activity. The majority of errors made by the system are classified as "grounding errors".

5/ 2) Grounding errors, where the LLM fails to recognize previously completed actions or suggests actions for a different variation of the task, are the dominant error modes. We can make progress in this domain by better enabling LLMs to attend to long visual histories.

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
A table showing two methods, Socratic 13B and VCLM 13B. The Socratic 13B method has a success rate of 27.8 and a mean intersection over union of 30.4. The VCLM has a success rate of 16.7 and a mean intersection over union of 23.0

A table showing two methods, Socratic 13B and VCLM 13B. The Socratic 13B method has a success rate of 27.8 and a mean intersection over union of 30.4. The VCLM has a success rate of 16.7 and a mean intersection over union of 23.0

4/ 1) Encoding the visual task history using the Socratic approach is more effective than representing this info implicitly using VCLMs. Implicit representations capture β€œlow-level” info, which is less useful for planning than the β€œhigh-level” info in explicit text representations.

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
A diagram showing the system setup for evaluating Multimodal LLMs for activity assistance. A video stream from a user is fed to a Multimodal LLM, which generates a plant to complete an activity.

A diagram showing the system setup for evaluating Multimodal LLMs for activity assistance. A video stream from a user is fed to a Multimodal LLM, which generates a plant to complete an activity.

3/ We set up a user study where users would complete the first half of a task themselves while the LLM monitored their progress and then relied on the LLM to guide them through the rest of the task.
We came away with three important findings:

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

2/ We tested two approaches:
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such a...

1/ Quick Info:
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.

Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM

23.02.2025 22:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

How well do Multimodal LLMs consider visual information when creating plans to complete household activities? To answer this, we put a few multimodal LLMs on a pair of smart glasses and had participants try to solve cooking tasks while taking instructions from them.

23.02.2025 22:07 πŸ‘ 8 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

1/ I am seeing a lot of comments on the slashing of NIH support along the lines of β€œuniversities should just spend their huge endowments.”

I’m the last person to cheer on the institutional stratification rising endowments have contributed to. But let me explain why this is not a solution.

18.02.2025 13:48 πŸ‘ 1244 πŸ” 524 πŸ’¬ 44 πŸ“Œ 120

Does everyone in your community agree on some folk knowledge that isn’t published anywhere? Put it in a paper! It’s a pretty valuable contribution

26.11.2024 22:31 πŸ‘ 201 πŸ” 26 πŸ’¬ 24 πŸ“Œ 10
Video thumbnail

Introducing Generative Omnimatte:

A method for decomposing a video into complete layers, including objects and their associated effects (e.g., shadows, reflections).

It enables a wide range of cool applications, such as video stylization, compositions, moment retiming, and object removal.

26.11.2024 15:55 πŸ‘ 134 πŸ” 20 πŸ’¬ 3 πŸ“Œ 8
Preview
ChatGPT Has No Place in the Classroom By Emily On November 20, 2024, OpenAI and an outfit called "Common Sense Media" released a guide to using ChatGPT in K-12 educationβ€”a guide which shows a...

No, ChatGPT won't help in the classroom, won't save teachers time, and doesn't represent a set of skills students need to learn.

On OpenAI's latest nonsense:

buttondown.com/maiht3k/arch...

22.11.2024 14:14 πŸ‘ 517 πŸ” 191 πŸ’¬ 43 πŸ“Œ 34
Video thumbnail

Hello world! I'm a PhD student at CMU in robotics, at the intersection of vision and robot manipulation.

18.11.2024 10:48 πŸ‘ 31 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0

If you like robots & makers here's a thread starter packs!

go.bsky.app/9xL642E

14.11.2024 04:43 πŸ‘ 32 πŸ” 11 πŸ’¬ 9 πŸ“Œ 4

I am trying to create a robotics and ai starter pack on bluesky: go.bsky.app/DfAoaJ1

Very incomplete please comment with suggestions (or just if you're missing and want to be added!)

11.11.2024 15:01 πŸ‘ 110 πŸ” 38 πŸ’¬ 78 πŸ“Œ 4

Hello!

11.11.2024 18:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0