We just released AnySense, an iPhone app for effortless data acquisition and streaming for robotics. We leverage Appleβs development frameworks to record and stream:
1. RGBD + Pose data
2. Audio from the mic or custom contact microphones
3. Seamless Bluetooth integration for external sensors
26.02.2025 15:14
π 34
π 10
π¬ 2
π 0
How well do Multimodal LLMs consider visual information when creating plans to complete household activities? To answer this, we put a few multimodal LLMs on a pair of smart glasses and had participants try to solve cooking tasks while taking instructions from them.
23.02.2025 22:07
π 8
π 2
π¬ 1
π 0
8/ A huge thank you goes out to my co-authors, Brian Chen, @heghbalz.bsky.social, Tushar Nagarajan, and Ruta Desai.
23.02.2025 22:07
π 1
π 0
π¬ 0
π 1
7/ Finally, even though the overall success rate was low, in 50% of successful trials with our best model, the model guided a participant to complete an activity they had never done before. This highlights the potential of these systems to provide household assistance, particularly to elderly folks.
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
6/ 3) Metrics from related offline benchmarks, like action anticipation, can be misleading and are not indicative of real-world performance. Check out our paper to see some of the errors we found with these metrics and how to conduct your own study!
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
A diagram showing the flow of a latte-making activity. The majority of errors made by the system are classified as "grounding errors".
5/ 2) Grounding errors, where the LLM fails to recognize previously completed actions or suggests actions for a different variation of the task, are the dominant error modes. We can make progress in this domain by better enabling LLMs to attend to long visual histories.
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
A table showing two methods, Socratic 13B and VCLM 13B. The Socratic 13B method has a success rate of 27.8 and a mean intersection over union of 30.4. The VCLM has a success rate of 16.7 and a mean intersection over union of 23.0
4/ 1) Encoding the visual task history using the Socratic approach is more effective than representing this info implicitly using VCLMs. Implicit representations capture βlow-levelβ info, which is less useful for planning than the βhigh-levelβ info in explicit text representations.
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
A diagram showing the system setup for evaluating Multimodal LLMs for activity assistance. A video stream from a user is fed to a Multimodal LLM, which generates a plant to complete an activity.
3/ We set up a user study where users would complete the first half of a task themselves while the LLM monitored their progress and then relied on the LLM to guide them through the rest of the task.
We came away with three important findings:
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
2/ We tested two approaches:
Socratic Models convert vision to text using pretrained models such as narration models and pass it to an off-the-shelf LLM.
Vision-Conditioned Language Models (VCLMs) encode vision with pretrained encoders and pass the embeddings to a fine-tuned LLM.
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such a...
1/ Quick Info:
This work, User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance, is being presented next weekend at #WACV2025.
Paper: www.arxiv.org/abs/2408.03160
Poster: Saturday March 1, Poster Session 2
Oral: Sunday March 2, Oral Session 5.4 Generative Models V 2:00 PM
23.02.2025 22:07
π 0
π 0
π¬ 1
π 0
How well do Multimodal LLMs consider visual information when creating plans to complete household activities? To answer this, we put a few multimodal LLMs on a pair of smart glasses and had participants try to solve cooking tasks while taking instructions from them.
23.02.2025 22:07
π 8
π 2
π¬ 1
π 0
1/ I am seeing a lot of comments on the slashing of NIH support along the lines of βuniversities should just spend their huge endowments.β
Iβm the last person to cheer on the institutional stratification rising endowments have contributed to. But let me explain why this is not a solution.
18.02.2025 13:48
π 1244
π 524
π¬ 44
π 120
Does everyone in your community agree on some folk knowledge that isnβt published anywhere? Put it in a paper! Itβs a pretty valuable contribution
26.11.2024 22:31
π 201
π 26
π¬ 24
π 10
Introducing Generative Omnimatte:
A method for decomposing a video into complete layers, including objects and their associated effects (e.g., shadows, reflections).
It enables a wide range of cool applications, such as video stylization, compositions, moment retiming, and object removal.
26.11.2024 15:55
π 134
π 20
π¬ 3
π 8
Hello world! I'm a PhD student at CMU in robotics, at the intersection of vision and robot manipulation.
18.11.2024 10:48
π 31
π 3
π¬ 1
π 0
If you like robots & makers here's a thread starter packs!
go.bsky.app/9xL642E
14.11.2024 04:43
π 32
π 11
π¬ 9
π 4
I am trying to create a robotics and ai starter pack on bluesky: go.bsky.app/DfAoaJ1
Very incomplete please comment with suggestions (or just if you're missing and want to be added!)
11.11.2024 15:01
π 110
π 38
π¬ 78
π 4
Hello!
11.11.2024 18:55
π 1
π 0
π¬ 0
π 0