Shaily's Avatar

Shaily

@shaily99

PhDing at LTI, CMU Prev: Ai2, Google Research, MSR Evaluating language technologies, regularly ranting, and probably procrastinating. https://sites.google.com/view/shailybhatt/

3,201
Followers
534
Following
322
Posts
18.07.2024
Joined
Posts Following

Latest posts by Shaily @shaily99

Title, author list, and two figures from the paper. 
Title: The Aftermath of DrawEduMath: Vision Language Models
Underperform with Struggling Students and Misdiagnose Errors
Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response.
Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMs’ names; that same model’s performance on erroneous student responses is directly below.

Title, author list, and two figures from the paper. Title: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response. Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMs’ names; that same model’s performance on erroneous student responses is directly below.

Models are now expert math solvers, and so AI for math education is receiving increasing attention.
Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. 🧵

03.03.2026 03:08 👍 34 🔁 11 💬 4 📌 2

@lucy3.bsky.social

15.02.2026 17:50 👍 1 🔁 0 💬 0 📌 0

The Workshop on Developing Standards and Documentation For LLM Use in HCI Human Subjects Research aims to bring the HCI community together to develop standards, guidance, and documentation for the use of large language models (LLMs) as simulated research authors. 1/2

11.02.2026 23:58 👍 2 🔁 1 💬 1 📌 0

It finally happened, someone told me that a direction I suggested made sense because "gemini says its novel and no one is focusing on it".

08.02.2026 16:07 👍 3 🔁 0 💬 0 📌 0
Preview
Don’t Let The Machines Do The Living | Culture Study Get more from Culture Study on Patreon

Loved this wonderful essay, which talks about discernment of LLM use but also how we are doing too much.

08.02.2026 15:57 👍 1 🔁 0 💬 0 📌 0

Bit both. Who you want your audience to be and knowing what that audience wants and needs can often be correlated with diasporic-ness (if that's a word) too...

04.02.2026 02:24 👍 1 🔁 0 💬 0 📌 0

I am sure someone has studied this, but if not, some day I will study how dialects/mannerisms of characters are different within the same book and across books depending on who the character is talking to, and who the author is etc.

It would make a cool DH project too...

CC: @mariaa.bsky.social 🙈

03.02.2026 21:27 👍 2 🔁 0 💬 1 📌 0

I have always found it very interesting to think about the differences of depiction from diasporic authors like Jhumpa Lahiri vs non-diasporic authors. And + non-EN works (usually translated). Scenes, but also character mannerisms and dialects.

03.02.2026 21:27 👍 1 🔁 0 💬 2 📌 0

This is a real banger of a paper. The example of a model being weirdly focused on jasmine (lol) makes me increasingly think that single-point-of-access models don't really consider who their audience is. Jasmine is a super legible cultural marker for people outside, but is so, _so_ generic.

03.02.2026 16:41 👍 12 🔁 4 💬 2 📌 0

THIS is pretty much what the person said. One quote not in the paper because it didn't align to any of the categories, but one of my fav ones was: "it sounds like someone visited Bangalore for three days and wrote this story". The systems really mis emic representation!

03.02.2026 17:02 👍 3 🔁 0 💬 1 📌 0

Deadline for submission is in just under 10 days! Reach out if you have any questions.

03.02.2026 01:40 👍 1 🔁 1 💬 0 📌 0
Preview
TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how they represent diverse cultures. However, evaluating cultural represe...

Excited that this will be my first ever CHI! See you in Barcelona (visa gods permitting).

In the meantime, check out our paper:

arxiv.org/abs/2511.21322

and use our data:

cultural-misrepresentations.github.io

10/10

02.02.2026 21:38 👍 9 🔁 0 💬 0 📌 0

It was a blast working on this with Kirti and awesome collaborators, Athul, @imadityav.bsky.social, Shachi, and @danishpruthi.bsky.social

Big thanks to friends and colleagues at @ltiatcmu.bsky.social, IISc Bangalore, IIT Madras, Google DeepMind, @cornellbowers.bsky.social for feedback and support.

02.02.2026 21:38 👍 6 🔁 0 💬 1 📌 0
Post image

Surprisingly, models CAN answer questions, even those from misreps THEY MADE!!

So, lack of cultural knowledge is NOT the problem!

There is a clear headroom for model improvement here: models are unable to use cultural knowledge appropriately in long-form creative writing!

8/10

02.02.2026 21:38 👍 3 🔁 0 💬 2 📌 0
Post image

This begs the questions: do models just not have the knowledge that could have prevented misreps?

We converted the misrep span annotations into a question bank ✨ TALES-QA ✨ using GPT 4.1 and got it fully human verified with our cultural experts.

7/10

02.02.2026 21:38 👍 3 🔁 0 💬 1 📌 0
Post image Post image Post image

We find:

1️⃣ Stories in mid- and low- resource langs have more misreps

2️⃣ Stories from tier-2 and 3 regions had more factual and cultural misreps

3️⃣ Most misreps are about social norms, practices, & food

Lots of other analysis in paper! Clear avenues of improvement!

6/10

02.02.2026 21:38 👍 5 🔁 1 💬 1 📌 0
Post image Post image

Armed with ✨ TALES-Tax ✨ we ran a large-scale evaluation of 6 LLMs in 13 Indic langs + EN

We partnered with AI4Bharat to recruit 108 experts from 71 different regions in India. They annotated 540 stories leading to about 3K span annotations of misrepresentations.

5/10

02.02.2026 21:38 👍 4 🔁 0 💬 1 📌 0
Post image Post image Post image Post image

The types of misrepresentations that emerge range from mistakes in cultural details, language, logic, and facts. People also frequently noticed cliches and romanticized unlikely scenarios. Finally, participants found overgeneralizations when cultural nuance was lost.

4/10

02.02.2026 21:38 👍 5 🔁 0 💬 1 📌 0
Post image Post image

We developed ✨ TALES-Tax ✨ a taxonomy of cultural misrepresentations after conducting focus groups (N=9) and interviews (N=15)

Participation is IMP! Cultural representation is subjective and nuanced, we consulted the experts - the people – to tell us what LLMs get wrong!

3/10

02.02.2026 21:38 👍 4 🔁 0 💬 1 📌 0

We introduce:

✨ TALES-Tax ✨ a taxonomy of cultural misrepresentations

✨ TALES-QA ✨ a question bank to test cultural knowledge

Lots of (human) work went into building them.

All the data & code: cultural-misrepresentations.github.io

2/10

02.02.2026 21:38 👍 4 🔁 0 💬 1 📌 0
Post image

🎭 How do LLMs (mis)represent culture?
🧮 How often?
🧠 Misrepresentations = missing knowledge? spoiler: NO!

At #CHI2026 we are bringing ✨TALES✨ a participatory evaluation of cultural (mis)reps & knowledge in multilingual LLM-stories for India

📜 arxiv.org/abs/2511.21322

1/10

02.02.2026 21:38 👍 45 🔁 21 💬 1 📌 2

CS ArXiv recently banned “review and position” papers, but what are those? Do they include more generated content? Who is most affected by this change? @yanai.bsky.social and I dug into the data to find out!

Nearly 50% of Computers & Society papers might be censored, vs 3% of Computer Vision ‼️

29.01.2026 14:14 👍 42 🔁 19 💬 2 📌 0

NOOO. Tok tok!!!

23.01.2026 01:17 👍 2 🔁 0 💬 1 📌 0

- being able to easily and controllably cross-post with other social media like LinkedIn / twitter
- my use tends to increase when looking for internships etc. so features to more easily discover hiring posts or indicate interest in being hired
- being able to pin multiple paper threads

15.01.2026 17:18 👍 3 🔁 0 💬 0 📌 0

Congratulations 🎉🎉

13.01.2026 12:40 👍 1 🔁 0 💬 0 📌 0

How do you remove such followers? I can see a block account option, but not an explicit remove follower.

04.01.2026 14:05 👍 0 🔁 0 💬 1 📌 0
Preview
NLP+CSS Workshops https://www.pexels.com/photo/group-hand-fist-bump-1068523/

✨The NLP+CSS workshop is returning to ACL 2026!✨

And this year, we have a new shared task with prizes!

Website/CfP: sites.google.com/site/nlpandc...
Deadlines: March 5 (direct), March 24 (pre-reviewed ARR)

#NLProc #CompSocialSci #ComputationalSocialScience #ACL2026NLP
@aclmeeting.bsky.social

18.12.2025 12:38 👍 18 🔁 12 💬 0 📌 3
Post image

What is future of reading? 📗

Announcing the 1st Science & Technology of Augmented Reading (STAR) workshop at #CHI2026!

We want your takes on: 🤖 AI & Agents for reading 👁️ Visual Interactions 🗺️ Domains (Code, Law, Ed, etc.)

👇 Submit a 2-4 pg paper: chi-star-workshop.github.io

19.12.2025 18:13 👍 5 🔁 2 💬 0 📌 1
Screenshot of paper title and authors. 

Title: Social Story Frames: Contextual Reasoning about Narrative Intent and Reception
Authors: Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti, Andrew Piper, Maarten Sap

Screenshot of paper title and authors. Title: Social Story Frames: Contextual Reasoning about Narrative Intent and Reception Authors: Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti, Andrew Piper, Maarten Sap

Reading social media stories evokes a wide range of contextual reader reactions—inferential, affective, evaluative—yet we lack methods to study these at scale.

Excited to share our new paper that builds a framework for analyzing storytelling practices across online communities!

19.12.2025 23:05 👍 22 🔁 7 💬 1 📌 1

Why read a 300-word "abstract" summary of a paper written by the actual authors when one can read a 300-word summary produced by an AI prone to hallucinations?

16.12.2025 20:16 👍 75 🔁 16 💬 5 📌 3