Caleb Ziems's Avatar

Caleb Ziems

@calebziems.com

PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory. πŸ“Palo Alto, CA πŸ”— calebziems.com

1,789
Followers
484
Following
17
Posts
06.11.2024
Joined
Posts Following

Latest posts by Caleb Ziems @calebziems.com

Thanks to many @stanfordnlp.bsky.social members for feedback! @juliakruk.bsky.social @yanzhe.bsky.social @myra.bsky.social @jaredlcm.bsky.social

May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you

04.11.2025 18:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).

This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!

04.11.2025 17:38 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Finally, Culture Cartography is aligned with prior notions of culture evals in our field.

We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).

04.11.2025 17:35 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.

We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.

Culture Cartography is "Google proof" since search doesn't help.

04.11.2025 17:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.

Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)

Even a strong reasoning model (R1) is challenged more by our data.

04.11.2025 17:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We propose a mixed-initiative method called Culture Cartography.

And to find challenging questions, we let the LLM steer towards topics it has low confidence in.

To find culturally-representative knowledge, we let the human steer towards what they find most salient.

04.11.2025 17:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.

Still this is a single-initiative process.

Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).

04.11.2025 17:32 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

How are prior benchmarks constructed?

In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.

This is single-initiative.

Annotators don't steer the process, so their interests and culture may not be represented.

04.11.2025 17:32 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Can we map out gaps in LLMs’ cultural knowledge?

Check out our #EMNLP2025 talk: Culture Cartography

πŸ—“οΈ 11/5, 11:30 AM
πŸ“Œ A109 (CSS Orals 1)

Compared to traditional benchmarking, our mixed-initiative method finds more knowledge gaps even in reasoning models like R1!

Paper: arxiv.org/pdf/2510.27672

04.11.2025 17:31 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Screenshot of paper title: Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence

Screenshot of paper title: Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence

AI always calling your ideas β€œfantastic” can feel inauthentic, but what are sycophancy’s deeper harms? We find that in the common use case of seeking AI advice on interpersonal situationsβ€”specifically conflictsβ€”sycophancy makes people feel more right & less willing to apologize.

03.10.2025 22:53 πŸ‘ 115 πŸ” 48 πŸ’¬ 2 πŸ“Œ 7
A screenshot of our paper's:

Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms
Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke
Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

A screenshot of our paper's: Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

I am so excited to be in πŸ‡¬πŸ‡·AthensπŸ‡¬πŸ‡· to present "A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms" by me, @kizilcec.bsky.social, and @allisonkoe.bsky.social, at #FAccT2025!!

πŸ”—: arxiv.org/pdf/2506.04419

23.06.2025 14:44 πŸ‘ 31 πŸ” 10 πŸ’¬ 1 πŸ“Œ 2
Post image

AI companions aren’t science fiction anymore πŸ€–πŸ’¬β€οΈ
Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs.

18.06.2025 16:27 πŸ‘ 6 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0
Preview
Comprehensive Assessment for Voice Assistants CAVA is a new benchmark for assessing how well Large Audio Models support voice assistant capabilities.

Introducing CAVA: The Comprehensive Assessment for Voice Assistants

A new benchmark for evaluating the capabilities required for speech-in-speech-out voice assistants!

- Latency
- Instruction following
- Function calling
- Tone awareness
- Turn taking
- Audio Safety

TalkArena.org/cava

07.05.2025 16:15 πŸ‘ 0 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

Reward models for LMs are meant to align outputs with human preferencesβ€”but do they accidentally encode dialect biases? πŸ€”

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! πŸŽ‰

Paper: arxiv.org/abs/2502.12858 (1/10)

06.03.2025 19:49 πŸ‘ 38 πŸ” 11 πŸ’¬ 1 πŸ“Œ 2

EgoNormia (egonormia.org) exposes a major gap in Vision-Language Models understanding of the social world: they don't know how to behave when norms about the physical world *conflict* βš”οΈ (<45% acc.)

But humans are naturally quite good at this (>90% acc.)

Check it out!

➑️ arxiv.org/abs/2502.20490

04.03.2025 04:44 πŸ‘ 8 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Culture is not trivia: sociocultural theory for cultural NLP. By Naitian Zhou and David Bamman from the Berkeley School of Information and Isaac L. Bleaman from Berkeley Linguistics.

Culture is not trivia: sociocultural theory for cultural NLP. By Naitian Zhou and David Bamman from the Berkeley School of Information and Isaac L. Bleaman from Berkeley Linguistics.

There's been a lot of work on "culture" in NLP, but not much agreement on what it is.

A position paper by me, @dbamman.bsky.social, and @ibleaman.bsky.social on cultural NLP: what we want, what we have, and how sociocultural linguistics can clarify things.

Website: naitian.org/culture-not-...

1/n

18.02.2025 20:45 πŸ‘ 121 πŸ” 35 πŸ’¬ 5 πŸ“Œ 3
Post image

LM agents today primarily aim to automate tasks. Can we turn them into collaborative teammates? πŸ€–βž•πŸ‘€

Introducing Collaborative Gym (Co-Gym), a framework for enabling & evaluating human-agent collaboration! I now get used to agents proactively seeking confirmations or my deep thinking.(🧡 with video)

17.01.2025 17:44 πŸ‘ 22 πŸ” 10 πŸ’¬ 1 πŸ“Œ 1

Bill Labov died this morning. I'm not coherent enough to talk about how important and influential and brilliant he was. I am very sad.

I was so lucky to know him, and I am grateful every day that he (and Gillian, and Walt, etc) built an academic field where kindness is expected.

18.12.2024 02:08 πŸ‘ 699 πŸ” 120 πŸ’¬ 24 πŸ“Œ 25
Talk Arena: Interactive Evaluation of Large Audio Models

Talk Arena: Interactive Evaluation of Large Audio Models

With an increasing number of Large *Audio* Models πŸ”Š, which one do users like the most?

Introducing talkarena.org β€” an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧡 (1/5)

10.12.2024 00:01 πŸ‘ 30 πŸ” 8 πŸ’¬ 3 πŸ“Œ 3

Maybe some starter packs for the Dyirbal noun classes?

1. most animate objects, men
2. women, water, fire, violence, and exceptional animals
3. edible fruit and vegetables
4. miscellaneous (includes things not classifiable in the first three)

24.11.2024 17:53 πŸ‘ 10 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
AI is not the GOAT. (Uh oh, your professor is attempting stand up comedy.)
AI is not the GOAT. (Uh oh, your professor is attempting stand up comedy.) YouTube video by Casey Fiesler

Hi Bluesky! You get to be the very first internet people to see my standup comedy debut. Because I know you’ll be nicer to me than the 12 year olds on TikTok. youtu.be/KqL2ahOvAgg?...

23.11.2024 18:52 πŸ‘ 72 πŸ” 7 πŸ’¬ 8 πŸ“Œ 3

I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!

23.11.2024 19:54 πŸ‘ 176 πŸ” 54 πŸ’¬ 101 πŸ“Œ 4

go.bsky.app/VZBhuJ5

22.11.2024 02:42 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

πŸ‘‹

19.11.2024 20:48 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

@butanium.bsky.social I nominate @aryaman.io

19.11.2024 16:57 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 1
A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.

Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park 😎

Apply by December 15th!

19.11.2024 10:38 πŸ‘ 303 πŸ” 136 πŸ’¬ 9 πŸ“Œ 12

Repost if you’ve participated in a Summer Institute in Computational Social Science. Let’s get #SICSS Bluesky going!

08.10.2023 19:49 πŸ‘ 51 πŸ” 63 πŸ’¬ 0 πŸ“Œ 3
resources | Julia Mendelsohn Materials that some people might find helpful

I'm sharing materials from my academic job search last year! Includes research, teaching, and diversity statements, plus my UMD cover letter and job talk slides. I applied for a mix of iSchool, data sci, CS, and linguistics positions). Feel free to share!
juliamendelsohn.github.io/resources/

18.11.2024 16:00 πŸ‘ 70 πŸ” 12 πŸ’¬ 0 πŸ“Œ 1

All the ACL chapters are here now: @aaclmeeting.bsky.social @emnlpmeeting.bsky.social @eaclmeeting.bsky.social @naaclmeeting.bsky.social #NLProc

19.11.2024 03:48 πŸ‘ 107 πŸ” 37 πŸ’¬ 1 πŸ“Œ 3

I wanted to contribute to "Starter Pack Season" with one for Stanford NLP+HCI: go.bsky.app/VZBhuJ5

Here are some other great starter packs:

- CSS: go.bsky.app/GoEyD7d + go.bsky.app/CYmRvcK
- NLP: go.bsky.app/SngwGeS + go.bsky.app/JgneRQk
- HCI: go.bsky.app/p3TLwt
- Women in AI: go.bsky.app/LaGDpqg

15.11.2024 19:20 πŸ‘ 25 πŸ” 10 πŸ’¬ 2 πŸ“Œ 2