Taylor Sorensen (@taylor-sorensen)

Also check out these very cool related papers exploring diversity and mode collapse!
arxiv.org/pdf/2505.00047 from
Peter West et al.

arxiv.org/abs/2504.05228, arxiv.org/abs/2404.10859 by Yiming Zhang +
Daphne Ippolito et al.

arxiv.org/abs/2510.01171 by
Jiayi Zhang et al.

08.10.2025 14:27 👍 1 🔁 0 💬 0 📌 0

Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. We cha...

Check it out!
Paper: arxiv.org/abs/2510.06084
Models: huggingface.co/collections/...
Data and code: github.com/tsor13/spect...

08.10.2025 14:26 👍 1 🔁 0 💬 1 📌 0

In new work, we introduce a simple post-training method and large-scale resource for maximizing diversity and coverage! We call it Spectrum Tuning.
More on this in the coming days - but I'm really excited about this work, and am so happy that it's now public

08.10.2025 14:25 👍 1 🔁 0 💬 1 📌 0

Pretrained models are better at this - they actually give you substantively different outputs when you sample. BUT, they are unable to reliably follow instructions.
How can we train models to follow instructions AND to span the space of possible outputs?

08.10.2025 14:25 👍 1 🔁 0 💬 1 📌 0

This may seem like a silly toy example - shouldn’t we just use np.randint()?
Fair - but this simple case is illustrative of a broader weakness. What about creative writing? Or hypothesis generation? Or diverse data generation?
We need models that SPAN the entire output space.

08.10.2025 14:25 👍 1 🔁 0 💬 1 📌 0

Current post-training teaches a model to output the highest reward answer, even if there are other good answers. E.g. when picking random numbers, 7 seems like the most “random” number to annotators - so models ALWAYS pick 7!
arxiv.org/pdf/2505.00047
arxiv.org/pdf/2203.02155
arxiv.org/pdf/2510.01171

08.10.2025 14:24 👍 1 🔁 0 💬 1 📌 0

Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?

08.10.2025 14:22 👍 4 🔁 2 💬 1 📌 0

Others can disagree, but in my view it's important to understand the science behind diverse viewpoint modeling both for understanding potential risks and for building prosocial systems. AI alignment especially is a case where I think if we aren't careful, it will be easy for people to be left behind

21.03.2025 21:59 👍 5 🔁 0 💬 0 📌 0

PNAS Proceedings of the National Academy of Sciences (PNAS), a peer reviewed journal of the National Academy of Sciences (NAS) - an authoritative source of high-impact, original research that broadly spans...

That is a risk to personalization technologies like these 😬 However, I'm excited about AI's potential to _reduce_ polarization ( www.pnas.org/doi/full/10....), find common ground (www.science.org/doi/10.1126/...), and help people gain sympathy by exploring others' perspectives (ongoing work)

21.03.2025 21:55 👍 2 🔁 0 💬 1 📌 0

That being said, I'm in total agreement that there are absolutely risks to the technology as well, and figuring out which technologies to deploy where in what way will be very important.

21.03.2025 21:36 👍 0 🔁 0 💬 0 📌 0

Additionally, I think steerability to diverse perspectives becomes even more important as AI systems start having more autonomy (e.g. AI agents). I want an AI agent that knows _my_ perspective, not just the average one!

21.03.2025 21:36 👍 0 🔁 0 💬 1 📌 0

That being said, it's my belief that all model responses already have _a_ worldview associated with it - so personally, I think it's important to have systems that a) we can measure/explictly see what perspectives it is being aligned to so b) many people's perspectives can be included.

21.03.2025 21:36 👍 0 🔁 0 💬 1 📌 0

Apologies if this wasn't clear! They're provided textually in an in-context prompt, which the model then tries to steer towards

And yes you are absolutely right, that's one of the risks of personalization in general (see great paper here: arxiv.org/pdf/2303.05453)

21.03.2025 21:36 👍 0 🔁 0 💬 1 📌 0

Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. ...

Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
🙅‍♀️ Model weights
🙅‍♀️ Training data
🙅‍♀️ Token probabilities 🧵 (1/5)

21.03.2025 19:08 👍 97 🔁 27 💬 4 📌 8

This was my Google DeepMind internship work with amazing collaborators Pushkar Mishra, @romapatel.bsky.social, Michael Henry Tessler, @mbakker.bsky.social, Georgina Evans, Iason Gabriel, @noahdgoodman.bsky.social, @verenarieser.bsky.social
They are a really amazing team!

20.03.2025 04:02 👍 1 🔁 0 💬 0 📌 0

Read the full paper here!
arxiv.org/abs/2503.15484

20.03.2025 03:57 👍 0 🔁 0 💬 1 📌 0

Value profiles enable new ways to model variation and enable representation at the individual level. We hope that our work helps enable systems that better model diverse perspectives and that work for everyone (yay for #pluralisticalignment #nlpforsocialgood #compsocialscience #compdemocracy!)

20.03.2025 03:57 👍 0 🔁 0 💬 1 📌 0

There are benefits though!
✅ Value profiles may enhance user agency, as, a person could change their own value profile
✅ Enabling value reflection via bottom-up discovery and top-down editing
✅ Unlike sociodemographics, which are often unchosen, people can choose values for themselves
(16/?)

20.03.2025 03:56 👍 0 🔁 0 💬 1 📌 0

Our goal in this work is to improve AI systems' ability to model diverse perspectives and better serve more people! However, risks remain
❌ Privacy risks: people may not wish values inferred
❌ Systems may fail to generalize to less common values, and we only test on English language
(15/?)

20.03.2025 03:55 👍 0 🔁 0 💬 2 📌 0

As a last experiment, we simulate an annotator population ("jury learning" @mitchellgordon) using with our trained models and value profiles.

We find that the instance-level interannotator agreement (IAA) predicted by our simulated population correlates with the observed IAA.
(14/?)

20.03.2025 03:53 👍 0 🔁 0 💬 1 📌 0

We also find that our value profile system is very well-calibrated.

This calibration is important for trusting the model's confidence and for disentangling value-related epistemic uncertainty from aleatoric uncertainty in rater variation.
(13/?)

20.03.2025 03:52 👍 0 🔁 0 💬 1 📌 0

Value profiles are written in natural language. But are they actually semantically interpretable? Does the system change its judgments with wording changes in common-sense ways?

Yes, we find that semantic changes in value profile lead to expected changes in the output.
(12/?)

20.03.2025 03:52 👍 0 🔁 0 💬 1 📌 0

Clustering with value profiles also enables dataset-level qualitative analysis.

For example, for OQA/DIC, even restricting to just 2 clusters explains the majority of rater variation, suggesting a bimodal distribution.

Additionally, the profile descriptions suggest why people may disagree.
(11/?)

20.03.2025 03:52 👍 1 🔁 0 💬 1 📌 0

Our algorithm is effective at uncovering useful rater groupings, with the resulting value profile clusters outperform the most performant demographic grouping!

Additionally, on the dataset where demographics helped most, the clusters partially recover ideological trends.
(10/?)

20.03.2025 03:51 👍 3 🔁 0 💬 1 📌 0

To characterize common modes of (dis)agreement, we introduce a value-based clustering algorithm.

Unlike traditional methods, ours: 1) does not require that raters label overlapping instances, 2) leverages semantic instance information, and 3) returns cluster descriptions.
(9/?)

20.03.2025 03:51 👍 0 🔁 0 💬 1 📌 0

Since the value profiles are inferred (or compressed) from in-context examples, information can only be lost. But how much is retained?

We find that value profiles generated by Gemini preserve the majority (>70%) of the useful predictive information!
(8/?)

20.03.2025 03:51 👍 0 🔁 0 💬 1 📌 0

Okay, onto results - we find that:

- Using a rater's in-context examples improve predictions most
- Value profiles significantly improve predictions (but not as much as examples)
- Demographics _do not_ offer a significant performance boost (except for OpinionQA)
(7/?)

20.03.2025 03:50 👍 0 🔁 0 💬 1 📌 0

We carry out experiments on six datasets relevant to model alignment, content moderation, and computational social science!

OpinionQA (Santurkar et al.)
Hatespeech (Kumar et al.)
DICES (Aroyo et al.)
Habermas (Tessler/@mbakker.bsky.social et al.)
Prism @hannahrosekirk.bsky.social
ValuePrism
(6/?)

20.03.2025 03:49 👍 0 🔁 0 💬 1 📌 0

Given the different representations, how can we tell which is most useful for modeling variation?

We apply an information-theoretic methodology to measure the amount of model-usable information in a rater representation for predicting an individual's ratings.
(5/?)

20.03.2025 03:47 👍 0 🔁 0 💬 1 📌 0

In the absence of a person's value self-description, we infer rater values from data via an autoencoder setup.

An encoder proposes values that could explain a person's ratings, and a decoder generalizes to held-out examples based on the value description.
(4/?)

20.03.2025 03:46 👍 0 🔁 0 💬 1 📌 0

Taylor Sorensen

Latest posts by Taylor Sorensen @taylor-sorensen