Thank you!
Thank you!
Feel free to reach out here or elsewhere if any of this interests you!
I'm also really interested in epistemological questions about how best to study LMs, relating primarily to construct validity (are the tasks we use appropriate?) and external validity (are the findings we obtain generalizable, and to which "populations" of LMs?).
Or whether and to what extent human behavior on mental state reasoning tasks (e.g., the false belief task) can be approximated by LMs trained solely on the distributional statistics of language—and how said LMs appear to solve such tasks.
E.g., asking whether human representations of ambiguous words (as operationalized by behavior on psycholinguistic tasks) can be approximated by the continuous representations in transformer LMs.
Much (though not all) of my current research uses language models as "model organisms" to test theories about human cognition—and, increasingly, adapts methods and conceptual frameworks from Cognitive Science to better understand the behaviors and internal mechanisms of LMs.
Our lab will work on questions at the intersection of language, cognition, and computation, like how humans represent ambiguous words; which factors plausibly give rise to our ability to reason about mental states; and how linguistic representations are integrated with sensorimotor experience.
Very excited to announce that I'll be starting as an Assistant Professor in the Psychology department at Rutgers University-Newark in January 2026!
(Thanks to @camrobjones.bsky.social , Pam Rivière, Oisín Parkinson-Coombs, and Kola Ayonrinde for valuable comments on various iterations of this work!)
In general I think there's tons of interesting work to be done exploring what *kinds of claims* generalize across *which kinds of model instances*!
Paper link here: openreview.net/pdf?id=sZZIO...
It's also possible that we live in a world where many mechanisms won't generalize at all, or won't generalize along most of these dimensions. But knowing that depends on having the investigatory framework in the first place—this paper is a first stab at systematizing that.
Another problem is that this is simply very hard to implement: we don't have random seeds for most models! Indeed, the problem is even worse: available models are *not* a representative sample of possible models! But this should just make us more cautious about our conclusions.
I conclude by discussing potential objections. E.g., if interpretability is intended to be idiographic rather than nomothetic, then we don't really need a framework for generalizability. But if we do want to generalize, then organizing principles are key.
larger models show earlier onset, higher peak, and steeper slope of 1-back attention.
Additionally, seeds of larger models generally show earlier onsets, higher peaks, and steeper slopes of 1-back attention developmentally. There's also some positional variation, albeit with putative 1-back heads usually appearing in earlier layers.
Figure showing development of putative 1-back attention heads across seeds in pythia models.
I then test select axes with a very simple example (1-back attention) across random seeds of the Pythia suite. Consistent with other work I find strong *developmental convergence* across seeds and also (to lesser extent) across architectures. (red line = GAM predictions across all models)
Axes include Functional (~same behavior and responsiveness to ablations), Developmental (emerge at similar points in training), Positional (at similar layers/depths), Relational (interact with other components), and Configurational (similar weight-space regions).
Drawing on recent interp literature, I first identify/propose five potential *axes of correspondence* along which the generalizability of mechanistic claims could be investigated. You can think of this as a set of organizing principles to guide investigations about generalizability/universality.
The issue of generalizability is not limited to mechinterp—we see it in psychology (e.g., "WEIRD" subjects) and work on LLM behavior more generally. But the nature of interp research raises another q: what does it even mean to say two instances have the "same" circuit?
Mechinterp research typically aims to identify *circuits* implementing functions in particular model instances. But it's unclear whether and when findings *generalize* to other model instances.
Screenshot of paper title.
Will be presenting a new paper on generalizability in mechinterp research at the 2025 NeurIPS MechInterp workshop! Thread below. #NeurIPS
Does vision training change how language is represented and used in meaningful ways?🤔The answer is a nuanced yes! Comparing VLM-LM minimal pairs, we find that while the taxonomic organization of the lexicon is similar, VLMs are better at _deploying_ this knowledge. [1/9]
I've been working on a related question: along which *correspondence axes* (developmental, etc.) can we reasonably expect mechanistic claims to generalize across instances? Will be presenting this work (along with a case study) at the NeurIPS 2025 MechInterp workshop: openreview.net/pdf?id=sZZIO...
I think understanding which factors lead to convergence and divergence (both in behavior and internal mechanisms) across networks is crucial to understanding what kinds of systems we're studying and what kinds of claims we can generalize across model instances. Very cool work!
This is really cool! I've been doing some related work on seed-wise variability I'll actually be presenting at the NeurIPS MechInterp workshop (openreview.net/pdf?id=sZZIO...). Will try to make it to your presentation/poster!
📍Excited to share that our paper was selected as a Spotlight at #NeurIPS2025!
arxiv.org/pdf/2410.03972
It started from a question I kept running into:
When do RNNs trained on the same task converge/diverge in their solutions?
🧵⬇️
A confounding thing for the linguistics of LMs: the best way to assess their grammatical ability is string probability. Yet string probability and grammaticality are famously not the same!
Really excited to have this out, where we give a formal account, w/ experiments, of how to make sense of that!
(By @raphaelmilliere.com and @cameronbuckner.bsky.social )
Thank you! I'd also recommend this philosophy-oriented overview of "interventionist" methods for studying neural networks: philpapers.org/rec/MILIMF-2
Congratulations!
Hard to process the news about Harvard and international students. Other universities should stand in solidarity with our colleagues who are being persecuted.
Please read my essay in TIME, which @science.org
did not do carefully before publishing this assertion. time.com/7285045/resi...