Ryan Steed (@rbsteed.com)

CNTR AISLE CNTR AISLE Portal

It's been a journey of nearly 3 years, but I'm very excited to announce the CNTR AISLE Portal! 🚀 cntr-aisle.org It’s a new way to review and evaluate the 1,000+ AI bills introduced in the U.S. over the last three years. Check out the Bill Library and our Profiles#AIPolicy #OpenData

02.03.2026 17:00 👍 24 🔁 14 💬 1 📌 0

GLMMs are just one approach — we’re looking forward to more work on statistical frameworks for AI evaluation. Send questions/comments to caisi-metrology@nist.gov.

Paper (w/ the talented Drew Keller, Kweku Kwegyir-Aggrey, Anita Rao, Julia Sharp, and Stevie Bergman): nvlpubs.nist.gov/nistpubs/ai/...

19.02.2026 15:54 👍 0 🔁 0 💬 0 📌 0

Fig. 6a from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects in each domain. Chemistry questions were particularly difficult for the LLMs we tested. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.

Fig. 6b from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects for the 191 questions at the three most common writer-annotated difficulty levels. Question difficulty for LLMs has a weak relationship with human-labeled difficulty. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.

GLMMs have other benefits, too:

- We can estimate question difficulties to identify problematic questions and other patterns in benchmarks.

- Variance decomposition (between- and within-questions) can highlight nuances in performance between tasks, languages, and other subsets of a benchmark.

19.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

Ideally, evaluators should use a statistical model to explicitly define the estimand & other statistical assumptions.

We propose one approach using generalized linear mixed models. GLMMs can often estimate uncertainty more precisely than typical “regression-free” approaches.

19.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

Fig. 1 from the paper: Comparing accuracy estimates (GPQA-Diamond). Lower plots show the estimated accuracy of a selection of tested LLMs* with 95% confidence intervals. Upper plots show corresponding confidence interval (CI) widths. Generalized accuracy CIs are larger than benchmark accuracy CIs because they account for the selection of benchmark items from a superpopulation. Notably, some pairs of LLMs may have significantly different benchmark accuracy but not generalized accuracy. The simple average (pink) estimates reflect the average across all n benchmark questions with standard error calculated as standard deviation of results divided by √n. For estimates of benchmark accuracy, the simple average method results in under-confident CIs compared to a valid regression-free method (blue). For estimates of generalized accuracy, the simple average method provides valid CIs, but precision can be increased by running more trials per item (as in the regression-free method). Generalized linear mixed model (GLMM, orange) estimates require additional assumptions but further increase precision.

AI evals rarely specify which question is being answered — but the choice matters, especially when it comes to computing error bars. (Assuming error bars are included at all…)

In particular, error bars for generalized accuracy tend to be larger and may yield different rankings.

19.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

We identify two distinct questions about accuracy:

- Benchmark accuracy: How well does the LLM perform on this specific, fixed benchmark?

- Generalized accuracy: How well would the LLM perform across the larger population of questions similar to those in this benchmark?

19.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

New Report: Expanding the AI Evaluation Toolbox with Statistical Models We're Hiring!Please visit the CAISI Careers Page to learn more abou

AI benchmark evals commonly report "accuracy" metrics — but what’s really being measured? And how should we compute the error bars?

New NIST report from my team at CAISI outlines a better statistical framework for eval analysis: www.nist.gov/news-events/...

19.02.2026 15:54 👍 1 🔁 0 💬 1 📌 1

This has been a massive community project, and we need you all to participate!

See more: evalevalai.com/projects/eve...

17.02.2026 17:39 👍 7 🔁 3 💬 0 📌 0

Had the chance to give feedback on this project on CAISI’s behalf. I’m very excited to see this develop!

17.02.2026 17:41 👍 3 🔁 0 💬 0 📌 0

If this kind of work speaks to you, come work with us! My team at CAISI is hiring an Applied Systems AI Research Scientist, among many other roles. www.nist.gov/caisi/career...

11.02.2026 21:17 👍 2 🔁 2 💬 0 📌 0

Towards Best Practices for Automated Benchmark Evaluations Comments Sought on Initial Public Draft of NIST AI 800-2 through March 31

CAISI invites input on any aspect of this draft, including from orgs that conduct AI evals and from users of eval reports (for decision-making, procurement, integration, etc.)

Public comment closes March 31 — details here: www.nist.gov/news-events/...

11.02.2026 21:14 👍 1 🔁 0 💬 0 📌 1

Table I.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

Automated benchmarks are not all you need, but they are popular tools in AI development. Hoping this doc is a foundation for future guidelines on field testing and other kinds of evals.

11.02.2026 21:14 👍 0 🔁 0 💬 1 📌 0

Table 3.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

Section 3 covers critical practices related to responsible and transparent reporting — including uncertainty quantification, reproducibility, and properly qualified claims.

11.02.2026 21:14 👍 0 🔁 0 💬 1 📌 0

Section 2 dives into the nitty-gritty operational details of setting up and running a benchmark — including helpful lists of relevant settings and design principles.

11.02.2026 21:14 👍 0 🔁 0 💬 1 📌 0

Table 1.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

I’m especially excited about the focus on practical measurement validity.

Sections 1 describes ways to assess the relationship between the contents of a benchmark and what evaluators really want to measure.

11.02.2026 21:14 👍 0 🔁 0 💬 1 📌 0

Towards Best Practices for Automated Benchmark Evaluations Comments Sought on Initial Public Draft of NIST AI 800-2 through March 31

Excited to co-author a new public draft from NIST CAISI on best practices for automated benchmark evals.

We want your feedback! Public comment open til March 31. Highlights below 🧵

www.nist.gov/news-events/...

11.02.2026 21:14 👍 2 🔁 0 💬 1 📌 2

NIST CAISI is also hiring post-docs — applications due Feb. 1.

Come work with our team on AI evaluations and metrology!

Apply here: ra.nas.edu/RAPLab10/Opp...

More detail: www.linkedin.com/posts/astevi...

14.01.2026 15:02 👍 1 🔁 0 💬 0 📌 1

Applicants Fellowship Information *** Application is currently OPEN *** Number of Awards: Varies annually Schedule: Online applications open in late August and close on January 15, 2026. Type: Fellowshi…

Late notice, but: NIST has a multi-year, funded graduate fellowship + summer intern program (potentially with our team at CAISI!). Due January 15.

stemfellowships.org/applicants/

09.01.2026 16:08 👍 2 🔁 2 💬 1 📌 0

$About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.$

About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.

Are you passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like? Come do a PhD with us.

Closing Date: 10 February 2026

Apply here aial.ie/hiring/phd-a...

17.12.2025 18:52 👍 16 🔁 10 💬 0 📌 0

US CAISI is hiring -- the internal govt name for the role is "IT Specialist" but it is effectively a research scientist role!

Salary is $120,579 to - $195,200 per year, and you get to work on AI evaluation within government agencies!

Job posting (**closes EOD 12/28/2025**): lnkd.in/exJgkqr5

11.12.2025 22:01 👍 24 🔁 10 💬 1 📌 1

USAJOBS Help Center - How do I write a resume for a federal job? USAJOBS Help Center

Note that this position requires a specially formatted resume, 2 pages max: help.usajobs.gov/faq/applicat...

08.12.2025 14:47 👍 1 🔁 0 💬 0 📌 0

Also, our team is hiring an AI Research Scientist!

www.usajobs.gov/job/851528400

08.12.2025 14:47 👍 10 🔁 7 💬 1 📌 0

Also, belated announcement that I joined @steviebergman.bsky.social’s wonderful Applied Systems team at CAISI — with @anitakrao.bsky.social, Drew Keller, & (formerly) @kwekuka.bsky.social.   More to come!

04.12.2025 20:17 👍 2 🔁 0 💬 0 📌 0

"Building gold-standard AI systems requires gold-standard AI measurement science... Today, many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid."  

We highlight open q's about construct validity, field studies, and more.

04.12.2025 20:17 👍 2 🔁 0 💬 1 📌 0

Accelerating AI Innovation Through Measurement Science Building gold-standard AI systems requires gold-standard AI measurement science – the scientific study of methods used to assess AI systems’ properties and impacts. NIST works to improve measurements ...

Our team at NIST's Center for AI Standards and Innovation (CAISI) just released a blog post with open questions for AI measurement science:

www.nist.gov/blogs/caisi-...

04.12.2025 20:17 👍 5 🔁 1 💬 1 📌 2

US CAISI (the equivalent of the US "AI Safety Institute") just put out their approach to AI measurement & there's such a significant portion on construct validity (nist.gov/blogs/caisi-...).

Great to see this after ongoing advocating about this issue (arxiv.org/abs/2511.04703)!

04.12.2025 13:50 👍 25 🔁 8 💬 1 📌 0

After having such a great time at #CHI2025 and #FAccT2025, I wanted to share some of my favorite recent papers here!

I'll aim to post new ones throughout the summer and will tag all the authors I can find on Bsky. Please feel welcome to chime in with thoughts / paper recs / etc.!!

🧵⬇️:

14.07.2025 17:02 👍 55 🔁 10 💬 2 📌 2

Thank you :) Love this thread

24.07.2025 20:40 👍 2 🔁 0 💬 0 📌 0

A title slide with the paper title: "Legacy Procurement Practices Shape How U.S. Cities Govern AI". The title includes a small illustration that is a simple chart: A government provides an AI vendor with money, and in exchange the vendor provides the government with an AI system.

Q: What do school buses, desktop computers, and AI have in common?
A: The same decades-old laws and processes apply when governments go to purchase them.

Our new #FAccT2025 paper examines how these legacy public procurement norms apply to AI. 🧵

14.05.2025 02:34 👍 8 🔁 1 💬 2 📌 1

Excited to present "Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling" at #CHI2025 tomorrow(today)!

🗓 Tue, 29 Apr | 9:48–10:00 AM JST (Mon, 28 Apr | 8:48–9:00 PM ET)
📍 G401 (Pacifico North 4F)

📄 dl.acm.org/doi/10.1145/...

28.04.2025 11:26 👍 22 🔁 8 💬 2 📌 2

Ryan Steed

Latest posts by Ryan Steed @rbsteed.com