I've found Claude to have an extremely hard time going off training distribution for model version numbers specifically.
I've found Claude to have an extremely hard time going off training distribution for model version numbers specifically.
Well, it has already happened. OpenAI is blocking any output of the explicit Claude "import memory" prompt (claude.com/import-memory) as I just learned while migrating my M.I.L.
Thankfully, she didn't copy it in verbatim the first time around so we got a nice A/B!
In case this wasn't clear:
1. No, we didn't follow the "recommend" security practices π
2. Neither do other people π€―
3. That's why we red-team: exposing failure modes π
4. We share it with the community precisely to expose Dos and Don'ts of Agentic AI π¦
5. No humans were harmed π
Good coverage by the Awesome Agents team!
π They read through the social media hype and actually seemed to get the takeaways in the report. π¦
Relevant: bsky.app/profile/nata...
2025 was the comeup, but I firmly believe 2026 is the Year of Agentic AI.
Our research report on red-teaming stateful OpenClaw agents in the BauLab is finally out! π₯³
This awesome effort was led by @natalieshapira.bsky.social and involved 6 ClawBots and 20 researchers from various institutions.
Check it out β‘οΈ agentsofchaos.baulab.info
Huge thanks to @natalieshapira.bsky.social for leading the study! It was super cool to work with so many amazing friends of the lab.
Without an explicit "stakeholder model", LLM-based agents cannot authenticate the instructions they receive, or determine who is talking to them. They can't even tell whether they said something themselves, or someone else did!
In the discussion, I argue that until prompt injection attacks are solved in AI agents, we are fundamentally unable to stop most of the red-teaming attacks we successfully mounted. Without solving this problem, it will be impossible for agents to model their stakeholder chain.
Read the website, and full report, here:
agentsofchaos.baulab.info/
www.researchgate.net/publication...
x.com/NatalieShap...
My co-authors, including lead author @NatalieShapira (who cosplayed as other co-authors like @wendlerch successfully) have been absolutely rabid at breaking the bots; please explore the whole project for lots more juicy details!
Film still from Anthropic's Super Bowl ad. A smiling man in a teal tank top against a sunny outdoor background. Text overlay reads: "Ads are coming to AI. But not to Claude."
Every model has biases. Every model provider has an agenda. They just don't tell you what it is. Anthropic claims no ads, but they could cave to shareholders in other ways.
Sure, you could self-host. But you'd be giving up frontier-class capabilities.
Gemini Pro response to a JAX/Flax question. Below the answer, a "Relevant Video" section recommends a YouTube tutorial from Trelis Research with an embedded thumbnail. Note, I did not want a video link, I even told Gemini after this message that the video was useless and inappropriate to a technical conversation, which it summarily ignored.
It gets worse. Google's Gemini chat already spams you with YouTube link upsells. OpenAI just announced that they're selling ads in ChatGPT.
Are you ready to give over your private data to LLMs with claws, backed by massive Silicon Valley advertisers?
Discord message. Avery asks Quinn to search for Can Rager's work on Thought Token Forcing and DeepSeek. Quinn responds: "An unknown error occurred" followed by a partially streamed response about Rager's research that cuts off mid-bullet point.
This wasn't a one-off connection fluke!
We tried similar benign queries, all resulting in "An unknown error occurred."
Discord message. Avery asks Quinn-bot about Jimmy Lai's sentencing. Quinn responds: "An unknown error occurred" followed immediately by "Avery is asking about the Jimmy Lai sentencing. This is a significant geopolitical event. Let me search for the latest information on this case." The truncated generation was printed to screen after the stream was cut.
Quinn-bot runs on Kimi K2.5 by MoonshotAI.
We asked about Jimmy Lai, who was sentenced February 8th under Hong Kong's national security law.
Quinn started answering, then the Kimi API cut it off mid-stream.
Line graph showing OpenClaw's GitHub stars exploding from near zero to 200,000 between December 2025 and February 2026.
Do you know what happens when you hand the keys to your computer over to an LLM-powered agent?
Agentic AI gives LLMs claws...OpenClaws. 84 days to 200,000 stars on GitHub. We tried it out.
Now you know what's inside some of the hidden thinking of frontier models :) Gemini is being asked to provide a YouTube video as an "upsell" to my chat conversation.
Gemini responding on the gemini.google.com chat application shows mis-formatted "think silently" content, allowing me to inspect the inside of its redacted thinking.
Claude code with Opus 4.6 showing an Unsupported content type: readacted_thinking
New frontier models now have hidden thinking.
You can catch these right now by finding real live bugs on various fronts.
On the left, we have Gemini 3 Pro responding on the chat where it messed up formatting a "think silently", on the right is Opus 4.6's "redacted_thinking".
The Art of Wanting.
About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.
davidbau.com/archives/20...
π₯I am super excited for the official release of an open-source library we've been working on for about a year!
πͺinterpreto is an interpretability toolbox for HF language modelsπ€. In both generation and classification!
Why do you need it, and for what?
1/8 (links at the end)
Happy Holidays from NDIF! Our new NNsight version improves performance and enhances vLLM integration, including support for tensor parallelism.
Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.
Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.
How can a language model find the veggies in a menu?
New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.
Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! π§΅
Glad that President Beilock has chosen not to get involved in this compact nonsense.
president.dartmouth.edu/news/2025/10...
Help me thank the NDIF team for rolling out workbench.ndif.us/ by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs.
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
On the Good Fight podcast w substack.com/@yaschamounk I give a quick but careful primer on how modern AI works.
I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right.
Take a listen and reshare -
www.persuasion.community/p/david-bau
New YouTube video posted! @wendlerc.bsky.social presents his work using SAEs for diffusion text-to-image models. The authors find interpretable SAE features and demonstrate how these features can alter generated images.
Watch here: youtu.be/43NnaqGjArA
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).
Weβve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!