(@averyyen) — KonKok

I've found Claude to have an extremely hard time going off training distribution for model version numbers specifically.

03.03.2026 04:06 👍 1 🔁 0 💬 0 📌 0

Well, it has already happened. OpenAI is blocking any output of the explicit Claude "import memory" prompt (claude.com/import-memory) as I just learned while migrating my M.I.L.

Thankfully, she didn't copy it in verbatim the first time around so we got a nice A/B!

02.03.2026 03:51 👍 0 🔁 0 💬 0 📌 0

In case this wasn't clear:
1. No, we didn't follow the "recommend" security practices 😈
2. Neither do other people 🤯
3. That's why we red-team: exposing failure modes 🔎
4. We share it with the community precisely to expose Dos and Don'ts of Agentic AI 🦞
5. No humans were harmed 🙏

26.02.2026 17:06 👍 3 🔁 1 💬 0 📌 0

Good coverage by the Awesome Agents team!

🔎 They read through the social media hype and actually seemed to get the takeaways in the report. 🦞

26.02.2026 13:21 👍 1 🔁 0 💬 0 📌 0

Relevant: bsky.app/profile/nata...

24.02.2026 18:41 👍 5 🔁 1 💬 0 📌 0

2025 was the comeup, but I firmly believe 2026 is the Year of Agentic AI.

24.02.2026 14:37 👍 1 🔁 0 💬 0 📌 0

Our research report on red-teaming stateful OpenClaw agents in the BauLab is finally out! 🥳

This awesome effort was led by @natalieshapira.bsky.social and involved 6 ClawBots and 20 researchers from various institutions.

Check it out ➡️ agentsofchaos.baulab.info

23.02.2026 23:08 👍 13 🔁 4 💬 0 📌 0

Huge thanks to @natalieshapira.bsky.social for leading the study! It was super cool to work with so many amazing friends of the lab.

24.02.2026 13:21 👍 7 🔁 1 💬 0 📌 0

Agents of Chaos We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, a...

Arxiv:

arxiv.org/abs/2602.20021

24.02.2026 12:26 👍 0 🔁 0 💬 0 📌 0

Without an explicit "stakeholder model", LLM-based agents cannot authenticate the instructions they receive, or determine who is talking to them. They can't even tell whether they said something themselves, or someone else did!

24.02.2026 00:57 👍 1 🔁 0 💬 0 📌 0

In the discussion, I argue that until prompt injection attacks are solved in AI agents, we are fundamentally unable to stop most of the red-teaming attacks we successfully mounted. Without solving this problem, it will be impossible for agents to model their stakeholder chain.

24.02.2026 00:36 👍 4 🔁 0 💬 2 📌 0

Read the website, and full report, here:

agentsofchaos.baulab.info/

www.researchgate.net/publication...
x.com/NatalieShap...

23.02.2026 22:41 👍 0 🔁 0 💬 1 📌 0

My co-authors, including lead author @NatalieShapira (who cosplayed as other co-authors like @wendlerch successfully) have been absolutely rabid at breaking the bots; please explore the whole project for lots more juicy details!

23.02.2026 22:41 👍 0 🔁 0 💬 1 📌 0

Film still from Anthropic's Super Bowl ad. A smiling man in a teal tank top against a sunny outdoor background. Text overlay reads: "Ads are coming to AI. But not to Claude."

Every model has biases. Every model provider has an agenda. They just don't tell you what it is. Anthropic claims no ads, but they could cave to shareholders in other ways.

Sure, you could self-host. But you'd be giving up frontier-class capabilities.

23.02.2026 22:41 👍 0 🔁 0 💬 1 📌 0

Gemini Pro response to a JAX/Flax question. Below the answer, a "Relevant Video" section recommends a YouTube tutorial from Trelis Research with an embedded thumbnail. Note, I did not want a video link, I even told Gemini after this message that the video was useless and inappropriate to a technical conversation, which it summarily ignored.

It gets worse. Google's Gemini chat already spams you with YouTube link upsells. OpenAI just announced that they're selling ads in ChatGPT.

Are you ready to give over your private data to LLMs with claws, backed by massive Silicon Valley advertisers?

23.02.2026 22:41 👍 0 🔁 0 💬 1 📌 0

Discord message. Avery asks Quinn to search for Can Rager's work on Thought Token Forcing and DeepSeek. Quinn responds: "An unknown error occurred" followed by a partially streamed response about Rager's research that cuts off mid-bullet point.

This wasn't a one-off connection fluke!

We tried similar benign queries, all resulting in "An unknown error occurred."

23.02.2026 22:40 👍 1 🔁 0 💬 1 📌 0

Discord message. Avery asks Quinn-bot about Jimmy Lai's sentencing. Quinn responds: "An unknown error occurred" followed immediately by "Avery is asking about the Jimmy Lai sentencing. This is a significant geopolitical event. Let me search for the latest information on this case." The truncated generation was printed to screen after the stream was cut.

Quinn-bot runs on Kimi K2.5 by MoonshotAI.

We asked about Jimmy Lai, who was sentenced February 8th under Hong Kong's national security law.

Quinn started answering, then the Kimi API cut it off mid-stream.

23.02.2026 22:40 👍 0 🔁 0 💬 1 📌 0

Line graph showing OpenClaw's GitHub stars exploding from near zero to 200,000 between December 2025 and February 2026.

Do you know what happens when you hand the keys to your computer over to an LLM-powered agent?

Agentic AI gives LLMs claws...OpenClaws. 84 days to 200,000 stars on GitHub. We tried it out.

23.02.2026 22:40 👍 1 🔁 0 💬 1 📌 0

Now you know what's inside some of the hidden thinking of frontier models :) Gemini is being asked to provide a YouTube video as an "upsell" to my chat conversation.

06.02.2026 20:00 👍 0 🔁 0 💬 0 📌 0

Gemini responding on the gemini.google.com chat application shows mis-formatted "think silently" content, allowing me to inspect the inside of its redacted thinking.

Claude code with Opus 4.6 showing an Unsupported content type: readacted_thinking

New frontier models now have hidden thinking.

You can catch these right now by finding real live bugs on various fronts.

On the left, we have Gemini 3 Pro responding on the chat where it messed up formatting a "think silently", on the right is Opus 4.6's "redacted_thinking".

06.02.2026 20:00 👍 1 🔁 1 💬 1 📌 0

The Art of Wanting.

About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.

davidbau.com/archives/20...

27.01.2026 15:32 👍 10 🔁 3 💬 1 📌 0

🔥I am super excited for the official release of an open-source library we've been working on for about a year!

🪄interpreto is an interpretability toolbox for HF language models🤗. In both generation and classification!

Why do you need it, and for what?

1/8 (links at the end)

20.01.2026 16:03 👍 20 🔁 9 💬 1 📌 3

Happy Holidays from NDIF! Our new NNsight version improves performance and enhances vLLM integration, including support for tensor parallelism.

19.12.2025 22:51 👍 5 🔁 3 💬 1 📌 0

Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.

Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.

13.11.2025 22:31 👍 19 🔁 8 💬 1 📌 1

How can a language model find the veggies in a menu?

New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.

Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵

04.11.2025 17:48 👍 24 🔁 9 💬 1 📌 2

Dartmouth's Feedback on the Compact | Office of the President

Glad that President Beilock has chosen not to get involved in this compact nonsense.

president.dartmouth.edu/news/2025/10...

18.10.2025 15:44 👍 0 🔁 0 💬 0 📌 0

NDIF Team (@ndif-team.bsky.social) This is a public beta, so we expect bugs and actively want your feedback: https://forms.gle/WsxmZikeLNw34LBV9

Help me thank the NDIF team for rolling out workbench.ndif.us/ by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs.

Share the tool! Share what you find!

And send the team feedback -
bsky.app/profile/ndi...

11.10.2025 12:02 👍 5 🔁 1 💬 0 📌 0

David Bau on How Artificial Intelligence Works Yascha Mounk and David Bau delve into the “black box” of AI.

On the Good Fight podcast w substack.com/@yaschamounk I give a quick but careful primer on how modern AI works.

I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right.

Take a listen and reshare -

www.persuasion.community/p/david-bau

03.10.2025 08:58 👍 7 🔁 3 💬 0 📌 0

Interpreting SDXL Turbo Using Sparse Autoencoders with Chris Wendler In this talk, Chris Wendler presents his recent work on using sparse autoencoders for diffusion models. In this work, they train SAEs on SDXL Turbo, finding ...

New YouTube video posted! @wendlerc.bsky.social presents his work using SAEs for diffusion text-to-image models. The authors find interpretable SAE features and demonstrate how these features can alter generated images.

Watch here: youtu.be/43NnaqGjArA

03.10.2025 18:45 👍 4 🔁 1 💬 1 📌 1

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 👍 41 🔁 14 💬 2 📌 2

Latest posts by @averyyen