New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance – details in🧵
23.01.2025 15:33
👍 40
🔁 8
💬 1
📌 5
This was all based on work by the Google DeepMind Rater Assist team, the absolute best team ever 🙂
24.12.2024 00:00
👍 0
🔁 0
💬 1
📌 0
We achieved human-AI complementarity by combining Hybridization and Rater Assistance, but continuous research is needed as the future of rating changes. Making progress in this space will require cross-disciplinary work. Let’s build these collaborations now! If you’re interested, please reach out.
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
Importantly, the best type of rater assistance depends a lot on how much raters over-rely on the assistant. Just showing directly quoted evidence helps more than showing this alongside the AI’s reasoning, judgments, and confidence, in our slice of data where humans > AI.
24.12.2024 00:00
👍 0
🔁 0
💬 1
📌 0
Hybridization can also enable impactful Rater Assistance. Prior HCI work has shown that achieving complementarity can be hard in settings where AI > Humans. Our hybridization identifies a slice of data where humans > AI. Here, rater assistance helps!
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
Combining judgements from human raters and AI raters working in isolation, called Hybridization, can be a useful technique to achieve complementarity.
We’ve found confidence-based hybridization (using AI ratings when it's confident, and human ratings otherwise) achieves complementarity!
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
Achieving complementarity can be quite hard! A key issue is over-reliance: how do we get humans to appropriately use AI, and not just default to its outputs? And, this problem gets worse in settings where AI > Humans. But there is hope!
24.12.2024 00:00
👍 0
🔁 0
💬 1
📌 0
Rater Assistance is not so useful if the combined Human-AI team doesn’t outperform humans or AI alone. This restated goal can be referred to as trying to achieve Human-AI Complementarity. Fundamentally, this is a Human-Computer Interaction (HCI) problem!
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
This is the field of Amplified Oversight (a subfield of Scalable Oversight). Much of the past work in this field, such as critiques, debate, and iterative amplification, has focused on Rater Assistance - assisting and enabling human raters to properly evaluate AI outputs.
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
As AI is able to perform increasingly challenging tasks, how do we make sure we’re able to properly evaluate its outputs so that we can accurately align the model to human values via e.g. RLHF? Relying on humans alone for this will be hard on tasks such as summarizing 1M pages.
24.12.2024 00:00
👍 0
🔁 0
💬 1
📌 0
Read our blog for the full details deepmindsafetyresearch.medium.com/human-ai-com...
Here’s a quick summary:
24.12.2024 00:00
👍 1
🔁 0
💬 1
📌 0
How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it.
🧵(1/10)
24.12.2024 00:00
👍 1
🔁 1
💬 1
📌 1
Can someone let me into Croatia’s inside joke
13.05.2023 21:14
👍 0
🔁 1
💬 0
📌 0