(@anandraghavan)

AGENTS.md AGENTS.md is a simple, open format for guiding coding agents. Think of it as a README for agents.

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

arxiv.org/abs/2602.11988

06.03.2026 23:02 👍 0 🔁 0 💬 0 📌 0

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex…

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

arxiv.org/abs/2602.16819

05.03.2026 23:01 👍 0 🔁 0 💬 0 📌 0

FAMOSE: A ReAct Approach to Automated Feature Discovery Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space…

FAMOSE: A ReAct Approach to Automated Feature Discovery

arxiv.org/abs/2602.176...

04.03.2026 23:01 👍 0 🔁 0 💬 0 📌 0

KLong: Training LLM Agent for Extremely Long-horizon Tasks This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via…

KLong: Training LLM Agent for Extremely Long-horizon Tasks

arxiv.org/abs/2602.17547

03.03.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in…

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

arxiv.org/abs/2602.10356

02.03.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Payrolls to Prompts: Firm-Level Evidence on the Substitution of Labor for AI Generative AI has the potential to transform how firms produce output. Yet, credible evidence on how AI is actually substituting for human labor remains limited. In this paper, we study firm-level…

Payrolls to Prompts: Firm-Level Evidence on the Substitution of Labor for AI

arxiv.org/abs/2602.00139

27.02.2026 23:00 👍 0 🔁 0 💬 0 📌 0

Prompt Repetition Improves Non-Reasoning LLMs When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.

Fascinating! "Prompt Repetition Improves Non-Reasoning LLMs" arxiv.org/abs/2512.14982

26.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when…

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks arxiv.org/abs/2512.22255

25.02.2026 23:02 👍 0 🔁 0 💬 0 📌 0

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily…

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning arxiv.org/abs/2602.08234

24.02.2026 23:00 👍 0 🔁 0 💬 0 📌 0

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent…

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning arxiv.org/abs/2602.100...

23.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing…

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice arxiv.org/abs/2601.16669

20.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Agentic Reasoning for Large Language Models Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world…

Agentic Reasoning for Large Language Models arxiv.org/abs/2601.12538

19.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently…

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces arxiv.org/abs/2601.11868

18.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets…

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks arxiv.org/abs/2601.02439

17.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

Towards a Science of Scaling Agent Systems Agents, language model-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the…

Towards a Science of Scaling Agent Systems arxiv.org/abs/2512.08296

16.02.2026 23:00 👍 1 🔁 0 💬 0 📌 0

Agent-as-a-Judge LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the…

Agent-as-a-Judge www.arxiv.org/abs/2601.05111

13.02.2026 23:01 👍 0 🔁 0 💬 0 📌 0

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require…

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

arxiv.org/abs/2601.09688

12.02.2026 23:00 👍 0 🔁 0 💬 0 📌 0

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically…

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment arxiv.org/abs/2508.02298

26.12.2025 23:01 👍 0 🔁 0 💬 0 📌 0

Evaluating AI’s ability to perform scientific research tasks We introduce FrontierScience, a new benchmark that evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.

Evaluating AI’s ability to perform scientific research tasks openai.com/index/fronti...

25.12.2025 23:00 👍 0 🔁 0 💬 0 📌 0

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text…

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality arxiv.org/abs/2512.107...

24.12.2025 23:00 👍 0 🔁 0 💬 0 📌 0

Fantastic Bugs and Where to Find Them in AI Benchmarks Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark…

Fantastic Bugs and Where to Find Them in AI Benchmarks arxiv.org/abs/2511.16842

23.12.2025 23:00 👍 0 🔁 0 💬 0 📌 0

AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted…

AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following arxiv.org/abs/2511.10507

22.12.2025 23:01 👍 0 🔁 0 💬 0 📌 0

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing…

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation arxiv.org/abs/2507.17937

19.12.2025 23:00 👍 0 🔁 0 💬 0 📌 0

Cryptographers Show That AI Protections Will Always Have Holes | Quanta Magazine Large language models such as ChatGPT come with filters to keep certain info from getting out. A new mathematical argument shows that systems like this can never be completely safe.

Cryptographers Show That AI Protections Will Always Have Holes

www.quantamagazine.org/cryptographe...

15.12.2025 18:39 👍 0 🔁 0 💬 0 📌 0

2025: The State of Generative AI in the Enterprise | Menlo Ventures For all the fears of over-investment, AI is spreading across enterprises at a pace with no precedent in modern software history.

The State of Generative AI in the Enterprise - report from Menlo Ventures menlovc.com/perspective/...

14.12.2025 22:35 👍 0 🔁 0 💬 0 📌 0

The Universal Weight Subspace Hypothesis We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates…

The Universal Weight Subspace Hypothesis arxiv.org/abs/2512.05117

14.12.2025 18:39 👍 0 🔁 0 💬 0 📌 0

How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it I wanted a dinner recommendation and got a research agenda instead. Using 13000+ restaurants, I rebuild its ratings with machine learning and map how algorithmic visibility actually distributes power.

How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it laurenleek.substack.com/p/how-google...

13.12.2025 22:35 👍 0 🔁 0 💬 0 📌 0

The Polyglot Neuroscientist Resolving How the Brain Parses Language | Quanta Magazine Is language core to thought, or a separate process? For 15 years, the neuroscientist Ev Fedorenko has gathered evidence of a language network in the human brain — and has found some similarities to…

The Polyglot Neuroscientist Resolving How the Brain Parses Language www.quantamagazine.org/the-polyglot...

13.12.2025 18:39 👍 0 🔁 0 💬 0 📌 0

AI & Human Co-Improvement for Safer Co-Superintelligence Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to…

AI & Human Co-Improvement for Safer Co-Superintelligence www.arxiv.org/abs/2512.05356

12.12.2025 22:35 👍 0 🔁 0 💬 0 📌 0

Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with…

Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems arxiv.org/abs/2502.04510

12.12.2025 18:39 👍 0 🔁 0 💬 0 📌 0

Latest posts by @anandraghavan