Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
arxiv.org/abs/2602.11988
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
arxiv.org/abs/2602.11988
Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
arxiv.org/abs/2602.16819
FAMOSE: A ReAct Approach to Automated Feature Discovery
arxiv.org/abs/2602.176...
KLong: Training LLM Agent for Extremely Long-horizon Tasks
arxiv.org/abs/2602.17547
Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
arxiv.org/abs/2602.10356
Payrolls to Prompts: Firm-Level Evidence on the Substitution of Labor for AI
arxiv.org/abs/2602.00139
Fascinating! "Prompt Repetition Improves Non-Reasoning LLMs" arxiv.org/abs/2512.14982
Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks arxiv.org/abs/2512.22255
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning arxiv.org/abs/2602.08234
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning arxiv.org/abs/2602.100...
PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice arxiv.org/abs/2601.16669
Agentic Reasoning for Large Language Models arxiv.org/abs/2601.12538
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces arxiv.org/abs/2601.11868
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks arxiv.org/abs/2601.02439
Towards a Science of Scaling Agent Systems arxiv.org/abs/2512.08296
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
arxiv.org/abs/2601.09688
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment arxiv.org/abs/2508.02298
Evaluating AI’s ability to perform scientific research tasks openai.com/index/fronti...
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality arxiv.org/abs/2512.107...
Fantastic Bugs and Where to Find Them in AI Benchmarks arxiv.org/abs/2511.16842
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following arxiv.org/abs/2511.10507
Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation arxiv.org/abs/2507.17937
Cryptographers Show That AI Protections Will Always Have Holes
www.quantamagazine.org/cryptographe...
The State of Generative AI in the Enterprise - report from Menlo Ventures menlovc.com/perspective/...
How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it laurenleek.substack.com/p/how-google...
The Polyglot Neuroscientist Resolving How the Brain Parses Language www.quantamagazine.org/the-polyglot...
AI & Human Co-Improvement for Safer Co-Superintelligence www.arxiv.org/abs/2512.05356
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems arxiv.org/abs/2502.04510