Yuhang Zang (@yuhangzang)

Will you be at #NeurIPS2025? Come talk TMLR and collect swag!

EiCs Gautam Kamath (@gautamkamath.com) and Nihar Shah will be there -- if you are an AE or an Expert Reviewer, or have a Featured or Outstanding Certification, you can get a free TMLR laptop sticker! Locations ⬇️

27.11.2025 17:16 👍 14 🔁 2 💬 1 📌 2

10.11.2025 20:47 👍 6 🔁 2 💬 0 📌 1

Our discussion period just started. Authors, please read our instructions carefully. We require responses by June 2.

But, what you really want to hear about is stats .... right? -> 🧵

27.05.2025 17:41 👍 17 🔁 5 💬 2 📌 0

o3’s weird hallucinations could indicate they used llm as a judge (or other softer verifiers) in high volume and in addition to math/code correctness.

This addition lets OpenAI scale RL by making more data available to train on, but has new downstream problems to solve.

20.04.2025 14:06 👍 19 🔁 2 💬 1 📌 0

One of the first papers I've seen with RLVR / reinforcement finetuning of vision language models

Looks about as simple as we would expect it to be, lots of details to uncover.

Liu et al. Visual-RFT: Visual Reinforcement Fine-Tuning
buff.ly/DbGuYve
(posted a week ago, oops)

10.03.2025 15:44 👍 16 🔁 2 💬 1 📌 1

Monitoring Reasoning Models for Misbehavior and the Risks of
Promoting Obfuscation cdn.openai.com/pdf/34f2ada6...

11.03.2025 04:49 👍 0 🔁 0 💬 0 📌 0

Open-Reasoner-Zero: An Open Source Approach to Scaling Up
Reinforcement Learning on the Base Model
github.com/Open-Reasone...

20.02.2025 11:16 👍 0 🔁 0 💬 0 📌 0

GitHub - MoonshotAI/MoBA: MoBA: Mixture of Block Attention for Long-Context LLMs MoBA: Mixture of Block Attention for Long-Context LLMs - MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs github.com/MoonshotAI/M...

18.02.2025 12:30 👍 0 🔁 0 💬 0 📌 0

Preference Modeling: Binary Discrimination Versus Imitation Learning | Notion Author: Wei Shen, Yunhui Xia TL;DR: This blog investigates a key observation from the paper

Preference Modeling: Binary Discrimination Versus Imitation Learning: swtheking.notion.site/182d3429a807...

18.02.2025 07:58 👍 0 🔁 0 💬 0 📌 0

Rare that a paper these days uses the original literature of "Outcome reward model" and not just doing bradley-terry model on right/wrong labels.
Nature is healing.

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Lyu et al
arxiv.org/abs/2502.06781

15.02.2025 15:50 👍 14 🔁 4 💬 2 📌 0

Examining False Positives under Inference Scaling for Mathematical Reasoning arxiv.org/pdf/2502.06217

11.02.2025 08:26 👍 0 🔁 0 💬 0 📌 0

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study | Notion One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-ref...

oatllm.notion.site/oat-zero
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

#Papers

07.02.2025 10:22 👍 0 🔁 0 💬 0 📌 0

This is a potentially counterintuitive result. We actually want the reasoning models to generate more tokens for wrong answers. Eventually, models should “know” when they’re not right and keep spending more compute on it!

Regardless, is a great plot.

arxiv.org/abs/2501.18585

31.01.2025 13:36 👍 34 🔁 3 💬 3 📌 2

✍️ Reminder to reviewers: Check author responses to your reviews, and ask follow up questions if needed.

50% of papers have discussion - let’s bring this number up!

25.11.2024 12:45 👍 38 🔁 8 💬 1 📌 3

Yuhang Zang

Latest posts by Yuhang Zang @yuhangzang