We push further with reinforcement learning 🚀
Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟
The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
11.04.2025 16:29
👍 1
🔁 0
💬 1
📌 0
Can we fix backtracking on CountDown by tackling these 2 issues? 🔧 We try two variations:
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
11.04.2025 16:29
👍 2
🔁 0
💬 1
📌 0
2️⃣ Backtracking makes models verbose—often at the expense of “actual” reasoning 💬
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
11.04.2025 16:29
👍 1
🔁 0
💬 1
📌 0
But what goes wrong when backtracking fails (eg in CountDown)?🤔We find 2 pitfalls:
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
11.04.2025 16:29
👍 1
🔁 0
💬 1
📌 0
Here’s what we found:
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
11.04.2025 16:29
👍 1
🔁 0
💬 1
📌 0
We compare backtracking (BT) to an alternative way to scale test-time compute: parallel sampling + best-of-N.
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔
11.04.2025 16:29
👍 3
🔁 0
💬 1
📌 0
In our newest work (led by the amazing
@sunnytqin.bsky.social , w/ @emalach.bsky.social, Samy Jelassi), we investigate a core question for LLMs: "𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘 𝑜𝑟 𝑛𝑜𝑡 𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘" in two prototypical logic-heavy puzzles: CountDown and Sudoku.
11.04.2025 16:29
👍 3
🔁 2
💬 1
📌 0
🚨 New preprint! TL;DR: Backtracking is not the "holy grail" for smarter LLMs.
It’s praised for helping models “fix mistakes” and improve reasoning—but is it really the best use of test-time compute? 🤔
11.04.2025 16:29
👍 8
🔁 2
💬 1
📌 0
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn...
Transformer LMs get pretty far by acting like ngram models, so why do they learn syntax? A new paper by sunnytqin.bsky.social, me, and @dmelis.bsky.social illuminates grammar learning in a whirlwind tour of generalization, grokking, training dynamics, memorization, and random variation. #mlsky #nlp
20.12.2024 17:55
👍 142
🔁 31
💬 5
📌 4