Leena C Vankadara's Avatar

Leena C Vankadara

@leenacvankadara

Lecturer @GatsbyUCL; Previously Applied Scientist @AmazonResearch; PhD @MPI-IS @UniTuebingen

264
Followers
67
Following
11
Posts
19.11.2024
Joined
Posts Following

Latest posts by Leena C Vankadara @leenacvankadara

Preview
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully ...

πŸ“„ Paper: arxiv.org/abs/2505.22491

Catch our Spotlight at #NeurIPS2025 Today!

πŸ“… Wed Dec 3 πŸ•Ÿ 4:30 - 7:30 PM πŸ“ Exhibit Hall C,D,E β€” Poster #3903
Huge thanks to my amazing collaborators: @mohaas.bsky.social @sbordt.bsky.social @ulrikeluxburg.bsky.social

03.12.2025 17:37 πŸ‘ 3 πŸ” 2 πŸ’¬ 0 πŸ“Œ 1

Summary: Practical nets do not approach kernel limits. Instead, they converge to a Feature Learning Limit.

This offers a new lens: Empirical quirks (like aggressive LR scaling) are not mere finite-width artefacts - they are faithful reflections of the true scaling limit. (9/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Early experiments suggest DL components like Adam & Norm layers also enable Controlled Divergence regimes.

Caveat: Controlled Divergence can still cause overconfidence and floating-point instabilities (precision failure) at scale! (8/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

This may explain the practical success of CE over MSE!

CE admits larger LRs β†’ richer feature learning. MSE is restricted to Lazy regime.

Validation: Under Β΅P (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

At the edge of this regime (where η ∝ 1/√m), there exists a well-defined infinite-width limit where feature learning persists in all hidden layers.

This Feature Learning Limit closely matches the behavior of optimally tuned finite-width networks under CE loss. (6/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

In the Controlled Divergence regime, network outputs diverge (saturating to one-hot). Yet, all the other dynamical quantities such as the activations and gradients remain stable throughout training.

This regime, however, does not exist under MSE. (5/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We resolve this via a fine-grained analysis of the regime previously considered unstable (and therefore uninteresting).

Under CE loss, we find this regime comprises two distinct sub-regimes: A Catastrophically Unstable Regime and A benign Controlled Divergence regime. (4/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We find this discrepancy persists even accounting for finite-width effects due to Catapult/EOS, Large Depth, Alignment Violations.

In fact, infinite-width alignment predictions hold robustly when measured with sufficient granularity.

So what explains this discrepancy? (3/10)

03.12.2025 17:37 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Most nets use He/Lecun init with single LR Ξ·. As width mβ†’βˆž, theory says

η∈O(1/m)⟹Kernel; Ξ·βˆˆΟ‰(1/m)⟹Unstable.

Thus max stable LR∝1/m.

Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)

03.12.2025 17:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Under He/Lecun inits, theory implies Kernel OR Unstable regimes as widthβ†’βˆž. Discrepancies (e.g. feature learning) are seen as finite width effects.

Our #NeurIPS2025 spotlight refutes this: practical nets do not converge to kernel limits; Feature learning persists as widthβ†’βˆžπŸ§΅

03.12.2025 17:37 πŸ‘ 7 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image Post image

Stable model scaling with width-independent dynamics?

Thrilled to present 2 papers at #NeurIPS πŸŽ‰ that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.

🧡 1/10

10.12.2024 07:08 πŸ‘ 21 πŸ” 5 πŸ’¬ 1 πŸ“Œ 0

Could you please add me to the list?

26.11.2024 09:15 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0