As expected. Congrats to the authors.
As expected. Congrats to the authors.
I remember when I saw the ICML 2015 test of time award winner, I noticed the paper "Learning to Rank Using Gradient Descent" for the first time. Then I got the idea for the "Deep Relative Attributes" paper.
#ICML2025 Test of time award is likely going to the Batch Normalization paper.
A while back when Sam Altman was in India he was asked whether a team with around $10 M could build something to compete with OpenAI and Sam Altman said it was "hopeless".
Deepseek-V3 with around $6 M cost for the pre-training run just released a model with very high capability (on benchmarks)
During my graduate school I lived in Germany for around 5 years and lived there without learning German. Certainly it is possible. But not learning the native language makes everyday life a bit too hard and long-term living there not feasible.
Jason Weston comments on Ilyaβs ToT award talk.
Says: "Pre-training _as we know it_ _will_ end"
emphasis on "as we know it" and "will"
And language
I hope this idea of multiple smaller meeting wonβt converge to one big meeting in north america and some small meetings around the world.
Thomas Kipf with the Google IO's DJ bathrobe :D
#NeurIPS2024
It is sad to see authors not being able to present their work at #NeurIPS2024 because of visa issues.
But some authors went above and beyond.
Here is @hadivafaii.bsky.social tele-presenting his work with an impressive setup (ipad, mic, speaker, holder, battery).
Well done sir!
Who has written this? Seems a little fishy. (Not saying it is untrue)
Given the large amounts of posters it was really hard or impossible to check all of them out, but I came across some interesting ones still and the authors usually did a great job explaining their work.
Although it seems that there remains some work and good engineering yet to be done to make this scheduler work in large-scale distributed setting.
The talk by meta folks about their schedule-free learning was great.
They provide nice theoretical insights as well as good experiments in their paper "The Road Less Scheduled".
arxiv.org/abs/2405.15682
I guess the below picture shows well the "any-time stopping" property well.
Attending Orals was not easy yesterday. With 4 parallel tracks it made it really hard. I think shorter orals with less tracks (more similar papers in the same track) is better than the current format.
But one Oral presentation did stand out.
One is Saurabh Tiwary which I heard talk about "Industrial Deep Learning" many times before.
The rest of the talk was about xLSTM among some other work and his company.
He also claimed that "the bitter lesson is over!"
bsky.app/profile/yass...
In the morning we had the Sepp Hochreiter talk about "Industrial AI". He gave interesting analogies to steam engines and production of ammonium nitrate for fertilizers.
I guess many people have noticed these analogies and came up with similar talks before.
#NeurIPS2024 (@neuripsconf.bsky.social) Day 2 (Wednesday) Experience
One of the great things about conferences like NeurIPS is that you get to see people who you admire for different reasons. I also got to see and talk to some. Really happy I got to talk to William Agnew.
Lot's of grokking papers recently. Lol
#NeurIPS2024
Sepp Hochreiter claims βthe bitter lesson is overβ!
#neurips2024
There was a panel discussion at the end which I missed. (Hope to catch up on the video)
Fun-fact, Sean Welleck is the host of the amazing "The Thesis Review" podcast: wellecks.com/thesisreview
This Tutorial was mostly based on their recent paper: arxiv.org/abs/2406.16838
In the afternoon I attended the "Beyond Decoding" Tutorial by Sean Welleck and others.
cmu-l3.github.io/neurips2024-...
This was truly an amazing Tutorial on Generation/Sampling for decoding, Meta-generation and efficient decoding, highly recommended.
- Usually there is around 65x reduction in data volume after filtering.
- Still training a good reward model is a challenge.
Some notes:
- It seems like if the scientific community does not do something, it might face major challenges accessing large-scale data. The inequality in data access is widening.
- User-provided content like Wikipedia and arXiv amount to less than 1% of data used in pre-training.
In the morning I attended the "Opening the Language Model Pipeline" Tutorial by @natolambert.bsky.social and others from Allen AI.
github.com/allenai/awes...
They talked about their work on Data, Pre-training and Post-Training while highlighting some recent works such as OLMov2, TULU3, etc.
#NeurIPS2024 (@neuripsconf.bsky.social) Day 1 Experience
There were a bunch of interesting Tutorial, Talks and events today at NeurIPS. But definitely the highlight of the day was catching up with friends and current and past colleagues and seeing folks.
Ilya Sutskever has won 3 test of time awards at NeurIPS now!
2022: for AlexNet paper
2023: for word2vec paper
2024: for Seq2Seq paper
Often will figure out that your intuition was wrong and everyone was right. But that only happens often and not all the time, which is great!
Excellent explanation of RoPE embedding, from scratch with all the math needed: https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding
And with beautiful 3blue1brown's style of animation: https://github.com/3b1b/manim.
Original RoPE paper: arxiv.org/abs/2104.09864