Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset
Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.
ICYMI, check out our latest results @datologyai.com on curating data for LLMs.
Intervening only on training data, our pipeline can train models faster (7.7x less compute), better (+8.5% performance), and smaller (models half the size outperform by >5%)!
www.datologyai.com/post/technic...
29.11.2024 16:36
π 5
π 2
π¬ 0
π 0
The text team cooked so much π§βπ³ it might be better than your Thanksgiving meal
Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
25.11.2024 20:29
π 8
π 1
π¬ 0
π 0
DatologyAI Jobs
DatologyAI Jobs
Working on making data curation dirt cheap btw
If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!
jobs.ashbyhq.com/DatologyAI
(also looking for enthusiastic research interns)
25.11.2024 20:37
π 3
π 1
π¬ 0
π 0
1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what weβve been working on!
Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.
Blog: π tinyurl.com/best-llm-data π§΅
25.11.2024 18:43
π 15
π 4
π¬ 1
π 0
π Train faster - Reach the same performance 7.7x faster
π Train Better - Improve performance by 8.5% over exact-deduplicated RPJv1, 6.1% over FineWeb-Edu, and 4.4% over DCLM
π Train Smaller - Train a model that's 2.1x smaller while simultaneously improving performance by >5%
25.11.2024 17:56
π 6
π 0
π¬ 0
π 0
Made my account today!
15.11.2024 02:01
π 1
π 0
π¬ 0
π 0
Massive, impressive post on data curation strategies for producing better models with less data and compute. The best part of data curation is that it's a (relatively small) one time cost that gets amortized over all future models.
Link to the technical write-up: www.datologyai.com/post/product...
14.11.2024 19:16
π 10
π 3
π¬ 0
π 0
This is the most interesting and most impactful data pipeline problem I have ever worked on (and if you know me, you know thatβs saying something.)
So happy to be able to share this work with the world! And now itβs time for a little vacation. π
14.11.2024 19:21
π 26
π 3
π¬ 0
π 0
Hello bluesky! First post, first results drop!
Today, we @datologyai.bsky.social are so excited to release our first results, demonstrating *massive* gains in training efficiency, performance, and inference efficiency with better data.
www.datologyai.com/post/datolog...
14.11.2024 19:37
π 4
π 1
π¬ 0
π 1