HuggingFaceFW/fineweb-2 · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Going far beyond our original FineWeb, we've created something massive - 1,893 script-language pairs with almost 3 trillion words spanning 8TB of compressed files! 📚
It's fully open-source released under ODC-By 1.0, with fully reproducible code! 💻
huggingface.co/datasets/Hug...
08.12.2024 09:27
👍 1
🔁 0
💬 0
📌 1
We heard you liked the FineWeb, so we made a second one: FineWeb 2! 🥂 Now supporting thousands of languages! 🌎
True to our standard, the fermentation process is of the highest quality; it beats all other datasets in 83% of tracked languages 📈.
08.12.2024 09:27
👍 1
🔁 0
💬 1
📌 0