@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below
02.03.2026 17:12
π 3
π 1
π¬ 0
π 0
Common Crawl - Blog - Introducing the New Examples & Resources Browser
We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and s...
We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.
blog.commoncrawl.org/blog/introdu...
23.02.2026 15:51
π 3
π 2
π¬ 0
π 0
CommonLID can help us to create the next generation of open-source LangID models, which can in turn help create larger multilingual datasets. We would like to thank members of Masakhane and @seacrowd.bsky.social for their support in this effort.
10.02.2026 20:44
π 3
π 0
π¬ 1
π 0
CommonLID proves to be the most challenging dataset in our evaluation of existing LangID systems, as can be seen in the last column of the table above.
10.02.2026 20:44
π 3
π 0
π¬ 1
π 0
A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.
Current benchmarks over-estimate LangID performance on web data. In our evaluations, we show top existing models have < 80% F1, even when limiting to languages the models explicitly support.
10.02.2026 20:44
π 3
π 0
π¬ 1
π 0
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!
10.02.2026 20:44
π 11
π 5
π¬ 1
π 0
Laurie Burchell at a lectern presenting her Turing Seminar talk
Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk
A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled βCommon Crawl: open web data for everybody.β
In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
27.11.2025 13:05
π 6
π 2
π¬ 0
π 0
Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.
commoncrawl.org/blog/host--a...
24.11.2025 17:46
π 2
π 1
π¬ 0
π 0
Banner for the World Digital Preservation Day, 6th of November 2025
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?
commoncrawl.org/blog/common-...
06.11.2025 14:56
π 3
π 0
π¬ 0
π 0