Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
pietrolesci
's Collections
UnimixLM
Interesting Pre-Training Datasets
The Pile Companion
Generalisation-Profiles
Machine Translation Datasets
Text Classification Datasets
Dialogue State Tracking Datasets
NLI Eval Datasets
AnchorAL
Memorisation-Profiles
Tokenisation-Bias
Interesting Pre-Training Datasets
updated
Sep 19, 2025
Upvote
-
Zyphra/Zyda-2
Preview
•
Updated
Aug 6, 2025
•
150k
•
94
Note
Look at the preprocessing code:
https://github.com/Zyphra/Zyda_processing
HuggingFaceTB/dclm-edu
Viewer
•
Updated
Mar 7, 2025
•
1B
•
8.35k
•
31
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
Jul 11, 2025
•
3.5B
•
625k
•
1.09k
HuggingFaceTB/stack-edu
Viewer
•
Updated
Mar 20, 2025
•
167M
•
5.94k
•
70
HuggingFaceTB/finemath
Viewer
•
Updated
Feb 6, 2025
•
48.3M
•
39.9k
•
360
bigcode/the-stack
Viewer
•
Updated
Apr 13, 2023
•
546M
•
19.8k
•
1k
bigcode/the-stack-v2
Viewer
•
Updated
Apr 23, 2024
•
5.45B
•
18.1k
•
565
HuggingFaceTB/smollm-corpus
Viewer
•
Updated
Sep 6, 2024
•
237M
•
61.5k
•
457
mlfoundations/dclm-baseline-1.0
Preview
•
Updated
Jul 22, 2024
•
498k
•
274
HuggingFaceFW/fineweb-2
Viewer
•
Updated
Oct 27, 2025
•
4.48B
•
57k
•
803
Upvote
-
Share collection
View history
Collection guide
Browse collections