Datasets
Explore and download datasets for your projects
CodeParrot Dataset
The CodeParrot Dataset, created by the authors of the "Natural Language Processing with Transformers" book (Lewis Tunstall, Leandro von Werra, and Thomas Wolf), is a comprehensive collection of Python source code designed specifically for training code generation language models. This dataset contains approximately 180GB of deduplicated Python code from public GitHub repositories, making it an ideal resource for learning how to train domain-specific language models and build Python programming assistants. Available on Hugging Face, this dataset is excellent for training GPT-style code models, building Python code completion tools, learning the end-to-end process of dataset curation and model training, and developing...
The Stack Dataset
The Stack Dataset, created by the BigCode project (a collaboration between Hugging Face and ServiceNow), is one of the largest and most comprehensive collections of source code ever assembled for machine learning research. This dataset contains 6.4TB of permissively licensed source code from GitHub repositories across 358 programming languages, making it the foundation for training state-of-the-art code generation models like StarCoder. Available on Hugging Face, this dataset is excellent for training large language models for code, building programming assistants, developing code completion tools, studying programming language patterns, and advancing AI-powered software development - representing a crucial resource for the next...
Credit Card Fraud Detection Dataset
The Credit Card Fraud Detection Dataset, created by the Machine Learning Group (MLG) at Université Libre de Bruxelles (ULB) in collaboration with Worldline, is the definitive benchmark dataset for fraud detection and imbalanced classification problems. This dataset contains real credit card transactions made by European cardholders in September 2013, where fraudulent transactions represent only 0.172% of all transactions - creating an extreme class imbalance scenario that mirrors real-world fraud detection challenges. Available on Kaggle, this dataset is excellent for building anomaly detection models, mastering techniques for handling severely imbalanced data, developing real-time fraud scoring systems, and understanding the critical balance...