Datasets
Explore and download datasets for your projects
HumanEval Dataset
The HumanEval Dataset, created by OpenAI and introduced in the Codex paper, is the gold standard benchmark for evaluating code generation models. This dataset contains 164 hand-written Python programming problems, each with a function signature, docstring, reference implementation, and multiple unit tests to verify functional correctness - providing a rigorous, execution-based evaluation framework that goes beyond syntactic correctness. Available on Hugging Face, this dataset is excellent for benchmarking code generation models, evaluating AI coding assistants, measuring functional correctness using the pass@k metric, comparing different code LLMs, and understanding the state-of-the-art in automated program synthesis - making it the definitive evaluation...
CodeParrot Dataset
The CodeParrot Dataset, created by the authors of the "Natural Language Processing with Transformers" book (Lewis Tunstall, Leandro von Werra, and Thomas Wolf), is a comprehensive collection of Python source code designed specifically for training code generation language models. This dataset contains approximately 180GB of deduplicated Python code from public GitHub repositories, making it an ideal resource for learning how to train domain-specific language models and build Python programming assistants. Available on Hugging Face, this dataset is excellent for training GPT-style code models, building Python code completion tools, learning the end-to-end process of dataset curation and model training, and developing...
The Stack Dataset
The Stack Dataset, created by the BigCode project (a collaboration between Hugging Face and ServiceNow), is one of the largest and most comprehensive collections of source code ever assembled for machine learning research. This dataset contains 6.4TB of permissively licensed source code from GitHub repositories across 358 programming languages, making it the foundation for training state-of-the-art code generation models like StarCoder. Available on Hugging Face, this dataset is excellent for training large language models for code, building programming assistants, developing code completion tools, studying programming language patterns, and advancing AI-powered software development - representing a crucial resource for the next...