HumanEval Dataset

Download

[free_download_btn]

The HumanEval Dataset, created by OpenAI and introduced in the Codex paper, is the gold standard benchmark for evaluating code generation models. This dataset contains 164 hand-written Python programming problems, each with a function signature, docstring, reference implementation, and multiple unit tests to verify functional correctness - providing a rigorous, execution-based evaluation framework that goes beyond syntactic correctness.

Available on Hugging Face, this dataset is excellent for benchmarking code generation models, evaluating AI coding assistants, measuring functional correctness using the pass@k metric, comparing different code LLMs, and understanding the state-of-the-art in automated program synthesis - making it the definitive evaluation standard for code generation research.

Key Features

Records: 164 hand-crafted Python programming problems.
Difficulty: Range from simple (basic string manipulation) to moderate (algorithms and data structures).
Variables/Structure:
- task_id: Unique identifier (e.g., "HumanEval/0", "HumanEval/1")
- prompt: Function signature and docstring describing the task
- canonical_solution: Reference implementation (ground truth)
- test: Unit test code with assertions to verify correctness
- entry_point: Function name to be implemented
Data Type: Text (Python code with natural language descriptions).
Format: JSON Lines (.jsonl), accessible via Hugging Face datasets library.
Quality: Hand-written by OpenAI engineers for high quality and clarity.
Test Coverage: Multiple test cases per problem (visible and hidden edge cases).
Evaluation Metric: pass@k - percentage of problems solved with k generated attempts.
Language: Python 3 exclusively.

Why This Dataset

HumanEval revolutionized code generation evaluation by introducing execution-based testing rather than relying on text similarity metrics like BLEU. It measures whether generated code actually works, not just whether it looks similar to reference code. It's ideal for projects that aim to:

Benchmark code generation models on functional correctness.
Evaluate AI coding assistants and copilots objectively.
Compare different code LLMs (GPT-4, Claude, Codex, StarCoder, CodeGen, etc.).
Measure improvements from model fine-tuning or prompt engineering.
Study the relationship between model size and coding ability.
Develop better code generation techniques and architectures.
Understand failure modes of code generation models.
Research program synthesis and automated programming.

How to Use the Dataset

Load via Hugging Face:

python

   from datasets import load_dataset
   
   # Load HumanEval dataset
   dataset = load_dataset("openai/openai_humaneval")
   
   # Access test split (only split available)
   humaneval = dataset['test']
   
   print(f"Number of problems: {len(humaneval)}")
   # Output: 164

Examine problem structure:

python

   # Look at first problem
   problem = humaneval[0]
   
   print("Task ID:", problem['task_id'])
   print("
Prompt (what model sees):")
   print(problem['prompt'])
   print("
Canonical Solution:")
   print(problem['canonical_solution'])
   print("
Tests:")
   print(problem['test'])
   print("
Entry Point:", problem['entry_point'])

Example problem breakdown:

python

   # HumanEval/0: Check if list has close elements
   problem = humaneval[0]
   
   # Prompt (given to model)
   """
   from typing import List
   
   def has_close_elements(numbers: List[float], threshold: float) -> bool:
       """ Check if in given list of numbers, are any two numbers closer 
       to each other than given threshold.
       >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
       False
       >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
       True
       """
   """
   
   # Canonical solution (reference implementation)
   """
       for idx, elem in enumerate(numbers):
           for idx2, elem2 in enumerate(numbers):
               if idx != idx2:
                   distance = abs(elem - elem2)
                   if distance < threshold:
                       return True
       return False
   """
   
   # Test cases (used for evaluation)
   """
   def check(candidate):
       assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
       assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
       # ... more test cases
   """

Generate code with a model:

python

   from transformers import AutoModelForCausalLM, AutoTokenizer
   
   # Load code generation model
   model_name = "bigcode/starcoder"
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModelForCausalLM.from_pretrained(model_name)
   
   def generate_code(prompt, num_samples=1, max_length=512):
       inputs = tokenizer(prompt, return_tensors="pt")
       
       outputs = model.generate(
           **inputs,
           max_length=max_length,
           num_return_sequences=num_samples,
           temperature=0.8,
           top_p=0.95,
           do_sample=True,
           pad_token_id=tokenizer.eos_token_id
       )
       
       completions = []
       for output in outputs:
           completion = tokenizer.decode(output, skip_special_tokens=True)
           # Extract only the new code (after prompt)
           completion = completion[len(prompt):]
           completions.append(completion)
       
       return completions
   
   # Generate solutions
   problem = humaneval[0]
   solutions = generate_code(problem['prompt'], num_samples=10)

Evaluate generated code:

python

   import subprocess
   import tempfile
   import os
   
   def execute_code(prompt, completion, test_code, entry_point):
       """Execute generated code and run tests."""
       
       # Combine prompt + completion + test
       full_code = prompt + completion + "
" + test_code
       full_code += f"
check({entry_point})
"
       
       # Write to temporary file
       with tempfile.NamedTemporaryFile(
           mode='w', 
           suffix='.py', 
           delete=False
       ) as f:
           f.write(full_code)
           temp_file = f.name
       
       try:
           # Execute with timeout
           result = subprocess.run(
               ['python', temp_file],
               timeout=5,
               capture_output=True,
               text=True
           )
           
           # Check if execution succeeded
           passed = result.returncode == 0
           return passed, result.stderr if not passed else ""
       
       except subprocess.TimeoutExpired:
           return False, "Timeout"
       except Exception as e:
           return False, str(e)
       finally:
           os.unlink(temp_file)
   
   # Test a solution
   problem = humaneval[0]
   solution = solutions[0]
   
   passed, error = execute_code(
       problem['prompt'],
       solution,
       problem['test'],
       problem['entry_point']
   )
   
   print(f"Test passed: {passed}")
   if not passed:
       print(f"Error: {error}")

Calculate pass@k metric:

python

   import numpy as np
   from collections import defaultdict
   
   def estimate_pass_at_k(num_samples, num_correct, k):
       """
       Estimate pass@k using the formula from the Codex paper.
       
       pass@k = E[1 - (n-c choose k) / (n choose k)]
       where n = num_samples, c = num_correct
       """
       if num_samples - num_correct < k:
           return 1.0
       
       return 1.0 - np.prod(
           1.0 - k / np.arange(num_samples - num_correct + 1, num_samples + 1)
       )
   
   def calculate_pass_at_k(results, k_values=[1, 10, 100]):
       """
       Calculate pass@k for all problems.
       
       results: dict mapping task_id -> list of bools (passed/failed)
       """
       pass_at_k = {k: [] for k in k_values}
       
       for task_id, outcomes in results.items():
           num_samples = len(outcomes)
           num_correct = sum(outcomes)
           
           for k in k_values:
               if k <= num_samples:
                   pass_k = estimate_pass_at_k(num_samples, num_correct, k)
                   pass_at_k[k].append(pass_k)
       
       # Average across all problems
       return {k: np.mean(v) for k, v in pass_at_k.items()}
   
   # Example usage
   results = defaultdict(list)
   
   for problem in humaneval:
       task_id = problem['task_id']
       solutions = generate_code(problem['prompt'], num_samples=100)
       
       for solution in solutions:
           passed, _ = execute_code(
               problem['prompt'],
               solution,
               problem['test'],
               problem['entry_point']
           )
           results[task_id].append(passed)
   
   metrics = calculate_pass_at_k(results, k_values=[1, 10, 100])
   print(f"pass@1:   {metrics[1]:.1%}")
   print(f"pass@10:  {metrics[10]:.1%}")
   print(f"pass@100: {metrics[100]:.1%}")

Use official evaluation harness:

bash

   # Install evaluation library
   pip install human-eval
   
   # Generate completions (save to JSONL)
   # Format: {"task_id": "HumanEval/0", "completion": "    return ...
"}
   
   # Run evaluation
   evaluate_functional_correctness samples.jsonl

Batch evaluation pipeline:

python

   import json
   from tqdm import tqdm
   
   def evaluate_model_on_humaneval(model, tokenizer, humaneval, num_samples=10):
       """Complete evaluation pipeline."""
       
       results = []
       
       for problem in tqdm(humaneval, desc="Evaluating"):
           task_id = problem['task_id']
           
           # Generate multiple solutions
           solutions = generate_code(
               problem['prompt'], 
               num_samples=num_samples
           )
           
           # Test each solution
           for i, solution in enumerate(solutions):
               passed, error = execute_code(
                   problem['prompt'],
                   solution,
                   problem['test'],
                   problem['entry_point']
               )
               
               results.append({
                   'task_id': task_id,
                   'completion': solution,
                   'passed': passed,
                   'error': error if not passed else None
               })
       
       return results
   
   # Run evaluation
   results = evaluate_model_on_humaneval(model, tokenizer, humaneval)
   
   # Save results
   with open('evaluation_results.jsonl', 'w') as f:
       for result in results:
           f.write(json.dumps(result) + '
')

Analyze failure patterns:

python

   def analyze_failures(results):
       """Identify common failure patterns."""
       
       failure_types = {
           'syntax_error': 0,
           'runtime_error': 0,
           'assertion_error': 0,
           'timeout': 0,
           'other': 0
       }
       
       for result in results:
           if not result['passed']:
               error = result.get('error', '')
               
               if 'SyntaxError' in error:
                   failure_types['syntax_error'] += 1
               elif 'AssertionError' in error:
                   failure_types['assertion_error'] += 1
               elif 'Timeout' in error:
                   failure_types['timeout'] += 1
               elif error:
                   failure_types['runtime_error'] += 1
               else:
                   failure_types['other'] += 1
       
       return failure_types
   
   failures = analyze_failures(results)
   print("Failure Analysis:")
   for error_type, count in failures.items():
       print(f"  {error_type}: {count}")

Compare multiple models:

python

    models_to_compare = [
        "bigcode/starcoder",
        "Salesforce/codegen-350M-mono",
        "microsoft/CodeGPT-small-py"
    ]
    
    comparison_results = {}
    
    for model_name in models_to_compare:
        print(f"
Evaluating {model_name}...")
        
        model = AutoModelForCausalLM.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        results = evaluate_model_on_humaneval(
            model, 
            tokenizer, 
            humaneval, 
            num_samples=10
        )
        
        # Calculate pass@k
        task_results = defaultdict(list)
        for r in results:
            task_results[r['task_id']].append(r['passed'])
        
        metrics = calculate_pass_at_k(task_results, k_values=[1, 10])
        comparison_results[model_name] = metrics
        
        print(f"pass@1:  {metrics[1]:.1%}")
        print(f"pass@10: {metrics[10]:.1%}")

Possible Project Ideas

Code model benchmark evaluating and comparing different code LLMs on HumanEval.
Prompt engineering study testing how different prompts affect pass@k scores.
Fine-tuning experiment measuring improvement from domain-specific fine-tuning.
Few-shot learning analysis testing impact of providing example solutions in prompts.
Temperature and sampling study optimizing generation parameters for code quality.
Error analysis dashboard categorizing and visualizing common failure modes.
Model size vs. performance studying scaling laws for code generation.
Ensemble code generation combining multiple models for better pass rates.
Self-consistency evaluation generating multiple solutions and selecting via voting.
Chain-of-thought for coding adding reasoning steps before code generation.
Difficulty analysis categorizing problems by complexity and model performance.
Execution-based filtering using test execution to select best among k generations.
Iterative refinement generating, testing, and refining code automatically.
Transfer learning study evaluating models trained on different code corpora.
Human-AI comparison benchmarking models against human programmer performance.

Dataset Challenges and Considerations

Limited Size: Only 164 problems limits statistical significance for small improvements.
Python-Only: Doesn't evaluate multi-language code generation abilities.
Moderate Difficulty: Problems are relatively straightforward; doesn't test complex algorithms.
Test Coverage: Fixed test suites may not catch all edge cases or bugs.
Overfitting Risk: Models could potentially memorize solutions if dataset leaked into training.
Single Solution: Canonical solution may not represent the only or best approach.
Evaluation Cost: Requires code execution, which is computationally expensive and potentially unsafe.
No Context: Problems are isolated; doesn't test integration or project-level coding.
Temporal Shift: Python features and best practices evolve; dataset is static.
Security Concerns: Executing generated code requires sandboxing to prevent malicious operations.

Understanding pass@k Metric

Definition: pass@k measures the probability that at least one solution out of k generated attempts passes all test cases.

Formula:

pass@k = E[1 - (n-c choose k) / (n choose k)]

where:
- n = total number of generated samples
- c = number of correct samples
- k = number of attempts we're evaluating

Interpretation:

pass@1: Likelihood of getting it right on first try (most stringent)
pass@10: Likelihood of getting it right in 10 tries (more lenient)
pass@100: Likelihood of getting it right in 100 tries (very lenient)

Example: If a model generates 100 solutions and 30 are correct:

pass@1 ≈ 30% (probability first sample is correct)
pass@10 ≈ 95% (very likely at least one of 10 samples is correct)
pass@100 = 100% (guaranteed at least one of 100 samples is correct)

Why Use pass@k?

Models often generate diverse solutions; first isn't always best
Reflects real-world usage where developers can try multiple suggestions
Captures model's ability to explore solution space
More stable metric than binary pass/fail on single generation

Performance Benchmarks

State-of-the-Art Results (as of 2024):

Model	pass@1	pass@10	pass@100
GPT-4	~67%	~85%	~92%
Claude 3 Opus	~65%	~82%	~90%
GPT-3.5 Turbo	~48%	~70%	~82%
StarCoder (15B)	~34%	~60%	~78%
CodeGen (16B)	~29%	~55%	~73%
Codex (12B)	~28.8%	~46.8%	~72.3%
GPT-3 (175B)	~0%	~2%	~5%

Key Insights:

Larger models generally perform better
Code-specialized models outperform general LLMs
GPT-4 represents current SOTA
Even best models fail ~30% of problems at pass@1
Improvement from pass@1 to pass@100 shows value of multiple attempts

Historical Context:

Original Codex (2021): 28.8% pass@1
Competitive programmer ceiling: ~90% pass@1 (estimated)
Continued rapid improvement in code generation capabilities

Problem Categories and Difficulty

Easy Problems (pass@1 > 60%):

String manipulation
Basic list operations
Simple mathematical functions
Type conversions

Medium Problems (30% < pass@1 < 60%):

Algorithm implementation (sorting, searching)
Data structure manipulation
Mathematical computations
Pattern matching

Hard Problems (pass@1 < 30%):

Complex algorithms
Edge case handling
Optimization problems
Multi-step reasoning

Common Failure Modes:

Off-by-one errors: Incorrect loop boundaries
Edge cases: Empty lists, single elements, negative numbers
Type errors: Mixing int/float, string formatting
Logic errors: Incorrect algorithm implementation
Incomplete solutions: Partial implementations
Syntax errors: Invalid Python code
Infinite loops: Non-terminating code

Evaluation Best Practices

Safety Considerations:

python

# Use sandboxing for code execution
import docker

def safe_execute_code(code, timeout=5):
    """Execute code in Docker container for safety."""
    client = docker.from_env()
    
    try:
        container = client.containers.run(
            "python:3.10-slim",
            f"python -c '{code}'",
            detach=True,
            mem_limit="512m",
            cpu_quota=50000,  # Limit CPU
            network_disabled=True  # No network access
        )
        
        result = container.wait(timeout=timeout)
        logs = container.logs().decode('utf-8')
        
        return result['StatusCode'] == 0, logs
        
    except Exception as e:
        return False, str(e)
    finally:
        container.remove(force=True)

Reproducibility:

Fix random seeds for sampling
Document model version, temperature, top_p, etc.
Save all generated solutions for analysis
Use consistent evaluation code (official library)

Statistical Significance:

Run multiple trials with different seeds
Report confidence intervals
Use bootstrap resampling for uncertainty estimates
Larger n (samples) provides more reliable pass@k estimates

Reporting Standards:

Always report pass@1 (primary metric)
Include pass@10 and pass@100 for completeness
Document number of samples used
Report evaluation time and compute costs
Share failure analysis and error categories

Extensions and Variations

MBPP (Mostly Basic Python Problems):

974 problems (larger than HumanEval)
Similar format but more entry-level
Complementary benchmark

HumanEval+:

Extended test suite with more edge cases
Harder evaluation (catches more bugs)
Same problems, more comprehensive testing

MultiPL-E:

HumanEval translated to 18+ languages
Tests multi-language code generation
Cross-language comparison

APPS (Automated Programming Progress Standard):

10,000 competitive programming problems
Much harder than HumanEval
Tests advanced algorithmic skills

Ethical Considerations

Academic Integrity:

Don't use for homework or assignments without disclosure
Understand code rather than blindly copying
Credit AI assistance appropriately

Data Contamination:

HumanEval may be in training data of newer models
Results may be inflated due to memorization
Need new benchmarks to avoid overfitting

Safety:

Generated code may contain vulnerabilities
Always review before executing or deploying
Sandbox untrusted code execution

Accessibility:

Democratizes programming knowledge
Lowers barrier to entry for coding
May disrupt programming education and employment

Version
Download 14
File Size 0.00 KB
File Count 1
Create Date April 18, 2026
Last Updated April 18, 2026

File	Action
openai_humaneval	Download

Key Features

Why This Dataset

How to Use the Dataset

Possible Project Ideas

Dataset Challenges and Considerations

Understanding pass@k Metric

Performance Benchmarks

Problem Categories and Difficulty

Evaluation Best Practices

Extensions and Variations

Ethical Considerations

Leave a Reply Cancel reply

Copyright © 2025 codewithfimi.com - All Rights Reserved