[free_download_btn]
The HumanEval Dataset, created by OpenAI and introduced in the Codex paper, is the gold standard benchmark for evaluating code generation models. This dataset contains 164 hand-written Python programming problems, each with a function signature, docstring, reference implementation, and multiple unit tests to verify functional correctness - providing a rigorous, execution-based evaluation framework that goes beyond syntactic correctness.
Available on Hugging Face, this dataset is excellent for benchmarking code generation models, evaluating AI coding assistants, measuring functional correctness using the pass@k metric, comparing different code LLMs, and understanding the state-of-the-art in automated program synthesis - making it the definitive evaluation standard for code generation research.
Key Features
- Records: 164 hand-crafted Python programming problems.
- Difficulty: Range from simple (basic string manipulation) to moderate (algorithms and data structures).
- Variables/Structure:
- task_id: Unique identifier (e.g., "HumanEval/0", "HumanEval/1")
- prompt: Function signature and docstring describing the task
- canonical_solution: Reference implementation (ground truth)
- test: Unit test code with assertions to verify correctness
- entry_point: Function name to be implemented
- Data Type: Text (Python code with natural language descriptions).
- Format: JSON Lines (.jsonl), accessible via Hugging Face datasets library.
- Quality: Hand-written by OpenAI engineers for high quality and clarity.
- Test Coverage: Multiple test cases per problem (visible and hidden edge cases).
- Evaluation Metric: pass@k - percentage of problems solved with k generated attempts.
- Language: Python 3 exclusively.
Why This Dataset
HumanEval revolutionized code generation evaluation by introducing execution-based testing rather than relying on text similarity metrics like BLEU. It measures whether generated code actually works, not just whether it looks similar to reference code. It's ideal for projects that aim to:
- Benchmark code generation models on functional correctness.
- Evaluate AI coding assistants and copilots objectively.
- Compare different code LLMs (GPT-4, Claude, Codex, StarCoder, CodeGen, etc.).
- Measure improvements from model fine-tuning or prompt engineering.
- Study the relationship between model size and coding ability.
- Develop better code generation techniques and architectures.
- Understand failure modes of code generation models.
- Research program synthesis and automated programming.
How to Use the Dataset
- Load via Hugging Face:
from datasets import load_dataset
# Load HumanEval dataset
dataset = load_dataset("openai/openai_humaneval")
# Access test split (only split available)
humaneval = dataset['test']
print(f"Number of problems: {len(humaneval)}")
# Output: 164
- Examine problem structure:
# Look at first problem
problem = humaneval[0]
print("Task ID:", problem['task_id'])
print("
Prompt (what model sees):")
print(problem['prompt'])
print("
Canonical Solution:")
print(problem['canonical_solution'])
print("
Tests:")
print(problem['test'])
print("
Entry Point:", problem['entry_point'])
- Example problem breakdown:
# HumanEval/0: Check if list has close elements
problem = humaneval[0]
# Prompt (given to model)
"""
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer
to each other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
"""
# Canonical solution (reference implementation)
"""
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False
"""
# Test cases (used for evaluation)
"""
def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
# ... more test cases
"""
- Generate code with a model:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load code generation model
model_name = "bigcode/starcoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def generate_code(prompt, num_samples=1, max_length=512):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=max_length,
num_return_sequences=num_samples,
temperature=0.8,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
completions = []
for output in outputs:
completion = tokenizer.decode(output, skip_special_tokens=True)
# Extract only the new code (after prompt)
completion = completion[len(prompt):]
completions.append(completion)
return completions
# Generate solutions
problem = humaneval[0]
solutions = generate_code(problem['prompt'], num_samples=10)
- Evaluate generated code:
import subprocess
import tempfile
import os
def execute_code(prompt, completion, test_code, entry_point):
"""Execute generated code and run tests."""
# Combine prompt + completion + test
full_code = prompt + completion + "
" + test_code
full_code += f"
check({entry_point})
"
# Write to temporary file
with tempfile.NamedTemporaryFile(
mode='w',
suffix='.py',
delete=False
) as f:
f.write(full_code)
temp_file = f.name
try:
# Execute with timeout
result = subprocess.run(
['python', temp_file],
timeout=5,
capture_output=True,
text=True
)
# Check if execution succeeded
passed = result.returncode == 0
return passed, result.stderr if not passed else ""
except subprocess.TimeoutExpired:
return False, "Timeout"
except Exception as e:
return False, str(e)
finally:
os.unlink(temp_file)
# Test a solution
problem = humaneval[0]
solution = solutions[0]
passed, error = execute_code(
problem['prompt'],
solution,
problem['test'],
problem['entry_point']
)
print(f"Test passed: {passed}")
if not passed:
print(f"Error: {error}")
- Calculate pass@k metric:
import numpy as np
from collections import defaultdict
def estimate_pass_at_k(num_samples, num_correct, k):
"""
Estimate pass@k using the formula from the Codex paper.
pass@k = E[1 - (n-c choose k) / (n choose k)]
where n = num_samples, c = num_correct
"""
if num_samples - num_correct < k:
return 1.0
return 1.0 - np.prod(
1.0 - k / np.arange(num_samples - num_correct + 1, num_samples + 1)
)
def calculate_pass_at_k(results, k_values=[1, 10, 100]):
"""
Calculate pass@k for all problems.
results: dict mapping task_id -> list of bools (passed/failed)
"""
pass_at_k = {k: [] for k in k_values}
for task_id, outcomes in results.items():
num_samples = len(outcomes)
num_correct = sum(outcomes)
for k in k_values:
if k <= num_samples:
pass_k = estimate_pass_at_k(num_samples, num_correct, k)
pass_at_k[k].append(pass_k)
# Average across all problems
return {k: np.mean(v) for k, v in pass_at_k.items()}
# Example usage
results = defaultdict(list)
for problem in humaneval:
task_id = problem['task_id']
solutions = generate_code(problem['prompt'], num_samples=100)
for solution in solutions:
passed, _ = execute_code(
problem['prompt'],
solution,
problem['test'],
problem['entry_point']
)
results[task_id].append(passed)
metrics = calculate_pass_at_k(results, k_values=[1, 10, 100])
print(f"pass@1: {metrics[1]:.1%}")
print(f"pass@10: {metrics[10]:.1%}")
print(f"pass@100: {metrics[100]:.1%}")
- Use official evaluation harness:
# Install evaluation library
pip install human-eval
# Generate completions (save to JSONL)
# Format: {"task_id": "HumanEval/0", "completion": " return ...
"}
# Run evaluation
evaluate_functional_correctness samples.jsonl
- Batch evaluation pipeline:
import json
from tqdm import tqdm
def evaluate_model_on_humaneval(model, tokenizer, humaneval, num_samples=10):
"""Complete evaluation pipeline."""
results = []
for problem in tqdm(humaneval, desc="Evaluating"):
task_id = problem['task_id']
# Generate multiple solutions
solutions = generate_code(
problem['prompt'],
num_samples=num_samples
)
# Test each solution
for i, solution in enumerate(solutions):
passed, error = execute_code(
problem['prompt'],
solution,
problem['test'],
problem['entry_point']
)
results.append({
'task_id': task_id,
'completion': solution,
'passed': passed,
'error': error if not passed else None
})
return results
# Run evaluation
results = evaluate_model_on_humaneval(model, tokenizer, humaneval)
# Save results
with open('evaluation_results.jsonl', 'w') as f:
for result in results:
f.write(json.dumps(result) + '
')
- Analyze failure patterns:
def analyze_failures(results):
"""Identify common failure patterns."""
failure_types = {
'syntax_error': 0,
'runtime_error': 0,
'assertion_error': 0,
'timeout': 0,
'other': 0
}
for result in results:
if not result['passed']:
error = result.get('error', '')
if 'SyntaxError' in error:
failure_types['syntax_error'] += 1
elif 'AssertionError' in error:
failure_types['assertion_error'] += 1
elif 'Timeout' in error:
failure_types['timeout'] += 1
elif error:
failure_types['runtime_error'] += 1
else:
failure_types['other'] += 1
return failure_types
failures = analyze_failures(results)
print("Failure Analysis:")
for error_type, count in failures.items():
print(f" {error_type}: {count}")
- Compare multiple models:
models_to_compare = [
"bigcode/starcoder",
"Salesforce/codegen-350M-mono",
"microsoft/CodeGPT-small-py"
]
comparison_results = {}
for model_name in models_to_compare:
print(f"
Evaluating {model_name}...")
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
results = evaluate_model_on_humaneval(
model,
tokenizer,
humaneval,
num_samples=10
)
# Calculate pass@k
task_results = defaultdict(list)
for r in results:
task_results[r['task_id']].append(r['passed'])
metrics = calculate_pass_at_k(task_results, k_values=[1, 10])
comparison_results[model_name] = metrics
print(f"pass@1: {metrics[1]:.1%}")
print(f"pass@10: {metrics[10]:.1%}")
Possible Project Ideas
- Code model benchmark evaluating and comparing different code LLMs on HumanEval.
- Prompt engineering study testing how different prompts affect pass@k scores.
- Fine-tuning experiment measuring improvement from domain-specific fine-tuning.
- Few-shot learning analysis testing impact of providing example solutions in prompts.
- Temperature and sampling study optimizing generation parameters for code quality.
- Error analysis dashboard categorizing and visualizing common failure modes.
- Model size vs. performance studying scaling laws for code generation.
- Ensemble code generation combining multiple models for better pass rates.
- Self-consistency evaluation generating multiple solutions and selecting via voting.
- Chain-of-thought for coding adding reasoning steps before code generation.
- Difficulty analysis categorizing problems by complexity and model performance.
- Execution-based filtering using test execution to select best among k generations.
- Iterative refinement generating, testing, and refining code automatically.
- Transfer learning study evaluating models trained on different code corpora.
- Human-AI comparison benchmarking models against human programmer performance.
Dataset Challenges and Considerations
- Limited Size: Only 164 problems limits statistical significance for small improvements.
- Python-Only: Doesn't evaluate multi-language code generation abilities.
- Moderate Difficulty: Problems are relatively straightforward; doesn't test complex algorithms.
- Test Coverage: Fixed test suites may not catch all edge cases or bugs.
- Overfitting Risk: Models could potentially memorize solutions if dataset leaked into training.
- Single Solution: Canonical solution may not represent the only or best approach.
- Evaluation Cost: Requires code execution, which is computationally expensive and potentially unsafe.
- No Context: Problems are isolated; doesn't test integration or project-level coding.
- Temporal Shift: Python features and best practices evolve; dataset is static.
- Security Concerns: Executing generated code requires sandboxing to prevent malicious operations.
Understanding pass@k Metric
Definition: pass@k measures the probability that at least one solution out of k generated attempts passes all test cases.
Formula:
pass@k = E[1 - (n-c choose k) / (n choose k)]
where:
- n = total number of generated samples
- c = number of correct samples
- k = number of attempts we're evaluating
Interpretation:
- pass@1: Likelihood of getting it right on first try (most stringent)
- pass@10: Likelihood of getting it right in 10 tries (more lenient)
- pass@100: Likelihood of getting it right in 100 tries (very lenient)
Example: If a model generates 100 solutions and 30 are correct:
- pass@1 ≈ 30% (probability first sample is correct)
- pass@10 ≈ 95% (very likely at least one of 10 samples is correct)
- pass@100 = 100% (guaranteed at least one of 100 samples is correct)
Why Use pass@k?
- Models often generate diverse solutions; first isn't always best
- Reflects real-world usage where developers can try multiple suggestions
- Captures model's ability to explore solution space
- More stable metric than binary pass/fail on single generation
Performance Benchmarks
State-of-the-Art Results (as of 2024):
| Model | pass@1 | pass@10 | pass@100 |
|---|---|---|---|
| GPT-4 | ~67% | ~85% | ~92% |
| Claude 3 Opus | ~65% | ~82% | ~90% |
| GPT-3.5 Turbo | ~48% | ~70% | ~82% |
| StarCoder (15B) | ~34% | ~60% | ~78% |
| CodeGen (16B) | ~29% | ~55% | ~73% |
| Codex (12B) | ~28.8% | ~46.8% | ~72.3% |
| GPT-3 (175B) | ~0% | ~2% | ~5% |
Key Insights:
- Larger models generally perform better
- Code-specialized models outperform general LLMs
- GPT-4 represents current SOTA
- Even best models fail ~30% of problems at pass@1
- Improvement from pass@1 to pass@100 shows value of multiple attempts
Historical Context:
- Original Codex (2021): 28.8% pass@1
- Competitive programmer ceiling: ~90% pass@1 (estimated)
- Continued rapid improvement in code generation capabilities
Problem Categories and Difficulty
Easy Problems (pass@1 > 60%):
- String manipulation
- Basic list operations
- Simple mathematical functions
- Type conversions
Medium Problems (30% < pass@1 < 60%):
- Algorithm implementation (sorting, searching)
- Data structure manipulation
- Mathematical computations
- Pattern matching
Hard Problems (pass@1 < 30%):
- Complex algorithms
- Edge case handling
- Optimization problems
- Multi-step reasoning
Common Failure Modes:
- Off-by-one errors: Incorrect loop boundaries
- Edge cases: Empty lists, single elements, negative numbers
- Type errors: Mixing int/float, string formatting
- Logic errors: Incorrect algorithm implementation
- Incomplete solutions: Partial implementations
- Syntax errors: Invalid Python code
- Infinite loops: Non-terminating code
Evaluation Best Practices
Safety Considerations:
# Use sandboxing for code execution
import docker
def safe_execute_code(code, timeout=5):
"""Execute code in Docker container for safety."""
client = docker.from_env()
try:
container = client.containers.run(
"python:3.10-slim",
f"python -c '{code}'",
detach=True,
mem_limit="512m",
cpu_quota=50000, # Limit CPU
network_disabled=True # No network access
)
result = container.wait(timeout=timeout)
logs = container.logs().decode('utf-8')
return result['StatusCode'] == 0, logs
except Exception as e:
return False, str(e)
finally:
container.remove(force=True)
Reproducibility:
- Fix random seeds for sampling
- Document model version, temperature, top_p, etc.
- Save all generated solutions for analysis
- Use consistent evaluation code (official library)
Statistical Significance:
- Run multiple trials with different seeds
- Report confidence intervals
- Use bootstrap resampling for uncertainty estimates
- Larger n (samples) provides more reliable pass@k estimates
Reporting Standards:
- Always report pass@1 (primary metric)
- Include pass@10 and pass@100 for completeness
- Document number of samples used
- Report evaluation time and compute costs
- Share failure analysis and error categories
Extensions and Variations
MBPP (Mostly Basic Python Problems):
- 974 problems (larger than HumanEval)
- Similar format but more entry-level
- Complementary benchmark
HumanEval+:
- Extended test suite with more edge cases
- Harder evaluation (catches more bugs)
- Same problems, more comprehensive testing
MultiPL-E:
- HumanEval translated to 18+ languages
- Tests multi-language code generation
- Cross-language comparison
APPS (Automated Programming Progress Standard):
- 10,000 competitive programming problems
- Much harder than HumanEval
- Tests advanced algorithmic skills
Ethical Considerations
Academic Integrity:
- Don't use for homework or assignments without disclosure
- Understand code rather than blindly copying
- Credit AI assistance appropriately
Data Contamination:
- HumanEval may be in training data of newer models
- Results may be inflated due to memorization
- Need new benchmarks to avoid overfitting
Safety:
- Generated code may contain vulnerabilities
- Always review before executing or deploying
- Sandbox untrusted code execution
Accessibility:
- Democratizes programming knowledge
- Lowers barrier to entry for coding
- May disrupt programming education and employment
- Version
- Download 14
- File Size 0.00 KB
- File Count 1
- Create Date April 18, 2026
- Last Updated April 18, 2026
| File | Action |
|---|---|
| openai_humaneval | Download |