- Version
- Download 2
- File Size 20.17 MB
- File Count 1
- Create Date December 15, 2025
- Last Updated December 15, 2025
Lithology Prediction Dataset
The Lithology Prediction Dataset (FORCE 2020), created as part of the FORCE (The Norwegian Offshore Competence Centre) Machine Learning Competition 2020, is one of the most comprehensive and challenging open datasets for lithology prediction in the petroleum industry. This dataset contains extensive well log data from multiple wells in the North Sea, designed specifically for developing and testing advanced machine learning models for rock type classification.
Available on Kaggle and the FORCE website, this dataset is excellent for building robust predictive models that help geoscientists and petroleum engineers automatically identify subsurface rock types from geophysical measurements, significantly accelerating reservoir characterization workflows.
Key Features
- Records: Hundreds of thousands of depth measurements across multiple North Sea wells.
- Variables: Over 20 features including:
- Target Variable: FORCE_2020_LITHOFACIES_LITHOLOGY (rock type classification)
- Confidence Score: FORCE_2020_LITHOFACIES_CONFIDENCE (label reliability indicator)
- Standard Logs: GR, RHOB, NPHI, DTC, SP, PEF, SGR
- Resistivity Suite: RSHA, RMED, RDEP, RMIC, ROPA, RXO
- Borehole Quality: CALI, BS, DCAL, DRHO
- Drilling Parameters: ROP, MUDWEIGHT
- Advanced Measurements: DTS (shear wave velocity)
- Location Data: X_LOC, Y_LOC, Z_LOC, DEPTH_MD
- Geological Context: GROUP, FORMATION, WELL
- Data Type: Mixed (numerical measurements and categorical labels).
- Format: CSV file.
- Challenge: Highly imbalanced classes representing North Sea lithologies.
Why This Dataset
This dataset represents a real-world industrial challenge with authentic complexities including missing data, class imbalance, measurement noise, and geological variability. It allows practitioners to explore how multiple geophysical measurements combine to characterize subsurface formations. It's ideal for projects that aim to:
- Predict lithology from multi-variate well log measurements.
- Handle severe class imbalance in geological classification problems.
- Build models that generalize across different wells and geological formations.
- Incorporate prediction confidence into model evaluation and deployment.
- Perform feature selection to identify the most diagnostic log measurements.
- Develop ensemble methods that combine multiple weak learners.
- Create production-ready models for automated lithology interpretation.
- Benchmark different machine learning algorithms on industry-standard data.
How to Use the Dataset
- Download the CSV file from Kaggle or the FORCE website.
- Load into Python using Pandas for comprehensive data analysis.
- Explore the data structure using
.info(),.describe(), and.value_counts()on the target variable. - Analyze class distribution to understand the severity of imbalance in lithology types.
- Visualize well logs as depth tracks, colored by lithology, using Matplotlib or Plotly.
- Handle missing values strategically - some may indicate tool unavailability in certain wells.
- Engineer features such as log ratios, moving averages, depth-based trends, or crossplot-derived indicators.
- Normalize/standardize features as different log measurements have vastly different scales.
- Split data by wells rather than randomly to ensure models can generalize to unseen locations.
- Address class imbalance using techniques like SMOTE, class weighting, or ensemble methods.
- Train models (Random Forest, XGBoost, LightGBM, Neural Networks, CatBoost).
- Evaluate carefully using F1-Score (macro/weighted), confusion matrices, and per-class metrics.
- Incorporate confidence scores when assessing model performance against uncertain labels.
- Visualize predictions as synthetic lithology logs alongside ground truth for geological validation.
Possible Project Ideas
- Competition-grade lithology classifier optimized for the FORCE 2020 challenge metrics.
- Imbalanced learning study comparing SMOTE, class weights, focal loss, and ensemble approaches.
- Multi-well transfer learning to predict lithology in wells with limited or missing log suites.
- Feature importance analysis identifying which logs are most critical for each lithology.
- Deep learning for sequential patterns using LSTM, 1D CNN, or Transformer architectures on depth sequences.
- Uncertainty-aware predictions using Bayesian methods or ensemble variance estimates.
- Missing log imputation system using machine learning to synthesize unavailable measurements.
- Geological formation-specific models that account for GROUP and FORMATION context.
- Interactive web application for real-time lithology prediction with Streamlit or Dash.
- Explainable AI dashboard using SHAP or LIME to make predictions interpretable for geologists.
- Cost-sensitive learning where misclassifying certain lithologies has higher business impact.
- Semi-supervised learning leveraging the confidence scores to identify reliable vs. uncertain labels.
Dataset Challenges and Considerations
- Class Imbalance: Some North Sea lithologies are rare, requiring specialized handling techniques.
- Missing Data: Not all wells have complete log suites; strategy for handling missingness affects model performance.
- Well-to-Well Variability: Geological conditions change across locations; models must generalize spatially.
- Label Uncertainty: The confidence column indicates that some labels are more reliable than others.
- Feature Correlation: Many log measurements are correlated; dimensionality reduction or regularization may help.
- Depth Dependencies: Sequential nature of depth measurements can be exploited for improved predictions.
Attached Files
| File | Action |
|---|---|
| archive (5).zip | Download |
