Every machine learning project eventually hits the same organizational problem. You ran fifty experiments last month. You know the best one used a learning rate of 0.01 and some regularization, but you cannot remember exactly which combination. The model that performed best on the validation set is saved somewhere as model_final_v3_actually_final.pkl. The preprocessing steps that produced the training data are documented in a notebook you can no longer find. A colleague asks you to reproduce the results from three weeks ago and you spend two days trying before admitting you cannot.
This is the experiment tracking problem and it plagues every team doing machine learning work without dedicated tooling. MLflow is the open source platform the community converged on for solving it. It tracks experiments automatically, logs everything that matters about a training run, versions your models, and provides a UI for comparing results across hundreds of runs.
This tutorial walks through the full MLflow workflow from installation through experiment tracking, artifact logging, the model registry, and serving. Every section includes working code you can run immediately.
What MLflow Actually Does
MLflow has four main components that address different parts of the ML lifecycle.
MLflow Tracking logs parameters, metrics, and artifacts from training runs and provides a UI for comparing them. This is the component most teams start with and the one this tutorial focuses on most heavily.
MLflow Projects packages ML code in a reusable, reproducible format with defined dependencies. It lets you run the same code reliably across different environments.
MLflow Models provides a standard format for packaging models that works with multiple deployment targets. A model packaged in MLflow format can be served as a REST API, deployed to cloud platforms, or loaded in batch inference pipelines without changing the model code.
MLflow Model Registry is a centralized model store with versioning, lifecycle management, and annotations. It tracks which model versions are in staging, production, and archived states.
Installation and Setup
bash
pip install mlflow scikit-learn pandas numpy matplotlib
MLflow works in two modes. Without a tracking server, it logs everything to a local mlruns directory in your working directory. With a tracking server, it logs to a remote database and artifact store that a team can share. Start local and add a server when you need team collaboration.
python
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import numpy as np
# Verify installation
print(f"MLflow version: {mlflow.__version__}")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
Your First Experiment
An experiment in MLflow is a named collection of runs. A run is a single execution of your training code with a specific set of parameters. Grouping related runs under one experiment makes comparison straightforward.
python
# Load data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create or set an experiment
mlflow.set_experiment("wine_classification")
# Start a run
with mlflow.start_run(run_name="random_forest_baseline"):
# Log parameters
params = {
"n_estimators": 100,
"max_depth": 5,
"min_samples_split": 2,
"random_state": 42
}
mlflow.log_params(params)
# Train model
model = RandomForestClassifier(**params)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)
# Log metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"f1_weighted": f1_score(y_test, y_pred, average="weighted"),
"roc_auc_ovr": roc_auc_score(y_test, y_prob, multi_class="ovr")
}
mlflow.log_metrics(metrics)
# Log the model
mlflow.sklearn.log_model(model, "random_forest_model")
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"F1 Score: {metrics['f1_weighted']:.4f}")
print(f"Run ID: {mlflow.active_run().info.run_id}")
Open the MLflow UI to see this run:
bash
mlflow ui
Navigate to http://localhost:5000 and you will see the wine_classification experiment with your run, its parameters, and its metrics.
Logging Metrics Across Training Steps
For models that train iteratively, logging metrics at each step produces a curve rather than a single number. MLflow’s step parameter enables this.
python
from sklearn.neural_network import MLPClassifier
mlflow.set_experiment("wine_classification")
with mlflow.start_run(run_name="mlp_iterative_logging"):
mlflow.log_params({
"hidden_layer_sizes": "(128, 64)",
"learning_rate_init": 0.001,
"max_iter": 100,
"random_state": 42
})
# Train iteratively and log per-epoch metrics
train_losses = []
val_accuracies = []
model = MLPClassifier(
hidden_layer_sizes=(128, 64),
learning_rate_init=0.001,
max_iter=1,
warm_start=True,
random_state=42
)
for epoch in range(1, 51):
model.max_iter = epoch
model.fit(X_train_scaled, y_train)
train_loss = model.loss_
val_accuracy = accuracy_score(y_test, model.predict(X_test_scaled))
# Log with step for curve visualization in UI
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_accuracy", val_accuracy, step=epoch)
# Log final metrics
y_pred = model.predict(X_test_scaled)
mlflow.log_metric("final_accuracy", accuracy_score(y_test, y_pred))
mlflow.sklearn.log_model(model, "mlp_model")
In the MLflow UI, metrics logged with a step parameter display as line charts showing how they evolved across training. This is the view you need for diagnosing overfitting, comparing convergence rates across runs, and deciding when to stop training.
Logging Artifacts
Artifacts are any files associated with a run: plots, feature importance charts, confusion matrices, preprocessors, and datasets. Anything you would want to inspect or reproduce later belongs as an artifact.
python
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import os
import json
mlflow.set_experiment("wine_classification")
with mlflow.start_run(run_name="rf_with_artifacts"):
params = {"n_estimators": 200, "max_depth": 8, "random_state": 42}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
# Log metrics
mlflow.log_metrics({
"accuracy": accuracy_score(y_test, y_pred),
"f1_weighted": f1_score(y_test, y_pred, average="weighted")
})
os.makedirs("artifacts", exist_ok=True)
# Confusion matrix plot
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=data.target_names
)
disp.plot(ax=ax, cmap="Blues")
ax.set_title("Confusion Matrix")
fig.savefig("artifacts/confusion_matrix.png", dpi=150, bbox_inches="tight")
plt.close()
mlflow.log_artifact("artifacts/confusion_matrix.png")
# Feature importance plot
importances = pd.Series(
model.feature_importances_,
index=data.feature_names
).sort_values(ascending=True)
fig, ax = plt.subplots(figsize=(8, 8))
importances.plot(kind="barh", ax=ax, color="steelblue")
ax.set_title("Feature Importances")
ax.set_xlabel("Importance")
fig.savefig("artifacts/feature_importance.png", dpi=150, bbox_inches="tight")
plt.close()
mlflow.log_artifact("artifacts/feature_importance.png")
# Log the scaler as artifact for reproducibility
import pickle
with open("artifacts/scaler.pkl", "wb") as f:
pickle.dump(scaler, f)
mlflow.log_artifact("artifacts/scaler.pkl")
# Log training data stats as JSON
stats = {
"train_samples": len(X_train),
"test_samples": len(X_test),
"n_features": X_train.shape[1],
"class_distribution": y_train.tolist()
}
with open("artifacts/data_stats.json", "w") as f:
json.dump(stats, f, indent=2)
mlflow.log_artifact("artifacts/data_stats.json")
mlflow.sklearn.log_model(model, "model")
print(f"Run ID: {mlflow.active_run().info.run_id}")
Every artifact is visible in the MLflow UI under its run. Six months later, anyone reviewing this run sees exactly what the data looked like, how the model performed by class, and which features drove its decisions.
Hyperparameter Search With Automatic Tracking
MLflow integrates naturally with hyperparameter search loops. Each parameter combination becomes its own run, all grouped under one experiment.
python
from itertools import product
mlflow.set_experiment("wine_classification_hypersearch")
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, 8, None],
"min_samples_split": [2, 5]
}
best_accuracy = 0
best_run_id = None
combinations = list(product(
param_grid["n_estimators"],
param_grid["max_depth"],
param_grid["min_samples_split"]
))
for n_estimators, max_depth, min_samples_split in combinations:
with mlflow.start_run(run_name=f"rf_n{n_estimators}_d{max_depth}_s{min_samples_split}"):
params = {
"n_estimators": n_estimators,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"random_state": 42
}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")
mlflow.log_metrics({"accuracy": accuracy, "f1_weighted": f1})
mlflow.sklearn.log_model(model, "model")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_run_id = mlflow.active_run().info.run_id
print(f"Best accuracy: {best_accuracy:.4f}")
print(f"Best run ID: {best_run_id}")
In the MLflow UI, switch to the table view and sort by accuracy to find the best run instantly. The comparison view lets you select multiple runs and plot their metrics side by side, which reveals patterns like diminishing returns on n_estimators or optimal depth ranges.
The Model Registry
The model registry adds lifecycle management on top of experiment tracking. It answers the question of which model version is currently in production, which is being tested in staging, and which have been retired.
python
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register the best model from the hyperparameter search
model_uri = f"runs:/{best_run_id}/model"
registered_model = mlflow.register_model(
model_uri=model_uri,
name="wine_classifier"
)
print(f"Model version: {registered_model.version}")
print(f"Model name: {registered_model.name}")
# Transition to staging
client.transition_model_version_stage(
name="wine_classifier",
version=registered_model.version,
stage="Staging",
archive_existing_versions=False
)
# Add description
client.update_model_version(
name="wine_classifier",
version=registered_model.version,
description="Random Forest trained on wine dataset. Best from grid search over 24 combinations."
)
# After validation, transition to production
client.transition_model_version_stage(
name="wine_classifier",
version=registered_model.version,
stage="Production",
archive_existing_versions=True # Archives previous production version
)
# Load the production model for inference
production_model = mlflow.sklearn.load_model(
model_uri="models:/wine_classifier/Production"
)
sample = X_test_scaled[:5]
predictions = production_model.predict(sample)
print(f"Predictions: {predictions}")
print(f"Classes: {[data.target_names[p] for p in predictions]}")
The model registry creates a clean separation between experimentation and deployment. Data scientists register models from experiments. ML engineers or an automated evaluation pipeline promote them through staging to production. The version history shows every model that has ever been in production and when it was replaced.
Setting Up a Remote Tracking Server
Local tracking works for individual projects. Team collaboration requires a shared tracking server with a database backend and remote artifact storage.
bash
# Start MLflow server with PostgreSQL backend and S3 artifact storage
mlflow server \
--backend-store-uri postgresql://user:password@localhost/mlflow \
--default-artifact-root s3://your-mlflow-bucket/artifacts \
--host 0.0.0.0 \
--port 5000
Point your training code at the remote server:
python
import os
# Set tracking URI to remote server
mlflow.set_tracking_uri("http://your-mlflow-server:5000")
# All subsequent runs log to the remote server
mlflow.set_experiment("shared_experiment")
with mlflow.start_run():
mlflow.log_param("team_member", "alice")
mlflow.log_metric("accuracy", 0.94)
Every team member pointing to the same tracking URI sees every run in the shared UI. The experiment becomes a single source of truth for what has been tried and what worked.
MLflow Cheat Sheet
| Task | Code |
|---|---|
| Set experiment | mlflow.set_experiment("name") |
| Start run | with mlflow.start_run(run_name="name"): |
| Log parameter | mlflow.log_param("lr", 0.01) |
| Log parameters dict | mlflow.log_params({"lr": 0.01, "epochs": 10}) |
| Log metric | mlflow.log_metric("accuracy", 0.95) |
| Log metric with step | mlflow.log_metric("loss", 0.3, step=10) |
| Log artifact file | mlflow.log_artifact("plot.png") |
| Log sklearn model | mlflow.sklearn.log_model(model, "model") |
| Register model | mlflow.register_model("runs:/run_id/model", "name") |
| Load production model | mlflow.sklearn.load_model("models:/name/Production") |
| Get run ID | mlflow.active_run().info.run_id |
| Launch UI | mlflow ui in terminal |
Common Mistakes
Not setting experiment names leaves all runs in the default experiment, which becomes an unusable dump of unrelated runs within a few weeks. Always call mlflow.set_experiment at the top of every training script.
Logging inside the run context but outside the with block causes silent failures where logs do not get recorded. Keep all mlflow.log_* calls inside the with mlflow.start_run() block.
Forgetting to log the preprocessor alongside the model is a reproducibility failure. A model loaded for inference without its scaler produces garbage predictions because the features are on the wrong scale. Log every preprocessing artifact that the model depends on.
Using the same run name for every experiment run makes the UI hard to navigate. Include the key parameters in the run name so you can identify runs at a glance without opening each one.
Skipping the model registry and deploying directly from run artifacts means there is no record of what is in production or when it changed. Even for solo projects, the discipline of using the registry pays off when you need to roll back or audit a production decision.
FAQs
What is MLflow and what problem does it solve?
MLflow is an open source platform for managing the machine learning lifecycle. It solves the experiment tracking problem where teams lose track of which parameters, data, and code produced which results across many training runs. It logs parameters, metrics, and artifacts automatically, provides a UI for comparing runs, versions models in a registry, and packages models for deployment. The core problem it solves is reproducibility: being able to recreate any past experiment and know exactly which model version is in production.
How do I install and start using MLflow?
Install MLflow with pip install mlflow. Without any additional setup, MLflow logs everything to a local mlruns directory. Wrap your training code in with mlflow.start_run(): and call mlflow.log_param, mlflow.log_metric, and mlflow.log_artifact inside it. Run mlflow ui in your terminal and navigate to http://localhost:5000 to see the tracking UI. The entire setup from installation to seeing your first run in the UI takes under ten minutes.
What is the difference between MLflow runs and experiments?
An experiment is a named collection of runs organized around a specific modeling problem or dataset. A run is a single execution of training code with a specific set of parameters, recording the metrics and artifacts it produced. You might have an experiment called customer_churn_prediction containing fifty runs that each represent a different model architecture or hyperparameter combination. Grouping related runs under one experiment makes comparison and navigation straightforward in the UI.
What is the MLflow Model Registry and when should I use it?
The MLflow Model Registry is a centralized store for versioning models and managing their lifecycle through staging and production states. You should use it when you need to track which model version is currently deployed, maintain a history of every model that has been in production, and create a clear process for promoting models from experimentation through validation to deployment. For individual projects or early-stage work, it adds overhead that may not be justified. For team environments or any model in production, the registry creates accountability and auditability that pays off quickly.
Can MLflow track deep learning experiments with PyTorch or TensorFlow?
Yes. MLflow has native integrations with PyTorch, TensorFlow, Keras, and most major frameworks through mlflow.pytorch, mlflow.tensorflow, and mlflow.keras modules. Autologging, enabled with mlflow.autolog() or framework-specific calls like mlflow.pytorch.autolog(), automatically captures parameters, metrics, and models without any additional logging code. Step-level metric logging works the same way across frameworks, producing training curves in the UI for loss, accuracy, and any other metrics computed during training.