You have trained a machine learning model. It performs well in evaluation. Now comes the most important question:
How do you actually use it to make predictions in the real world?
The answer depends almost entirely on one decision — whether your predictions need to happen now (in real time) or whether they can happen later (in batch).
This single architectural choice shapes everything about how your model is deployed, how much it costs to run, how complex the infrastructure becomes, and what kind of business problems it can solve.
In this guide, we will break down batch inference and real time inference clearly — what they are, how they work, when to use each, and how to choose between them.
What Is Inference?
Before comparing the two types, it helps to define inference itself.
Inference is the process of using a trained machine learning model to generate predictions on new, unseen data.
Training is when the model learns from historical data. Inference is when the trained model is put to work — receiving inputs and returning predictions.
Training: Historical data → Model learns patterns → Trained model
Inference: New data → Trained model → Predictions
Inference happens constantly in production ML systems — every time a fraud score is calculated, a recommendation is generated, or a customer churn probability is computed, that is inference happening.
What Is Batch Inference?
Batch inference is the process of running a trained model on a large collection of data all at once — at a scheduled time — and storing the predictions for later use.
The model does not respond to individual requests in real time. Instead, it processes an entire dataset periodically — hourly, daily, weekly and writes the results to a database, data warehouse, or file system. Applications then read those pre-computed predictions when needed.
Simple Analogy
Think of batch inference like a bakery that bakes bread every morning. The baker does not bake one loaf every time a customer walks in. Instead, they bake a large batch at 5 AM and have it ready when the store opens. Every customer who comes in during the day gets bread from that pre-baked batch.
The predictions are made ahead of time and served from storage — not computed on demand.
How Batch Inference Works
1. Scheduler triggers at defined time (e.g., 2 AM daily)
2. Batch pipeline reads all entities from the data store
3. Features are computed for all entities
4. Model generates predictions for all entities at once
5. Predictions are written to a database or data warehouse
6. Applications query the database to retrieve pre-computed predictions
Example: Email Marketing Churn Prediction
A SaaS company runs a churn prediction model every Sunday night. It scores all 50,000 active customers with a churn probability. On Monday morning, the customer success team queries the results and calls customers with probability above 0.7. The predictions are already there — no waiting, no real-time computation needed.
What Is Real Time Inference?
Real time inference (also called online inference) is the process of running a model to generate a prediction immediately in response to a specific request — typically within milliseconds to seconds.
Instead of pre-computing predictions for a batch of entities, the model receives a single input, processes it, and returns a prediction right now — while a user, system, or application is waiting for the response.
Simple Analogy
Think of real time inference like a coffee shop that makes each drink to order. When you walk in and order a latte, the barista makes it specifically for you, right now. You wait a couple of minutes and receive your drink. Nothing is pre-made — everything is fresh and specific to your request.
The prediction is generated on demand, specific to the current input, and returned immediately.
How Real Time Inference Works
1. User or system sends a prediction request
2. Request hits the model serving endpoint (API)
3. Features are retrieved from a low-latency store
4. Model generates a single prediction
5. Prediction is returned in the response
6. Application uses the prediction immediately
Example: Fraud Detection at a Bank
When a customer swipes their credit card, the payment system sends the transaction details to a fraud detection model API. Within 200 milliseconds, the model returns a fraud score. If the score exceeds a threshold, the transaction is declined. The cardholder never notices the model running — it happens faster than the card terminal processes the payment.
Key Differences — Side by Side
Difference 1: Timing
Batch — Predictions are generated at scheduled intervals (hourly, daily, weekly). The prediction is always about past or current state — never about what is happening right now.
Real Time — Predictions are generated the moment a request arrives. The prediction reflects the current state of the world at the exact moment of the request.
Difference 2: Latency
Batch — Latency is irrelevant during inference itself — the model runs in the background. Latency matters only when applications read pre-computed results, which is near-instant.
Real Time — Latency is everything. The model must respond within a defined SLA — often under 100ms for customer-facing applications, sometimes under 10ms for financial systems.
Difference 3: Input
Batch — Processes many entities simultaneously. Inputs are a dataset with hundreds, thousands, or millions of rows.
Real Time — Processes one entity at a time (or a very small micro-batch). Input is a single record or a tiny payload.
Difference 4: Infrastructure
Batch — Runs on batch processing frameworks like Apache Spark, Databricks, or simple Python scripts on a schedule. No always-on serving infrastructure needed.
Real Time — Requires a continuously running model serving endpoint — a REST API, gRPC service, or managed endpoint that is always available to receive requests.
Difference 5: Cost
Batch — Compute is used only when the batch job runs — cost is proportional to job frequency and data volume. Highly cost-efficient for large-scale predictions.
Real Time — Requires always-on serving infrastructure — the endpoint must be running even when no requests are coming in. Costs are higher, especially at low request volumes.
Difference 6: Feature Freshness
Batch — Features are computed from data available at batch time. Predictions may be hours or days old by the time they are used.
Real Time — Features are retrieved from a low-latency store at request time — reflecting the most current available state.
Full Comparison Table
| Feature | Batch Inference | Real Time Inference |
|---|---|---|
| Timing | Scheduled (hourly, daily, weekly) | On-demand, immediate |
| Latency requirement | None during inference | Milliseconds to seconds |
| Input volume | Thousands to millions of records | One or few records |
| Infrastructure | Batch job scheduler | Always-on serving endpoint |
| Cost | Low — compute only when running | Higher — always-on infrastructure |
| Feature freshness | Hours to days old | Current at request time |
| Scalability | Scales to very large datasets easily | Scales with request volume |
| Complexity | Lower — simpler architecture | Higher — serving, monitoring, SLA |
| Prediction availability | Pre-stored in database | Returned in API response |
| When to use | Non-urgent, scheduled decisions | Immediate, customer-facing decisions |
Implementing Batch Inference in Python
Simple Batch Inference Pipeline
python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import joblib
from datetime import datetime
# Simulated trained model and scaler
np.random.seed(42)
n_training = 1000
X_train = np.random.randn(n_training, 5)
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Save model artifacts
joblib.dump(model, 'churn_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
def run_batch_inference(input_data_path, output_path):
"""
Run batch inference on all customers and save predictions.
This function would be triggered by a scheduler (e.g., Airflow, cron).
"""
print(f"Batch inference started: {datetime.now()}")
# Load all customer data
df = pd.DataFrame(
np.random.randn(50000, 5),
columns=['feature_1', 'feature_2', 'feature_3',
'feature_4', 'feature_5']
)
df['customer_id'] = range(1, len(df) + 1)
print(f"Processing {len(df):,} customers...")
# Load model artifacts
model = joblib.load('churn_model.pkl')
scaler = joblib.load('scaler.pkl')
# Prepare features
feature_cols = ['feature_1', 'feature_2', 'feature_3',
'feature_4', 'feature_5']
X = scaler.transform(df[feature_cols])
# Generate predictions for all customers at once
df['churn_probability'] = model.predict_proba(X)[:, 1]
df['churn_prediction'] = model.predict(X)
df['risk_segment'] = pd.cut(
df['churn_probability'],
bins=[0, 0.3, 0.6, 1.0],
labels=['Low', 'Medium', 'High']
)
df['prediction_timestamp'] = datetime.now()
# Save predictions to database/file
output_df = df[['customer_id', 'churn_probability',
'churn_prediction', 'risk_segment',
'prediction_timestamp']]
output_df.to_csv(output_path, index=False)
print(f"Batch inference completed: {datetime.now()}")
print(f"High risk customers: {(df['risk_segment'] == 'High').sum():,}")
print(f"Predictions saved to: {output_path}")
return output_df
# Run the batch job
results = run_batch_inference(
input_data_path='customer_features.csv',
output_path='churn_predictions_2024_01_15.csv'
)
print(results.head())
Output:
Batch inference started: 2024-01-15 02:00:01
Processing 50,000 customers...
Batch inference completed: 2024-01-15 02:01:23
High risk customers: 12,847
Predictions saved to: churn_predictions_2024_01_15.csv
Implementing Real Time Inference With FastAPI
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import joblib
import time
from datetime import datetime
app = FastAPI(title="Fraud Detection API")
# Load model at startup — not per request
model = joblib.load('churn_model.pkl')
scaler = joblib.load('scaler.pkl')
class PredictionRequest(BaseModel):
customer_id: int
feature_1: float
feature_2: float
feature_3: float
feature_4: float
feature_5: float
class PredictionResponse(BaseModel):
customer_id: int
churn_probability: float
churn_prediction: int
risk_segment: str
latency_ms: float
timestamp: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
start_time = time.time()
try:
# Prepare features from the request
features = np.array([[
request.feature_1,
request.feature_2,
request.feature_3,
request.feature_4,
request.feature_5
]])
# Scale and predict
features_scaled = scaler.transform(features)
churn_probability = float(model.predict_proba(features_scaled)[0][1])
churn_prediction = int(model.predict(features_scaled)[0])
# Classify risk
if churn_probability >= 0.6:
risk_segment = "High"
elif churn_probability >= 0.3:
risk_segment = "Medium"
else:
risk_segment = "Low"
latency_ms = (time.time() - start_time) * 1000
return PredictionResponse(
customer_id=request.customer_id,
churn_probability=round(churn_probability, 4),
churn_prediction=churn_prediction,
risk_segment=risk_segment,
latency_ms=round(latency_ms, 2),
timestamp=datetime.now().isoformat()
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "loaded"}
# To run: uvicorn main:app --host 0.0.0.0 --port 8000
# Test: curl -X POST "http://localhost:8000/predict" \
# -H "Content-Type: application/json" \
# -d '{"customer_id": 123, "feature_1": 0.5, "feature_2": -0.3,
# "feature_3": 1.2, "feature_4": -0.8, "feature_5": 0.1}'
Example API Response:
json
{
"customer_id": 123,
"churn_probability": 0.2847,
"churn_prediction": 0,
"risk_segment": "Low",
"latency_ms": 12.34,
"timestamp": "2024-01-15T14:30:45.123456"
}
Near Real Time Inference — The Middle Ground
Some systems need predictions faster than batch scheduling allows but not as immediately as true real time. This middle ground is called near real time inference or micro-batch inference.
Batch: Runs every 24 hours
Near Real Time: Runs every 5 minutes or triggers on new events
Real Time: Runs within 100ms of each request
Near real time works well for use cases like:
- Refreshing recommendation scores every 15 minutes
- Scoring new transactions every minute for operational monitoring
- Updating customer segments every few hours based on latest activity
Infrastructure is simpler than real time but more responsive than daily batch — often using stream processing tools like Apache Kafka or Apache Flink to trigger inference on new events rather than on a fixed schedule.
When to Use Each
Use Batch Inference When
- The prediction does not need to influence an action happening right now
- You have large volumes of entities to score (millions of customers, products, transactions)
- Cost efficiency is a priority
- Predictions are consumed hours or days after they are generated
- Infrastructure simplicity matters — no always-on serving required
Classic batch inference use cases:
- Weekly customer churn scoring for a customer success team
- Daily product recommendation refresh for an email campaign
- Nightly credit risk scoring for a loan portfolio review
- Monthly demand forecasting for inventory planning
Use Real Time Inference When
- A user or system is waiting for the prediction right now
- The decision cannot be made without a fresh, up-to-the-moment prediction
- Predictions must reflect the most current state of the world
- The business impact of a stale prediction is unacceptable
Classic real time inference use cases:
- Fraud detection at the point of transaction
- Dynamic pricing based on current demand
- Search result ranking for each query
- Content recommendation during an active user session
- Autonomous vehicle decision making
Real-World Use Cases
E-commerce Platform
Batch: Every night at 1 AM, a recommendation model scores all 2 million product-user combinations. Results are stored in a Redis cache. When users browse the next day, recommendations load instantly from cache — no model inference needed during browsing.
Real Time: When a user adds an item to their cart, a complementary product recommendation model is called instantly — generating context-aware suggestions based on the exact cart state at that moment.
Banking
Batch: Every morning before markets open, a risk model scores the entire loan portfolio for expected default probability. Risk officers review the output during the trading day.
Real Time: Every card transaction is scored in under 200ms by a fraud detection model. The decision to approve or decline the transaction happens before the payment terminal times out.
Healthcare
Batch: Every week, a patient readmission risk model scores all patients who were discharged in the last 30 days. Care coordinators follow up with high-risk patients.
Real Time: During an ICU monitoring session, a deterioration prediction model scores patient vitals every 60 seconds and alerts nursing staff if the score crosses a critical threshold.
Advantages and Disadvantages
Batch Inference
Advantages:
- Simple architecture — no always-on serving required
- Highly cost-efficient — compute used only when job runs
- Easily handles very large datasets
- Failures are recoverable — rerun the batch job
- No latency constraints during inference
Disadvantages:
- Predictions become stale between batch runs
- Cannot react to events in real time
- Not suitable for customer-facing, interactive applications
- Predictions may be outdated by the time they are used
Real Time Inference
Advantages:
- Predictions reflect the current state of the world
- Enables immediate, personalized decision making
- Supports interactive and customer-facing applications
- Can react to events as they happen
Disadvantages:
- Requires always-on, low-latency serving infrastructure
- Higher operational complexity — monitoring, scaling, SLA management
- More expensive — especially at low request volumes
- Harder to debug and test than batch pipelines
- Tight latency requirements limit model complexity
Common Mistakes to Avoid
- Using real time inference when batch is sufficient — Real time infrastructure is significantly more complex and expensive. If predictions do not need to be instant, batch is almost always the better choice
- Not monitoring prediction staleness in batch systems — Batch predictions go stale. If the batch job fails silently, applications serve stale predictions without knowing it. Always monitor batch job success and add freshness checks
- Loading the model on every real time request — Loading a model from disk takes seconds. In a real time serving endpoint, the model must be loaded once at startup and kept in memory — not reloaded on each prediction request
- Ignoring training-serving skew — If features are computed differently in batch vs the feature store used for real time serving, predictions will differ unexpectedly. Use a feature store to ensure consistency
- Not setting latency budgets for real time systems — Real time inference without a defined latency SLA leads to unpredictable user experiences. Define acceptable latency upfront and design the system to meet it
- Over-engineering with real time when near real time is enough — Running a batch job every 5 minutes is dramatically simpler than building a fully real time serving system and meets the requirements for many use cases that feel like they need real time
Batch inference and real time inference are not competing approaches — they are complementary tools for different situations. Many production ML systems use both simultaneously.
Here is the simplest decision framework:
- Is someone waiting for this prediction right now? → Real time inference
- Can this prediction be generated ahead of time and stored? → Batch inference
- Do you need more frequently than daily but do not need sub-second? → Near real time
The choice shapes everything — infrastructure, cost, complexity, and what business problems you can solve. Getting it right at the design stage saves enormous amounts of engineering effort and cost later.
FAQs
What is the main difference between batch and real time inference?
Batch inference generates predictions for many entities at once on a schedule and stores them for later use. Real time inference generates a single prediction immediately in response to a request. The core difference is timing — pre-computed vs on-demand.
Which is cheaper — batch or real time inference?
Batch inference is almost always cheaper. Compute runs only when the batch job executes. Real time inference requires always-on serving infrastructure that runs even when no requests are coming in, making it significantly more expensive especially at low request volumes.
When should I use real time inference?
Use real time inference when the prediction must influence a decision that cannot wait — fraud detection at point of transaction, search result ranking for each query, dynamic pricing based on current demand, or any user-facing interactive application.
What is near real time inference?
Near real time inference is the middle ground — predictions generated more frequently than a daily batch (every few minutes or triggered by new events) but not as immediately as true real time. It uses stream processing tools like Kafka or Flink and is simpler than full real time serving infrastructure.
Can I use both batch and real time inference in the same system?
Yes — many production ML systems use both. A recommendation system might use batch inference to pre-compute personalized recommendations stored in a cache, and real time inference to generate context-specific suggestions during an active user session.
What is training-serving skew and why does it matter here?
Training-serving skew occurs when features are computed differently during model training compared to inference time — producing different prediction behavior in production than in evaluation. It is especially important to manage when switching between batch and real time inference contexts.