How to Fine-Tune a Small Language Model Locally

How to Fine-Tune a Small Language Model Locally

The gap between what a general-purpose language model does and what a specific application needs it to do is where fine-tuning lives. A base model trained on internet-scale data knows a lot about everything and is optimized for nothing in particular. Fine-tuning takes that general capability and specializes it for a specific task, domain, writing style, or output format using your own data.

Until recently, fine-tuning was practically out of reach for most practitioners. It required expensive GPU clusters, deep knowledge of distributed training, and access to large proprietary datasets. That has changed significantly. Techniques like LoRA and QLoRA have reduced the compute requirements by an order of magnitude. Small but capable models like Llama 3, Mistral, Phi-3, and Gemma run comfortably on consumer hardware. The Hugging Face ecosystem has made the tooling accessible to anyone comfortable with Python.

This guide walks through the entire local fine-tuning workflow end to end. Dataset preparation, parameter-efficient training with LoRA, running training on your own machine, evaluating the result, and understanding when fine-tuning is the right tool versus alternatives like RAG.

When Fine-Tuning Is the Right Choice

Fine-tuning is not always the right tool and understanding when it is prevents wasted effort.

Fine-tuning changes how a model behaves, its style, tone, format, and domain-specific reasoning patterns. It does not reliably inject factual knowledge. A model fine-tuned on medical literature does not become a reliable source of specific medical facts. It learns to reason and communicate in a medical style. For factual knowledge injection, RAG is almost always more effective, auditable, and updatable.

The use cases where fine-tuning genuinely shines are consistent output format (training a model to always return structured JSON in a specific schema), domain-specific style (training a model to write in the voice and format of your documentation), task specialization (training a model to perform a specific classification or extraction task), and instruction following for proprietary workflows (training a model to follow your organization’s specific conventions and terminology).

If your goal is making a model answer questions about your documents, start with RAG. If your goal is making a model behave differently, fine-tuning is the right tool.

Hardware Requirements

Local fine-tuning of small models (1B to 13B parameters) is achievable on consumer hardware with the right techniques.

A GPU with 8GB VRAM can fine-tune models up to 7B parameters using QLoRA with 4-bit quantization. 16GB VRAM handles 7B models more comfortably and enables 13B models with QLoRA. 24GB VRAM (RTX 3090, RTX 4090, or A10) handles 13B models in full precision LoRA and larger models with QLoRA. Apple Silicon Macs with 32GB or more unified memory are increasingly viable for local fine-tuning through Metal acceleration.

CPU-only fine-tuning is possible for very small models but training times become impractical for anything beyond quick experimentation.

Dataset Preparation

Dataset quality determines fine-tuning quality more than any other factor. A small, high-quality dataset consistently outperforms a large, noisy one.

Fine-tuning datasets for instruction following use a prompt-completion format where each example shows the model the kind of input it will receive and the kind of output it should produce.

python

# Standard instruction fine-tuning format
dataset_example = {
    "instruction": "Classify the sentiment of the following customer review.",
    "input": "The product arrived damaged and customer service was unhelpful.",
    "output": "Negative"
}

# Chat format (used by most modern models)
chat_example = {
    "messages": [
        {"role": "system", "content": "You are a sentiment classifier. Return only: Positive, Negative, or Neutral."},
        {"role": "user", "content": "The product arrived damaged and customer service was unhelpful."},
        {"role": "assistant", "content": "Negative"}
    ]
}

For most fine-tuning tasks, 500 to 2000 high-quality examples is enough. Below 200 examples, the model struggles to learn the pattern consistently. Above 5000, you are likely including lower-quality examples that hurt more than they help.

python

from datasets import Dataset
import json

# Load your examples
with open("training_data.jsonl", "r") as f:
    examples = [json.loads(line) for line in f]

# Convert to Hugging Face dataset
dataset = Dataset.from_list(examples)

# Apply chat template formatting
def format_example(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
            add_generation_prompt=False
        )
    }

formatted_dataset = dataset.map(format_example)
train_test = formatted_dataset.train_test_split(test_size=0.1)

Inspect your formatted examples before training. Malformed examples, incorrect chat template application, or truncation issues are common and invisible until you notice the trained model behaving unexpectedly.

LoRA and QLoRA: Why They Make Local Fine-Tuning Possible

Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means storing and computing gradients for seven billion floating point numbers, which requires enormous GPU memory. LoRA (Low-Rank Adaptation) sidesteps this by adding small trainable matrices alongside the frozen original weights rather than modifying the original weights directly.

The intuition is that the weight updates needed to adapt a model to a new task tend to be low-rank, meaning they can be represented as the product of two much smaller matrices. Instead of updating a 4096 by 4096 weight matrix directly, LoRA trains an 4096 by 8 matrix and an 8 by 4096 matrix whose product approximates the needed update. The number 8 here is the rank, a tunable parameter that controls the expressiveness versus efficiency tradeoff.

QLoRA extends this further by quantizing the base model to 4-bit precision before applying LoRA, reducing memory requirements dramatically at minimal quality cost.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# QLoRA: load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: higher = more expressive, more memory
    lora_alpha=32,                 # Scaling factor, typically 2x rank
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 8,072,192,000 || trainable: 0.52%

Only 0.52 percent of parameters are trained. This is what makes local fine-tuning tractable.

Running the Training Loop

With the dataset and model configured, the SFTTrainer from the trl library handles the training loop with sensible defaults for instruction fine-tuning.

python

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./fine_tuned_model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,   # Effective batch size = 8
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="text"
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_test["train"],
    eval_dataset=train_test["test"],
    peft_config=lora_config
)

trainer.train()
trainer.save_model("./fine_tuned_model")

Training a 8B model on 1000 examples for three epochs takes roughly two to four hours on an RTX 4090 and six to ten hours on a 16GB GPU. Watch the training loss curve. It should decrease consistently across epochs. A loss that stops decreasing early suggests the learning rate is too low or the dataset is too small. A loss that oscillates wildly suggests the learning rate is too high.

Evaluating the Fine-Tuned Model

Evaluation is where most fine-tuning guides stop too early. Loss on the validation set tells you whether training progressed but not whether the model actually does what you need.

python

from peft import PeftModel

# Load the fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
ft_model = PeftModel.from_pretrained(base_model, "./fine_tuned_model")
ft_model = ft_model.merge_and_unload()

def generate(model, prompt, max_new_tokens=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Compare base model vs fine-tuned model on held-out examples
test_prompt = "Classify the sentiment: 'Shipping was fast but the item broke immediately.'"

print("Base model:", generate(base_model, test_prompt))
print("Fine-tuned:", generate(ft_model, test_prompt))

For classification and extraction tasks, calculate precision, recall, and F1 on a held-out test set. For generation tasks, human evaluation on twenty to fifty examples is more informative than automated metrics. The key question is whether the fine-tuned model consistently produces outputs that match your requirements better than the base model. If the improvement is marginal, revisit your dataset before adjusting training parameters.

Common Fine-Tuning Mistakes

Starting with too little data and blaming the technique is the most common mistake. If the model is not learning the pattern you want, add more high-quality examples before adjusting hyperparameters.

Training for too many epochs causes overfitting where the model memorizes training examples rather than learning generalizable patterns. Three to five epochs is usually sufficient for small datasets.

Using a rank that is too high wastes memory without improving quality. A rank of 8 to 16 handles most tasks. Reserve rank 64 or higher for tasks with genuinely complex stylistic requirements.

Skipping qualitative evaluation produces models that look good on validation loss but fail on real inputs. Always test on examples the model has never seen and evaluate the outputs manually before deploying.

Cheat Sheet

ParameterConservativeBalancedAggressive
LoRA rank (r)81664
Learning rate1e-42e-45e-4
Epochs235
Batch size (effective)4816
Dataset size200-500500-20002000+

FAQs

What is the difference between fine-tuning and RAG?

RAG retrieves external documents at inference time and includes them as context for the language model. It is best for factual knowledge injection, up-to-date information, and auditable outputs. Fine-tuning modifies the model’s weights to change its behavior, style, or output format. Most production AI applications use RAG for knowledge and fine-tuning for behavior. If your goal is making a model answer questions about your documents, start with RAG. If your goal is making a model behave differently, fine-tune.

How much data do I need to fine-tune a small language model?

For most instruction fine-tuning tasks, 500 to 2000 high-quality examples is the practical sweet spot. Below 200 examples the model struggles to learn the pattern reliably. Quality matters far more than quantity. A dataset of 500 carefully curated, correctly formatted examples will outperform 5000 noisy or inconsistently formatted ones.

What is LoRA and why is it used for local fine-tuning?

Low-Rank Adaptation, is a technique that adds small trainable matrices alongside the frozen weights of a base model rather than updating all the model’s parameters directly. This reduces the number of trainable parameters from billions to millions, making fine-tuning feasible on consumer GPUs. QLoRA extends this by quantizing the base model to 4-bit precision before applying LoRA, further reducing memory requirements with minimal quality loss.

Which small models are best for local fine-tuning?

Llama 3 8B Instruct, Mistral 7B Instruct, Phi-3 Mini, and Gemma 2 9B are the most practical choices in 2024. All are capable, have permissive licenses for most use cases, and have strong community support in the Hugging Face ecosystem. For hardware with less than 8GB VRAM, Phi-3 Mini at 3.8B parameters is the most capable option that fits comfortably.

How do I know if my fine-tuned model is actually better?

Compare the fine-tuned model against the base model on a held-out test set of examples it has never seen. For classification tasks, calculate F1 score on both models. For generation tasks, evaluate twenty to fifty outputs manually and assess whether they consistently match your requirements. Validation loss decreasing during training is necessary but not sufficient evidence of improvement. Always evaluate qualitatively on real inputs before considering the fine-tune successful.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top