dbt Tutorial for Beginners in Analytics Engineering

dbt Tutorial for Beginners in Analytics Engineering

If you have spent any time in data circles recently, you have probably noticed that dbt keeps coming up. In job descriptions, in Slack communities, in conversations about the modern data stack. Analytics engineers list it as their primary tool. Data analysts list it as the skill they are most glad they learned. Data engineers describe it as the thing that finally made SQL-based transformation work manageable at scale.

The reason dbt gets this kind of consistent praise is that it solved a real and frustrating problem. Before dbt, SQL transformations in the data warehouse were a mess of undocumented queries, fragile dependencies, no version control, no testing, and no reliable way to understand what any given table actually contained or where it came from. dbt brought software engineering practices, version control, testing, documentation, and modularity, into the SQL transformation workflow without asking analysts to become software engineers.

This tutorial walks you through everything you need to understand and start using dbt as a beginner. It covers what dbt actually does, how to set up a project, how to write models, how to test your data, and how to document your work. By the end you will have a clear mental model of how dbt fits into the analytics engineering workflow and enough practical knowledge to build your first real project.

What dbt Actually Does

dbt stands for data build tool. Understanding what it does and what it does not do saves a lot of confusion early on.

dbt does one thing: it transforms data that is already in your data warehouse. It takes raw data that has been loaded into your warehouse by some other tool and runs SQL transformations against it to produce clean, modeled tables that analysts can query.

dbt does not extract data from source systems. It does not load data into the warehouse. Those steps are handled by separate tools like Fivetran, Airbyte, or custom pipelines. dbt sits at the T in ELT, the transform step, and it does that step exceptionally well.

The way dbt works is straightforward. You write SQL SELECT statements. dbt wraps those SELECT statements in CREATE TABLE or CREATE VIEW statements and executes them in your warehouse in the correct order. What makes dbt powerful is everything it builds around those SQL statements: dependency management, testing, documentation, version control compatibility, and a modular project structure that scales from a handful of models to hundreds.

dbt Core vs dbt Cloud

Before setting up dbt you need to understand the two ways to use it.

dbt Core is the open source command line tool. It is free, runs locally or on your own infrastructure, and gives you full control over your environment. You write your models in a text editor, run dbt commands in your terminal, and manage scheduling and deployment yourself. dbt Core is what most analysts start with and what this tutorial focuses on.

dbt Cloud is the managed platform built on top of dbt Core. It adds a web-based IDE, managed job scheduling, a documentation hosting environment, and collaboration features. It has a free tier for individual developers and paid tiers for teams. dbt Cloud is worth considering once you are comfortable with dbt Core concepts and working in a team environment.

For learning purposes, dbt Core gives you better visibility into what is actually happening and a deeper understanding of the tool. Start there.

Setting Up Your First dbt Project

Prerequisites

You need Python installed (version 3.8 or higher), a data warehouse to connect to, and a terminal you are comfortable using. For learning purposes, BigQuery has a generous free tier and works well with dbt. Snowflake and DuckDB are also popular choices for beginners. DuckDB in particular requires no account setup and runs entirely locally, making it the fastest path to a working dbt environment.

Installing dbt Core

Install dbt Core and the adapter for your warehouse using pip. Each warehouse has its own adapter package.

bash

# For BigQuery
pip install dbt-bigquery

# For Snowflake
pip install dbt-snowflake

# For DuckDB (easiest for local learning)
pip install dbt-duckdb

# For Redshift
pip install dbt-redshift

Verify the installation worked:

bash

dbt --version

Initializing a Project

Navigate to the directory where you want to create your project and run:

bash

dbt init my_first_project

dbt will ask you which warehouse adapter you are using and walk you through the connection configuration. It creates a project directory with the following structure:

my_first_project/
├── dbt_project.yml
├── models/
│   └── example/
├── tests/
├── macros/
├── seeds/
├── snapshots/
└── analyses/

The two things you need to understand immediately are dbt_project.yml and the models/ directory. Everything else can wait.

The dbt_project.yml File

This is the configuration file for your entire project. It defines the project name, the warehouse connection profile to use, and configuration that applies to groups of models. A minimal version looks like this:

yaml

name: 'my_first_project'
version: '1.0.0'
config-version: 2

profile: 'my_first_project'

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]

models:
  my_first_project:
    +materialized: view

The profiles.yml File

Your warehouse connection credentials live in a separate file called profiles.yml that sits outside your project directory in a .dbt folder in your home directory. This separation keeps credentials out of your project repository.

A BigQuery profile looks like this:

yaml

my_first_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: your-gcp-project-id
      dataset: dbt_dev
      threads: 4

A DuckDB profile for local development looks like this:

yaml

my_first_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'my_first_project.duckdb'
      threads: 4

Test your connection with:

bash

dbt debug

A successful debug output confirms dbt can reach your warehouse and your project is configured correctly.

Understanding dbt Models

Models are the core of any dbt project. A model is simply a SQL file containing a SELECT statement. dbt takes that SELECT statement and materializes it in your warehouse as a table or view.

Your First Model

Create a file called customers.sql inside the models/ directory:

sql

select
    id as customer_id,
    first_name,
    last_name,
    email,
    created_at
from raw.customers

That is a complete dbt model. Run it with:

bash

dbt run

dbt executes this SELECT statement and creates a view called customers in your warehouse. Every time you run dbt run, it recreates this view with the latest logic.

Materializations

Materialization controls how dbt persists your model in the warehouse. There are four options.

view creates a database view. The SQL runs every time someone queries the view. This is the default and is appropriate for lightweight transformations where you do not need to store intermediate results.

table creates a full table by running your SELECT and storing the results. Queries against a table are faster than against a view because the computation is done at build time rather than query time. Use this for models that are queried frequently or involve expensive computation.

incremental creates a table on the first run and then only processes new or updated records on subsequent runs. This is critical for large event tables where rebuilding the full table on every run would be prohibitively slow.

ephemeral does not create anything in the warehouse. The model is injected as a common table expression (CTE) into models that reference it. Use this for intermediate logic you want to reuse but do not need materialized as a standalone object.

Set materialization in your dbt_project.yml for groups of models or in individual model files using a config block:

sql

{{ config(materialized='table') }}

select
    id as customer_id,
    first_name,
    last_name,
    email,
    created_at
from raw.customers

Building Model Dependencies With ref()

The feature that makes dbt models genuinely powerful is the ref() function. Instead of hardcoding table names when referencing other models, you use ref() to declare dependencies between models.

Consider a two-model example. The first model stg_orders.sql cleans raw order data:

sql

select
    id as order_id,
    customer_id,
    status,
    amount,
    created_at as order_date
from raw.orders
where status != 'test'

The second model customer_orders.sql aggregates order data at the customer level and references the first model:

sql

select
    customer_id,
    count(*) as total_orders,
    sum(amount) as lifetime_value,
    min(order_date) as first_order_date,
    max(order_date) as most_recent_order_date
from {{ ref('stg_orders') }}
group by customer_id

The {{ ref('stg_orders') }} call does two things. It replaces itself with the actual database path of the stg_orders model at runtime, so it always points to the right table regardless of environment. And it tells dbt that customer_orders depends on stg_orders, so dbt knows to build stg_orders first when running the project.

dbt builds a directed acyclic graph (DAG) of all model dependencies and executes models in the correct order automatically. You never have to manually manage execution order.

You can visualize the DAG with:

bash

dbt docs generate
dbt docs serve

This opens a browser interface showing the full lineage of your project, which models depend on which, and the SQL behind each one.

Project Structure and Naming Conventions

A well-structured dbt project makes it easy to understand what each model does and where it sits in the transformation pipeline. The most widely adopted convention organizes models into three layers.

Staging models (stg_)

Staging models sit directly on top of raw source data. Their job is to rename columns to consistent conventions, cast data types correctly, filter out obviously bad records, and do nothing else. One staging model per source table is the standard pattern. They are named with the stg_ prefix.

sql

-- models/staging/stg_customers.sql
select
    id as customer_id,
    first_name,
    last_name,
    lower(email) as email,
    cast(created_at as timestamp) as created_at
from raw.customers

Intermediate models (int_)

Intermediate models join and combine staging models into useful building blocks. They sit between staging and marts and handle the complex joining logic. Not all projects need this layer, but it is valuable when you have logic that is reused across multiple mart models.

sql

-- models/intermediate/int_customer_orders.sql
select
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email,
    count(o.order_id) as total_orders,
    sum(o.amount) as lifetime_value
from {{ ref('stg_customers') }} c
left join {{ ref('stg_orders') }} o
    on c.customer_id = o.customer_id
group by 1, 2, 3, 4

Mart models (fct_, dim_)

Mart models are the final output of the transformation pipeline. They are the tables that analysts and BI tools query. Fact tables (fct_) contain transactional or event data. Dimension tables (dim_) contain descriptive attributes of entities like customers, products, or locations.

sql

-- models/marts/dim_customers.sql
select
    customer_id,
    first_name,
    last_name,
    email,
    total_orders,
    lifetime_value,
    case
        when lifetime_value >= 1000 then 'high_value'
        when lifetime_value >= 200 then 'mid_value'
        else 'low_value'
    end as customer_segment
from {{ ref('int_customer_orders') }}

Testing Your Data

Testing is one of the most important features dbt provides and one of the most underused by beginners. dbt has two types of tests: generic tests and singular tests.

Generic Tests

Generic tests are built-in tests you apply to columns in a configuration file called schema.yml. The four built-in generic tests cover the most common data quality requirements.

unique asserts that every value in a column is unique. not_null asserts that no values in a column are null. accepted_values asserts that every value in a column belongs to a defined list. relationships asserts that every value in a column exists as a value in another column in another model, functioning like a foreign key check.

Define these tests in a schema.yml file alongside your models:

yaml

version: 2

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'cancelled']

Run your tests with:

bash

dbt test

dbt executes each test as a query against your warehouse and reports failures. A failing test means data in your warehouse does not meet the assertion you defined.

Singular Tests

Singular tests are custom SQL files that return rows when something is wrong. If the query returns zero rows, the test passes. If it returns any rows, the test fails and dbt shows you what went wrong.

Create a file in the tests/ directory:

sql

-- tests/assert_positive_order_amounts.sql
select
    order_id,
    amount
from {{ ref('stg_orders') }}
where amount < 0

This test fails if any orders have negative amounts. Singular tests are useful for business logic assertions that the generic tests cannot express.

Documenting Your Models

dbt documentation lives alongside your code in the same schema.yml files where you define tests. Adding descriptions to models and columns takes a few extra lines but produces documentation that is automatically kept in sync with your code.

yaml

version: 2

models:
  - name: dim_customers
    description: >
      One row per customer. Contains customer attributes and
      aggregated order metrics. Use this model as the primary
      source for customer-level analysis.
    columns:
      - name: customer_id
        description: Primary key. Unique identifier for each customer.
        tests:
          - unique
          - not_null
      - name: lifetime_value
        description: >
          Total USD value of all completed orders for this customer.
          Does not include cancelled or refunded orders.
      - name: customer_segment
        description: >
          Derived segment based on lifetime value.
          high_value = LTV >= $1000, mid_value = LTV >= $200,
          low_value = LTV < $200.

Generate and serve the documentation site with:

bash

dbt docs generate
dbt docs serve

The documentation site shows every model in your project, its description, its columns and their descriptions, the SQL that defines it, and its lineage in the DAG. For anyone joining the data team, this is invaluable context.

Sources and Freshness Checks

So far the examples have referenced raw tables directly with hardcoded paths like raw.customers. dbt provides a better pattern for this through sources.

Sources are declarations of the raw tables your dbt project depends on. Defining them in a sources.yml file gives you several advantages: you can reference them with the source() function which makes the dependency explicit, dbt can test properties of source tables the same way it tests models, and dbt can check whether source data is fresh.

yaml

version: 2

sources:
  - name: raw
    database: my_database
    schema: raw
    tables:
      - name: customers
        description: Raw customer records loaded from the production database.
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        loaded_at_field: _loaded_at
      - name: orders
        description: Raw order records loaded from the production database.

Reference sources in your staging models using source():

sql

-- models/staging/stg_customers.sql
select
    id as customer_id,
    first_name,
    last_name,
    email,
    created_at
from {{ source('raw', 'customers') }}

Check source freshness with:

bash

dbt source freshness

dbt queries the loaded_at_field column and warns or errors based on how recently data was loaded. This gives you automated alerting when upstream data pipelines are delayed.

Incremental Models

For large tables with millions or billions of rows, rebuilding the full table on every dbt run is impractical. Incremental models solve this by only processing new or changed records on each run.

sql

{{ config(materialized='incremental') }}

select
    event_id,
    user_id,
    event_type,
    properties,
    occurred_at
from {{ source('raw', 'events') }}

{% if is_incremental() %}
    where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}

The is_incremental() block only executes on runs after the first. {{ this }} refers to the existing table in the warehouse. This pattern reads only events that occurred after the latest event already in the table, inserts them, and leaves existing rows untouched.

Incremental models require careful thought about your data. If source records can be updated after insertion, a simple max(occurred_at) filter misses those updates. The unique_key config tells dbt to upsert records rather than just insert them:

sql

{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

Running dbt Commands

The commands you will use most frequently as you work with dbt:

dbt run builds all models in your project in dependency order.

dbt run --select stg_customers builds only the stg_customers model.

dbt run --select +dim_customers builds dim_customers and all of its upstream dependencies.

dbt test runs all tests defined in your project.

dbt test --select stg_orders runs only tests for the stg_orders model.

dbt build runs models and tests together, stopping if a test fails before building downstream models that depend on the failing model.

dbt compile compiles your SQL without executing it, useful for debugging what your Jinja templates produce.

dbt docs generate && dbt docs serve generates and serves your documentation site locally.

dbt source freshness checks whether source tables have been updated recently enough.

dbt Beginner Cheat Sheet

ConceptWhat It IsWhen to Use It
ModelA SQL SELECT statement dbt materializesAlways, this is the core unit
ref()Reference to another dbt modelWhenever one model depends on another
source()Reference to a raw source tableFor all raw table references in staging
Staging modelCleans one raw source tableOne per source table
Mart modelFinal analytical outputWhat analysts and BI tools query
Generic testBuilt-in unique, not_null, etcColumn-level data quality checks
Singular testCustom SQL that returns failing rowsComplex business logic assertions
IncrementalOnly processes new recordsLarge event or log tables
schema.ymlTests and docs configurationAlongside every model folder
dbt buildRuns models and tests togetherCI/CD and production runs

Common Beginner Mistakes

Putting business logic in staging models is the most common structural mistake. Staging models should only clean and rename. Joins, aggregations, and business logic belong in intermediate or mart models. Keeping staging models simple makes the entire project easier to debug and modify.

Not using ref() consistently breaks the dependency graph. If you hardcode a table name instead of using ref(), dbt does not know about the dependency and may build models in the wrong order. Always use ref() for models and source() for raw tables.

Skipping tests entirely is a mistake that causes problems later. Even a minimal set of unique and not_null tests on primary keys catches a surprising number of data quality issues before they reach analysts. Start with basic tests on every primary key and expand from there.

Writing models that are too large is a readability problem. A model with hundreds of lines of SQL joining ten tables is hard to understand, hard to test, and hard to debug. Break complex logic into intermediate models that each do one thing clearly.

FAQs

What is dbt used for in analytics engineering?

dbt is used to transform raw data inside a data warehouse into clean, tested, and documented analytical models. It allows data teams to write SQL SELECT statements and have dbt handle materializing them as tables or views, managing dependencies between models, running data quality tests, and generating documentation. It brings software engineering practices like version control, testing, and modularity to SQL-based data transformation work.

Do I need to know Python to use dbt?

No. The core dbt workflow is SQL-based. You write SQL SELECT statements and dbt handles the rest. dbt uses Jinja templating for dynamic SQL, which has a small syntax to learn, but it is not Python programming. Python knowledge is useful for writing custom dbt packages and macros but is not required to be productive with dbt from day one.

What is the difference between dbt Core and dbt Cloud?

dbt Core is the free open source command line tool that you install locally and run from your terminal. dbt Cloud is a managed web platform built on dbt Core that adds a browser-based IDE, managed job scheduling, hosted documentation, and collaboration features. dbt Cloud has a free tier for individual users. Most people learn with dbt Core and adopt dbt Cloud when working in team environments.

What data warehouses does dbt support?

dbt supports all major data warehouses through adapter packages including Snowflake, BigQuery, Redshift, Databricks, DuckDB, PostgreSQL, and many others. The core dbt workflow is the same regardless of warehouse and you install the adapter package for your specific warehouse when setting up dbt Core.

How long does it take to learn dbt as a beginner?

An analyst comfortable with SQL can be productive with dbt within a few days of focused learning. Understanding models, ref(), basic tests, and the staging and mart layer pattern covers the majority of day-to-day dbt work. More advanced topics like incremental models, custom macros, packages, and deployment take longer to master but are not required to start building real projects.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top