Building Your First Data Warehouse: A Hands-on Tutorial

Global 0410 October 2024 Blog, Datawarehouse, Tutorials

What a data warehouse is, the reference architecture, how to stand one up (Snowflake/BigQuery/Redshift), build a first ETL/ELT pipeline, and validate performance and costs.

What is a Data Warehouse?

A data warehouse centralizes historical, read-optimized data across sources for reporting and analytics. Key traits: schema-on-write, dimensional models, and separation of compute & storage.

Reference Architecture

Sources: OLTP DBs, SaaS (Stripe, Salesforce), files, events.
Staging/Landing: raw immutable copies (often partitioned by ingestion date).
Transform: ELT in-warehouse with SQL (dbt) or ETL before load.
Warehouse: curated dimensional models (Star/Snowflake schema).
Data Marts: domain-oriented subsets for BI teams.
BI/Serving: dashboards (Looker, Power BI), APIs, ML features.

Spin Up the Warehouse

Choose a platform (Snowflake, BigQuery, Redshift). Create a database + schema, a compute (warehouse/slot/cluster), and a service user.

CREATE SCHEMA analytics.staging;
CREATE SCHEMA analytics.marts;

CREATE TABLE analytics.marts.dim_customer (
  customer_id STRING PRIMARY KEY,
  name STRING, email STRING, created_at TIMESTAMP
);

CREATE TABLE analytics.marts.fact_orders (
  order_id STRING PRIMARY KEY,
  customer_id STRING REFERENCES analytics.marts.dim_customer(customer_id),
  order_ts TIMESTAMP, amount NUMERIC(12,2), status STRING
);

First ELT Pipeline

Land raw data daily, transform with SQL models, then publish marts. Tools: Airbyte/Fivetran (ingest), dbt (transform), Airflow/Cloud Composer (orchestration).

-- dbt-style model (orders_clean.sql)
SELECT
  id          AS order_id,
  customer_id,
  CAST(created_at AS TIMESTAMP) AS order_ts,
  CAST(amount AS NUMERIC)       AS amount,
  COALESCE(status, 'unknown')   AS status
FROM analytics.staging.orders_raw;

Validation, Performance & Cost

Validation: row counts, known null checks, business rule tests (dbt tests).
Performance: partition by date, cluster by high-cardinality columns.
Cost: schedule off-hours downscaling, cache results, materialize incremental models.

Back to Tutorials