
Building Your First Data Warehouse: A Hands-on Tutorial
Global 0410 October 2024 Blog, Datawarehouse, Tutorials
What a data warehouse is, the reference architecture, how to stand one up (Snowflake/BigQuery/Redshift), build a first ETL/ELT pipeline, and validate performance and costs.
What is a Data Warehouse?
A data warehouse centralizes historical, read-optimized data across sources for reporting and analytics. Key traits: schema-on-write, dimensional models, and separation of compute & storage.
Reference Architecture
- Sources: OLTP DBs, SaaS (Stripe, Salesforce), files, events.
- Staging/Landing: raw immutable copies (often partitioned by ingestion date).
- Transform: ELT in-warehouse with SQL (dbt) or ETL before load.
- Warehouse: curated dimensional models (Star/Snowflake schema).
- Data Marts: domain-oriented subsets for BI teams.
- BI/Serving: dashboards (Looker, Power BI), APIs, ML features.
Spin Up the Warehouse
Choose a platform (Snowflake, BigQuery, Redshift). Create a database + schema, a compute (warehouse/slot/cluster), and a service user.
CREATE SCHEMA analytics.staging; CREATE SCHEMA analytics.marts; CREATE TABLE analytics.marts.dim_customer ( customer_id STRING PRIMARY KEY, name STRING, email STRING, created_at TIMESTAMP ); CREATE TABLE analytics.marts.fact_orders ( order_id STRING PRIMARY KEY, customer_id STRING REFERENCES analytics.marts.dim_customer(customer_id), order_ts TIMESTAMP, amount NUMERIC(12,2), status STRING );
First ELT Pipeline
Land raw data daily, transform with SQL models, then publish marts. Tools: Airbyte/Fivetran (ingest), dbt (transform), Airflow/Cloud Composer (orchestration).
-- dbt-style model (orders_clean.sql) SELECT id AS order_id, customer_id, CAST(created_at AS TIMESTAMP) AS order_ts, CAST(amount AS NUMERIC) AS amount, COALESCE(status, 'unknown') AS status FROM analytics.staging.orders_raw;
Validation, Performance & Cost
- Validation: row counts, known null checks, business rule tests (dbt tests).
- Performance: partition by date, cluster by high-cardinality columns.
- Cost: schedule off-hours downscaling, cache results, materialize incremental models.
