How We Cut Our ETL Costs by 37% and Runtime by 44%: A dbt Optimization Story

Jan 7

At Calm, our data platform runs thousands of dbt models across Databricks, orchestrated by Apache Airflow. What started as a performance investigation turned into a complete rethinking of how we run dbt in production—and the results speak for themselves:

Runtime: 16 hours → 9 hours (44% faster)
Compute costs: 37% reduction
Warehouse size: 75% smaller (Medium → xSmall)

This is the story of how we got there.

The Problem: Death by a Thousand Parses

Our dbt DAGs were slow. Painfully slow. A model that executed in 10 seconds would take 13+ minutes of wall clock time. Tests that ran in under a second consumed 16+ minutes per task.

When we dug into the logs, we found the culprit:

Unable to do partial parsing because saved manifest not found. Starting full parse.

Our Airflow DAGs created one task per dbt model and test. Every single task was parsing the entire dbt project from scratch—all 2,200+ models and 3,700+ tests. With hundreds of tasks per DAG, we were burning hours on repeated parsing overhead.

Here's what a typical task looked like:

Phase	Duration
Pod startup	~5 min
Full manifest parse	~11 min
Model discovery	~8 min
Actual model execution	10 sec
Cleanup	~1 min

We were spending 81% of our compute time on overhead.

What We Tried First (And Why It Didn't Work)

Our initial instinct was to cache the manifest in the Docker image at build time. We modified our Dockerfile to run dbt parse during the build, hoping to ship a pre-built manifest with each deployment.

The problem: --vars.

We were passing --vars to our dbt commands for incremental model configuration. According to dbt's parsing documentation, certain inputs will trigger a full re-parse even when partial parsing is enabled:

If certain inputs change between runs, dbt will trigger a full re-parse. Today those inputs are: --vars, profiles.yml content, dbt_project.yml content, installed packages, dbt version, and certain widely-used macros.

Since each Airflow task could have different --vars values, dbt would invalidate the cached manifest and trigger a full reparse every time:

Unable to do partial parsing because vars changed. Starting full parse.

We could have removed the --vars usage, but it was integral to how we configured incremental models. We needed a different approach.

The Solution: Three Complementary Optimizations

Instead of fighting the manifest caching problem, we took a step back and asked: What if we just parse less often?

1. Switch to `dbt build` with Tag-Based Selection

Our original architecture created one Airflow task per dbt model and test. This gave us granular retry capability but meant:

Hundreds of tasks per DAG
Each task parsing the full manifest independently
No benefit from dbt's internal parallelization

We consolidated to a single dbt build command per DAG, using tags to select the right models:

dbt build --select tag:dag_id:core_subscriptions_databricks

This delegates parallelization to dbt itself, which handles it much more efficiently than Airflow orchestrating individual tasks. One parse, one command, dbt manages the rest internally.

We also cranked up the --threads argument to 256—an arbitrarily large value that lets dbt run as many models in parallel as it can. Since our Databricks warehouse is configured to auto-scale, it spins up additional compute as needed to handle the parallelism. This eliminates the queuing time that would otherwise occur if dbt had to wait for available threads.

One thing to watch: as parallelism increases, you may need to tune your warehouse's auto-scaling max values to keep up. We've adjusted ours a few times as we've pushed more concurrent models through.

The key enabler: We already had our models tagged by DAG in dbt_project.yml:

models:
  our_project:
    dags:
      core_subscriptions:
        +tags:
          - "dag_id:core_subscriptions"
      core_ua:
        +tags:
          - "dag_id:core_ua"

Each tag corresponds to a logical grouping of models with clear ownership. The tags existed for organizational purposes and alerting, but they turned out to be perfect for selection.

This change also dramatically simplified our codebase—we were able to delete thousands of lines of custom DAG construction code.

2. Upgrade Python from 3.9 to 3.12

While investigating parsing performance, we noticed that Python itself was a bottleneck. dbt's manifest parsing is CPU-intensive, and Python 3.12 includes significant performance improvements over 3.9.

We updated our Dockerfile:

# Before
FROM python:3.9-slim as build_base

# After
FROM python:3.12-slim as build_base

This alone reduced our parsing time meaningfully. Python 3.12's faster startup, improved dict operations, and better overall interpreter performance all contributed.

3. Right-Size the Warehouse

With parsing no longer dominating runtime, we could finally see our actual compute costs clearly. We were running everything on a medium serverless warehouse.

The conventional wisdom says "bigger warehouse = faster queries = lower total cost." But that's not always true.

We tested progressively smaller warehouses:

Configuration	Runtime	Cost Reduction
Pre-optimization	16 hrs	baseline
dbt build + Medium	3 hrs	18% savings
dbt build + Small	7 hrs	28% savings
dbt build + xSmall	9 hrs	37% savings

The xSmall warehouse ran slower, but the cost savings more than compensated. Our workload wasn't compute-bound—it was I/O and orchestration bound. Throwing more compute at it didn't help proportionally.

Important caveat: This won't be true for everyone. If your models are doing heavy aggregations or complex joins, a larger warehouse may genuinely be worth it. Profile your actual workload.

The Rollout

We took a phased approach to minimize risk:

Day 1: Migrated a small handful of DAGs to validate the new pattern worked correctly in production. We monitored closely for any issues with model execution, test failures, or unexpected behavior.

Days 2-3: Rolled out to the remaining DAGs once we had confidence in the approach.

This incremental rollout let us catch issues early without affecting the entire platform.

Tradeoffs: What About Retries?

The tradeoff with consolidated dbt build is granularity: you can't retry individual models from Airflow's UI anymore.

For us, this was acceptable because:

The actual model execution is fast; the overhead was the problem
dbt 1.6+ has dbt retry which re-runs only failed nodes
We can use dbt build --select result:fail --state ./previous_run for targeted reruns
Our failure rate is low enough that full DAG reruns aren't expensive

If you have flaky models or need fine-grained retry control, you might need a hybrid approach—perhaps grouping related models into batches rather than running everything in one command.

Cost Visibility: Knowing Where Your Money Goes

Once we optimized runtime, we wanted to understand costs per model. Databricks' system.query.history shows query-level metrics, but mapping queries back to dbt models was tricky since table names can have aliases.

We added metadata comments to every query via dbt_project.yml:

query-comment:
  comment: >
    dbt_meta: {
      "model": "{{ node.name if node else 'manual' }}",
      "schema": "{{ node.schema if node else 'none' }}"
      {%- if target.type == 'databricks' and node -%}
      , "catalog": "{{ node.database }}"
      {%- endif -%}
    }
  append: true

Now every query includes:

/* dbt_meta: {"model": "fct_subscriptions", "schema": "marts", "catalog": "data_prod"} */
CREATE OR REPLACE TABLE ...

And we can track relative costs by comparing execution times across models:

SELECT 
  REGEXP_EXTRACT(statement_text, '"model":\\s*"([^"]+)"', 1) AS model_name,
  COUNT(*) AS runs,
  ROUND(SUM(execution_duration_ms) / 1000 / 60, 2) AS total_runtime_minutes,
  ROUND(AVG(execution_duration_ms) / 1000, 2) AS avg_runtime_seconds
FROM system.query.history
WHERE statement_text LIKE '%dbt_meta:%'
  AND start_time >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY model_name
ORDER BY total_runtime_minutes DESC;

This visibility helped us identify expensive models and prioritize further optimization work.

Results Summary

Metric	Before	After	Improvement
Orchestration runtime	16 hours	9 hours	44% faster
Compute costs	baseline	-37%	37% savings
Warehouse size	Medium	xSmall	75% smaller

Key Learnings

Tag your models. The entire migration depended on our existing practice of tagging models by DAG.

Don't fight the parsing. Caching the manifest sounds appealing but has edge cases with --vars. Reducing parse frequency by consolidating tasks is more robust.

dbt build is underrated. It combines run + test, handles parallelization internally, and simplifies your orchestration layer significantly.

Bigger isn't always better. Profile your workload before assuming you need more compute. We saved money by downsizing.

Measure your costs. Query comments are cheap to add and invaluable for optimization decisions. You can't improve what you can't measure.

Roll out incrementally. Even when you're confident in a change, phased rollouts let you catch issues before they affect everything.

What's Next

We're continuing to explore:

Model-level cost alerting: Flag when a model's cost spikes unexpectedly
Incremental model optimization: Converting full-refresh models where appropriate
Workload-specific warehouse sizing: Some jobs may benefit from different configurations

If you're running dbt on Airflow and seeing similar performance issues, I hope this helps. The optimizations aren't complex—they just require understanding where your time and money are actually going.

Thanks to Cameron Gridley for his contributions to this work.

dbtDatabricksAirflowPerformanceCost OptimizationData Engineering

Calm Engineering https://eng.calm.com/