How We Cut Our ETL Costs by 37% and Runtime by 44%: A dbt Optimization Story
At Calm, our data platform runs thousands of dbt models across Databricks, orchestrated by Apache Airflow. What started as a performance investigation turned into a complete rethinking of how we run dbt in production—and the results speak for themselves:
- Runtime: 16 hours → 9 hours (44% faster)
- Compute costs: 37% reduction
- Warehouse size: 75% smaller (Medium → xSmall)
This is the story of how we got there.
The Problem: Death by a Thousand Parses
Our dbt DAGs were slow. Painfully slow. A model that executed in 10 seconds would take 13+ minutes of wall clock time. Tests that ran in under a second consumed 16+ minutes per task.
When we dug into the logs, we found the culprit:
Unable to do partial parsing because saved manifest not found. Starting full parse.
Our Airflow DAGs created one task per dbt model and test. Every single task was parsing the entire dbt project from scratch—all 2,200+ models and 3,700+ tests. With hundreds of tasks per DAG, we were burning hours on repeated parsing overhead.
Here's what a typical task looked like:
| Phase | Duration |
|---|---|
| Pod startup | ~5 min |
| Full manifest parse | ~11 min |
| Model discovery | ~8 min |
| Actual model execution | 10 sec |
| Cleanup | ~1 min |
We were spending 81% of our compute time on overhead.
What We Tried First (And Why It Didn't Work)
Our initial instinct was to cache the manifest in the Docker image at build time. We modified our Dockerfile to run dbt parse during the build, hoping to ship a pre-built manifest with each deployment.
The problem: --vars.
We were passing --vars to our dbt commands for incremental model configuration. According to dbt's parsing documentation, certain inputs will trigger a full re-parse even when partial parsing is enabled:
If certain inputs change between runs, dbt will trigger a full re-parse. Today those inputs are:
--vars,profiles.ymlcontent,dbt_project.ymlcontent, installed packages, dbt version, and certain widely-used macros.
Since each Airflow task could have different --vars values, dbt would invalidate the cached manifest and trigger a full reparse every time:
Unable to do partial parsing because vars changed. Starting full parse.
We could have removed the --vars usage, but it was integral to how we configured incremental models. We needed a different approach.
The Solution: Three Complementary Optimizations
Instead of fighting the manifest caching problem, we took a step back and asked: What if we just parse less often?
1. Switch to dbt build with Tag-Based Selection
Our original architecture created one Airflow task per dbt model and test. This gave us granular retry capability but meant:
- Hundreds of tasks per DAG
- Each task parsing the full manifest independently
- No benefit from dbt's internal parallelization
We consolidated to a single dbt build command per DAG, using tags to select the right models:
dbt build --select tag:dag_id:core_subscriptions_databricks
This delegates parallelization to dbt itself, which handles it much more efficiently than Airflow orchestrating individual tasks. One parse, one command, dbt manages the rest internally.
We also cranked up the --threads argument to 256—an arbitrarily large value that lets dbt run as many models in parallel as it can. Since our Databricks warehouse is configured to auto-scale, it spins up additional compute as needed to handle the parallelism. This eliminates the queuing time that would otherwise occur if dbt had to wait for available threads.
One thing to watch: as parallelism increases, you may need to tune your warehouse's auto-scaling max values to keep up. We've adjusted ours a few times as we've pushed more concurrent models through.
The key enabler: We already had our models tagged by DAG in dbt_project.yml:
models:
our_project:
dags:
core_subscriptions:
+tags:
- "dag_id:core_subscriptions"
core_ua:
+tags:
- "dag_id:core_ua"
Each tag corresponds to a logical grouping of models with clear ownership. The tags existed for organizational purposes and alerting, but they turned out to be perfect for selection.
This change also dramatically simplified our codebase—we were able to delete thousands of lines of custom DAG construction code.
2. Upgrade Python from 3.9 to 3.12
While investigating parsing performance, we noticed that Python itself was a bottleneck. dbt's manifest parsing is CPU-intensive, and Python 3.12 includes significant performance improvements over 3.9.
We updated our Dockerfile:
# Before
FROM python:3.9-slim as build_base
# After
FROM python:3.12-slim as build_base
This alone reduced our parsing time meaningfully. Python 3.12's faster startup, improved dict operations, and better overall interpreter performance all contributed.
3. Right-Size the Warehouse
With parsing no longer dominating runtime, we could finally see our actual compute costs clearly. We were running everything on a medium serverless warehouse.
The conventional wisdom says "bigger warehouse = faster queries = lower total cost." But that's not always true.
We tested progressively smaller warehouses:
| Configuration | Runtime | Cost Reduction |
|---|---|---|
| Pre-optimization | 16 hrs | baseline |
| dbt build + Medium | 3 hrs | 18% savings |
| dbt build + Small | 7 hrs | 28% savings |
| dbt build + xSmall | 9 hrs | 37% savings |
The xSmall warehouse ran slower, but the cost savings more than compensated. Our workload wasn't compute-bound—it was I/O and orchestration bound. Throwing more compute at it didn't help proportionally.
Important caveat: This won't be true for everyone. If your models are doing heavy aggregations or complex joins, a larger warehouse may genuinely be worth it. Profile your actual workload.
The Rollout
We took a phased approach to minimize risk:
Day 1: Migrated a small handful of DAGs to validate the new pattern worked correctly in production. We monitored closely for any issues with model execution, test failures, or unexpected behavior.
Days 2-3: Rolled out to the remaining DAGs once we had confidence in the approach.
This incremental rollout let us catch issues early without affecting the entire platform.
Tradeoffs: What About Retries?
The tradeoff with consolidated dbt build is granularity: you can't retry individual models from Airflow's UI anymore.
For us, this was acceptable because:
- The actual model execution is fast; the overhead was the problem
- dbt 1.6+ has
dbt retrywhich re-runs only failed nodes - We can use
dbt build --select result:fail --state ./previous_runfor targeted reruns - Our failure rate is low enough that full DAG reruns aren't expensive
If you have flaky models or need fine-grained retry control, you might need a hybrid approach—perhaps grouping related models into batches rather than running everything in one command.
Cost Visibility: Knowing Where Your Money Goes
Once we optimized runtime, we wanted to understand costs per model. Databricks' system.query.history shows query-level metrics, but mapping queries back to dbt models was tricky since table names can have aliases.
We added metadata comments to every query via dbt_project.yml:
query-comment:
comment: >
dbt_meta: {
"model": "{{ node.name if node else 'manual' }}",
"schema": "{{ node.schema if node else 'none' }}"
{%- if target.type == 'databricks' and node -%}
, "catalog": "{{ node.database }}"
{%- endif -%}
}
append: true
Now every query includes:
/* dbt_meta: {"model": "fct_subscriptions", "schema": "marts", "catalog": "data_prod"} */
CREATE OR REPLACE TABLE ...
And we can track relative costs by comparing execution times across models:
SELECT
REGEXP_EXTRACT(statement_text, '"model":\\s*"([^"]+)"', 1) AS model_name,
COUNT(*) AS runs,
ROUND(SUM(execution_duration_ms) / 1000 / 60, 2) AS total_runtime_minutes,
ROUND(AVG(execution_duration_ms) / 1000, 2) AS avg_runtime_seconds
FROM system.query.history
WHERE statement_text LIKE '%dbt_meta:%'
AND start_time >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY model_name
ORDER BY total_runtime_minutes DESC;
This visibility helped us identify expensive models and prioritize further optimization work.
Results Summary
| Metric | Before | After | Improvement |
|---|---|---|---|
| Orchestration runtime | 16 hours | 9 hours | 44% faster |
| Compute costs | baseline | -37% | 37% savings |
| Warehouse size | Medium | xSmall | 75% smaller |
Key Learnings
Tag your models. The entire migration depended on our existing practice of tagging models by DAG.
Don't fight the parsing. Caching the manifest sounds appealing but has edge cases with --vars. Reducing parse frequency by consolidating tasks is more robust.
dbt build is underrated. It combines run + test, handles parallelization internally, and simplifies your orchestration layer significantly.
Bigger isn't always better. Profile your workload before assuming you need more compute. We saved money by downsizing.
Measure your costs. Query comments are cheap to add and invaluable for optimization decisions. You can't improve what you can't measure.
Roll out incrementally. Even when you're confident in a change, phased rollouts let you catch issues before they affect everything.
What's Next
We're continuing to explore:
- Model-level cost alerting: Flag when a model's cost spikes unexpectedly
- Incremental model optimization: Converting full-refresh models where appropriate
- Workload-specific warehouse sizing: Some jobs may benefit from different configurations
If you're running dbt on Airflow and seeing similar performance issues, I hope this helps. The optimizations aren't complex—they just require understanding where your time and money are actually going.
Thanks to Cameron Gridley for his contributions to this work.