BigQuery Cost-Efficient ETL Using MERGE-Based UPSERT
BigQuery Cost-Efficient ETL Using MERGE-Based UPSERT
Tools : Looker Studio, Google BigQuery, Google Cloud Function, Google Cloud Scheduler
Tags : Optimization
Created : 2023
💡 "I utilized a datamart created by our data engineering team and leveraged the INFORMATION_SCHEMA.JOBS_BY_PROJECT view in BigQuery to develop a simple audit dashboard. This dashboard helps identify which datamarts or queries may require optimization based on usage patterns and performance metrics."
To optimize BigQuery costs in our ETL process, I implemented an UPSERT strategy using the MERGE function. This approach allows us to update only changed or new records, significantly reducing the amount of data processed daily. Instead of overwriting full tables or running expensive full refreshes, we now maintain up-to-date datasets at a fraction of the cost.
Our original ETL jobs were structured as 2 to 3 months full-table loads, even when only a few rows had changed. This led to unnecessary data processing, increased costs, and longer query times. Since BigQuery charges based on data scanned, we needed a more efficient way to maintain freshness in our tables without compromising performance or accuracy.
I transitioned the daily data load process from full refreshes to incremental UPSERT by applying the MERGE statement in SQL. This method checks whether a record already exists in the target table based on a unique key and either updates it or inserts a new one. The query only run data with INTERVAL 1 DAY therefore the cost are significantly lower.
Sample logic :
MERGE target_table t
USING new_data n ON t.order_id = n.order_id
WHEN MATCHED THEN UPDATE SET t.status = n.status, t.updated_at = n.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, updated_at) VALUES (n.order_id, n.status, n.updated_at)
To show the impact, here’s a simulated comparison between my previous approach and the optimized one:
Cost Saving: ~Rp 1.93 million per month, or 99% reduction in query cost. And thats only the cost of one datamart ETL.
🔻 Reduced Query Cost: Daily data volume dropped by over 80–90% in most pipelines.
⚡ Faster Loads: Job durations decreased significantly.
🧹 Cleaner Logic: Easier to maintain and scale with schema changes.
📈 Business Value: Dashboards and reports remain fresh and accurate—without burning budget.
By switching to incremental MERGE operations, we gained control over how much data is processed each day, keeping costs low while still ensuring the latest data is reflected in our dashboards and reports.