Optimizing SQL Runner in Snowflake

We have recently been contacted by a few customers who noticed that their sql runner jobs were taking a long time on their Snowflake warehouse, caused by full table scans on their atomic events table despite filtering on the appropriate timestamp. After some investigation with a customer we were able to identify the issue was being caused by filters which use subqueries for their values, and implemented a fix in our data models.

In this post I’ll cover what the issue was, how you can alter your models to take advantage of this improvement, and then for those interested I’ll attempt to give a simplified explanation of what was causing it.

The issue

Our data models have always taken advantage of the partitioning method of the warehouse where possible by filtering on the relevant _tstamp column, based on the limits calculated for that run. However, we have done this via a subquery on the limits table:

...
    AND a.collector_tstamp >= (SELECT lower_limit FROM {{.scratch_schema}}.base_new_events_limits{{.entropy}})
    AND a.collector_tstamp <= (SELECT upper_limit FROM {{.scratch_schema}}.base_new_events_limits{{.entropy}})
...

For reasons explained later, there are cases where the query optimizer will perform a full table scan before it fetches these values and applies the filter. In the case of the events table this could mean reading 99% of data that isn’t required, and this will only get worse the more events you generate.

The fix

To fix this issue, we force the database to first fetch the limits which allows the optimizer to use these at compile time, taking advantage of the partitioning approach of the warehouse.

SET (LOWER_LIMIT, UPPER_LIMIT) = (SELECT lower_limit, upper_limit FROM {{.scratch_schema}}.base_new_events_limits{{.entropy}});

...
    AND a.collector_tstamp >= $LOWER_LIMIT
    AND a.collector_tstamp <= $UPPER_LIMIT
...

You can see the full change we made here. Making similar changes to your models, especially for queries on the events table, could lead to massive performance improvements (we have seen some customers enjoy over 10x improvements in speed). Even if you don’t currently experience this issue, the change will not decrease performance and will future proof your models to stop this from happening.

Note that our dbt models don’t experience the same issue as they already set the limits outside of the query itself.

For those of you that just wanted a fix, feel free to skip the rest of this post!

The Technical Details

Query optimization is somewhat of a mystic art, just because you write your query in a particular way does not mean it will be executed in that order. Optimizers need to take into account joins, filters, nested queries or CTEs, order by clauses, and storage and metadata (such as partitions, indexes, clustering etc.) to run your query in the fastest possible way. In this case fast usually means reading the least amount of data possible, and applying transformations/ordering as late as possible. All this combined with certain parts of the query only being available at run-time (when it’s running) instead of compile time (when it decides how to run it) make the job of an optimizer not an easy one.

To simplify things a bit, let’s imagine a large number of filing cabinets with our event records in; on the front of each drawer is a little card giving information about the records inside it such as the maximum and minimum collector timestamp, the range of ip addresses, and how many distinct user_ids there are. New events get added to later drawers. This is actually pretty similar to how Snowflake Micropartitions works!

Now, if I ask you to go find all records with the collector_tstamp between some specific dates, you will look first at the card on the front and if the dates I asked for are not within the min and max on that card, you won’t even bother to open the drawer. When it overlaps, you’ll open the drawer and look at each record to see which ones match. When the date range I ask for entirely covers the min to max of the drawer, you just grab them all and don’t event check. If I also asked to only get records from a specific app_id you have to check each record but only for drawers where the dates aligned. This is known as query pruning, you have been able to ignore a large number of records without having to check them individually!

-- Checking drawers based on the dates on the front, then checking individual records for app_id
select * from atomic.events 
where collector_tstamp between timestamp '2022-10-01' and timestamp '2022-11-01'
    and app_id in ('app1', 'app2')

If instead I had just asked for records with those app_ids, you would probably have to check each record individually, this is a full table scan because you are reading every record of a table. This is why it’s important to always try and filter on a partition column in all data warehouses.

-- Checking every record in all drawers for app_id
select * from atomic.events 
where app_id in ('app1', 'app2')

Now let’s pretend instead of telling you the date range, I have put it on a noticeboard on the other side of the office - you have to make a decision do you walk over there and get the date first, or do you check the records based on app_id first, and then go and get the dates and filter on those?

select * from atomic.events 
where collector_tstamp between (select min_date from noticeboard) and(select max_date from noticeboard)
    and app_id in ('app1', 'app2')

In this case you understand the data, and you know that filtering on the date is going to save a lot more time by doing it first. But what if you get there and the dates I’ve given cover every drawer, you just wasted your time! What about a more complex case, where you need to join to some other table and I have even more filters to apply? What if that join would get rid of 90% of records? How do you know what to do first to save yourself the most time? Fortunately this is why we have optimizers!

Optimizers have a lot of metadata, those cards on the front of the drawers cover every column and have even more information. They can estimate how many records a join will remove and how many a filter will remove - most of the time. They can only do this when they have that information at compile time, i.e. when they are making the plan. When I wrote the dates on the noticeboard instead of telling you, that’s the same as using a subquery in the where clause, which is only available at run time - it doesn’t know upfront how good that filter will be to get rid of records, so sometimes it chooses to do other things that it think will be better first.

When we swap to using a SET we force it to go find out those dates first, now it knows the values at compile time and can optimize properly. Because we know on the events table this filter is going to give by far the best query pruning, this is how we can make sure the optimizer knows this too. Even in the case there is some better filter to apply first, the optimizer can still make that choice!

Further reading

For those interested in better understanding how the Snowflake optimizer works, the following is a great article that covers the basics: Understanding the Snowflake Query Optimizer

9 Likes

After identifying this optimization, we recommended this to one of our customers who were experiencing long running sql-runner jobs on their Snowflake. Their entire jobs were taking 9+ hours to complete.

After introducing this optimization across their SQL files, they noticed dramatic improvements, as Snowflake was able to more affectively prune the partitions in the tables.

image

As you can see, jobs that were taking nearly 10 hours each time to run (costing the company a lot in Snowflake credits) was reduced to barely over 1 hour - nearly a 10X improvement in performance!

3 Likes