Redshift PercentageDiskSpaceUsed is increasing over 80

Hello,

I’m experiencing some strange spikes in disk space used on redshift every 6 hours and I can’t tell for sure what it is. I know that no data will be lost if disk space is reaching 100% in Redshift but I want to understand what is causing this and if I should be alarmed.

The EmrEtlRunner, StorageLoader and SqlRunner are running every 6 hours with no problems but around the time Emr is finished an alarm is triggered in Redshift (PercentageDiskSpaceUsed is over 80)

Redshift alarm history from CloudWatch in 30/07/2017

EmrEtlRunner history in 30/07/2017:

For example if I inspect Redshift for the alarm triggered in 30/07 at 05:40 I can see is right after the creation of web.page_views_tmp and loading page_views_tmp. Is the loading into memory of all data (green bar) causing this spike?


For the alarm tirggered in 30/07 at 10:12 is also the creation of web.page_views_tmp.

My guess is that this is happening in of of the SQLRunner process and is super normal.

  1. Is the data insertion in redshift truly related to the alarm I see?
    • The SqlRunner is invoking following playbooks: deduplicate, refined-model-unload, web-model, refined-model-add.
  2. Should I worry and add another redshift instance?

Thanks and looking forward for some explanations.

This can often occur when running queries where the demands of the query exceed the available memory on the cluster. In an instance where this memory is exhausted Redshift will begin to spill the temporary results to disk which is what you’re seeing here.

You can use the following query (lifted from here) to determine which step in the query plan is causing these issues:

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = [queryid]
order by workmem desc;

If the is_diskbased is set to ‘t’ this is indicative that this step in the query plan had to spill to disk. After the query has calculated the result Redshift will remove the temporary/intermediate tables involved which is why you’ll see this cyclical pattern of disk usage.

2 Likes

Thanks @mike for your quick response.

If I analyze the queries from 30/07 around the alarm at 05:40 I can see that creation of the following tables are diskbased:

  1. 04:56 - CREATE TABLE scratchpad.web_page_context
  2. 05:00 - CREATE TABLE scratchpad.web_events
  3. 05:09 - CREATE TABLE scratchpad.web_events_time
  4. 05:11 - CREATE TABLE scratchpad.web_events_scroll_depth
  5. 05:13 - CREATE TABLE scratchpad.web_ua_parser_context
  6. 05:21 - CREATE TABLE scratchpad.web_timing_context
  7. 05:25 - CREATE TABLE web.page_views_tmp
    05:40 Alarm triggered
  8. 05:43 - CREATE TABLE web.sessions_tmp
  9. 05:45 - CREATE TABLE web.users_tmp
    05:45 Alarm OK

I’m currently running a dc1.large redshift instance in eu-central-1 (Frankfurt).

If the number of events increase dramatically the queries will be more diskbased right?
Do you think I should scale up the redshift instance/increase the number of nodes to 2 or is this stuff you can live with?

@radubogdan As you scale the number of events, you’ll probably want to move to the incremental model, which drastically improves perforamance by querying only new data.

2 Likes

Thank you very much @bernardosrulzon. I’ll research and migrate to the incremantal model in the near future since my initial implementation will become problematic.