Redshift PercentageDiskSpaceUsed is increasing over 80

radubogdan · July 31, 2017, 10:50am

Hello,

I’m experiencing some strange spikes in disk space used on redshift every 6 hours and I can’t tell for sure what it is. I know that no data will be lost if disk space is reaching 100% in Redshift but I want to understand what is causing this and if I should be alarmed.

The EmrEtlRunner, StorageLoader and SqlRunner are running every 6 hours with no problems but around the time Emr is finished an alarm is triggered in Redshift (PercentageDiskSpaceUsed is over 80)

Redshift alarm history from CloudWatch in 30/07/2017

EmrEtlRunner history in 30/07/2017:

For example if I inspect Redshift for the alarm triggered in 30/07 at 05:40 I can see is right after the creation of web.page_views_tmp and loading page_views_tmp. Is the loading into memory of all data (green bar) causing this spike?

For the alarm tirggered in 30/07 at 10:12 is also the creation of web.page_views_tmp.

My guess is that this is happening in of of the SQLRunner process and is super normal.

Is the data insertion in redshift truly related to the alarm I see?
- The SqlRunner is invoking following playbooks: deduplicate, refined-model-unload, web-model, refined-model-add.
Should I worry and add another redshift instance?

Thanks and looking forward for some explanations.

mike · July 31, 2017, 11:28am

This can often occur when running queries where the demands of the query exceed the available memory on the cluster. In an instance where this memory is exhausted Redshift will begin to spill the temporary results to disk which is what you’re seeing here.

You can use the following query (lifted from here) to determine which step in the query plan is causing these issues:

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = [queryid]
order by workmem desc;

If the is_diskbased is set to ‘t’ this is indicative that this step in the query plan had to spill to disk. After the query has calculated the result Redshift will remove the temporary/intermediate tables involved which is why you’ll see this cyclical pattern of disk usage.

radubogdan · July 31, 2017, 12:14pm

Thanks @mike for your quick response.

If I analyze the queries from 30/07 around the alarm at 05:40 I can see that creation of the following tables are diskbased:

04:56 - CREATE TABLE scratchpad.web_page_context
05:00 - CREATE TABLE scratchpad.web_events
05:09 - CREATE TABLE scratchpad.web_events_time
05:11 - CREATE TABLE scratchpad.web_events_scroll_depth
05:13 - CREATE TABLE scratchpad.web_ua_parser_context
05:21 - CREATE TABLE scratchpad.web_timing_context
05:25 - CREATE TABLE web.page_views_tmp
05:40 Alarm triggered
05:43 - CREATE TABLE web.sessions_tmp
05:45 - CREATE TABLE web.users_tmp
05:45 Alarm OK

I’m currently running a dc1.large redshift instance in eu-central-1 (Frankfurt).

If the number of events increase dramatically the queries will be more diskbased right?
Do you think I should scale up the redshift instance/increase the number of nodes to 2 or is this stuff you can live with?

bernardosrulzon · July 31, 2017, 4:17pm

@radubogdan As you scale the number of events, you’ll probably want to move to the incremental model, which drastically improves perforamance by querying only new data.

radubogdan · August 1, 2017, 8:21am

Thank you very much @bernardosrulzon. I’ll research and migrate to the incremantal model in the near future since my initial implementation will become problematic.

Topic		Replies	Views
Does a Redshift disk full error cause data loss? Storage targets	0	2482	July 5, 2016
Choice of Redshift instances Redshift	4	5437	July 6, 2016
Dealing with technical and economic limitations For engineers	1	877	October 6, 2017
Amazon is rolling out a patch for Out of Memory errors Storage targets	0	1313	April 14, 2016
How many events per day is maintainable for atomic.events? Storage targets	2	1524	May 25, 2016

Redshift PercentageDiskSpaceUsed is increasing over 80

Related topics