I can imagine most of the production Snowplow / Redshift users here had similar thoughts about this topic so I thought I’d see if I can learn from anyone’s experience.
We currently run Redshift on a 4 dc1.large cluster. We are currently bound by disk space (so we archive fairly aggressively) and by CPU (heavy queries will max out the CPU on all instances).
I think we will need about 3-4 times more disk space than we have now and some more CPU capcity.
The thing is, giving the options and the cost of the instances, I’m not sure if we’re better off moving to say, 12-16 dc1.large instances or maybe 3-4 ds2.xlarge or even a single dc1.8xlarge. All of which would be approximately the same in terms of pricing. I know Redshift is meant to be linearly scalable but there’s always a tradeoffs between having more, weaker machines and having less stronger machines.
I also did those calculations. My gut is that there’s a cost which is hard to account for of moving data between nodes when, for example, joined DIST keys don’t match.
This is also where throughput comes into play. dc1.large has 0.20GB/s “I/O” and that doesn’t scale with adding more nodes. In way that’s a limiting factor with scaling a cluster.
Just to make sure - I’m not trying to contradict your advice in any way. Just adding some more thoughts to the mix.
I’ve seen various ways of handling this but it really depends on how your current implementation works so there’s a few options here.
A few questions first up.
What sort of queries are maxing out the CPUs on the cluster currently? (e.g., are they mostly adhoc or can you cache partial results?)
Is the execution load balanced across all 8 vCPUs for these queries?
Are the columnar compressions the same as the ones in the Snowplow definition?
Depending on how much data you have in Redshift it’s not uncommon to have another source (other than your primary query cluster) as colder storage (S3 or a ds2 cluster) and a hot storage for insertion and queries that require more recent data, depending on the data size that could be a dc1 cluster.