EmrEtlRunner sink Shredded data into S3 bucket


Our Snowplow installation will hit storage shortages on the Redshift cluster. We could add another node and it would solve the problem for now.

I’ve done the following to test out Redshift Spectrum -

  1. UNLOAD 3 days of atomic.* tables from Redshift into S3 buckets (partitioned by date, ordered by event_id, etc.)
  2. Run SQL Runner batch jobs against this data

Based on that test results, I have the following question -
Q. What would be a recommended approach to Sink the shredded data from EMRETLRunner directly into an S3 bucket?

(I realise that the data currently will be in web/archive/shredded/run=x/ etc - my question is more about a production batch run - i.e. how would I be able to customize the partitioning, etc.).

Hope to hear back from you!