Hi, how exactly do we load snowplow’s data from S3’s enriched folder to PostgreSQL?
Based on the Setup Guide, loading to PostgreSQL is automated when running snowplow-emr-etl-runner by setting the --target argument. This works for the raw data that are going to be enriched and eventually loaded to PostgreSQL.
But how about enriched data which already exists in S3? Can we do just the load-to-DB step only? I read something about using storage-loader-runner but this is deprecated method.
I think what should work for you is following process:
Copy all enriched folders you want to load from enriched.archive to enriched.good
Run EmrEtlRunner, starting from RDB Shredder step
Bear in mind however that Postgres support is very-very limited and considered experimental at the moment. It does not support any shredded entities (e.g. contexts and self-describing events), but only atomic data. It also uses fairly inefficient process to load data with copying it to a local machine first, so you need to make sure none of your enriched folders exceed EMR master’s free disk space. I’d restrain from using PostgreSQL at the moment and go with Redshift instead.
Good news though is that we have proper Postgres support on mind. No ETA yet, but my hope is that it will be available in 2019.
Thanks @anton for responding. This is very helpful.
If I am understanding it right, do we run the EmrEtlRunner with the option --target postgresql.json --resume-from-shred?
Although it is deprecated, can we still use storage-loader-runner?
I don’t actually know much about shredding. I probably skipping this step but I still have my data in tsv-format in enriched.archive. May I know what this step is for?
Thanks for the advice however Redshift’s cost is beyond justification at this stage so we opt for PostgreSQL. As long as the event data (which I assume data in the format defined here https://github.com/snowplow/snowplow/wiki/canonical-event-model which I also see in the enriched.archive) can be stored properly in PostgreSQL, I am good.
I am so so glad that Snowplow is going to have a more proper support of PostgreSQL in mind.
Hi, if I re-run using the resume-from shred, I got error at archive_raw step. This is basically the error:
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-20-168.ap-southeast-1.compute.internal:8020/tmp/2502e778-27bc-4379-bee5-79d931acbcd5/files