I keep reading how you can drip-fed into Redshift.
"In 2014, Snowplow added an Amazon Kinesis stream to its service to capture and store data from client systems. The data is then drip-fed into Redshift for continuous real-time processing. " https://aws.amazon.com/solutions/case-studies/snowplow/
Yet I can’t find any documentation on this, at the moment I can only find elaticsearch so I am a it confused, can you feed direct into Redshift from the kenisis stream or not?
Long-story short: you can setup a Real-time pipeline with Amazon Kinesis and Snowplow S3 Loader dumping enriched data to S3. Then you can setup EmrEtlRunner to only shred enriched data and load it straight to Redshift.
Thank you I now have the s3-loader running which is taking the stream from kenisis to a gzipped s3 file.
Just fighting with the EMReltRunner as it’s not playing ball. Well it’s running just not picking up any files to push to Redshift, I am not sure how it knows where to pick the files up from.
Yes, you do. Shredding is a step dedicated for preparing enriched data (which can be considered as a canonical format) for loading into Redshift. As described in R102 release notes you need to add new enriched.stream bucket to your config.yml pointing to Kinesis output dir. EmrEtlRunner will stage this data for shredding.
I have a stream-collector pushing data to s3 :- this works
I have a stream-enricher pulling data from the above stream and pushing data to a new stream
I have the snowplow-s3-loader-0.6.0.jar Pulling data from the above stream and is gziping the data to an s3 bucket.
This is the step I am un clear on now.
I now need to run the snowplow-emr-etl-runner. I have a targets folder with the redshift database target.
So I need to shred the data,
I have added - stream: s3://pf-dol-my-out-bucket/enriched/good to my config as per the release notes
Should the etl runner pick up the zipped enriched files and then shred them. Storing them in the s3 shredded.good bucket which will then push up to redshift?
./snowplow-emr-etl-runner run -x staging,enrich,elasticsearch,archive_raw,analyze,archive_enriched,archive_shredded --config config.yml --resolver iglu_resolver.json --target targets
When you added enrich.stream bucket - you don’t need to explicitly skip enrich step anymore - in “Stream Enrich mode”, EmrEtlRunner simply “forgets” about Spark Enrich. I actually wouldn’t recommend skipping any of those steps unless you fully understand what they mean.
One more question is there a way of seeing what the errors are for the redshift storage option as at the moment it’s failing on that step but I am not sure why,
EmrEtlRunner should fetch RDB Loader’s output and print it to stdout. If it didn’t fetch them then I doubt it really connected and probably there was a configuration error. You can also check out RDB Loader’s stdout in EMR console. Most likely this is something like non-existing Redshift table or JSONPaths mismatch.