Monitoring snowplow

grvregmi · May 7, 2018, 5:23pm

I am setting up snowplow to track user events from web and mobile apps in my platform. I got it up and running and it is now logging events to Redshift.

Web/mobile -> Clojure collector -> S3 -> EMREtl -> Redshift

Now, I want to monitor if there are any errors getting event data to the collector or if the EMR ran as it was intended to and dumped the data into redshift. What do you guys use to log any errors or just to monitor the overall health of the pipeline.

Thanks in advance for your input.

Lars · May 7, 2018, 6:10pm

Can you share a bit more about your ETL pipeline?

Any orchestration tool that you’re using, like Airflow or Luigi?
do you have a separate user for your ETL in Redshift? (i.e. no shared login)
how frequent are you dumping data into Redshift?
do you have a separate staging schema for those data dumps in Redshift?
have you set up workload management in Redshift?

grvregmi · May 7, 2018, 7:22pm

Hi Lars,

Thanks Lars for replying. Please find the answers below

No orchestration tools yet.
No separate user
Every two hours
No separate staging schema
No workload management set up yet

Lars · May 7, 2018, 8:16pm

ok, thanks for that additional detail.

If you’re just playing around a bit, and not giving anybody direct SQL access to your data, you’re fine.

But if you increase your activity on the cluster, maybe even a bit more mission-critical (e.g. embedded charts, or a scheduled report), then there’s a bit of upfront work you can do now, and it will help you in the long run.

I wrote that up in 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

tl;dr

give your ETL pipeline a separate user, e.g. ‘snowplow_etl’
add a “load” queue in your WLM, assign your ETL user to that queue
create a separate schema for your data loads

By following that set-up, you’ll have more visibility into your ETL, which will make isolating any errors / exceptions straightforward. By having a separate “raw schema”, you’ll protect your raw data (or “atomic data”, as the Snowplow teams like to say) from ad-hoc use. If your end-users start building queries straight on top of your raw data, it’ll be hard to change those tables later.

For your data loads, there are also some best practices.

For example it’s important to select a single timestamp format and enforce it across all tables in your schemas. Also when possible validate simple strings like emails, URLs, or other IDs to avoid needing to do that later.

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Primary keys are not enforced! De-duplication is your responsibility.

Hope that helps! Ping me here if you have more questions, happy to help.

Topic		Replies	Views
Not all events going into Redshift using EmrEtlRunner Redshift	3	1801	May 8, 2016
How to transform and load data from s3 into redshift Troubleshooting	3	536	December 11, 2023
Data loading error to redshift Storage targets	7	8767	November 1, 2019
SendGrid+Snowplow+AWS S3&Redshift For engineers	30	4630	October 30, 2019
How do is store snowplow events to both s3 and redshift For engineers	1	1650	December 2, 2018

Monitoring snowplow

Related topics