Just getting started and learning about snowplow. Few noob question I was wondering to get some help on:
1- is it possible to have a minimal pipeline of collector -> s3 loader ? Or is the enrich step required
2- if I do not have need to enrich data, can it also be saved in redshift right after collection?
It’s possible but not recommended. The raw format is tricky - though not impossible - to work with.
If Snowplow data is a chocolate cake then the raw format is a couple of farm-fresh eggs, some butter, milk, cocoa powder and flour in a mixing bowl. You can certainly eat this as-is but you’re going to have a terrible time. You’ll come to question whether this was a good idea at all.
Meanwhile the enriched data is a mixture of these ingredients lovingly baked at 160C (fan forced), rested until cool and iced with a buttercream frosting. Unlike the raw ingredients, this cake doesn’t taste like crushing regret. The thunder and lightning caused by consuming raw Apache Thrift ingredients recedes and a beam of sunlight strikes the cake illuminating the room. You notice that a solitary tear begins rolling down your face - but this time it’s from joy rather than the uncontrollable shaking sobs of the existential crisis induced by dealing with Elephant Bird encoded Apache Thrift records serialized to Protocol Buffers compressed as LZO.
The raw format has been designed as a format to feed into the enricher, rather than as a data structure for analysis. Most of the interesting stuff in Snowplow really happens after collection. For example the enricher is responsible for (among other things):
- Splitting out multiple events (POST) into single events
- Validating the data within a payload
- Running base enrichments (parsing URLs into their components, device detection, UTM attribution, referrer parsing etc)
- Running custom enrichments
- Transforming the raw payload into something that is more consumable by machines and humans like (tab separated and JSON)
From a technical stand point yes (with some modification), from an execution standpoint I’d implore you not to.
What a wonderfully delicious description of Snowplow data.
Thank you for the delicious explanation. I will leave the steam coming out of the collector alone and focus on enrichment