I’ve been wondering what’s the difference betwen the csvs in shredded/archive/run=<date>/atomic-events/
is the shredded/../atomic-events a subset of atomic events found in enrchired/archive or there are more events in enrhiced/archive than in shredded/?
or is the enriched/... folder merely an intermediate step and the shredded/... folder can be considered “final” destination and contains all events (from the canonical event model + [un]structured events), from where the RDB loader eventually takes the events?
@pocin, the (shredded) atomic_events are canonical events in TSV format that end up in events table. The enriched events apart from canonical data contain custom (self-describing) events and contexts as well as derived contexts. During the shredding, the self-describing (unstruct) data gets shredded out of the enriched event and is kept in “shredded” bucket in JSON format ready to be loaded into Redshift. You can see a visualization of this process here (diagram at the bottom): https://github.com/snowplow/snowplow/wiki/StorageLoader.
Shredding is only applicable if you need to load the data into Redshift. The canonical (TSV) data is loaded to events table with COPY command while self-describing (JSON) data is loaded with COPY FROM JSON command.