I want to process the files generated by the S3 Loader, so I’m trying to make sense of the data that is there. I’ve struggled a bit to find some documentation on the topic. When I found this schema in the iglu repository, I thought it was all that I needed, sadly the number of columns were not the same (128 in the schema, 131 in the TSV files).
I also found this issue in which all the 131 properties are mentioned and it seems to match the data in the file. The 3 extra columns that are not included in the iglu schema are contexts, unstruct_event and derived_contexts.
Is it just that the iglu schema is not up to date to reflect those properties? Where can I find the definite source of truth regarding the files schema? Is there any documentation that I missed for the S3 Loader or Enrich applications where this is stated?
Iglu schemas describe self-describing JSONs (more info here. We don’t have an equivalent for TSV.
To parse these TSV enriched events, you can use our analytics SDK. To use it you will need to import com.snowplowanalytics.snowplow.analytics.scalasdk.Event and parse a line with Event.parse(line). You can then directly access the fields, e.g. event.user_id. The list of fields is available here.
That schema reflects the atomic.events table in database (eg. redshift) - the three fields you mentioned each would correspond to their own, (mostly custom) schemas.
unstruct_event is where custom data goes - the process of tracking them involves creating a schema for them.
contexts is an array, housing custom context data - similar scenario to events, except the standard contexts have their own schemas (eg. which are in iglu central).
derived_contexts houses data that came from an enrichment process - some of these are custom (and so you would again create your own schema), some are standard and have schemas in Iglu Central.