What's the schema of the S3 Loader generated files?

danielsepulvedab · August 20, 2020, 3:19pm

Hi everybody,

I want to process the files generated by the S3 Loader, so I’m trying to make sense of the data that is there. I’ve struggled a bit to find some documentation on the topic. When I found this schema in the iglu repository, I thought it was all that I needed, sadly the number of columns were not the same (128 in the schema, 131 in the TSV files).

I also found this issue in which all the 131 properties are mentioned and it seems to match the data in the file. The 3 extra columns that are not included in the iglu schema are contexts, unstruct_event and derived_contexts.

Is it just that the iglu schema is not up to date to reflect those properties? Where can I find the definite source of truth regarding the files schema? Is there any documentation that I missed for the S3 Loader or Enrich applications where this is stated?

Thanks!

BenB · August 20, 2020, 3:53pm

Hi @danielsepulvedab, welcome to the Snowplow community !

Iglu schemas describe self-describing JSONs (more info here. We don’t have an equivalent for TSV.

To parse these TSV enriched events, you can use our analytics SDK. To use it you will need to import com.snowplowanalytics.snowplow.analytics.scalasdk.Event and parse a line with Event.parse(line). You can then directly access the fields, e.g. event.user_id. The list of fields is available here.

There is also a Python version.

Please do not hesitate if you have more questions!

Colm · August 20, 2020, 3:59pm

Hi @danielsepulvedab,

That schema reflects the atomic.events table in database (eg. redshift) - the three fields you mentioned each would correspond to their own, (mostly custom) schemas.

unstruct_event is where custom data goes - the process of tracking them involves creating a schema for them.

contexts is an array, housing custom context data - similar scenario to events, except the standard contexts have their own schemas (eg. which are in iglu central).

derived_contexts houses data that came from an enrichment process - some of these are custom (and so you would again create your own schema), some are standard and have schemas in Iglu Central.

As I type, I noticed that BenB has responded to your use case. I’ll just add that there are also JS and .NET Analytics SDKs!

danielsepulvedab · August 20, 2020, 4:02pm

Thanks for the answers! I’ll check the SDKs

Topic		Replies	Views
Documentation for custom context Iglu	2	2905	April 26, 2017
Enriched TSV, columns contexts and derived_contexts schema order For data modelers & consumers	2	1571	May 9, 2021
ELI5: Where can I find the schema for the canonical event model? Redshift	5	1929	March 1, 2019
Javascript tracker with unstructured events	5	1605	January 14, 2020
Working with a partial fork of Snowplow Tracking SDKs	7	2863	May 24, 2016

What's the schema of the S3 Loader generated files?

Related topics