I want to process the files generated by the S3 Loader, so I’m trying to make sense of the data that is there. I’ve struggled a bit to find some documentation on the topic. When I found this schema in the iglu repository, I thought it was all that I needed, sadly the number of columns were not the same (128 in the schema, 131 in the TSV files).
I also found this issue in which all the 131 properties are mentioned and it seems to match the data in the file. The 3 extra columns that are not included in the iglu schema are
Is it just that the iglu schema is not up to date to reflect those properties? Where can I find the definite source of truth regarding the files schema? Is there any documentation that I missed for the S3 Loader or Enrich applications where this is stated?
Hi @danielsepulvedab, welcome to the Snowplow community !
Iglu schemas describe self-describing JSONs (more info here. We don’t have an equivalent for TSV.
To parse these TSV enriched events, you can use our analytics SDK. To use it you will need to
import com.snowplowanalytics.snowplow.analytics.scalasdk.Event and parse a line with
Event.parse(line). You can then directly access the fields, e.g.
event.user_id. The list of fields is available here.
There is also a Python version.
Please do not hesitate if you have more questions!
That schema reflects the atomic.events table in database (eg. redshift) - the three fields you mentioned each would correspond to their own, (mostly custom) schemas.
unstruct_event is where custom data goes - the process of tracking them involves creating a schema for them.
contexts is an array, housing custom context data - similar scenario to events, except the standard contexts have their own schemas (eg. which are in iglu central).
derived_contexts houses data that came from an enrichment process - some of these are custom (and so you would again create your own schema), some are standard and have schemas in Iglu Central.
As I type, I noticed that BenB has responded to your use case. I’ll just add that there are also JS and .NET Analytics SDKs!
Thanks for the answers! I’ll check the SDKs