My current pipeline for page_view events looks like
Javascript Tracker -> Kinesis Collector -> Kinesis Enricher -> Lamdba Function -> S3
I have used Athena to connect to the S3 files to query the data.
Now for the unstructured events, I followed this link. As per the link, we need to use Redshift Cluster/DB to store the events.
All our data analysis happens in BigQuery. For all the data which are in S3(other than collected from snowplow data), we have different jobs to move sync this data to BigQuery. We want to use the existing jobs to move the data from S3 to BigQuery so that our data analysis team can continue to use the new data captured using Snowplow.
I have a couple of questions related to this.
Is it mandatory to use Redshift for unstructured events?
Can the unstructured events be stored in S3? If yes, how can I link event_id and root_id as we do it in FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)?
If no, then do I need to use Redshift for structured(page_view) events also so that I can link both unstructured and structured events using FOREIGN KEY constraint?
If there is any other way to improve the pipeline, let me know. But the end goal is to move the data to BigQuery as we want to have a hybrid(both AWS and GCP) cloud solution.
@raghavn, why do you keep mentioning Redshift if you load the data into BigQuery? Unstructured events can be stored in BigQuery, Google Cloud Storage, Snowflake DB, S3, Redshift depending on your architecture and circumstances.
Redshift is different from the rest as the unstructured events are stored in the dedicated child tables. All the other storages have the unstructured events in a dedicated field/column of the atomic events table/record.
Can the unstructured events be stored in S3?
Sure. If you are running your pipeline in AWS you can have your events in enriched and shredded format. Both could be queried with Athena. The shredded format is typically used when you also need to load data to Redshift. Otherwise, we would recommend either using Athena or Snowplow Analytics SDKs to work on enriched data.
As per your response, the unstructured events are stored in a dedicated field/column of the atomic events table/record. If that is the case, do I need to create my schema to track the unstructured events? I presume I do not have to create SQL and JSON schemas as I am not using Redshift or Postgres. The unstructured value will be stored as a JSON field.
We created my custom schema and loaded it to S3. In Javascript tracker, I used the custom schema name(schema": "example/unstructured_events/jsonschema/1-0-0"). In enrichment, I gave my custom JSON file.
The actual path of the schema file is https://example-schemas.s3.region.amazonaws.com/schemas/example/unstructured_events/jsonschema/1-0-0
In the Javascript tracker and enricher JSON file, we have given iglu:example/unstructured_events/jsonschema/1-0-0.
We are getting the below-given error in the S3.
"errors":[{"level":"error","message":"error: Could not find schema with key iglu:example/unstructured_events/jsonschema/1-0-0 in any repository, tried:\n level: \"error\"\n repositories: [\"Iglu Central - GCP Mirror [HTTP]\",\"Iglu Central [HTTP]\",\"Iglu Client Embedded [embedded]\"]\n"}],"failure_tstamp":"2020-01-20T13:46:22.833Z"}
Questions related to this are 1. Do we need to give the actual path of JSON schema in the tracker and enricher the JSON file? 2. Do we need to make the schema folder public?
If you provide an example for creating the custom schema, that would be really useful.
We have gone through the blogs and technical documentation of Iglu and Snowplow, but we are still struggling to understand how the pipeline works for the custom schemas.
I presume I do not have to create SQL and JSON schemas as I am not using Redshift or Postgres. The unstructured value will be stored as a JSON field.
If you’re not loading to Redshift, you don’t need the SQL DDL or Jsonpath files. You still need the JSON schema itself though.
1. Do we need to give the actual path of JSON schema in the tracker and enricher the JSON file?
I’m not sure I follow exactly the question here, but for each custom event you track, you need to supply the iglu path to the schema. The format of the example you’ve provided looks correct (iglu:example/unstructured_events/jsonschema/1-0-0).
As for the enrichment step of the pipeline, on that side of things you just need to make sure your iglu resolver configuration contains the location of your schema repository (in addition to Iglu Central, which is what is used for standard events).
2. Do we need to make the schema folder public?
If you’re using S3 to host your schemas, then yes. Per-S3-bucket authentication isn’t possible at present. If you require a private repository then Iglu-Server allows you to configure private API-key based access.
You’re pointing tracking to "iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0", but your schema is at iglu:mycompany/unstructured_events/jsonschema/1-0-0.
Tracking unstructured events isn’t to do with enrichments. The enrichments json files are configuration files for the enrich process of the pipeline.
All you need to do to track a custom event is upload the schema to Iglu, then point your tracking to that schema.
I’m unsure what the behaviour of the pipeline is if you have an enrichment configuration file which doesn’t correlate to an enrichment, so best thing to do is remove that, in case it causes issues.
I recommend spinning up a Snowplow Mini instance to give yourself a faster feedback loop on the process of setting up custom events - we normally use Mini to test and debug schemas and tracking before taking it to prod.
Finally - I would recommend designing your events to be one schema per event type rather than a single schema to cover all of them.
Thanks for the updates.
It worked. I realized that there is no need to keep any json for unstructured schema for in the enrichments folder.
Meanwhile, I will use snowplow-mini for testing purpose.
About designing the events based on event, I do agree with you. I will be writing separate schema for each event type.
I could successfully import messages from S3 to BigQuery using BigQuery transfer.
When I used custom unstruct schema, it is stored in the unstruct_events column as JSON. Is there any way to flatten the unstruct events to BigQuery?
Our entire analysis platform is on BigQuery.
current pipeline looks like
javascript Tracker -> Kinesis Collector -> Kinesis Enricher -> Lamdba Function -> S3 -> BigQuery