I am stuck on a seemingly simple issue that I can’t seem to get around no matter what I do.
I have a custom unstructured event that is making it thru to BigQuery events table without throwing any errors, but is coming in empty. The mutator was able to correctly build the unstructured event from the schema. The custom event schema was uploaded with igluctl as a public schema to my hosted iglu registry/repo. The same site is able to track pings, pageviews, and other out of the box schemas.
Not sure if the 2 snowplow debugger chrome extensions are worthwhile in this case but I will note the behavior I am seeing with them:
Snowplow Debugger Chrome Ext
With the snowplow debugger tool I am seeing the event with what appears to be the correct data, but the validity checker seems to think it is invalid. The main thing that stands out to me as weird is that the iglu server url is showing iglucentral as the iglu server when it is stored in my self hosted iglu server.
Not sure how the debugger would be able to check my self hosted iglu server so maybe that is a non-issue…
Snowplow Chrome Ext
This one had some more behavior that was interesting. While tracking the event with this extension I was able to, again, see what appeared as the correct data in the event, but the upper schema checker showed up as unrecognized in the image below. Not sure why the top schema is my custom event and the bottom schema is the unstruct_event schema with my custom schema in the data field.
As you can see it is showing up as valid—
That it because I imported the self hosted iglu registry that my pipeline is using with the chrome extension’s schema registry tool to validate against that. Again I don’t think the trackers can see, or care about, my self-hosted iglu, but I wanted to use it to validate it against to make sure my enricher sees the right schema.
So I am just kind of wandering around aimlessly trying to determine why this event, which in a janky test environment can populate the field in BigQuery, just doesn’t push any data through in the semi-prod environment. I have restarted the iglu server and enrichment server a few times to make sure I am not going insane.
I can’t speak to the first extension (or why it is trying to validate your schema against Iglu Central rather than your custom repo) but the second extension will validate against your schema in your custom Iglu repository.
In a very small set of circumstances it may show ‘valid’ where the pipeline invalidates the row (this is due to a slightly different implementation in the Javascript JSON schema validation library and the one used in the enrichment pipeline).
Is the event being enriched and appearing in PubSub as expected? Is it perhaps appearing in failed events or somewhere else if the BQ sink cannot insert the row?
Any chance you can share your schema / an example of the payload not sinking and the corresponding BQ column definition?
I am not entirely sure. When I look at the pubsub message I am getting the feeling that it is enriching – I say that because it appears similar to the contexts enrichments that I added. Should the pubsub message contain both of the json schemas shown in this picture?
– both the top schema, which is my custom event, and the bottom schema, which appears to be the unstruct_event schema with my custom schema in the data field?
I am not sure why it shows two schemas for one event here in the chrome extension but in my pubsub message it only shows the json that contains the unstruct_event schema with my custom schema in the data field.
While there are a few failed events it appears that most of the events are making it into bigquery. I am only tracking a single site right now so its easier to determine what is coming thru. In the bigquery table I can see the fields that have the metadata for my custom event [event, event_vendor, event_name, event_format, event_version] but the record field that should contain the data is all nulls.
Another interesting behavior related to this question – I still have the postgres atomic and atomic_bad databases getting populated from the enriched-topic and bad-1-topic pubsub queues as a backup. The table that was created for this custom event is still getting all the empty events even though the schema has been modified.
I would have assumed the postgres loader should have created a new table when I uploaded a new version of the schema with different property names and datatypes.
It is a simple schema with all properties as strings and no validation checks.
I am not sure where would be the best spot to pull the payload of an example but here is the data field of the pubsub message from the enriched topic exported directly to bigquery:
The above payload from pubsub bigquery export was in a TSV like format with no column titles - I removed all the empty tabs and transposed it for easier viewing. If you can tell me where to grab the payload that would be the most helpful I will delete this one and put in the new one
Yep - the top one will be validation for the newsbreak event schema and the bottom will be for the unstruct_event schema (which is a built in one that the tracker wraps for you).
Nothing in what you’ve posted looks obviously incorrect, I generally set additionalProperties: false on any schemas but this is unlikely to cause any issues. It looks like the event is being enriched without any issues so I’d say it’s something between your enricher pubsub topic and the BigQuery loader / table.
To confirm the behaviour are you seeing no row at all go into BigQuery or are you seeing something sent in but the unstruct_event columns is empty (or null)?
They are all coming in empty/null. The overall snowplow tracked event is fine, but the custom event is an empty: "unstruct_event_com_newsbreak_event_1_0_1": {}
The Mutator logs seem to recognize the schema, at least the wrapped one I think:
2022-09-15T14:58:55.214143308Z [Gax-3] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - [{"schema":"iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0","type":"CONTEXTS"},{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","type":"DERIVED_CONTEXTS"},{"schema":"iglu:nl.basjes/yauaa_context/jsonschema/1-0-2","type":"DERIVED_CONTEXTS"},{"schema":"iglu:com.newsbreak/event/jsonschema/1-0-1","type":"UNSTRUCT_EVENT"}]
2022-09-15T14:58:55.215095453Z [ioapp-compute-1] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - Received Contexts(CustomContexts) iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0, Contexts(DerivedContexts) iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0, Contexts(DerivedContexts) iglu:nl.basjes/yauaa_context/jsonschema/1-0-2, UnstructEvent iglu:com.newsbreak/event/jsonschema/1-0-1
Looking into the communication between the enriched topic and the loaders (postrgres and bigquery)
Also will double check the iglu_resolver and the hocon config files associated with those to see if there are any clues there.