We have set up the snowplow pipeline and is running fine without any issues. However, we have noticed that a valid message is getting inserted into bigquery bad pubsub topic. So, we would like to know the list of reasons for bq bad events. What type of events doesn’t get processed by bq loader? It would be great if you could provide list of valid reasons.
Thanks for the quick response. I think we have three topics in bq loader config. typesTopic, badRows, and failedInserts. We understand what failedinserts are. But, we would like to understand when badRows are filed.
One possible reason is invalid data coming from PubSub (that cannot be parsed into enriched event), which never happens unless some deployment mistake. Another slightly more common case for bad rows in BQ Loader is unavailable Iglu registry, because the Loader needs a schema to transform JSONs (contexts and self-describing events) into BigQuery format (e.g. to understand that a string with format: date-string is actually a TIMESTAMP and not a plain string).
Overall, bad rows coming from a BigQuery Loader is very rare occasion. So rare that I’d say most users never see them. But at the same time, from architecture point of view we cannot remove it because it means that if one of above scenarios happen, the data will be siliently dropped.
We’re trying to keep Snowplow pipeline as lossless as possible and as a result we prefer to keep a topic for bad data even if the chance of generating this bad data is abysmall.
Thank you for the clarification. I was wondering that even iglu registry is available, the message was processed by loader but failed at validation, and it was sent to a bq bad pubsub topic. Mentioning the message
By any chance, is there some other software sitting between Beam Enrich and BigQuery Loader?
This event has likely been sent not by Beam Enrich into enriched topic. As the error indicates the TSV line misses several columns that necessary for a Snowplow enriched event, it has 128 columns instead of expected 131. Namely, it misses:
app_id - first column, before pc
event_fingerprint a hash after your last column (2-0-0)
true_stamp - a timestamp at the very end of the TSV
All these columns can be empty (just \t), but they must be there. It seems something has trimmed these columns as they were empty in the original line produced by Beam Enrich.
Thank you for clarification, It was working fine after formatting the TSV. Understood that snowplow gives importance to the format of the event. We were under impression that if the corresponding entity was not available then the loader loads the event as null. Thank you for the clarification.