Streaming bad events are not queryable

arihantsurana · October 17, 2018, 6:10am

I have set up a real-time scala stream collector > streaming enrich > s3loader pipeline. I have moved from a batch pipeline setup which used Elastic beanstalk Clojure collector.

I use AWS Athena to query bad events. I used to be able to query the batch pipline bad events, by using presto functions : from_utf8(from_base64(line))
this gave me a TSV line of raw events and was subsequently easy to deconstruct and parse in SQL queries.

But since I moved to the scala stream collector, the data in line of bad events has become undecipherable. Possibly because the line is thrift encoded. Is there a way to make it more human readable and more importantly queryable?

Colm · October 17, 2018, 7:26am

Hi @arihantsurana,

I’m not sure this is the reason for your troubles, but the key difference in bad rows for the stream pipeline is that the partition structure changes. The batch pipeline outputs a structure partitioned by run, but since the stream pipeline doesn’t run on a schedule in the same way, that’s not possible.

There’s a recent tutorial here on querying bad rows specifically for the real-time format. Note that if you have a lot of data in bad rows, it’s worth copying a sample to another bucket and querying that at first, to avoid running up charges.

Is it possible that your queries produce an undecipherable output because of this change in format?

Best,

mike · October 17, 2018, 8:31am

line is Thrift encoded which makes it a bit of a pain to turn into a human readable format. It is possible to decode the Thrift record (for example our Snowplow Chrome Inspector will do it for you) but I don’t think you’ll be able to do this in Athena.

Currently Athena doesn’t support UDFs (though BigQuery does) but once this becomes an option it will be possible to have a UDF capable of deserialising the Thrift record.

arihantsurana · October 17, 2018, 10:40pm

Hey @Colm, I solved for this a few weeks ago by adding date partitioned writing in S3Loader. https://github.com/snowplow/snowplow-s3-loader/pull/135
so I am able to query the partitioned bad rows, and I can see the error json and related details, trouble is with the line field specifically.

arihantsurana · October 17, 2018, 10:43pm

Thanks @mike for the insight.
I think there is support for writing simple lambda functions in presto, I might attempt writing one for the thrift decoding. Alternatively I think we can load Python UDFs into redshift spectrum.

Can you point me to a bigquery version of the udf I can use as a starting point?

Colm · October 18, 2018, 6:59am

Ah I hadn’t seen that - thanks for the PR @arihantsurana!

mike · October 18, 2018, 9:41am

If you’re running straight Presto you can certainly use UDFs otherwise Python UDFs within Redshift are one option (although they run quite slowly).

I don’t have the BigQuery version handy - but if you’re writing it in Python I wrote a blog post about this topic.

Topic		Replies	Views
How to stream bad events into s3 using flink Job AWS real-time pipeline	16	1877	March 11, 2021
Debugging bad rows in Athena (Real-time tutorial) For data modelers & consumers	0	4725	August 2, 2018
Encoded bad rows in Elasticsearch - advanced debugging support For data modelers & consumers	3	2564	October 18, 2018
Using AWS Athena to query the shredded events For data modelers & consumers	0	5567	August 4, 2017
Using AWS Athena to query the 'good' bucket on S3 For data modelers & consumers	2	8907	June 28, 2017

Streaming bad events are not queryable

Related topics