Parsing lzo compressed events using Spark

ablimit · November 12, 2020, 8:41am

I wanted to use Spark to read and parse the bad events that were compressed with lzo. However, looks like they were also encoded with Thrift at some point.

What I have is something like this:

val input = sc.textFile(“s3://snowplow-raw-events-bucket/2020-10-07-00/2020-10-07-00/raw*.lzo”)
val df = input.toDF()
df.show()

However it’s showing a lot of unrecognizable dark question marks instead of showing a nice Json. How do get around this ?

I went over the tutorial from @christophe and @yali . But looks like they are parsing a decoded text file.

mike · November 12, 2020, 9:15pm

Are you reading raw events or bad events? The raw events are Thrift encoded but the bad events should either

be line delimited JSON with a Thrift encoded payload (if you are running an older version of the pipeline)
or they should be line delimited JSON that performs a partial enrichment of the bad row (and provides more helpful error messaging, R118 onwards)

The tutorial is likely for the older format, rather than the new JSON format - but either format should be line delimited rather than containing raw Thrift records.

ablimit · November 12, 2020, 9:50pm

@mike Thanks for responding. 'm reading both bad and raw events. We have used lzo encoding, and it’s list of .lzo and .loz.index files. Given that Spark is reading them fine, I thought the split lzo files are handled by Spark, I assumed it’s an encoding issue.

mike · November 12, 2020, 10:09pm

Bad should read without any issues, raw will be a bit more problematic. What’s your use case for reading the raw data (as hopefully you don’t need to!).

ablimit · November 12, 2020, 11:02pm

My use case for now it’s pretty much to verify that good + bad = raw. We have some events we don’t see either in the good stream or bad stream, and hoping to see if raw contains those events.

ihor · November 13, 2020, 12:03am

@ablimit, the formula good + bad = raw will not necessarily work. It depends on how you count the events. There could be many events in a payload. A single bad event in the payload will result in the whole payload ending up in bad despite the fact that the good events from the payload will also end up in good.

Topic		Replies	Views
Archive/raw file format and encoding details Enrichment	5	1588	March 20, 2019
Snowplow bad events reprocessing	7	1436	February 23, 2021
Correct/change data in thrift LZO files for reprocessing Enrichment	20	5317	March 27, 2017
Encoded bad rows in Elasticsearch - advanced debugging support For data modelers & consumers	3	2564	October 18, 2018
Catching bad events using the Scala SDK Analytics SDKs	4	2883	July 23, 2016

Parsing lzo compressed events using Spark

Related topics