I wanted to use Spark to read and parse the bad events that were compressed with lzo. However, looks like they were also encoded with Thrift at some point.
What I have is something like this:
val input = sc.textFile(“s3://snowplow-raw-events-bucket/2020-10-07-00/2020-10-07-00/raw*.lzo”)
val df = input.toDF()
df.show()
However it’s showing a lot of unrecognizable dark question marks instead of showing a nice Json. How do get around this ?
I went over the tutorial from @christophe and @yali . But looks like they are parsing a decoded text file.
Are you reading raw events or bad events? The raw events are Thrift encoded but the bad events should either
be line delimited JSON with a Thrift encoded payload (if you are running an older version of the pipeline)
or they should be line delimited JSON that performs a partial enrichment of the bad row (and provides more helpful error messaging, R118 onwards)
The tutorial is likely for the older format, rather than the new JSON format - but either format should be line delimited rather than containing raw Thrift records.
@mike Thanks for responding. 'm reading both bad and raw events. We have used lzo encoding, and it’s list of .lzo and .loz.index files. Given that Spark is reading them fine, I thought the split lzo files are handled by Spark, I assumed it’s an encoding issue.
Bad should read without any issues, raw will be a bit more problematic. What’s your use case for reading the raw data (as hopefully you don’t need to!).
My use case for now it’s pretty much to verify that good + bad = raw. We have some events we don’t see either in the good stream or bad stream, and hoping to see if raw contains those events.
@ablimit, the formula good + bad = raw will not necessarily work. It depends on how you count the events. There could be many events in a payload. A single bad event in the payload will result in the whole payload ending up in bad despite the fact that the good events from the payload will also end up in good.