Convert Snowplow thrift files (on S3) to parquet

shardnit · June 23, 2016, 11:13am

Hi,

We are using snowplow scala collector to collect events. Its a standard collection pipeline - collector sinks events in Kinesis, kinesis-s3 consumes from Kinesis and writes these events to S3.

Our intention is to use PrestoDB to analyze the S3 files. We’d like to convert these thrift files to parquet format, since parquet supposedly performs better. Any suggestion on how do we go about that? Also, is it possible to dump the events to S3 directly in parquet format?

Cheers
Nitish

alex · June 23, 2016, 8:05pm

Hi @shardnit - putting a query engine like Presto or Drill or Impala over the raw collector payloads isn’t going to get you very far - you’ll be missing out on all the format translation, schema validation and event enrichment (“dimension widening”) that the Stream Enrich (or indeed Hadoop Enrich) component does.

So we would always recommend analyzing the enriched event files in S3. In terms of Parquet support for these - it’s something we’d like to support in the future, but there’s a lot of work to do first to refactor our enriched event format (likely into Avro). You’ll find our first Avro milestone in our GitHub repo.

In the meantime, the recommended way of analyzing the enriched event files in S3 is to use Spark plus our Snowplow Python or Scala Analytics SDKs.

johan · February 25, 2019, 8:28am

@alex are there any material updates on this topic or on your guidance above?

Topic		Replies	Views
Parquet - how to get it from Enriched Stream For engineers	4	2047	November 24, 2020
Reading raw events from kinesis-s3 log AWS real-time pipeline	5	3317	May 24, 2017
Only using collectors without enrichers For engineers	3	778	July 9, 2019
Thrift Parsing Format For engineers	2	821	September 7, 2019
Scala Stream Collector + Strem Enrich + S3 Loader Setup AWS real-time pipeline	6	3698	December 5, 2017

Convert Snowplow thrift files (on S3) to parquet

Related topics