Data in S3 in JSON format (quickstart-examples)

Pratik · March 23, 2022, 6:09pm

We are currently testing out Snowplow on AWS using quickstart-examples. Is there a way to put the data in S3 in compressed but pure JSON format? Currently, it’s a compressed document consisting of some metadata and then JSON string. Getting it there in pure JSON format would resolve the need of having an intermediary transforming step in between to convert it to JSON before moving it to the final warehouse destination (Snowflake in our case).

I attempted to utilize the purpose prop in the terraform-aws-s3-loader-kinesis-ec2 module by setting the value ‘JSON’ ( GitHub - snowplow-devops/terraform-aws-s3-loader-kinesis-ec2 ) but the data stopped sinking to S3 after that. Any ideas?

Thanks in advance!

josh · March 24, 2022, 12:51pm

Hey @Pratik the S3 Loader does not do any transformations of data before landing it into S3. The enriched data comes is a TSV format where certain values are indeed JSON - this is the format you are seeing inside the GZipped files.

To convert TSV → JSON you would generally use our Analytics SDKs inside a Lambda function / some other microservice style consumer of the Kinesis stream. You would then re-publish the event to a new Kinesis Stream which would now contain the JSON you want to use.

The flow therefore looks a bit like:

enriched stream > processor to convert TSV - JSON > enriched json stream > s3 loader

The SDK documentation can be found here: Analytics SDKs - Snowplow Docs

Hope this helps!

Pratik · March 25, 2022, 1:55pm

Thanks Josh! This was really helpful! I will look into updating the pipeline using the above approach.

Alternatively, I also came across Snowflake Loader - Snowplow Docs which we’re wondering if we could utilize since our end goal is to get the data into snowflake. Is there any documentation around adding that to the existing pipeline? The existing documentation is helpful but I’m a bit lost as to where to start. Currently, we have everything up to S3 set up. I would really appreciate if you could provide some pointers about adding the Snowplow Snowflake Transformer and Snowplow Snowflake Loader pieces adding to the existing pipeline.

Thanks!

josh · March 27, 2022, 1:00pm

Hi @Pratik you are on the right track already with loading into Snowflake!

After the Enriched data is landed in S3 the following stages happen:

EMR: Stages data that has been landed in S3, Snowflake Transformer converts it for loading into Snowflake, saves it to a new destination in S3
EMR: Snowflake Loader copies the data from the S3 staging bucket in

The Setup steps are fairly involved but should be comprehensive on all the steps required - we are working hard on simplifying this however and extending our Open Source modules to support loading into Snowflake as well with the same ease as the rest of the setup.

When we are closer to that releasing would you be open to beta-testing the modules (which should automate this whole process)?

Pratik · April 25, 2022, 7:06pm

Hi @josh ! Apologies for the late response! Thank you so much for your response!

Yes, we would definitely be interested in beta-testing these modules. Do you have a rough timeline for when the release might go out?

josh · April 26, 2022, 7:04am

So I believe they have all been created and the team are just working on the docs and examples - if you follow this repo (GitHub - snowplow/quickstart-examples: Examples of how to automate creating a Snowplow Open Source pipeline) you should be able to find the Snowflake specific sections when they are ready!

Topic		Replies	Views
Aws quickstart optimized snowplow infra For engineers	3	736	January 30, 2023
Does Snowplow Analytics SDK Support JSON to TSV Conversion? Analytics SDKs	0	30	February 28, 2025
Parquet - how to get it from Enriched Stream For engineers	4	2045	November 24, 2020
Use Lambda for snowplow Enrichment	13	2902	September 21, 2020
Load Data from s3 to redshift	0	786	January 4, 2023

Data in S3 in JSON format (quickstart-examples)

Related topics