Aws quickstart optimized snowplow infra

Hi, we have been operating the snowplow infrastructure launched using the Terraform Quickstart AWS Secure Setup. Our plan is to eliminate other loaders, including the Postgres loader and Iglu, and solely rely on the collected data in the S3 raw bucket. This will result in a more cost-efficient and manageable infrastructure. Is there a guide available for this type of architecture? Additionally, how can we start analyzing the collected data stored in the raw bucket in .gz format?


Do you mean the raw data (collector payloads) or the data that comes from the enriched process for clarification? The β€œraw” data that comes from the collector isn’t really in an analysable format whereas the enriched data certainly can be used (e.g., Kinesis => S3 loader => S3 => Glue/Athena.

Hi Mike, Yes, this is exactly what I am trying to do (e.g., Kinesis => S3 loader => S3 => Glue/Athena.). I am having challenges looking for documentations for this set up. Do we have some references on how this can be achieved? I am new to data engineering, thanks for understanding.

You’re going to want to go:

Snowplow Collector β†’ Raw Kinesis β†’ Snowplow Enrich β†’ Enriched Kinesis β†’ S3 Loader β†’ Enriched S3

You should be able to remove parts of the Quick start to remove any loaders (Postgres/Snowflake).

Then you can use Glue/Athena to query this. Some links that might be useful.

Some of the manual deployment docs might help with your broader understanding: Manual Setup on AWS | Snowplow Documentation