Configure Collector

nando_roz · September 14, 2021, 5:08pm

Hello,

I have as my source more then one app. Using the quickstart guide I configure an EC2 instance as collector and pass the collector for my app.

As I have more than one app, can I use the same collector address or I should use one address collector for each applications?

Regards

Colm · September 14, 2021, 6:37pm

You just need one collector, which receives data over http from any number of clients. For a production use case though we recommend that the collector consists of more than one instance across more than one availability zone, and sits behind a load balancer.

However, you’ll need more than just the collector to process the data, if you haven’t set up the rest of the pipeline yet - you can find more details in the documentation.

nando_roz · September 14, 2021, 9:10pm

Great, Yes, I implementement the hole pipeline My doubt was only the number of collector.

Thnaks a lot,

josh · September 15, 2021, 7:12am

This is why there is a Load Balancer sitting in front of the actual collector servers. In this way your client side + server side trackers can all point to this single Load Balancer and behind this you can horizontally scale the number of Collectors as your traffic grows.

As @Colm mentioned you should have a minimum of 2 Collector Server nodes for production to provide HA in case of sudden traffic surges / AZ downtime.

nando_roz · September 15, 2021, 9:23am

Great Josh,

I am testing the quick start and now checking the correct size to forecast the size for EC2 instances, RDS and so on.

Another doubt, do we have any specific topic here to share with you guys about all the quick start guide architecture? For example the Terraform created 8 EC2 instances. Should I use all of them or it depends my needs?

Thanks a lot

josh · September 15, 2021, 9:29am

It sort of depends on your needs. At a minimum you need the Collector, Enrichment and an Iglu Server. The other nodes are saving the raw, enriched and bad data to different destinations so you need to see where you are actually going to use the data from and decide where you want to load it.

If you are never going to access the raw data on S3 you can get rid of that loader. Equally if you just never intend to access any data on S3 you can disable all of those 3 loaders.

Its really up to you!

nando_roz · September 15, 2021, 9:37am

Perfect. Yes I have mention to use the data in the S3. But I didn’t do it, so what the difference between the raw data and the enrichment/bad in S3?

Colm · September 15, 2021, 12:42pm

Raw data consists of messy, thrift-formatted payloads basically. They’re still in the format that the trackers sent them, and also the raw stream will contain junk data that later gets filtered out by the validation step (which happens in the enrich component).

Generally you should never need to work directly with raw data, and doing so is a lot of work and pain.

The reason one would load raw data to S3 is usually as a failsafe. If some drastic issue happens downstream, then the raw data is in S3 and can be reprocessed (note that it’s not easy to do this, it’s a last resort).

Enriched (good) data has been validated so only contains the good, high quality data. It also has information added to it in the enrichment process, and it’s in TSV format.

The enrichment process also produces bad data - which is the result of failing validation. Normally, the bad data is loaded to s3 so that one can use tools like Athena to debug issues (for example a tracking mistake where an int is sent as a string).

Both the good and bad enriched data can be accessed via elasticsearch, or directly from the streams. So if you never care about using the data in S3, nor do you care about having a backup copy in filestorage, (and you’re not loading to Redshift or Snowflake) then there’s no real reason to load to S3.

If I were using Snowplow for the first time, and my starting volumes were low, I’d probably start with the enriched good and bad data in S3, just to give me a way to dig into the data directly. Then after getting to grips with things, I’d turn off whatever I don’t think I’ll need to keep.

nando_roz · September 15, 2021, 9:41pm

Perfect. Thank you so much for your explanation

Topic		Replies	Views
Configure Collector used AWS real-time pipeline	1	957	October 15, 2021
Setup Each Service AWS real-time pipeline	15	1241	September 15, 2021
Trouble Configuring Snowplow Pipeline with AWS S3 as the Collector For engineers	1	237	April 3, 2024
Architecture question For engineers	2	757	July 11, 2019
Remote collector/enricher/s3 loader config For engineers	0	945	June 18, 2020

Configure Collector

Related topics