You just need one collector, which receives data over http from any number of clients. For a production use case though we recommend that the collector consists of more than one instance across more than one availability zone, and sits behind a load balancer.
This is why there is a Load Balancer sitting in front of the actual collector servers. In this way your client side + server side trackers can all point to this single Load Balancer and behind this you can horizontally scale the number of Collectors as your traffic grows.
As @Colm mentioned you should have a minimum of 2 Collector Server nodes for production to provide HA in case of sudden traffic surges / AZ downtime.
I am testing the quick start and now checking the correct size to forecast the size for EC2 instances, RDS and so on.
Another doubt, do we have any specific topic here to share with you guys about all the quick start guide architecture? For example the Terraform created 8 EC2 instances. Should I use all of them or it depends my needs?
It sort of depends on your needs. At a minimum you need the Collector, Enrichment and an Iglu Server. The other nodes are saving the raw, enriched and bad data to different destinations so you need to see where you are actually going to use the data from and decide where you want to load it.
If you are never going to access the raw data on S3 you can get rid of that loader. Equally if you just never intend to access any data on S3 you can disable all of those 3 loaders.
Raw data consists of messy, thrift-formatted payloads basically. They’re still in the format that the trackers sent them, and also the raw stream will contain junk data that later gets filtered out by the validation step (which happens in the enrich component).
Generally you should never need to work directly with raw data, and doing so is a lot of work and pain.
The reason one would load raw data to S3 is usually as a failsafe. If some drastic issue happens downstream, then the raw data is in S3 and can be reprocessed (note that it’s not easy to do this, it’s a last resort).
Enriched (good) data has been validated so only contains the good, high quality data. It also has information added to it in the enrichment process, and it’s in TSV format.
The enrichment process also produces bad data - which is the result of failing validation. Normally, the bad data is loaded to s3 so that one can use tools like Athena to debug issues (for example a tracking mistake where an int is sent as a string).
Both the good and bad enriched data can be accessed via elasticsearch, or directly from the streams. So if you never care about using the data in S3, nor do you care about having a backup copy in filestorage, (and you’re not loading to Redshift or Snowflake) then there’s no real reason to load to S3.
If I were using Snowplow for the first time, and my starting volumes were low, I’d probably start with the enriched good and bad data in S3, just to give me a way to dig into the data directly. Then after getting to grips with things, I’d turn off whatever I don’t think I’ll need to keep.