Recommendations for starting Snowplow procs in autoscaling groups

We run the Scala Stream Collector and the Kinesis S3 sink in autoscaling groups to ensure availability.

The setup for the collectors, load balancers, autoscaling groups etc is pretty straightforward and we have that all working fine. What I’m not sure of is the best practice for starting up the collector or kinesis sink on the machine after it starts up.

We have an AMI with all our config that is used to spin up new instances, but over the years several different people have set up the AMIs and there is a bit of a hodgepodge of ways that the actual process is kicked off, how logging is handled, etc. Just curious if there is some documentation anywhere for a standard way to handle this.



We are using LaunchConfiguration’s Userdata to first configure the instances and at the last step run nohup to start the Snowplow process. For logging we redirect the output to a file and have a cloudwatch agent on the machine to listen to that log.

This is working fine but we’re looking to test dockerizing Snowplow and running it on Fargate at some point.

I haven’t also found any “standard” way of doing this.

1 Like

One way is to use AWS elasticbeanstalk with the docker platform. Snowplow have official docker images which mirror the scala stream collector releases. You can build your own docker images based off this and add your own configuration and publish to ECR.