I’m new to the snowplow setup and not an expert developer, but i’m really keen on setting up a real-time streaming pipeline.
I downloaded the zip with the collector, enrich and sink files, but to be honest I don’t know what to do with them. What service of Amazon should I use? I already setup a EC2 instance, but that’s about where i get stuck. Should I start an elastic bean stalk or … ?
I was in the same boat just a couple of months ago. Yes, a good place to start is Elastic Beanstalk. It requires a bit more legwork and more concepts to learn but it’s going to be worth it in the medium term (comes with ASGs out of the box, as @alex recommended). Note where your resources are located - it’d make sense to put them in a VPC. Again, more concepts - but more options for you later. Good luck! AWS really does give you ropes to hang yourself with and for someone without devops expertise it’s a jungle like any other - be patient and give yourself time.
Thanks a lot for your quick replies @alex and @vivricanopy, this gives me some pointers again:)
One more question to prevent me from taking a wrong turn here, should i setup a worker environment or webserver environment.
Thanks a lot in advance!
@vivricanopy thanks again, i got the collector launched and the config file set:) but since i started the ec2 instance via beanstalk i can’t/shouldn’t run it by accessing the ec2 instance right? How should i do this?
Thanks a billion again:D
no worries man! so… one way to do it is to have a Docker container run it, another would be setting something like a daemon service directly on the ec2. be careful though - with EB, you can’t restore a terminated environment afaik, so make sure everything you do is scripted and saved in git
You don’t need tomcat as the collector binary has a web server bundled. The easiest way if your doing it manually is to bundle the collector binary, procfile and config file in one zip and upload through aws console or eb command.
I am running it local now, and it seems to get pretty far, kinesis streams are found and active, but after that I get the error below. This doesn’t really say anything to me. Anybody any clue how i can fix this?
[ForkJoinPool-2-worker-1] ERROR c.s.s.c.scalastream.ScalaCollector$ - Failure binding to port
which component is failing? is it the enricher? if so - then it can’t find the collector to do the health check (have you set it up? maybe try without first). i’ve set up the collector to accept from 0.0.0.0 on port 80. another probable cause is availability-zone/security-group/subnet misconfiguration
@vivricanopy if i run it locally at 0.0.0.0 port 8000 i get no errors, and if i start an EB with health reporting to basic it seems to be working fine. However if i do the enhanced health reporting it fails, but i haven’t configured anything other than just selecting that option.
(nothing else than the collector and the bad and good kinesis streams are setup currently, so no enricher or anything else)
well I think you’re on the right track; the “enhanced health reporting” may be sending logs to S3, so you need to configure an access policy to it - I don’t really know what’s needed but the truth is out there. Also, make sure the role you run it under (maybe the service role? i dunno) has enough permissions - maybe S3 permissions, maybe CloudWatch permissions, maybe something else entirely; look at the log outputs, see what’s fishy. In any case, you don’t need it to experiment and to create a proof of concept.
@vivricanopy if i don’t need it to keep going i gladly skip it for now. I have been looking at setting up the tracker already to see if things aren’t already working. But i can’t find an example of a tracker for the real-time pipeline yet. Where do I find it?
I was wondering if you were able to get this running successfully? I too am in a similar boat and would really appreciate any help with setting up Scala through EB.
btw…i had to remove the option -Dhttp.port=5000 as it threw errors in the EB event log. I wonder if something changed in the scala collector jar since Nov?
Were you able to identify the root cause of the bind failure? I am experiencing a similar issue and I am thinking it is related to my AWS environment but I am not sure how to address the issue. Any information would be appreciated.