Scala Stream Collector on Elastic Beanstalk - how to configure and run?

hi all

I am setting up the pipeline for the first time on AWS and would really appreciate any help to cover the missing documentation from the setup guide. I promise that once I have this running, I’ll post my detailed setup here as reference!

I am trying to get the scala collector setup on an Elastic Beanstalk. From my understanding this should be fairly straightforward. I used a few very helpful support topics and the official setup guide for clojure as a reference.

However, I am not able to get the collector running. Here are some steps I followed:

  1. Tested on my local dev (127.0.0.1 on port 8080) and with sink set as stdout. This worked.
  2. I created an EB with the web server environment type (with java as the configuration) and ELB enabled.
  3. I uploaded a zip file with the Procfile, collector jar and the above config file in it. For testing, I kept the settings in the config file the same as I used for local testing (ie interface:127.0.0.1, port 8080, sink: stdout). The Procfile contents were: web: ./snowplow-stream-collector --config test.conf
  4. Note: I will eventually set this up with kinesis but I wanted to test it on stdout first. Can this be an issue as well?

The EB starts up successfully. Accessing the EB url gives the nginx 502 error. But I am assuming that this is the error on port 80 which the EB starts up automatically due to the web server config on the EB. But trying to access the <eb_url:8080> refuses the connection completely. I cannot telnet or stat this port on the server at all.

I accessed the EC2 volume being used and checked the processes. I can see the line java -jar ./snowplow-stream-collector --config ih.conf so the process is running. But not sure where and why is it inaccessible? Is it a firewall issue (the EB is not inside a VPC)?

Would appreciate any help…thanks all!

UPDATE: while trying to figure this out, i tried the above with 2 kinesis streams (good and bad) and got the same result ie it worked when testing locally, but not in the EB. Thanks!

UPDATE 2: OK…doing some more debugging, seems like it was partly AWS setup at fault. The EB created a default security group with this application, which of course, did not allow 8080 inbound! So I added 8080 as a rule, and I was able to access to server! But weirdly I could not use the EB url to do so, I had to use the EC2 IP address (or amazon public dns) to access it. That doesn’t sound right to me…Also eventually I would like to run the collector on port 80. How can I do that on this application and override the nginx process that runs on it by default? Thank a million!

UPDATE 2: Now after a few retries, I can access the EB url (eg. *.us-east-1.elasticbeanstalk.com) to access the box on port 8080. But as above, if I configure the scala conf to use port 80, it doesn’t work since the server box’s default web server (nginx) takes control. How can I disable this so I can use the scala collector at the standard port 80?
Hope someone can help as it really can’t be all that difficult :slight_smile:

UPDATE3: Sorry for using the forum as a rather verbose running log of my activities - but I am hoping the resolution will help others struggling at this stage as well. So leaving the previous EB as is, I tried the same approach by building in an ELB/ASG with the EB (since this is highly recommended byt the snowplow team anyhow). Now I cannot expose the port 8080 from outside anymore. I can still use the ec2 box’s IP and return OK for port 80, but that defeats the purpose of the ELB anyhow. So its a bit of a catch22 here. I can work with 8080 but the ELB doesn’t allow connecting to it. But if I switch to port 80 (ideal) or possibly https (443) then the default nginx takes over the requests.

Would appreciate any help or guidance…thanks very much!

OK…I decided to answer this myself in case it helps others. The issue was essentially at the AWS implementation end. Here’s what I finally did to get the scala collector working on port 8080 through and ELB inside a VPC:

  1. Create and EB inside an ELB (and a VPC in my case). I used a public subnet in the VPC for the ELB and the instance.

  2. Set the health check url to /health

  3. Once started, I had to update the Load Balancer settings: set listener for port 80 to instance port 8080.

  4. I updated the Load Balancer’s security group settings (inbound and outbound) to map port 8080 to any source.

  5. I updated the instance’s security group settings to allow inbound 8080 traffic from any source.

Finally, with these edits, I was able to connect to the collector on standard http port through the ELB! I will eventually set it up to work on https instead but I assume the process will be similar.

Considering the number of manual steps required above, I wonder if there’s a way to automate them so we don;t have to go through these steps at every ELB setup.

I would love to hear comments from others…while I resume further digging into snowplow…and I am sure I will have more questions to post!

Thanks very much!

1 Like

Thanks for sharing your experiences @kjain!

Thanks @kjain for share. Can you upload the config file to here?

My case:
Setup Stream Collector in EBL(select “java”, not need Tomcat).
Procfile should:
web: java -jar snowplow-stream-collector-kinesis-0.15.0.jar --config movan-stream-collector.conf
Done.
All work well!