Amazon is deprecating the DescribeJobFlows EMR API that old versions of EmrEtlRunner / elasticity have been using. We are in the unfortunate situation that our old pipeline is still making these API calls.
Do you have any suggestions how to move away from DescribeJobFlows API? Did anyone have success in upgrading elasticity without upgrading the version of EmrEtlRunner? We’ve already bumped elasticity once for EMR role support, but now this change seems to be more massive.
Hi @rgabo, is there a reason why you can’t use the latest version of EmrEtlRunner with your old pipeline?
@alex, my concern was that the new EmrEtlRunner/StorageLoader is shredding and the version we have in-place for our legacy pipeline does not do any of that.
Are you saying that I should be able to configure the latest EmrEtlRunner (which we use for our new pipeline) to do enrich and storage load without shredding?
I realize that the version of hadoop-enrich is configurable, but how will StorageLoader behave?
I’d suggest giving it a try and reporting back! You need a new version of EmrEtlRunner but you can always play around with an older version of StorageLoader, and EmrEtlRunner’s
Even if you end up needing some manual Boto scripting between EmrEtlRunner and StorageLoader, that’s probably the least painful path to take…
We’ve tried, but we’re on such an old version that it was easier to backport the necessary changes.
Bump Elasticity on the old EmrEtlRunner was actually quite straightforward and we are going to stop using our old Snowplow pipeline anyways, so hopefully we won’t need to maintain it for too long.
I was also hoping to package up EmrEtlRunner with
warble but for some reason I can only get it to produce a
snowplow-emr-etl-runner.jar, but not a
snowplow-emr-etl-runner that is executable without
Am I missing a step?
Ah, that’s here: https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/Rakefile#L26
Getting closer to a working solution, I was able to backport JRuby/Warble support to build executable fat jars and deploy similar to how recent release are deployed (zip file containing executables). Btw we call them “Snowplow Classic” and not legacy
While functionally everything works, sluice and staging/archiving is extremely slow (multiple seconds per file even though its threaded). I bumped Sluice to 0.2.2, could there be any reason for the slowdown? When building/compiling I used JRuby 1.7.4 (Ruby 1.9.3). When running, the Docker container has 1.7.0_101 (OpenJDK IcedTea 2.6.6).
Any ideas? I’m almost there
UPDATE: CPU is 100% maxed out on a simple
move_files, something is fishy with JRuby/jar produced.
@alex any idea why the JRuby + sluice + fog performance is abysmal? 200% CPU and a still a magnitude slower than the non-JRuby counterpart. Given that we use the CloudFront collector in our old pipeline, it’s a lot of files we need to move and EmrEtlRunner can’t handle it.
Bumped Sluice all the way to 0.4.0 but did not see improvements.
I can’t see why your JRuby fatjar would have different performance characteristics from the one we publish… What EC2 instance type are you running this from?
@alex we used to run t2.small / t2.medium which worked great for r77, but when trying to fatjar an old 0.9.9 emretlrunner, it blows up. We’re now running it on c4.large and seeing if the larger instance improves it, but it’s not ideal.
There were a few hiccups around the exact versions of fog-core, mime-types, etc to use but I ended up pinning most to the same version as r81.
I’ll keep testing and investigating…
Yes - conducting file moves from the orchestration box was an original design mistake.
We’ll be first replacing these file moves with S3DistCp, and later hopefully removing file moves altogether (in favor of manifests; writing an RFC on this soon).
New Relic charts from our test runs yesterday:
First two “spikes” (~10% CPU 500MB RAM), lasting a few minutes each are r77 Great Auk EmrEtlRunner moving Clojure collector logs and processing them.
The second two spikes (100% CPU, 1100MB RAM), lasting 2.5 hours each is an old 0.9.x build with JRuby support backported processing CloudFront collector logs.
There are a lot more CloudFront collector logs so that plays a role, but we don’t have the same problem when running the old build with MRI Ruby.
The diff is dead simple too: https://github.com/sspinc/snowplow/pull/5
I’m still suspecting MRI vs JRuby with the Gemfile bundle potentially playing a role. I will now test the same codebase with MRI to see if that is fast.
Side note: I wish EmrEtlRunner and StorageLoader were both Go binaries like SqlRunner is
@alex what version of JRuby do you use to create the releases?
Thanks @alex. It seems we were finally able to crack it. I was running JRuby / warble on Java 8 and it seems that was causing performance issues in the resulting jar when running on openjdk-7. I do not know exactly what combination of runtime and dependencies were the culprit, but going back to JRuby 188.8.131.52 running on Java 7 and using the latest Gemfile.lock files for the project did the trick.
That’s good to know @rgabo - BTW we use oraclejdk everywhere for production, openjdk has fallen behind so far by this point…
Makes sense, the CLIs were built into an existing Docker image that had openjdk-7, but I guess it would be beneficial to migrate at some point.