DescribeJobFlows deprecated

rgabo · July 7, 2016, 6:25pm

Hello Snowplowers,

Amazon is deprecating the DescribeJobFlows EMR API that old versions of EmrEtlRunner / elasticity have been using. We are in the unfortunate situation that our old pipeline is still making these API calls.

Do you have any suggestions how to move away from DescribeJobFlows API? Did anyone have success in upgrading elasticity without upgrading the version of EmrEtlRunner? We’ve already bumped elasticity once for EMR role support, but now this change seems to be more massive.

Thanks,
Gabor

alex · July 7, 2016, 8:13pm

Hi @rgabo, is there a reason why you can’t use the latest version of EmrEtlRunner with your old pipeline?

rgabo · July 8, 2016, 11:21am

@alex, my concern was that the new EmrEtlRunner/StorageLoader is shredding and the version we have in-place for our legacy pipeline does not do any of that.

Are you saying that I should be able to configure the latest EmrEtlRunner (which we use for our new pipeline) to do enrich and storage load without shredding?

I realize that the version of hadoop-enrich is configurable, but how will StorageLoader behave?

alex · July 8, 2016, 11:45am

I’d suggest giving it a try and reporting back! You need a new version of EmrEtlRunner but you can always play around with an older version of StorageLoader, and EmrEtlRunner’s --skip shred.

Even if you end up needing some manual Boto scripting between EmrEtlRunner and StorageLoader, that’s probably the least painful path to take…

rgabo · July 29, 2016, 11:55am

hey @alex,

We’ve tried, but we’re on such an old version that it was easier to backport the necessary changes.

Bump Elasticity on the old EmrEtlRunner was actually quite straightforward and we are going to stop using our old Snowplow pipeline anyways, so hopefully we won’t need to maintain it for too long.

I was also hoping to package up EmrEtlRunner with warble but for some reason I can only get it to produce a snowplow-emr-etl-runner.jar, but not a snowplow-emr-etl-runner that is executable without java -jar.

Am I missing a step?

Thanks,
Gabor

alex · July 29, 2016, 12:10pm

Ah, that’s here: https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/Rakefile#L26

rgabo · July 29, 2016, 1:36pm

rgabo · August 3, 2016, 10:19am

Getting closer to a working solution, I was able to backport JRuby/Warble support to build executable fat jars and deploy similar to how recent release are deployed (zip file containing executables). Btw we call them “Snowplow Classic” and not legacy

While functionally everything works, sluice and staging/archiving is extremely slow (multiple seconds per file even though its threaded). I bumped Sluice to 0.2.2, could there be any reason for the slowdown? When building/compiling I used JRuby 1.7.4 (Ruby 1.9.3). When running, the Docker container has 1.7.0_101 (OpenJDK IcedTea 2.6.6).

Any ideas? I’m almost there

UPDATE: CPU is 100% maxed out on a simple move_files, something is fishy with JRuby/jar produced.

rgabo · August 8, 2016, 7:36am

@alex any idea why the JRuby + sluice + fog performance is abysmal? 200% CPU and a still a magnitude slower than the non-JRuby counterpart. Given that we use the CloudFront collector in our old pipeline, it’s a lot of files we need to move and EmrEtlRunner can’t handle it.

Bumped Sluice all the way to 0.4.0 but did not see improvements.

alex · August 8, 2016, 8:21am

I can’t see why your JRuby fatjar would have different performance characteristics from the one we publish… What EC2 instance type are you running this from?

rgabo · August 8, 2016, 4:05pm

@alex we used to run t2.small / t2.medium which worked great for r77, but when trying to fatjar an old 0.9.9 emretlrunner, it blows up. We’re now running it on c4.large and seeing if the larger instance improves it, but it’s not ideal.

There were a few hiccups around the exact versions of fog-core, mime-types, etc to use but I ended up pinning most to the same version as r81.

I’ll keep testing and investigating…

alex · August 8, 2016, 4:35pm

Yes - conducting file moves from the orchestration box was an original design mistake.

We’ll be first replacing these file moves with S3DistCp, and later hopefully removing file moves altogether (in favor of manifests; writing an RFC on this soon).

rgabo · August 9, 2016, 10:33am

New Relic charts from our test runs yesterday:

First two “spikes” (~10% CPU 500MB RAM), lasting a few minutes each are r77 Great Auk EmrEtlRunner moving Clojure collector logs and processing them.

The second two spikes (100% CPU, 1100MB RAM), lasting 2.5 hours each is an old 0.9.x build with JRuby support backported processing CloudFront collector logs.

There are a lot more CloudFront collector logs so that plays a role, but we don’t have the same problem when running the old build with MRI Ruby.

The diff is dead simple too: https://github.com/sspinc/snowplow/pull/5

I’m still suspecting MRI vs JRuby with the Gemfile bundle potentially playing a role. I will now test the same codebase with MRI to see if that is fast.

rgabo · August 9, 2016, 10:53am

Side note: I wish EmrEtlRunner and StorageLoader were both Go binaries like SqlRunner is

alex · August 9, 2016, 11:43am

StorageLoader will most probably be moving into Spark and Spark Streaming per our RFC Migrating the Snowplow batch jobs from Scalding to Spark
EmrEtlRunner is splitting into snowplowctl and Dataflow Runner, a Golang app, per our RFC Splitting EmrEtlRunner into snowplowctl and Dataflow Runner; you can follow our progress in the dataflow-runner repo

rgabo · August 9, 2016, 1:27pm

@alex what version of JRuby do you use to create the releases?

alex · August 9, 2016, 5:12pm

jruby 1.7.19

rgabo · August 10, 2016, 2:07pm

Thanks @alex. It seems we were finally able to crack it. I was running JRuby / warble on Java 8 and it seems that was causing performance issues in the resulting jar when running on openjdk-7. I do not know exactly what combination of runtime and dependencies were the culprit, but going back to JRuby 1.7.20.1 running on Java 7 and using the latest Gemfile.lock files for the project did the trick.

alex · August 10, 2016, 2:09pm

That’s good to know @rgabo - BTW we use oraclejdk everywhere for production, openjdk has fallen behind so far by this point…

rgabo · August 10, 2016, 2:17pm

Makes sense, the CLIs were built into an existing Docker image that had openjdk-7, but I guess it would be beneficial to migrate at some point.

Topic		Replies	Views
DEPRECATION NOTICE: EmrEtlRunner Announcements	2	950	October 8, 2021
EmrEtlRunner issues with --use-persistent-jobflow Troubleshooting	4	1293	October 17, 2019
Upgrade EmrEtlRunner to use Spark-enrich For engineers	7	1278	December 14, 2017
EmrEtlRunner::EmrExecutionError AWS batch pipeline (Legacy)	3	1770	October 5, 2017
Snowplow RDB Loader R35 relased New releases	0	1341	January 27, 2021

DescribeJobFlows deprecated

Related topics