I downloaded snowplow_emr_r89_plain_of_jars.zip and tried to use JDecompiler to get the classes in snowplow-emr-etl-runner.jar. Although I got file names on the left pane of JD, but I cannot see any contents of those files in the JD right pane.
Sometimes data volume can be much larger than usual, and the pipeline doesn’t seem scale up well (at least in certain steps - and that might not be a snowplow problem rather it’s in our configuration). Right now emr-etl-runner/storage-loader is basically a black box to us. We rely on Jenkins’ logs to guess what’s going there. So if we have all the source classes/codes, we should be able to know what each step of the pipeline is exactly doing, and what each log line really means; then we may be able to skip or tune certain steps to get performance we need.
If somehow you didn’t know both EmrEtlRunner and StorageLoader are open-source applications, you don’t need to reverse-engineer them.
What is more important is that they also just thin wrappers and don’t make any heavy-lifting work. All your data volumes are processed on EMR cluster, which you need to scale manually according to your needs.