I have had the batch pipeline running successfully (CloudFront + EmrEtlRunner + Redshift) for a couple of months now, and recently tried out the --use-persistent-jobflow option in hopes of speeding up the rate that I can load into Redshift. The option, however, seems to break the run by not creating either the [archive_enriched] or [archive_shredded] EMR steps.
Recreating the error looks something like this:
Run snowplow-emr with --use-persistent-jobflow for the first time, creating the cluster --> runs all steps and completes successfully
Run snowplow-emr again with --use-persistent-jobflow --> re-using the cluster, runs all steps up to [rdb_load], then runs [archive_shredded], then finishes. It does not create the [archive_enriched] step.
Run snowplow-emr a 3rd time with --use-persistent-jobflow --> error “There seems to be an ongoing run of EmrEtlRunner: Cannot safely add enrichment step to jobflow,
s3://snowplow-emr-etl/enriched/good/ is not empty”
This issue seems to be happening consistently in the couple hours I have spent debugging it. EMR runs using an existing cluster skip either [archive_shredded] or [archive_enriched] step.
Is this a known issue? Is --use-persistent-jobflow compatible with the CloudFront collector setup? I am on EmrEtlRunner version 0.34.1. Happy to provide any additional information about the issue.
@thedstrom, typically persistent cluster is used when the data is processed in real-time and thus could be available more often for batch to prepare the files for Redshift data load. Using Cloudfront collector means having the data available for processing in EMR not more often than once an hour.
It might be more sensible to have your EMR cluster configured in a way your data is processed faster. That means selecting the suitable EC2 instance types and providing the appropriate Spark configuration.
Having said that I don’t think the data produced by Cloudfront would be treated any different on the persistent cluster (we do not use Cloudfront collector ourselves to be honest).
Your described scenario appears to be a temporary glitch whereby enriched files failed to be moved by AWS S3DictCp utility utlized by EmrEtlRunner. The recovery would mean archiving the enriched files manually as the shredded files have been archived already.
One of the possible reason for failed archiving is lots of either empty files or directories that S3DistCp utility leave behind when using either s3n or s3a protocols with your buckets in EmrEtlrunner configuration file. It is advisable to do a regular clean-up of those files.
This corresponds to R113. I would advise to upgrade to the latest R115 as one similar issue was fixed in the release.
I suppose you get 3 warnings in a row that it is not possible to submit a step and the job continues its execution without the step while it should fail. Have you looked at stdout and stderr to understand the behavior?
@ihor and @egor Thanks for the quick responses. Looking back at the logs, yes I was getting the error submitting steps, which would explain why they were missing:
D, [2019-10-14T17:35:13.872683 #24145] DEBUG -- : Initializing EMR jobflow
W, [2019-10-14T17:35:38.052701 #24145] WARN -- : Got an error while trying to submit a jobflow step: [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3
W, [2019-10-14T17:35:40.077746 #24145] WARN -- : Got an error while trying to submit a jobflow step: [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3
W, [2019-10-14T17:35:44.113545 #24145] WARN -- : Got an error while trying to submit a jobflow step: [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3
I will upgrade my version and see if it helps. For cleaning up the empty files and directories left behind by S3DictCp, are you referring to the run=... files and folders left behind in enriched/good, shredded/bad, etc.?
Depending on how you reference the buckets in your config.yml you will get either empty (zero bytes) files (s3n) or empty “run” directories (s3a) in good bucket. Internally, we execute a python script that locates zero-byte objects and deletes them from the bucket to keep the bucket “clean”. As the objects grow in number it might affect EmrEtlRunner performance as S3DistCp utility has to scan more and more objects.