EmrEtlRunner not loading data into RedShift

Hi. I’m having issues getting data loaded into RedShift.

I’m using Scala Collect, Scala Stream Enricher, S3 loader and the EmrEtlRunner (version 0.34.2). The hits are appearing in the appropriate S3 bucket for the EmrEtlRunner to process, however after the EmrEtlRunner process runs no data is loaded into RedShift and no errors are logged or displayed.

The command I’m using to start the EmrEtlRunner process:

$ ./snowplow-emr-etl-runner run -c /home/ubuntu/configs/config_emr_etl_runner.yml -r resolver.js -t /home/ubuntu/targets

Output from the EmrEtlRunner command:

uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
uri:classloader:/gems/json-schema-2.7.0/lib/json-schema/util/array_set.rb:18: warning: constant ::Fixnum is deprecated
D, [2019-06-20T15:20:17.002891 #10172] DEBUG – : Initializing EMR jobflow
D, [2019-06-20T15:20:19.525834 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 started, waiting for jobflow to complete…
I, [2019-06-20T16:00:28.183705 #10172] INFO – : RDB Loader logs
D, [2019-06-20T16:00:28.195534 #10172] DEBUG – : Downloading s3://snowplow-emr-log/rdb-loader/2019-06-20-15-20-17/fcb2400a-2fc6-40dd-9254-e28f7a6e8275 to /tmp/rdbloader20190620-10172-jzwhho
I, [2019-06-20T16:00:28.261527 #10172] INFO – : AWS Redshift enriched events storage
I, [2019-06-20T16:00:28.271089 #10172] INFO – : RDB Loader successfully completed following steps: [Discover]
D, [2019-06-20T16:00:28.464189 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 completed successfully.
I, [2019-06-20T16:00:28.472809 #10172] INFO – : Completed successfully

The EmrEtlRunner processes successfully but no data is sent to RedShift and I never seen this message “RDB Loader successfully completed following steps: [Discover, Load, Analyze]” only this message “RDB Loader successfully completed following steps: [Discover].”

My configuration for the EMR ETL Runner:

aws:
  access_key_id: "** redacted **"
  secret_access_key: "** redacted **"
  s3:
    region: "us-west-2"
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets:
      log: "s3://snowplow-emr-log"
      encrypted: false
      enriched:
        good: "s3://snowplow-emr-enriched-good"
        archive: "s3://snowplow-emr-enriched-archive"
        stream: "s3://snowplow-stream-enriched"
      shredded:
        good: "s3://snowplow-emr-shredded-good"
        bad: "s3://snowplow-emr-shredded-bad"
        errors:
        archive: "s3://snowplow-emr-shredded-archive"
    consolidate_shredded_output: false
  emr:
      ami_version: 5.9.0
      region: "us-west-2"
      jobflow_role: EMR_EC2_DefaultRole
      service_role: EMR_DefaultRole
      placement: "us-west-2b"
      ec2_subnet_id:
      ec2_key_name: "snowplow"
      security_configuration:
      bootstrap: []
      software:
        hbase:
        lingual:
      # Adjust your Hadoop cluster below
      jobflow:
        job_name: Snowplow ETL
        master_instance_type: m1.medium
        core_instance_count: 2
        core_instance_type: m1.medium
        core_instance_ebs:
          volume_size: 50
          volume_type: gp2
          volume_iops: 400
          ebs_optimized: false
        task_instance_count: 0
        task_instance_type: m1.medium
        task_instance_bid: 0.015
      bootstrap_failure_tries: 3
      configuration:
        yarn-site:
          yarn.resourcemanager.am.max-attempts: "1"
        spark:
          maximizeResourceAllocation: "true"
      additional_info:
  collectors:
      format: "thrift"
  enrich:
      versions:
        spark_enrich: 1.18.0
      continue_on_unexpected_error: false
      output_compression: GZIP
    storage:
      versions:
        rdb_loader: 0.14.0
        rdb_shredder: 0.13.0
        hadoop_elasticsearch: 0.1.0
    monitoring:
      tags: {}
      logging:
        level: DEBUG

My targets configuration file:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
    "data": {
        "name": "AWS Redshift enriched events storage",
        "host": "** redacted **",
        "database": "sandbox",
        "port": 5439,
        "sslMode": "DISABLE",
        "username": "** redacted **",
        "password": "** redacted **",
        "roleArn": "arn:aws:iam::** redacted **:role/RedshiftLoadRole",
        "schema": "snowplow.events",
        "maxError": 1,
        "compRows": 20000,
        "sshTunnel": null,
        "purpose": "ENRICHED_EVENTS"
    }
}

Thank you in advance for looking at this issue.

@opethian, do you get any shredded data at all? If no shredded data produced there’s nothing to load to Redshift.

Yes I’m seeing the shredded data.

Staging_Stream_Enrich Step

  • S3 bucket snowplow-stream-enriched has data

  • S3 bucket snowplow-stream-enriched bucket has data

  • Shredding Step

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/atomic-events.

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/shredded_types.

    • S3 bucket snowplow-emr-shredded-bad has data in run=2019-06-24-16-45-20/part-00000-REDACATED.txt file that is empty.

    • S3 bucket snowplow-emr-shredded-bad has _SUCCESS file that is empty.

  • RDB_Load

    • RDB Loader runs successfully with this message. “RDB Loader successfully completed following steps: [Discover]”
  • Archive Enriched

    • S3 bucket snowplow-emr-enriched-archive has data in run=2019-06-24-16-45-20.
  • Archive Shredded

    • S3 bucket snowplow-emr-shredded-archive has data in run=2019-06-24-16-45-20/atomic-events.

@opethian, I cannot see anything wrong. The only thing I spotted is the schema for the target configuration. Is your schema really snowplow.events and not just snowplow?

@opethian,

Is Redshift able to read data from your S3 bucket (role/credentials)?

@ihor Good catch. My schema isn’t snowplow.events just snowplow, however I still don’t see any loading of data into RedShift.

As @grzegorzewa asked if RedShift has the proper roles/credentials to read from S3. I’m currently setting up and running everything as AdminAdministrator, which is really bad, until I get everything in Snowplow setup.

My next question is where would I look for errors from AWS when there is a lack of credentials or roles?

@opethian I am running into the same problem you are… Were you able to figure out the solution? It seems like the issue might be that the shredded events are empty. and in the docs it says the loader will only pick up non empty shredded events… I don’t know why they are empty though.

@darussian No I’m still experiencing the same problem. Where the RDB Loader successfully completed following steps: [Discover] but no data is transported into RedShift.

I have gone down the road of permissions, which I think is a possibility to complete bafflement.

@ihor based on the schema you mentioned, I created events table in redshift under schema “atomic”.
So I should input “atomic.events” in the redshift.json?

thanks!

@AllenWeieiei, no, it is always events table for core Snowplow data. Thus, the target configuration file has schema property for the schema name. In your case, it will be just “atomic”.

Hi, @ihor

I think I am facing some similar issue like this. Without redshift.json, everything looks good. I got files in shred bucket and the runner completed successfully.

After adding the redshift.json and change the command, looks like the files in S3 Loader bucket were shredded good (I can find them in S3 shred bucket under good folder, and all subfolders are created successfully). But the RDB Loader is not working properly. I got error as following: packet_write_wait: Connection to 18.205.53.235 port 2225: Broken pipe

I will lose SSH connection to the server after running the runner for sometime. Do you have any advice?

Thanks!

@ihor
looks like it was fixed after adding server alive interval.

I got a new error:
F, [2019-10-25T18:32:00.513819 #12727] FATAL – :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-31G1GGSLCYVQY failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
j-31G1GGSLCYVQY: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2019-10-25 18:19:06 +0000 - ]

    1. [staging_stream_enrich] s3-dist-cp: Stream Enriched s3://enrich-s3-loader/ -> Enriched Staging S3: COMPLETED ~ 00:02:02 [2019-10-25 18:19:08 +0000 - 2019-10-25 18:21:11 +0000]
    1. [shred] s3-dist-cp: Enriched S3 -> HDFS: COMPLETED ~ 00:01:50 [2019-10-25 18:21:13 +0000 - 2019-10-25 18:23:03 +0000]
    1. [shred] spark: Shred Enriched Events: COMPLETED ~ 00:03:16 [2019-10-25 18:23:05 +0000 - 2019-10-25 18:26:22 +0000]
    1. [shred] s3-dist-cp: Shredded HDFS -> S3: COMPLETED ~ 00:01:38 [2019-10-25 18:26:24 +0000 - 2019-10-25 18:28:02 +0000]
    1. [shred] s3-dist-cp: Shredded HDFS _SUCCESS -> S3: COMPLETED ~ 00:01:24 [2019-10-25 18:28:04 +0000 - 2019-10-25 18:29:28 +0000]
    1. [rdb_load] Load SendGrid-Target Storage Target: FAILED ~ 00:01:09 [2019-10-25 18:29:35 +0000 - 2019-10-25 18:30:44 +0000]
    1. [archive_shredded] s3-dist-cp: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:799:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in <main>' org/jruby/RubyKernel.java:994:in load’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:970:in require’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in

Does this mean I need to initial log for hadoop?

@AllenWeieiei, as the logs suggest the failure took place at loading data to Redshift. You need to investigate it further. I suspect your permissions are not set up correctly. Check EMR logs and see if more info on failure could be found.

thank you so much!


Hi @ihor,

When I check the log as the instruction said, the job Shutdown as step failed. Screenshot showing the loading step was failed, and I check the syslog of it. It has only a WARN there about crediential. Do you have any idea about this? In my redshift json, username and password is the master user of the target cluster and the roleARN is pointing to a role which has full access to my S3 and Redshift.

The WARN is showing I need to move a file, could you please provide more info about it?

Much appreciated!

@ihor

The WARN is showing a conf file at some specific location, is that location created by EmrEtlRunner? Is it configurable on my end?

Thanks!

@AllenWeieiei, that warning is not Snowplow related. It is related to how you use AWS CLI to run the pipeline. Indeed, the AWS CLI expects your credentials file to be located at ~/.aws/credentials: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.

Having said that it is just a warning and the job does run. You need to find out why it fails at the data load step.

OK, thanks!

@ihor, when I check the article about redshift connection, I have a question. Even I don’t use a SQLWorkbench to connect with it, does the server (where I am running the collector, enrich, loader, runner) require a driver for JDBC to be installed?

@ihor, hello.

I got the process working and data loaded into Redshift. I also set up the supervisor of collector and enrich to run, now I have data in collector stream and enrich stream whenever the new data is coming. When I check the instruction, I only found there are some steps about scheduling the EmrEtlRunner. So I would believe the EmrEtlRunner will be scheduled to run periodically.

How about the S3 Loader? schedule or other setting?

Thanks!

@AllenWeieiei, indeed, EmrEtlRunner needs to be scheduled and we just set up cron ourselves but you can use whatever is more appropriate for you.

S3 Loader is a real-time component - subscriber to Kinesis stream and as such it is “always-on” once started. You do need to configure when the streamed events are uploaded to S3 for batch processing (shredding and loading to Redshift). It is achieved with these properties in S3 Loader configuration file.