Hi. I’m having issues getting data loaded into RedShift.
I’m using Scala Collect, Scala Stream Enricher, S3 loader and the EmrEtlRunner (version 0.34.2). The hits are appearing in the appropriate S3 bucket for the EmrEtlRunner to process, however after the EmrEtlRunner process runs no data is loaded into RedShift and no errors are logged or displayed.
The command I’m using to start the EmrEtlRunner process:
$ ./snowplow-emr-etl-runner run -c /home/ubuntu/configs/config_emr_etl_runner.yml -r resolver.js -t /home/ubuntu/targets
Output from the EmrEtlRunner command:
uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
uri:classloader:/gems/json-schema-2.7.0/lib/json-schema/util/array_set.rb:18: warning: constant ::Fixnum is deprecated
D, [2019-06-20T15:20:17.002891 #10172] DEBUG – : Initializing EMR jobflow
D, [2019-06-20T15:20:19.525834 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 started, waiting for jobflow to complete…
I, [2019-06-20T16:00:28.183705 #10172] INFO – : RDB Loader logs
D, [2019-06-20T16:00:28.195534 #10172] DEBUG – : Downloading s3://snowplow-emr-log/rdb-loader/2019-06-20-15-20-17/fcb2400a-2fc6-40dd-9254-e28f7a6e8275 to /tmp/rdbloader20190620-10172-jzwhho
I, [2019-06-20T16:00:28.261527 #10172] INFO – : AWS Redshift enriched events storage
I, [2019-06-20T16:00:28.271089 #10172] INFO – : RDB Loader successfully completed following steps: [Discover]
D, [2019-06-20T16:00:28.464189 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 completed successfully.
I, [2019-06-20T16:00:28.472809 #10172] INFO – : Completed successfully
The EmrEtlRunner processes successfully but no data is sent to RedShift and I never seen this message “RDB Loader successfully completed following steps: [Discover, Load, Analyze]” only this message “RDB Loader successfully completed following steps: [Discover].”
My configuration for the EMR ETL Runner:
aws:
access_key_id: "** redacted **"
secret_access_key: "** redacted **"
s3:
region: "us-west-2"
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets:
log: "s3://snowplow-emr-log"
encrypted: false
enriched:
good: "s3://snowplow-emr-enriched-good"
archive: "s3://snowplow-emr-enriched-archive"
stream: "s3://snowplow-stream-enriched"
shredded:
good: "s3://snowplow-emr-shredded-good"
bad: "s3://snowplow-emr-shredded-bad"
errors:
archive: "s3://snowplow-emr-shredded-archive"
consolidate_shredded_output: false
emr:
ami_version: 5.9.0
region: "us-west-2"
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement: "us-west-2b"
ec2_subnet_id:
ec2_key_name: "snowplow"
security_configuration:
bootstrap: []
software:
hbase:
lingual:
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_ebs:
volume_size: 50
volume_type: gp2
volume_iops: 400
ebs_optimized: false
task_instance_count: 0
task_instance_type: m1.medium
task_instance_bid: 0.015
bootstrap_failure_tries: 3
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info:
collectors:
format: "thrift"
enrich:
versions:
spark_enrich: 1.18.0
continue_on_unexpected_error: false
output_compression: GZIP
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.0
hadoop_elasticsearch: 0.1.0
monitoring:
tags: {}
logging:
level: DEBUG
My targets configuration file:
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
"data": {
"name": "AWS Redshift enriched events storage",
"host": "** redacted **",
"database": "sandbox",
"port": 5439,
"sslMode": "DISABLE",
"username": "** redacted **",
"password": "** redacted **",
"roleArn": "arn:aws:iam::** redacted **:role/RedshiftLoadRole",
"schema": "snowplow.events",
"maxError": 1,
"compRows": 20000,
"sshTunnel": null,
"purpose": "ENRICHED_EVENTS"
}
}
Thank you in advance for looking at this issue.