Hi everyone, I’m quite new with snowplow so I managed to get the enriched/good sink on S3 and now I’m doing the next steps to save that on Redshift.
I’ve got some question:
- How schedule EMR jobs with S3DistCp and Shredder ? Is it by time that it runs “cron jobs”?
- I’m having an error while deploying in Elastic Beanstalk the docker image from
snowplow/snowplow-rdb-loader:1.1.0
here is my config.hocon (that id should be created by my iglu server or it could be any uuid?)
{
# Human-readable identificator, can be random
"name": "{{ custom name}}",
# Machine-readable unique identificator, must be UUID
"id": "{{id created on iglu server with create permission}}",
# Data Lake (S3) region
"region": "us-west-2",
# SQS topic name used by Shredder and Loader to communicate
"messageQueue": "{{name_of_sqs}}.fifo",
# Shredder-specific configs
"shredder": {
# "batch" for Spark job and "stream" for fs2 streaming app
"type" : "batch",
# For batch: path to enriched archive (must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories) for S3 input
"input": "s3://{{path_to_enriched}}/archive/",
# For stream: appName, streamName, region triple for kinesis
#"input": {
# # kinesis and file are the only options for stream shredder
# "type": "kinesis",
# # KCL app name - a DynamoDB table will be created with the same name
# "appName": "acme-rdb-shredder",
# # Kinesis Stream name
# "streamName": "enriched-events",
# # Kinesis region
# "region": "us-east-1",
# # Kinesis position: LATEST or TRIM_HORIZON
# "position": "LATEST"
#},
# For stream shredder : frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes
# "windowing": "10 minutes",
# Path to shredded archive
"output": {
# Path to shredded output
"path": "s3://{{path_to_enriched}}/shredded/",
# Shredder output compression, GZIP or NONE
"compression": "GZIP"
}
},
# Schema-specific format settings (recommended to leave all three groups empty and use TSV as default)
"formats": {
# Format used by default (TSV or JSON)
"default": "TSV",
# Schemas to be shredded as JSONs, corresponding JSONPath files must be present. Automigrations will be disabled
"json": [ ],
# Schemas to be shredded as TSVs, presence of the schema on Iglu Server is necessary. Automigartions enabled
"tsv": [ ],
# Schemas that won't be loaded
"skip": [ ]
},
# Optional. S3 path that holds JSONPaths
#"jsonpaths": "s3://bucket/jsonpaths/",
# Warehouse connection details
"storage" : {
# Database, redshift is the only acceptable option
"type": "redshift",
# Redshift hostname
"host": "{redshifhostname}.redshift.amazonaws.com",
# Database name
"database": "dev",
# Database port
"port": 5439,
# AWS Role ARN allowing Redshift to load data from S3
"roleArn": "{rolearn}/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift",
# DB schema name
"schema": "atomic",
# DB user with permissions to load data
"username": "{{username}}",
# DB password
"password": "{{pass}}",
# Custom JDBC configuration
"jdbc": {"ssl": true},
# MAXERROR, amount of acceptable loading errors
"maxError": 10
},
# Additional steps. analyze, vacuum and transit load are valid values
"steps": ["vacuum"],
# Observability and reporting options
"monitoring": {
# Snowplow tracking (optional)
"snowplow": {
"appId": "snowplow",
"collector": "{{collector}}.elasticbeanstalk.com",
}
# Optional, for tracking runtime exceptions
"sentry": {
"dsn": ""
},
# Optional, configure how metrics are reported
# "metrics": {
# # Optional, send metrics to StatsD server
# "statsd": {
# "hostname": "localhost",
# "port": 8125,
# # Any key-value pairs to be tagged on every StatsD metric
# "tags": {
# "app": "rdb-loader"
# }
# # Optional, override the default metric prefix
# # "prefix": "snowplow.rdbloader."
# },
# # Optional, print metrics on stdout (with slf4j)
# "stdout": {
# # Optional, override the default metric prefix
# # "prefix": "snowplow.rdbloader."
# }
# }
}
}
my resolver.json (is it necessary?)
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "{{uri_to_my_iglu_server}}.elasticbeanstalk.com:8080/api/",
"apikey": "{{my api key}}"
}
}
}
]
}
}
my docker-compose.yaml to run it on elastic beanstalk:
services:
rdb-loader:
container_name: rdb-loader
image: snowplow/snowplow-rdb-loader:1.1.0
command: [
"--config", "{{config.hocon base64 encoded}}",
"--iglu-config", "{{resolver base64 encoded}}",
]
and I got this error while deploying:
Attaching to rdb-loader
rdb-loader | [ioapp-compute-0] WARN io.sentry.dsn.Dsn - *** Couldn't find a suitable DSN, Sentry operations will do nothing! See documentation: https://docs.sentry.io/clients/java/ ***
rdb-loader | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Sentry has been initialised at
rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - RDB Loader 1.1.0 [Torre Redshift Loader] has started. Listening snowplow-rdb-loader-queue.fifo
rdb-loader | [ioapp-compute-0] ERROR com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Loader shutting down
rdb-loader | java.lang.NullPointerException
rdb-loader | at com.amazon.redshift.core.jdbc42.S42NotifiedConnection.setAutoCommit(Unknown Source)
rdb-loader | at doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$setAutoCommit$1(kleisliinterpreter.scala:800)
rdb-loader | at doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$setAutoCommit$1$adapted(kleisliinterpreter.scala:800)
rdb-loader | at doobie.free.KleisliInterpreter.$anonfun$primitive$2(kleisliinterpreter.scala:109)
rdb-loader | at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
rdb-loader | at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader | at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader | at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader | at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader | at bracketCase @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader | at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader | at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader | at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader | at bracketCase @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader | at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader | at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader | at use @ com.snowplowanalytics.snowplow.rdbloader.Main$.run(Main.scala:36)
rdb-loader | [cats-effect-blocker-0] INFO org.http4s.client.PoolManager - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
rdb-loader exited with code 1
any help would be truly appreciate! Thanks!