Hi, I’m trying to set up the SnowflakeDB Loader, but it’s currently failing in the “Snowflake Transformer” step (“Staging enriched data” is successful).
I followed the Quick start AWS guide and managed to get everything working with the Postgres loader and am now trying to replace it with Snowflake. (Using the S3 loader to get data into my bucket).
Some precursor thoughts:
I’m a bit confused about how the setup for the roleArn.
The setup guide Setup - Snowplow Docs says to follow the snowflake setup for setting up an integration.
I did that (which I have done before for other services and got it working).
But then the guide says to another policy with similar statements but a different trust relationship which doesn’t include the Snowflake role, only EMR_EC2_DefaultRole
and to use this for the roleArn
.
Obviously this won’t work as it cannot be assumed by Snowflake.
So I’m wondering if this policy should be combined with the snowflake one and if so, the trust relationship needs to include both EMR_EC2_DefaultRole
and my snowflake role? (I did this for now, to be able to run the setup).
The setup also mentions: “Don’t forget to use 1-0-3 version of configuration schema.”, I’m not sure where you mean I’m supposed to use this? It doesn’t seem related to the loader? Or if it’s just for the resolver.json
I suppose it’s already set up.
I also created a DynamoDB table which I’m unsure if it’s correct. I only gave it this config:
name = "snowflake-event-manifest"
billing_mode = "PROVISIONED"
read_capacity = 20
write_capacity = 20
hash_key = "RunId"
attribute {
name = "RunId"
type = "S"
}
But it has an item count of 0.
Then, to run everything I did this:
java -jar snowplow-snowflake-loader-0.8.2.jar setup --config ./config.json --resolver ./resolver.json
and then:
./dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json
This is my config.json
:
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-3",
"data": {
"name": "Snowflake Config",
"awsRegion": "eu-west-1",
"auth": {
"integrationName": "SNOWPLOW_S3_INTEGRATION",
"roleArn": "<ROLE_ARN>",
"sessionDuration": 900
},
"manifest": "snowflake-event-manifest",
"snowflakeRegion": "eu-west-1",
"database": "SNOWPLOW",
"input": "s3://MYBUCKET/archive/enriched/",
"stage": "STAGE_SNOWPLOW_S3_INTEGRATION",
"badOutputUrl": "s3://MYBUCKET/archive/snowflake/badrow/",
"stageUrl": "s3://MYBUCKET/archive/snowflake/transformed/",
"warehouse": "snowplow_warehouse",
"schema": "atomic",
"account": "MYSNOWFLAKEACCOUNT",
"username": "SNOWPLOW_LOADER",
"password": {
"ec2ParameterStore": {
"parameterName": "/analytics/snowflake/snowplow/loader"
}
},
"maxError": 1,
"purpose": "ENRICHED_EVENTS"
}
}
And this is my resolver.json
:
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [
"com.snowplowanalytics"
],
"connection": {
"http": {
"uri": "http://iglucentral.com"
}
}
},
{
"name": "MY-COMPANY iglu schema repository",
"priority": 1,
"vendorPrefixes": [
"se.MY-COMPANY"
],
"connection": {
"http": {
"uri": "http://MY-COMPANY-iglu-schemas.s3-website-eu-west-1.amazonaws.com"
}
}
}
]
}
}
And this is my events_manifest.json
:
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/amazon_dynamodb_config/jsonschema/2-0-0",
"data": {
"name": "Snowflake deduplication config",
"auth": null,
"awsRegion": "eu-west-1",
"dynamodbTable": "snowflake-event-manifest",
"id": "MY ID",
"purpose": "EVENTS_MANIFEST"
}
}
And this is my cluster.json
:
{
"schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
"data": {
"name": "dataflow-runner - snowflake transformer",
"logUri": "s3://MYBUCKET/logs/",
"region": "eu-west-1",
"credentials": {
"accessKeyId": "env",
"secretAccessKey": "env"
},
"roles": {
"jobflow": "EMR_EC2_DefaultRole",
"service": "EMR_DefaultRole"
},
"ec2": {
"amiVersion": "5.9.0",
"keyName": "MYKEYNAME",
"location": {
"vpc": {
"subnetId": "MYSUBNET"
}
},
"instances": {
"master": {
"type": "m2.xlarge"
},
"core": {
"type": "m2.xlarge",
"count": 1
},
"task": {
"type": "m1.medium",
"count": 0,
"bid": "0.015"
}
}
},
"tags": [],
"bootstrapActionConfigs": [],
"configurations": [
{
"classification": "core-site",
"properties": {
"Io.file.buffer.size": "65536"
}
},
{
"classification": "mapred-site",
"properties": {
"Mapreduce.user.classpath.first": "true"
}
},
{
"classification": "yarn-site",
"properties": {
"yarn.resourcemanager.am.max-attempts": "1"
}
},
{
"classification": "spark",
"properties": {
"maximizeResourceAllocation": "true"
}
}
],
"applications": [
"Hadoop",
"Spark"
]
}
}
And this is my playbook.json
(I removed "--conf", "spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat"
from the transformer step as recommended here: Upgraded to snowflake loader 0.8.0 but data is not loaded but it made no difference):
{
"schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
"data": {
"region": "eu-west-1",
"credentials": {
"accessKeyId": "env",
"secretAccessKey": "env"
},
"steps": [
{
"type": "CUSTOM_JAR",
"name": "Staging enriched data",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
"arguments": [
"--src",
"s3://MYBUCKET/enriched/",
"--dest",
"s3://MYBUCKET/archive/snowflake/transformed/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
"--s3Endpoint",
"s3.amazonaws.com",
"--srcPattern",
".*\\.gz",
"--deleteOnSuccess",
"--s3ServerSideEncryption"
]
},
{
"type": "CUSTOM_JAR",
"name": "Snowflake Transformer",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "command-runner.jar",
"arguments": [
"spark-submit",
"--deploy-mode",
"cluster",
"--class",
"com.snowplowanalytics.snowflake.transformer.Main",
"s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.8.2.jar",
"--config",
"{{base64File "./config.json"}}",
"--resolver",
"{{base64File "./resolver.json"}}",
"--events-manifest",
"{{base64File "./events_manifest.json"}}"
]
},
{
"type": "CUSTOM_JAR",
"name": "Snowflake Loader",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.8.2.jar",
"arguments": [
"load",
"--base64",
"--config",
"{{base64File "./config.json"}}",
"--resolver",
"{{base64File "./resolver.json"}}"
]
}
],
"tags": []
}
}
Creating reply with error logs as the post became too long…