I have a successful pipeline loading data into Redshift without issue. I am trying to also connect in the Snowflake Transformer/Loader to also pipe the data into Snowflake using the 0.4.2 jars.
I ran the setup task in the loader which ran successfully and created the warehouse, events table, etc without error.
I got then got the cluster.json and playbook.json config files to the point where kicking off the dataflow-runner successfully runs both the transform and load step.
./dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json
INFO[0000] Launching EMR cluster with name 'dataflow-runner - snowflake transformer'...
INFO[0000] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0451] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0481] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0511] EMR cluster launched successfully; Jobflow ID: j-3L41FF6YEO99H
INFO[0511] Successfully added 2 steps to the EMR cluster with jobflow id 'j-3L41FF6YEO99H'...
INFO[0662] Step 'Snowflake Transformer' with id 's-X2L179XZXDA8' completed successfully
INFO[0707] Step 'Snowflake Loader' with id 's-180LKK3PA3KGS' completed successfully
INFO[0707] Terminating EMR cluster with jobflow id 'j-3L41FF6YEO99H'...
INFO[0707] EMR cluster is in state TERMINATING - need state TERMINATED, checking again in 30 seconds...
INFO[0887] EMR cluster is in state TERMINATING - need state TERMINATED, checking again in 30 seconds...
INFO[0918] EMR cluster terminated successfully
INFO[0918] Transient EMR run completed successfully
The log from the loader step in EMR shows:
Loading...
Preliminary checks passed
Warehouse snowplow_wh_staging resumed
New column [contexts_com_snowplowanalytics_snowplow_ua_parser_context_1] has been added
Folder [run=2019-04-24-17-19-28] from stage [snowplow_stage] has been loaded
Folder [run=2019-04-24-20-03-41] from stage [snowplow_stage] has been loaded
Success. Exiting...
Yet there were no records written into the events table in Snowflake. Is there something I’m missing here? What might cause the tasks to run without errors, but not actually load any data?
In the meantime I have checked the credentials in the config, and have also added them explicitly to the stage in Snowflake, but am continuing to get the same error that the remote file cannot be accessed. Is there anything else I can do to debug this or something I might have missed in setting the credentials up correctly?
The message was from the Snowflake query history, nothing about the error was output in any of the Snowplow logs.
That column was added when I ran the setup process. There are 129 columns in the events table in snowflake, which appears correct from what I can see.
Yes, that output was from stdout. There was nothing output to stderr from the EMR step. The syslog had the following:
2019-04-26 15:51:18,393 WARN shadeaws.profile.path.cred.CredentialsLegacyConfigLocationProvider (main): Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
Let me know if there is anything else I can provide to help!
contexts_com_snowplowanalytics_snowplow_ua_parser_context_1 was added because transformer found this context in your enriched data and then communicated this fact to the loader via DynamoDB manifest. This means that your Snowflake user certainly has enough permissions to alter the table.
I have two main hypothesis on what’s going on:
First is that there’s mismatch between stageUrl in config and actual stage URL you can use something like SHOW STAGES. Can you check that data is indeed present in the path Loader tries to load it from?
Second hypothesis is that Snowflake cannot authenticate itself to load data from the stage you created via setup process. Could you please double-check you followed role creation instructions precisely?
If it doesn’t work I’d recommend to try to use static credentials:
And then check that you can use these same credentials with AWS CLI to list and fetch data from stageUrl.
It’s more likely the former than later, but I’m puzzled because Loader since 0.4.0 should abort the process on stage mismatch, but indeed this behavior was possible for pre-0.4.0 loader.
Thanks again for sticking with this. I had the same ideas for what might be wrong and have been digging into both of those. Here is what I tried:
I verified the URL in the stage and it is definitely correct.
I ran an ALTER STAGE command to set the credentials on the stage directly in Snowflake.
I copied the commands that the loader ran into a Snowflake worksheet and ran them manually. The only thing I changed was removing the CREDENTIALS section of the COPY INTO. The command I ran looked like this:
COPY INTO staging.snowplow_tmp_run_2019_04_26_18_31_15(enriched_data)
FROM @staging.snowplow_stage/run=2019-04-26-18-31-15
ON_ERROR = SKIP_FILE_1
FILE_FORMAT = (FORMAT_NAME = 'staging.snowplow_enriched_json'
STRIP_NULL_VALUES = TRUE);
The data was loaded correctly with no errors.
So, that proves to me the credentials are good, and that the stage is set up correctly. That leaves the role as the only possible configuration problem.
One thing I’m not clear on which may be causing issues - There are credentials in the cluster.json, playbook.json and also in the config.json. In config.json I have the roleArn and sessionDuration in the “auth” section with the role set up as per the docs as far as I can tell. What about the “credentials” section of the playbook.json and cluster.json? Those I am just using static credentials right now while debugging, but are they used at all by the loader? My understanding was that those creds should just be used for spinning up the EMR cluster. Is that correct?
Some follow ups from yesterday - I was able to get the storage loader to load correctly into Snowflake by making some changes to the Policy attached to the Role. I had to remove the resource specific permissions for the buckets and paths that the role had access to. I must have made a mistake in the path names in the role. I’ll try again to add the restrictions based on the doc and see what is going on.
Would still like some clarity (and maybe doc updates) on the different auth/credentials sections, but that’s not urgent. Thanks for your help!
Can you share this policy change? I’m having the same issue.
shadeaws.services.dynamodbv2.model.AmazonDynamoDBException: User: arn:aws:iam::213790343224:user/username is not authorized to perform: dynamodb:Scan on resource: arn:aws:dynamodb:us-east-1:213747503224:table/snowplow-manifest (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException; Request ID: 3AM85881GMUU1B32JEIAD6HDAFVV4KQNSO5AEMVJF66Q9ASUAAJG)
In my case it was just a typo in the load role, the ARN for the bucket was not correct. But it looks like yours is for DynamoDB not s3.
Is this during the enrich stage? You might need to look at the EMR_EC2_DefaultRole which I think is where it gets the permissions to access dynamodb table.
Hey @sonnypolaris, this is slightly outside my area so forgive me if I’m wrong, but I have a hunch that the manifest in question might be part of the deduplication process, which happens at loading stage.
If you take a look at the full error message for the run, it should indicate at what stage this failed on. If that’s RDB Load/ EMR ETL runner then I think the solution would be to grant permissions to the user that your loader is using to scan DynamoDB.