Snowflake Loader - Process ran successfully but no data loaded

Brandon_Kane · April 25, 2019, 5:49pm

Hi there,

I have a successful pipeline loading data into Redshift without issue. I am trying to also connect in the Snowflake Transformer/Loader to also pipe the data into Snowflake using the 0.4.2 jars.

I ran the setup task in the loader which ran successfully and created the warehouse, events table, etc without error.

I got then got the cluster.json and playbook.json config files to the point where kicking off the dataflow-runner successfully runs both the transform and load step.

./dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json
INFO[0000] Launching EMR cluster with name 'dataflow-runner - snowflake transformer'...
INFO[0000] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0451] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0481] EMR cluster is in state STARTING - need state WAITING, checking again in 30 seconds...
INFO[0511] EMR cluster launched successfully; Jobflow ID: j-3L41FF6YEO99H
INFO[0511] Successfully added 2 steps to the EMR cluster with jobflow id 'j-3L41FF6YEO99H'...
INFO[0662] Step 'Snowflake Transformer' with id 's-X2L179XZXDA8' completed successfully
INFO[0707] Step 'Snowflake Loader' with id 's-180LKK3PA3KGS' completed successfully
INFO[0707] Terminating EMR cluster with jobflow id 'j-3L41FF6YEO99H'...
INFO[0707] EMR cluster is in state TERMINATING - need state TERMINATED, checking again in 30 seconds...
INFO[0887] EMR cluster is in state TERMINATING - need state TERMINATED, checking again in 30 seconds...
INFO[0918] EMR cluster terminated successfully
INFO[0918] Transient EMR run completed successfully

The log from the loader step in EMR shows:

Loading...
Preliminary checks passed
Warehouse snowplow_wh_staging resumed
New column [contexts_com_snowplowanalytics_snowplow_ua_parser_context_1] has been added
Folder [run=2019-04-24-17-19-28] from stage [snowplow_stage] has been loaded
Folder [run=2019-04-24-20-03-41] from stage [snowplow_stage] has been loaded
Success. Exiting...

Yet there were no records written into the events table in Snowflake. Is there something I’m missing here? What might cause the tasks to run without errors, but not actually load any data?

Appreciate any assistance.

Brandon_Kane · April 26, 2019, 2:39pm

After digging into the Snowflake logs further, I found that the copy command failed.

The error was “Failed to access remote file: access denied. Please check your credentials”.

I’m investigating the credentials side of things, but why did the job report success if Snowflake returned an error from the COPY INTO statement?

ihor · April 26, 2019, 3:55pm

@Brandon_Kane, the version 0.4.2 appears to have an issue that prevents surfacing a failure. We will be working on a new release to overcome this

Brandon_Kane · April 26, 2019, 3:58pm

Thanks @ihor!

In the meantime I have checked the credentials in the config, and have also added them explicitly to the stage in Snowflake, but am continuing to get the same error that the remote file cannot be accessed. Is there anything else I can do to debug this or something I might have missed in setting the credentials up correctly?

anton · April 29, 2019, 10:43am

Hi @Brandon_Kane,

Could you please share with us where did you find the Failed to access remote file: access denied. Please check your credentials error message?

Can you also confirm that contexts_com_snowplowanalytics_snowplow_ua_parser_context_1 column has been added?

Assuming the output you provided is from stdout. Is there anything in stderr?

Brandon_Kane · April 29, 2019, 1:29pm

Hi @anton, thanks for the follow up.

The message was from the Snowflake query history, nothing about the error was output in any of the Snowplow logs.

That column was added when I ran the setup process. There are 129 columns in the events table in snowflake, which appears correct from what I can see.

Yes, that output was from stdout. There was nothing output to stderr from the EMR step. The syslog had the following:

2019-04-26 15:51:18,393 WARN shadeaws.profile.path.cred.CredentialsLegacyConfigLocationProvider (main): Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].

Let me know if there is anything else I can provide to help!

anton · April 29, 2019, 5:12pm

contexts_com_snowplowanalytics_snowplow_ua_parser_context_1 was added because transformer found this context in your enriched data and then communicated this fact to the loader via DynamoDB manifest. This means that your Snowflake user certainly has enough permissions to alter the table.

I have two main hypothesis on what’s going on:

First is that there’s mismatch between stageUrl in config and actual stage URL you can use something like SHOW STAGES. Can you check that data is indeed present in the path Loader tries to load it from?

Second hypothesis is that Snowflake cannot authenticate itself to load data from the stage you created via setup process. Could you please double-check you followed role creation instructions precisely?

If it doesn’t work I’d recommend to try to use static credentials:

"auth": {
    "accessKeyId": "YOUR_ACCESS_KEY_ID",
    "secretAccessKey": "YOUR_SECRET_ACCESS_KEY"
}

And then check that you can use these same credentials with AWS CLI to list and fetch data from stageUrl.

It’s more likely the former than later, but I’m puzzled because Loader since 0.4.0 should abort the process on stage mismatch, but indeed this behavior was possible for pre-0.4.0 loader.

Brandon_Kane · April 29, 2019, 5:44pm

Hey @anton,

Thanks again for sticking with this. I had the same ideas for what might be wrong and have been digging into both of those. Here is what I tried:

I verified the URL in the stage and it is definitely correct.
I ran an ALTER STAGE command to set the credentials on the stage directly in Snowflake.
I copied the commands that the loader ran into a Snowflake worksheet and ran them manually. The only thing I changed was removing the CREDENTIALS section of the COPY INTO. The command I ran looked like this:

COPY INTO staging.snowplow_tmp_run_2019_04_26_18_31_15(enriched_data) 
FROM @staging.snowplow_stage/run=2019-04-26-18-31-15 
ON_ERROR = SKIP_FILE_1 
FILE_FORMAT = (FORMAT_NAME = 'staging.snowplow_enriched_json' 
STRIP_NULL_VALUES = TRUE);

The data was loaded correctly with no errors.

So, that proves to me the credentials are good, and that the stage is set up correctly. That leaves the role as the only possible configuration problem.

One thing I’m not clear on which may be causing issues - There are credentials in the cluster.json, playbook.json and also in the config.json. In config.json I have the roleArn and sessionDuration in the “auth” section with the role set up as per the docs as far as I can tell. What about the “credentials” section of the playbook.json and cluster.json? Those I am just using static credentials right now while debugging, but are they used at all by the loader? My understanding was that those creds should just be used for spinning up the EMR cluster. Is that correct?

Brandon_Kane · April 30, 2019, 1:10pm

Hi @anton,

Some follow ups from yesterday - I was able to get the storage loader to load correctly into Snowflake by making some changes to the Policy attached to the Role. I had to remove the resource specific permissions for the buckets and paths that the role had access to. I must have made a mistake in the path names in the role. I’ll try again to add the restrictions based on the doc and see what is going on.

Would still like some clarity (and maybe doc updates) on the different auth/credentials sections, but that’s not urgent. Thanks for your help!

sonnypolaris · May 29, 2019, 2:20pm

Can you share this policy change? I’m having the same issue.

shadeaws.services.dynamodbv2.model.AmazonDynamoDBException: User: arn:aws:iam::213790343224:user/username is not authorized to perform: dynamodb:Scan on resource: arn:aws:dynamodb:us-east-1:213747503224:table/snowplow-manifest (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException; Request ID: 3AM85881GMUU1B32JEIAD6HDAFVV4KQNSO5AEMVJF66Q9ASUAAJG)

Brandon_Kane · May 29, 2019, 3:12pm

In my case it was just a typo in the load role, the ARN for the bucket was not correct. But it looks like yours is for DynamoDB not s3.

Is this during the enrich stage? You might need to look at the EMR_EC2_DefaultRole which I think is where it gets the permissions to access dynamodb table.

Colm · May 29, 2019, 3:39pm

Hey @sonnypolaris, this is slightly outside my area so forgive me if I’m wrong, but I have a hunch that the manifest in question might be part of the deduplication process, which happens at loading stage.

If you take a look at the full error message for the run, it should indicate at what stage this failed on. If that’s RDB Load/ EMR ETL runner then I think the solution would be to grant permissions to the user that your loader is using to scan DynamoDB.

sonnypolaris · May 29, 2019, 6:55pm

You are right. I’ve got something incorrect in setting up my role. I defaulted to the key/secret approach and all worked.

Topic		Replies	Views
Snowflake Loader - Process ran successfully but no data loaded in transform s3 bucket	7	1431	October 17, 2019
Snowflake loader error Troubleshooting	3	880	August 31, 2023
Upgraded to snowflake loader 0.8.0 but data is not loaded For engineers	2	1046	December 5, 2020
Spark missing in Dataflow-runner Enrichment	25	3730	December 10, 2020
Snowflake Transformer failing exitCode: 15 Troubleshooting	4	1390	December 16, 2021

Snowflake Loader - Process ran successfully but no data loaded

Related topics