Snowflake Loader - Process ran successfully but no data loaded in transform s3 bucket

i am using snowplow snowflake loader 0.5.0 version and trying to run the loader.

The transform process ran successfully but i cannot see any data in the s3 output folder.

Can someone guide me on this issue?

Hi @Milan_Mathew,

That’s very strange. Coul you let us know:

  1. If there’s any data in badOutputUrl
  2. If you’re sure there was data in input (enriched archive)
  3. If ther’s any records in DynamoDB manifest

If possible also, your snowflake config file would be useful (you can hide sensitive details such as bucket names).

Hi @anton,

Thank you for the prompt reply.

  1. If there’s any data in badOutputUrl- no output there
  2. If you’re sure there was data in input (enriched archive) -yes there is data
  3. If there’s any records in DynamoDB manifest - no records( is there any permission i have to give the user credentials i am using to access the Dynamo DB)

snowflake config-

{
“schema”: “iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-2”,
“data”: {
“name”: “Snowflake_config”,
“awsRegion”: “”,
“auth”: {
“accessKeyId”: “”, — specified the static creds
“secretAccessKey”: “”
},
“manifest”: “snowflake-event-manifest”,
“snowflakeRegion”: “eu-west-1”,
“database”: “PFDW_STG_LANDING_DB”,
“input”: “s3://sp-sf/archive/enriched/”,
“stage”: “snowplow_stage”,
“badOutputUrl”: “s3://sp-sf/archive/snowflake/badrow/”,
“stageUrl”: “s3://sp-sf/archive/snowflake/transformed/stage/”,
“warehouse”: “PFDW_DEV_DE_TINY_VWH”,
“schema”: “snowplow”,
“account”: “”,
“username”: “”,
“password”: “”,
“maxError”: 1,
“purpose”: “ENRICHED_EVENTS”
}
}

PFA the screeshot for the bucket policy i attached for the user i am using for the whole process(those static credentials are of that user in snowflake config)

I have the following doubts regarding credentials

  1. whatever credentials i have specified in the config , is the same creds getting used to access dynamo DB , creating snowflake stage and for running playbook json?
  2. in the set up guide i can see in s3 policy that it has mentioned prefix “Resource”: [
    “arn:aws:s3:::YOUR-SNOWFLAKE-BUCKET/prefix/*”
    ]----is this relevant or is it okay to specify in the same way i did for policy(refer screenshot).

No, not exactly. DynamoDB client relies on default credentials chain. We usually run it on EMR and EMR instances have access to DynamoDB.

Yes, that should be fine.

Its very suspicious that you don’t have any records in DynamoDB. Are you sure your Transformer job finished successfully? Can you provide stdout/stderr from EMR console? How long did it take and how much data did you have in input? I suspect that your EMR cluster tried to access DynamoDB, failed at it and (potentially) exited with false positive.

@anton
Are you sure your Transformer job finished successfully? yes its showing - Snowflake Transformer’ with id ‘s-2KI7JN4FC13NL’ completed successfully

only stderr getting populated
Warning: Skip remote jar s3://sp-dev-hostedassetsmirror-25zbzj9qbp5t/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.4.2.jar. 19/10/17 07:50:18 INFO RMProxy: Connecting to ResourceManager at ip-172-31-13-35.eu-west-1.compute.internal/172.31.13.35:8032 19/10/17 07:50:18 INFO Client: Requesting a new application from cluster with 1 NodeManagers 19/10/17 07:50:18 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container) 19/10/17 07:50:18 INFO Client: Will allocate AM container, with 12288 MB memory including 1117 MB overhead 19/10/17 07:50:18 INFO Client: Setting up container launch context for our AM 19/10/17 07:50:18 INFO Client: Setting up the launch environment for our AM container 19/10/17 07:50:18 INFO Client: Preparing resources for our AM container 19/10/17 07:50:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 19/10/17 07:50:21 INFO Client: Uploading resource file:/mnt/tmp/spark-344e6a17-8a12-4081-9583-13f5b1cc056d/__spark_libs__7368307070419365450.zip -> hdfs://ip-172-31-13-35.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1571213926957_0020/__spark_libs__7368307070419365450.zip 19/10/17 07:50:24 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected 19/10/17 07:50:24 INFO Client: Uploading resource s3://sp-dev-hostedassetsmirror-25zbzj9qbp5t/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.4.2.jar -> hdfs://ip-172-31-13-35.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1571213926957_0020/snowplow-snowflake-transformer-0.4.2.jar 19/10/17 07:50:24 INFO S3NativeFileSystem: Opening 's3://sp-dev-hostedassetsmirror-25zbzj9qbp5t/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.4.2.jar' for reading 19/10/17 07:50:26 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-172-31-13-35.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1571213926957_0020/hive-site.xml 19/10/17 07:50:26 INFO Client: Uploading resource file:/mnt/tmp/spark-344e6a17-8a12-4081-9583-13f5b1cc056d/__spark_conf__5755840517616732680.zip -> hdfs://ip-172-31-13-35.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1571213926957_0020/__spark_conf__.zip 19/10/17 07:50:26 INFO SecurityManager: Changing view acls to: hadoop 19/10/17 07:50:26 INFO SecurityManager: Changing modify acls to: hadoop 19/10/17 07:50:26 INFO SecurityManager: Changing view acls groups to: 19/10/17 07:50:26 INFO SecurityManager: Changing modify acls groups to: 19/10/17 07:50:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 19/10/17 07:50:26 INFO Client: Submitting application application_1571213926957_0020 to ResourceManager 19/10/17 07:50:26 INFO YarnClientImpl: Submitted application application_1571213926957_0020 19/10/17 07:50:27 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:27 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1571298626222 final status: UNDEFINED tracking URL: http://ip-172-31-13-35.eu-west-1.compute.internal:20888/proxy/application_1571213926957_0020/ user: hadoop 19/10/17 07:50:28 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:29 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:30 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:31 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:32 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:33 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:34 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:35 INFO Client: Application report for application_1571213926957_0020 (state: ACCEPTED) 19/10/17 07:50:36 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:36 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.9.216 ApplicationMaster RPC port: 0 queue: default start time: 1571298626222 final status: UNDEFINED tracking URL: http://ip-172-31-13-35.eu-west-1.compute.internal:20888/proxy/application_1571213926957_0020/ user: hadoop 19/10/17 07:50:37 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:38 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:39 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:40 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:41 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:42 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:43 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:44 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:45 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:46 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:47 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:48 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:49 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:50 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:51 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:52 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:53 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:54 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:55 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:56 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:57 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:58 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:50:59 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:00 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:01 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:02 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:03 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:04 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:05 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:06 INFO Client: Application report for application_1571213926957_0020 (state: RUNNING) 19/10/17 07:51:07 INFO Client: Application report for application_1571213926957_0020 (state: FINISHED) 19/10/17 07:51:07 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.9.216 ApplicationMaster RPC port: 0 queue: default start time: 1571298626222 final status: SUCCEEDED tracking URL: http://ip-172-31-13-35.eu-west-1.compute.internal:20888/proxy/application_1571213926957_0020/ user: hadoop 19/10/17 07:51:07 INFO ShutdownHookManager: Shutdown hook called 19/10/17 07:51:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-344e6a17-8a12-4081-9583-13f5b1cc056d Command exiting with ret '0'

How long did it take and how much data did you have in input ? it took 54secs and i have data of 440kb just for testing

i tried putting an item in dynamo db from EMR cluster and it was succesful , so i dnt think Dynamodb has access issues.

aws dynamodb put-item --table-name snowflake-event-manifest --item ‘{“RunId”: {“S”: “1234”}}’

Ok, from this log we can assume that at least if we had an error - it happened not in Spark driver, and you’re right - any DynamoDB exception would manifest itself in driver’s log, so I agree with you, its not DynamoDB.

There’s also a chance it happened in YARN container. You need to go to your job’s logs and navigate to containers/application_1571213926957_0020. It can have many containers, but one of them might contain necessary information. I don’t know exactly what we’re looking for here, but you can check for any errors/failures in those containers’ stdout/stderr.

Also, it looks like you’re using Transformer 0.4.2, not 0.5.0. This version does not support bad rows.

Hi @anton,
Sorry i delayed to update you …

Good news!!! . Actually now the transform worked fine when i added a folder in between the s3 bucket and datafiles in the input s3 location

(s3://sp-sf/archive/enriched/----here—/data.gz)

because when i debugged your code i can see that ,its looking for a prefix …

so transform is fine now . but now my loader is failing saying

Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object does not exist, or operation cannot be performed.

i dint create any role in aws for it . i am using the static credentials of a user who has access to that s3 bucket (s3://sp-sf/archive/snowflake/transformed/stage/)

The following steps i tried in snowflake UI with the role - sysadmin

1, CREATE STAGE “PFDW_STG_LANDING_DB”.“SNOWPLOW”.snowplow_stage URL = ‘s3://sp-sf/archive/snowflake/transformed/stage/’
CREDENTIALS = (AWS_KEY_ID = 'static credentials ’ AWS_SECRET_KEY = 'static credentials ')
file_format = (type = ‘JSON’)
COMMENT = ‘snowplow stage for snowplow-snowflake loader process’;

2, COPY INTO “PFDW_STG_LANDING_DB”.“SNOWPLOW”.ATOMIC
FROM @PFDW_STG_LANDING_DB.“SNOWPLOW”.snowplow_stage
ON_ERROR = SKIP_FILE_1 ;

both steps worked fine and data was loaded unfortunately when i tried the loader from EMR , its failing saying object doesn’t exist .
so my assumption is it do anything with the role loader using to do the operation in snowflake ?

i think this loader process using

ROLE public

and trying to do operations on the object and its failing? Am i right?

@Milan_Mathew I’m not as much of an expert on this as Anton, but the difference in copying across data and what the pipeline does is that the pipeline loader also needs to create columns/modify the table.

If you ensure that the pipeline’s snowflake user has sufficient permission to modify the atomic events table, and write access to the atomic schema I believe you’ll solve that.

One way to verify is grant it sysadmin & see if it runs. I definitely don’t recommend giving it full sysadmin access for a production use case, but it’s easier to revoke what you don’t need after a successful run than to figure out what you do need from an unsuccessful one.

Best,