Spark missing in Dataflow-runner

anton · December 9, 2020, 12:30pm

I added a dedicated section on our docs website: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-snowflake-loader/setup/#Staging_enriched_data, please have a look and let us know if it worked out.

joseph · December 10, 2020, 12:29am

Hi @anton,

Thank you so much – I just saw this, so I will give it a try shortly and post my results here.

-Joseph

joseph · December 10, 2020, 2:27am

Hi @anton,

I have updated my playbook.json file to include the new first step.
I changed --src to reflect where my s3-loader sinks, and --dest to s3://my-stageUrl/enriched/archive/run={{nowWithFormat "2006-01-02-15-04-05"}}/.

On the first run, the new first step failed with the following error (via stderr):
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2'

This one took me a minute. I know that my resources are in us-west-2, but I did not see us-east-1 specified explicitly in the new step.

It turns out that us-east-1 is the default region when you specify --s3Endpoint as s3.amazonaws.com.

The transformer finally succeeded after I changed my --s3Endpoint to s3-us-west-2.amazonaws.com. And by that, I mean that it moved events from the --src folder to the --dest folder.

The loader step (#3) appeared in stdout log to be successful as well. However, upon checking atomic.events in Snowflake, there were still no rows.

In the log, I saw this message:

2020-12-10T01:36:08.675Z: Launching Snowflake Loader. Fetching state from DynamoDB 2020-12-10T01:36:09.618Z: State fetched, acquiring DB connection 2020-12-10T01:36:11.619Z: DB connection acquired. Loading... 2020-12-10T01:36:12.434Z: Existing column [event_id VARCHAR(36) NOT NULL] doesn't match expected definition [event_id CHAR(36) NOT NULL UNIQUE] at position 7 2020-12-10T01:36:12.434Z: Existing column [domain_sessionidx INTEGER] doesn't match expected definition [domain_sessionidx SMALLINT] at position 17 2020-12-10T01:36:12.434Z: Existing column [geo_country VARCHAR(2)] doesn't match expected definition [geo_country CHAR(2)] at position 19 2020-12-10T01:36:12.434Z: Existing column [geo_region VARCHAR(3)] doesn't match expected definition [geo_region CHAR(3)] at position 20 2020-12-10T01:36:12.434Z: Existing column [tr_currency VARCHAR(3)] doesn't match expected definition [tr_currency CHAR(3)] at position 107 2020-12-10T01:36:12.434Z: Existing column [ti_currency VARCHAR(3)] doesn't match expected definition [ti_currency CHAR(3)] at position 111 2020-12-10T01:36:12.434Z: Existing column [base_currency VARCHAR(3)] doesn't match expected definition [base_currency CHAR(3)] at position 113 2020-12-10T01:36:12.435Z: Existing column [domain_sessionid VARCHAR(128)] doesn't match expected definition [domain_sessionid CHAR(128)] at position 121 2020-12-10T01:36:12.779Z: Warehouse snowplow_wh resumed 2020-12-10T01:36:12.790Z: Success. Exiting...

I vaguely remember reading here that the above message is not necessarily an error, but I cannot recall.

Regardless, what might be causing the Snowflake Loader step to fail somewhat silently?

Thanks again for your update – I had been trying to use the s3-loader.hocon config file to enforce the directory structure in s3, to no avail. This helped me a lot.

-Joseph

anton · December 10, 2020, 6:33pm

Hey @joseph,

It somehow seems to be the same problem as @danrodrigues has. I’d recommend you to have a look at your DynamoDB table to make sure that Transformer managed to find the folder and process it.

Just to recap the structure of the playbook and role of the steps:

S3DistCp - simply stages data from enriched sink into run=YYYY-MM-DD-hh-mm-ss folders
Transformer - discovers that folder on S3, processes it, adds corresponding folder (when discovered with New state, when finished with Processed state) to the DynamoDB manifest
Loader - discovers all folders in Processed state via DynamoDB manifest and loads

So you need to make sure the second step worked as expected.

joseph · December 10, 2020, 7:28pm

Hey @anton,

It seems the transformer step did not work quite right after all.

Although the files were moved from the --src to the --dest, the file format did not change at all.

In DynamoDb, I see RunIds enriched/ and run=/ (the latter may be remnants of a previous mistake / fiddling with s3-loader config).

I am not sure what to make of this, but here is what it looks like:

Screen Shot 2020-12-10 at 12.03.26 PM

Screen Shot 2020-12-10 at 12.03.07 PM

If DynamoDb is working as expected (and I am not sure that it is, based on the above), is the snowflake-transformer jar supposed to take care of the file format in --dest ?

anton · December 10, 2020, 7:43pm

Hi @joseph,

Although the files were moved from the --src to the --dest , the file format did not change at all .

I think you’re confusing S3DistCp and Transformer. Former isn’t supposed to change format of files, latter doesn’t have --src and --dest options. Please re-read my previous message explaining the difference, you must have all three steps in your playbook.

Data in DymamoDB doesn’t look correct - RunId column should have format of run=2020-12-05-14-33-05/. I’d advise you to manually delete both records.

Topic		Replies	Views
Application configuration with dataflow-runner Troubleshooting	3	1419	December 22, 2017
Snowflake Loader - Process ran successfully but no data loaded Storage targets	12	3911	May 29, 2019
Validation error on dataflow runner up	12	1284	October 18, 2021
Recommended/Supported EMR Versions? Enrichment	3	1197	March 31, 2021
RDB Shredder step fails in Dataflow Runner AWS real-time pipeline	4	1134	May 19, 2021

Spark missing in Dataflow-runner

Related topics