Snowflake Loader Documentation - Version incompatibilities and manifest references before definition

medicinal-matt · January 14, 2022, 8:42am

Hi!

We’re back with new strengths in our effort to upgrade our Snowplow system. We think the collector thing is solved and how now moved on to the Snowflake Loader. When reading the official documentation, I’ve noticed a couple things that leads to confusion unless you bother to understand the code.

The documentation recommends using “amiVersion”:“5.9.0” and version 0.8.2 of s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.8.2.jar". However this combination seems to give the error “java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)”.

This is also the case for the setup. “wget https://github.com/snowplow-incubator/snowplow-snowflake-loader/releases/download/0.8.1/snowplow-snowflake-loader-0.8.2.jar” where it mixes 0.8.1 and 0.8.2

I noticed however that there is a version 0.9.0 implicitly recommends ami version 6.4.0 and seems to resolve that error although I haven’t been able to run it end to end yet.

The playbook.json references events_manifest.json, however this is not introduced until you read the Cross-batch deduplication page. This page further does not explicitly state that you have to create this table manually, but I guess this is the case?

anton · January 14, 2022, 10:43am

Hi @medicinal-matt,

Thanks for the report - I’ll bump versions and make the purpose of events_manifest.json clearer.

But answering your question - you’re right, events_manifest.json is optional and used only for cross-batch deduplication, you can omit it.

Also, AMI 6.4.0 is recommended for 0.9.0 apps (which means you don’t need the --conf option - I’ll fix it as well): Snowplow Snowflake Loader 0.9.0 released

medicinal-matt · January 14, 2022, 11:51am

Looks better already!

which means you don’t need the --conf option - I’ll fix it as well

What is meant by this?

If you leave out the emr cluster config (cluster.json), you get --emr-config needs to be specified.

If you leave it out of the transformer in the playbook.json you get

Missing expected flag --config!

Usage: snowplow-snowflake-transformer --config --resolver [–inbatch-deduplication] [–events-manifest ] [–s3a]

Not sure if you did finished your planned changes, but the cluster config still says

      "ec2":{
         "amiVersion":"5.9.0",

Another thing I noticed. In playbook.json it is called

               "--config",
               "{{base64File "./config.json"}}",

but in the step earlier it is called /path/to/self-describing-config.json \

Maybe it would be more clear if they had the same name in both places?

medicinal-matt · January 14, 2022, 1:18pm

And another thing, as mentioned by this guy:

--s3Endpoint as s3.amazonaws.com defaults to region us-east-1 causing the error

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2'

You need to manually change s3Endpoint for S3DistCp in playbook.json to your own region. In our case s3-eu-west-1.amazonaws.com

medicinal-matt · January 14, 2022, 3:18pm

@anton: Maybe the combination of AMI 6.4.0 and 0.9.0 isn’t quite there either. Now I get

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.snowplowanalytics.snowflake.transformer.S3OutputFormat not found

in my containers/application_1642171377915_0002/container_1642171377915_0002_01_000001/stderr

The says steps/s-2RJYVHBLPA2JA/stderr says

Exception in thread "main" org.apache.spark.SparkException: Application application_1642171377915_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1253)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1645)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:

anton · January 14, 2022, 3:22pm

@medicinal-matt, yup that’s what I meant by --conf parameter.

There’s this part in your playbook:

“–conf”,
“spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat”,

And it was necessary only pre-0.9.0.

medicinal-matt · January 14, 2022, 4:02pm

Nice! That seems to fix the issues!

Now I have some “Error assuming AWS_ROLE”, but I see if I can solve that and otherwise it is a topic for another thread.

Topic		Replies	Views
Snowflake Transformer 0.4.3 released New releases	0	667	December 22, 2020
Snowflake loader - cross-batch deduplication configuration issues Troubleshooting	2	1291	January 17, 2022
Snowplow Snowflake Loader 0.9.0 released New releases	0	1462	November 18, 2021
Snowplow Snowflake Loader 0.4.0 released New releases	0	705	January 17, 2019
Snowplow Snowflake Loader 0.8.0 released New releases	4	1094	January 4, 2021

Snowflake Loader Documentation - Version incompatibilities and manifest references before definition

Related topics