We’re back with new strengths in our effort to upgrade our Snowplow system. We think the collector thing is solved and how now moved on to the Snowflake Loader. When reading the official documentation, I’ve noticed a couple things that leads to confusion unless you bother to understand the code.
The documentation recommends using “amiVersion”:“5.9.0” and version 0.8.2 of s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.8.2.jar". However this combination seems to give the error “java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)”.
I noticed however that there is a version 0.9.0 implicitly recommends ami version 6.4.0 and seems to resolve that error although I haven’t been able to run it end to end yet.
The playbook.json references events_manifest.json, however this is not introduced until you read the Cross-batch deduplication page. This page further does not explicitly state that you have to create this table manually, but I guess this is the case?
--s3Endpoint as s3.amazonaws.com defaults to region us-east-1 causing the error
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2'
You need to manually change s3Endpoint for S3DistCp in playbook.json to your own region. In our case s3-eu-west-1.amazonaws.com
@anton: Maybe the combination of AMI 6.4.0 and 0.9.0 isn’t quite there either. Now I get
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.snowplowanalytics.snowflake.transformer.S3OutputFormat not found
in my containers/application_1642171377915_0002/container_1642171377915_0002_01_000001/stderr
The says steps/s-2RJYVHBLPA2JA/stderr says
Exception in thread "main" org.apache.spark.SparkException: Application application_1642171377915_0002 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1253)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1645)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala: