Which version of Enrich/S3 Loader(s) did you use before?
Generally speaking the mentioned shredded/loader ( rdb_shredder version= 0.13.0 and rdb_loader version 0.14.0) should be compatible with the latest components. Given that, it is important to understand what error you are getting (for this you might need to dive into EMR logs).
Ideally, of course, we would recommend to use the latest recommended versions which are listed in the version compatibility matrix. However this will require some efforts from you, especially for RDB Loader. You will need to go up to 0.18.2 first which is the latest which runs with EER and then you will go to v1 where shredding and loading are separated. For upgrade guides you will look at Snowplow RDB Loader - Snowplow Docs.
The s3 loader was actually the first component in the pipeline that i’ve upgraded from version 0.18 to 2.0.0rc2 - without giving too much thought about compatibility i’ve pushed the change to our dev, staging and production environments - without problems.
recently I’ve upgraded the stream-enrich in dev and staging and although it’s deployed and running the downstream RDB loader (which is only in staging) is failing with this error: INFO Client: Deleted staging directory hdfs://ip-10-5-215-178.ec2.internal:8020/user/hadoop/.sparkStaging/application_1631198412915_0003 Exception in thread "main" org.apache.spark.SparkException: Application application_1631198412915_0003 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 21/09/09 15:56:57 INFO ShutdownHookManager: Shutdown hook called
I’m not too familiar with debugging EMRs so not sure you can make out the error from the above log.
[EDITED]
Found this error in the EMR output: Data loading error [Amazon](500310) Invalid operation: Cannot COPY into nonexistent table nl_basjes_yauaa_context_1; ERROR: Data loading error [Amazon](500310) Invalid operation: Cannot COPY into nonexistent table nl_basjes_yauaa_context_1; Following steps completed: [Discover]
I guess this error explains what’s missing in my Redshift schema, but not sure how to create the new schema? also, how can i make sure that all my other enrichments are supported by my Redshift Cluster.
In regards to the upgrade steps (going to 0.18.2 then to v1) can’t I simply deploy the newest version “alongside” the old (current) process and then I’ll just change the s3-loader output bucket (to the new RDB input bucket)?
I guess this error explains what’s missing in my Redshift schema, but not sure how to create the new schema? also, how can i make sure that all my other enrichments are supported by my Redshift Cluster.
From RDB Loader R32 ( rdb_shredder version= 0.16.0 and rdb_loader version 0.17.0) new tables are created automatically by the loader. The same applies for changes in existing tables.
In regards to the upgrade steps (going to 0.18.2 then to v1) can’t I simply deploy the newest version “alongside” the old (current) process and then I’ll just change the s3-loader output bucket (to the new RDB input bucket)?
In theory you can but in practice it might be hard to put all changes together at once. You are currently on R28 and the best path would be: