Snowplow R89 Plain of Jars released

We are pleased to announce the release of Snowplow 89 Plain of Jars:

https://snowplowanalytics.com/blog/2017/06/12/snowplow-r89-plain-of-jars-released

This release ports the Snowplow batch pipeline to Apache Spark, building on our RFC:

http://discourse.snowplow.io/t/migrating-the-snowplow-batch-jobs-from-scalding-to-spark/492

6 Likes

The Bintray link doesn’t work for me http://dl.bintray.com/snowplow/snowplow-generic/snowplow_emr_r89_plain_of_jars.zip

Our continuous delivery failed us this morning, we’re currently rebuilding the artifacts one by one, I’ll post here once they’re all up, sorry about the inconvenience.

Everything is up! Again, apologies for the inconvenience.

What a massive release - nice work to everybody who contributed to the Spark port!

Would the change from Scalding to Spark change anything regarding recovering from bad rows?

Hi @tclass - are you referring to Hadoop Event Recovery? No, that remains a Scalding-based application, and of course the underlying Snowplow data formats have not changed in this release.

1 Like

yes, that’s what I meant, just wanted to make sure, that we don’t lose that feature while upgrading. Thanks

Any recommendations about AWS instance sizes with the new internals? We’ve been using c3.8xlarges, but with Spark being more memory intensive, are the r3s better now? Is the instance storage still a requirement (e.g. c3/r3 vs c4/r4)?

Hi @rbolkey - the c3.8xlarges should be fine, but let us know how you go.

You don’t need instance storage, but you will need to attach EBS if you don’t have instance storage (c4/r4), because we are still using the HDFS on the cluster. We’ll remove that usage of HDFS in a future release.