Snowplow RDB Loader R31 released

anton · August 27, 2019, 3:54pm

We’re happy to announce release R31 of RDB Loader and Shredder with new bad rows format and data quality improvement.

stevecs · September 15, 2019, 7:23pm

Hello! We’ve noticed a bug in R31 that may be impacting some users. The result is a significant increase in shredded/bad data.

The bug is in the schema validator, where elements of property with type [array, null] are invalid if property is null. We believe the occurrence of this is quite rare, but please do check if this case exists if you’re on this release.

We will be pushing R32 shortly, which will include a fix for this issue. Apologies for any inconvenience.

Aurimas_Griciunas · September 30, 2019, 12:06pm

Hi there. Is it normal to experience twice an increase in RDB shredder run length after upgrading from 0.14.0 to 0.15.0?

anton · September 30, 2019, 12:43pm

Hi @Aurimas_Griciunas,

We did notice 5-10% increased run length for some pipelines (mostly those with cross-batch deduplication enabled) due the fact that we re-worked caching of the DAG for the “orphan events” fix. But 100% increase definitely doesn’t look normal.

Do you have cross-batch deduplication enabled? Does your pipeline use shredded types heavily? What instances/volume we’re talking about?

Aurimas_Griciunas · September 30, 2019, 1:37pm

Hi, @anton,

Thank’s for such a swift response!

Cross-batch deduplication is not enabled.
Our pipeline has around 45 distinct shredded types consisting of ~ 80 schema versions (some obsolete, so around 60 active ones).
Latest test was performed on 24 GB of raw .lzo files using 20 i3.2xlarge instances. ~10 million rows in atomic.events table.
Shredding of enriched events took 74 minutes versus 32 minutes benchmark which was performed on the same data but with no version bump (i.e. {“rdb_loader”: 0.16.0, “rdb_shredder”: 0.15.0} vs {“rdb_loader”: 0.15.0, “rdb_shredder”: 0.14.0})

anton · September 30, 2019, 3:04pm

Thanks for the details, @Aurimas_Griciunas!

20 i3.2xlarge

I’m wondering if number of instances has something to do with it. Most of our high-volume pipelines tend to use more “vertical scaling”, e.g. single r4.16xlarge or several 8xlarge. I’ll try analyze some of our pipelines with similar characteristics and will get back to you ASAP.

Meanwhile, I think most important question for you is whether you care about the orhpan events issue and shredded bad rows. If not then probably it makes sense to rollback to R30.

Aurimas_Griciunas · October 1, 2019, 1:27pm

Quick update. Changing emr cluster to 5 x i3.8xlarge CORE instances actually doubled the run time of shredd job for both old and new rdb_shredder versions on the same data.

ihor · October 1, 2019, 11:52pm

@Aurimas_Griciunas, different EC2 types and their number requires different Spark tuning to effectively utilize those instances in EMR cluster. There are plenty of posts on the subject in this forum.

Topic		Replies	Views
Snowplow RDB Loader R32 released New releases	0	1606	March 6, 2020
Snowplow RDB Loader R33 released New releases	1	703	December 1, 2020
Snowplow RDB Loader 2.1.0 released New releases	0	1378	January 19, 2022
Snowplow RDB Loader 1.0.0 released New releases	0	1386	April 15, 2021
Snowplow RDB Loader 2.2.0 released New releases	2	1042	February 24, 2022

Snowplow RDB Loader R31 released

Related topics