BigQuery Loader 0.6.0 released

dilyan · September 29, 2020, 8:47am

[EDIT-2: 0.6.1 is now out: BigQuery Loader 0.6.1 released .]

[EDIT: We have discovered a bug in version 0.6.0 that can lead to contexts being dropped from the event. Please do not upgrade to this version. We’re working on a fix and will update soon.]

We are very excited to have released version 0.6.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.

BigQuery StreamLoader

The main highlight of this release is the introduction of BigQuery StreamLoader, a standalone Scala app (in the vein of Repeater and Mutator) that can be deployed as an alternative to the Loader Dataflow job. Loading into BigQuery is a relatively simple task. It doesn’t require much aggregation or data shuffling and so isn’t benefiting from some of Beam’s key analytics features.

This component still has experimental status, but we are very keen to get into users’ hands and hear your feedback on it.

Bugs

This version fixes a bug in Mutator where a projectId passed in through the config file was being ignored. Many thanks to simplylizz for contributing this fix.

Security

We’ve fixed the versions of several transitive dependencies to address security vulnerabilities in those libraries.

But creating it locally should be straightforward - please do let us know if you need any help there, I can prepare a branch that builds it with one command. We’re very interested in giving it a shot as soon as possible and potentially making it a default implementation at some point, so any information would be highly appreciated - even if it doesn’t have better perrformance now - there are plenty of things that can be tweaked in stream loader that cannot in a current Dataflow implementation.

robkingston · October 5, 2020, 10:41pm

No worries @anton… SBT was actually dead easy to setup and build the binaries.

Gave streamloader a crack and it’s working great. Kind of astonished… first command and off it went.

CPU/RAM usage was high on a cheap VPS I tested it from and only hit ~300 elements / second (compared to ~2000 elements / second of an n1-standard). But who cares when we can run this off spare CPU cycles of machines we already have running… might throw this at our Xeons later in the week.

Is there anyway to drain streamloader before exiting? Or would ctrl+c'ing out gracefully drain and close the streamloader? Anything else you’d want reported back?

anton · October 6, 2020, 12:03am

Awesome stuff, @robkingston! I agree that flexibility is the huge win here and again - there’s a lot of room for improvement, I believe we can exeed those 2K on n1-stanard with a bit of tweaking.

With Ctrl+C it will release all acquired resources gracefully. Also I believe Dataflow had slightly different mechansm - Stream BQ Loader doesn’t buffer any data and acknowledges a message only after it got inserted into BigQuery. In other words, even if something went wrong and you’ve lost a node after it pulled a message - the message shouldn’t get lost without being inserted.

One more thing we’re curious about is it’s scaling ability. There’s not much sense in using StreamLoader on high-scale yet, but as I mentioned eventually we’d like to make it a default implementation and scale with kubernetes. We did some tests with it already and allthough it got faster - it also produced duplicates (10K for 500K events). Partially it’s due a PubSub nature, but scaling DF job should be producing less of them.

dilyan · October 6, 2020, 8:48am

That is excellent news @robkingston! In addition to what @anton said, if you have a chance to check the container logs for the pods that are running streamloader, we’d be interested to know if you see anything suspicious there. With Beam, the logs are quite noisy and often there are exceptions thrown by underlying libraries and either ignored or buried under other logs. With streamloader we expect it to look much tidier.

robkingston · October 6, 2020, 9:31am

How frustrating. Takes only ~30 mins for a single n1-standard to load all our events each day. We’re probably too small to verify this.

Sure! Just what’s emitted to stdout/stderr from the binary? Will share those logs when I set it loose on my workstation.

Topic	Replies	Views
BigQuery Loader 1.0.1 released New releases	955	November 2, 2021
BigQuery Loader 0.6.1 released New releases	988	October 8, 2020
BigQuery Loader 0.6.3 released New releases	719	April 1, 2021
BigQuery Loader 0.6.2 released New releases	725	January 29, 2021
[IMPORTANT ALERT] Snowplow BigQuery Loader 0.6.0 dropping contexts Open Source Alerts	1143	October 7, 2020

BigQuery Loader 0.6.0 released

BigQuery StreamLoader

Bugs

Security

Read more

Related topics