[EDIT: We have discovered a bug in version 0.6.0 that can lead to contexts being dropped from the event. Please do not upgrade to this version. We’re working on a fix and will update soon.]
We are very excited to have released version 0.6.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.
BigQuery StreamLoader
The main highlight of this release is the introduction of BigQuery StreamLoader, a standalone Scala app (in the vein of Repeater and Mutator) that can be deployed as an alternative to the Loader Dataflow job. Loading into BigQuery is a relatively simple task. It doesn’t require much aggregation or data shuffling and so isn’t benefiting from some of Beam’s key analytics features.
This component still has experimental status, but we are very keen to get into users’ hands and hear your feedback on it.
Bugs
This version fixes a bug in Mutator where a projectId passed in through the config file was being ignored. Many thanks to simplylizz for contributing this fix.
Security
We’ve fixed the versions of several transitive dependencies to address security vulnerabilities in those libraries.
I don’t think there are any jars out there, unfortunately. And in fact we’re planning to deprecate Bintray in near future, so even if we continue publishing fatjars - they likely will be somewhere else.
But creating it locally should be straightforward - please do let us know if you need any help there, I can prepare a branch that builds it with one command. We’re very interested in giving it a shot as soon as possible and potentially making it a default implementation at some point, so any information would be highly appreciated - even if it doesn’t have better perrformance now - there are plenty of things that can be tweaked in stream loader that cannot in a current Dataflow implementation.
No worries @anton… SBT was actually dead easy to setup and build the binaries.
Gave streamloader a crack and it’s working great. Kind of astonished… first command and off it went.
CPU/RAM usage was high on a cheap VPS I tested it from and only hit ~300 elements / second (compared to ~2000 elements / second of an n1-standard). But who cares when we can run this off spare CPU cycles of machines we already have running… might throw this at our Xeons later in the week.
Is there anyway to drain streamloader before exiting? Or would ctrl+c'ing out gracefully drain and close the streamloader? Anything else you’d want reported back?
Awesome stuff, @robkingston! I agree that flexibility is the huge win here and again - there’s a lot of room for improvement, I believe we can exeed those 2K on n1-stanard with a bit of tweaking.
With Ctrl+C it will release all acquired resources gracefully. Also I believe Dataflow had slightly different mechansm - Stream BQ Loader doesn’t buffer any data and acknowledges a message only after it got inserted into BigQuery. In other words, even if something went wrong and you’ve lost a node after it pulled a message - the message shouldn’t get lost without being inserted.
One more thing we’re curious about is it’s scaling ability. There’s not much sense in using StreamLoader on high-scale yet, but as I mentioned eventually we’d like to make it a default implementation and scale with kubernetes. We did some tests with it already and allthough it got faster - it also produced duplicates (10K for 500K events). Partially it’s due a PubSub nature, but scaling DF job should be producing less of them.
That is excellent news @robkingston! In addition to what @anton said, if you have a chance to check the container logs for the pods that are running streamloader, we’d be interested to know if you see anything suspicious there. With Beam, the logs are quite noisy and often there are exceptions thrown by underlying libraries and either ignored or buried under other logs. With streamloader we expect it to look much tidier.