Google Cloud Dataflow example project released

colobas · March 31, 2017, 9:52am

We are pleased to announce the Google Cloud Dataflow example project.

This project will help you start your own real-time event processing pipeline, using some of the great services and tools offered by Google Cloud Platform: Pub/Sub (for distributed message queueing), Dataflow (for data processing) and Bigtable (for NoSQL storage).

alex · March 31, 2017, 9:59am

This is the first output from Gui’s winter internship at Snowplow working on Google Cloud Platform - great work @colobas!

evaldas · March 31, 2017, 3:59pm

very interesting, thanks!

couple of questions:

Is this a test pilot or is there more plans to do future improvements to sample pipeline?
Any reason why you haven’t used kafka+bigquery and existing stream collector+enrich?

Cheers,
Evaldas

alex · March 31, 2017, 9:39pm

Hi @evaldas! I can probably answer for @colobas:

This is a standalone test pilot, independent of any future work porting Snowplow to GCP
We didn’t use Kafka because we were trying to learn about GCP, not Kafka!
We thought about using BigQuery but as we had done some experimentation with BigQuery a while back, it was more interesting to try out Bigtable. Plus it fitted this analytics-on-write use case better
We didn’t use existing Snowplow components because this is meant to be a standalone example project - no prior knowledge of Snowplow required

Hope this helps. Stay tuned for the Snowplow on GCP Request for Comments @evaldas - it sounds like this is more what you are looking for…

evaldas · April 9, 2017, 6:09pm

Hi @alex, thanks for the info. I was meaning to try all of those standard GCP parts as that is usually the canonical setup that google always demo’s for any event streaming example projects. Now I’ll have a good reason to try this out. I guess it does make sense to use BT in some cases especially if you need high throughput for QPS, which BigTable provides, though for dwh type analysis BigQuery has a lot more advantages being a columnar store, having nested data structs and that you don’t need to manage the nodes yourself. Also it supports streaming inserts as well which are not available in Redshift (though it has some caveats too).

The Dataflow seems to be interesting especially if you combine with Apache Beam abstraction to manage the pipelines it might offer best of both worlds not locked cloud option and ability to switch to any other solution.

Will be very interesting to see the RFC for GCP!

evaldas · April 9, 2017, 6:38pm

btw, when I try any inv command I always get " did not receive all required positional arguments!" though vagrant up completed ok

alex · April 9, 2017, 6:55pm

Hi @evaldas - yes, the potential for “programming to the interface” and using Beam for other, non-GCP environments too is super interesting.

Feel free to raise a bug in the repository!

colobas · April 18, 2017, 12:56am

Hey @evaldas , I’m sorry it took me so long to answer. I forked the repo and tried to correct the problem, but unfortunately I have no GCP account w/ available resources to test it out right now. If you have the time, could you try it out? If it works I’ll open a PR. It’s here: https://github.com/colobas/google-cloud-dataflow-example-project .

I believe the problem has to do with Python 3 vs 2 conflicts, but it was my fault for sure - I probably didn’t test the helper script properly inside the vagrant machine (Python 2 environment), and only tested it in my development environment (Python 3 environment). Sorry!

evaldas · April 23, 2017, 4:53pm

Hey @colobas, thanks for the fix I’ve tried your fork and run into another error which I posted here: https://github.com/snowplow/google-cloud-dataflow-example-project/issues/5

Wouldn’t be easier just to update vagrant to use python 3 instead?

Cheers,
Evaldas

Topic		Replies	Views
Porting Snowplow to Google Cloud Platform RFCs	7	7907	February 8, 2024
Google Cloud Platform data pipeline optimization GCP pipeline	11	4529	April 14, 2020
Low-cost, low-maintenance, scalable Snowplow using fully-managed cloud services RFCs	2	2521	August 7, 2017
GCP: Ideal setup For engineers	7	1290	April 30, 2020
Setup Snowplow on GCP GCP pipeline	2	1477	November 8, 2021

Google Cloud Dataflow example project released

Related topics