This project will help you start your own real-time event processing pipeline, using some of the great services and tools offered by Google Cloud Platform: Pub/Sub (for distributed message queueing), Dataflow (for data processing) and Bigtable (for NoSQL storage).
This is a standalone test pilot, independent of any future work porting Snowplow to GCP
We didn’t use Kafka because we were trying to learn about GCP, not Kafka!
We thought about using BigQuery but as we had done some experimentation with BigQuery a while back, it was more interesting to try out Bigtable. Plus it fitted this analytics-on-write use case better
We didn’t use existing Snowplow components because this is meant to be a standalone example project - no prior knowledge of Snowplow required
Hope this helps. Stay tuned for the Snowplow on GCP Request for Comments @evaldas - it sounds like this is more what you are looking for…
Hi @alex, thanks for the info. I was meaning to try all of those standard GCP parts as that is usually the canonical setup that google always demo’s for any event streaming example projects. Now I’ll have a good reason to try this out. I guess it does make sense to use BT in some cases especially if you need high throughput for QPS, which BigTable provides, though for dwh type analysis BigQuery has a lot more advantages being a columnar store, having nested data structs and that you don’t need to manage the nodes yourself. Also it supports streaming inserts as well which are not available in Redshift (though it has some caveats too).
The Dataflow seems to be interesting especially if you combine with Apache Beam abstraction to manage the pipelines it might offer best of both worlds not locked cloud option and ability to switch to any other solution.
Hey @evaldas , I’m sorry it took me so long to answer. I forked the repo and tried to correct the problem, but unfortunately I have no GCP account w/ available resources to test it out right now. If you have the time, could you try it out? If it works I’ll open a PR. It’s here: https://github.com/colobas/google-cloud-dataflow-example-project .
I believe the problem has to do with Python 3 vs 2 conflicts, but it was my fault for sure - I probably didn’t test the helper script properly inside the vagrant machine (Python 2 environment), and only tested it in my development environment (Python 3 environment). Sorry!