As a digital analyst I want reliable analytics data without the need of much administration so that I can focus on supporting the business with numbers and insights.
I tried to find the best solution for below criteria:
stream processing, enabling near-realtime data
ideally fully-managed services, no administration/maintenance after initial deployment
ideally no fixed costs, pay-per-use
ideally automatic scaling, no managing of instance sizes, numbers, clusters, load balancing etc.
I believe that Google Cloud Platform is a great platform for fully-managed services:
Firebase Hosting comes with a free SSL certificate for a custom domain (see here)
Firebase Hosting handles dynamic requests with Cloud Functions (see here)
Firebase Hosting Blaze plan is pay-per-use with no monthly fee, the first 2,000,000 Cloud Function invocations are free
Google Cloud Pub/Sub is pay-per-use with no monthly fee, the first 10GB are free (see here)
Google Cloud Dataflow is pay-per-use with no monthly fee (see here)
Google BigQuery is pay-per-use and costs $0.05 per GB for streaming data inserts and $0.02 per GB stored after the first 10GB (see here)
How it would work
The /i path on the custom domain HTTPS host points to a Node.js Cloud Function
The Cloud Function takes care of cookie management and puts the payload into Google Cloud Pub/Sub (SnowCannon might be useful)
Cloud Pub/Sub triggers Snowplow’s scala-stream-collector on Cloud Dataflow
From the scala-stream-collector the data goes to Snowplow’s stream-enrich on Cloud Dataflow
From Cloud Dataflow the data is streamed into BigQuery where it can be queried directly, from Cloud Data Studio and Cloud Data Lab, or via third party analytics and visualization tools that support BigQuery, including Apache Superset.
What I like about this is that any size of website or business can benefit from real-time clickstream analytics and the costs directly correlate with the amount of data.
What do you think about above proposed solution?
You are more than welcome to join me to make the adjustments necessary to deploy to GCP. Below projects might be useful resources:
Google Cloud Dataflow example project, see here and here
Awesome! I know Snowplow have been planning an RFC for GCP so they are likely to have some valuable thoughts on this. Here’s are some thoughts/open questions below.
I’d avoid using Firebase/Cloud Functions/Node.js for anything at the moment and try to instead closely replicate what exists in the battle-tested real time pipeline by running Scala stream collector inside Kubernetes. Cloud functions is only available in a single zone (1/33) and is still in beta.
Some services are only available in specific regions. This is likely to disappear over time but for some clients this matters (e.g., BigQuery isn’t available in any Asia Pacific zone)
Pub/Sub is amazing but the interface differs quite a bit from Kinesis so may need some additional logic around handling high availability for consumers. It’s also worth thinking about deduplication downstream from here (particularly where duplicates are > 60 seconds apart) due to the at-least-once behaviour of Pub/Sub.
Dataflow currently isn’t advanced enough to replicate the stream enrich functionality. I’d likely put Dataproc in it’s place instead with a future version of Spark stream enrich. Dataflow (Apache Beam) is also a little buggy at the moment but with the significant amount of development on the project this is likely to make large gains quickly.
Sinking data into BigQuery via streaming and/or batch raises interesting data modelling questions (to shred or not to shred?) around how data is structured in BigQuery, how schema migrations are handled etc. BigQuery is the thing I’m most excited about on GCP (streaming, no more having to think about column compressions!)
Pricing. We need a better way of calculating how much Snowplow is going to cost on GCP when compared to AWS and Azure. Most of the pipeline is relatively easy to estimate (Compute Engine, Dataflow, Dataproc etc) but BigQuery is a lot tricker as most customers don’t know apriori how much data they are going to process in advance so it’s problematic to estimate (particularly when including BI tools). Being able to compare, predict and forecast these costs out is critical.
We have been using it since closed alpha and in my opinion Cloud Functions are very stable and reliable now. I assume it will go out of beta soon and become available in more regions but it might be North America and Europe first. Regarding battle-tested components: I agree but have also been in touch with the guy who developed SnowCannon and he told me that the Node.js collector has been used by a large media company for years. However, I only wanted to use it for cookie management and putting the payload into Pub/Sub so that it can be handled by the scala-stream-collector on Google Cloud Dataflow.
Services being available only in certain regions is definitely a problem of the Google Cloud Platform, unfortunately.
I assumed that Dataflow could host scala-stream-collector and stream-enrich and that Kinesis could be replaced by Pub/Sub. I must admit that I am only a JavaScript/Python developer and have not used Dataflow yet. Because you mentioned Spark, I’d like to add this link to the discussion: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
Data storage and data modeling with BigQuery probably requires the biggest amount of work and I look forward to the official GCP RFC in this regard. By the way, there is a pricing calculator for BigQuery but the actual costs probably depend a lot on how this gets implemented: https://cloud.google.com/products/calculator/