I saw that loading good/bad events is not in in the GCP quickstart or manual setup docs. Im wondering if the ElasticSearch loader works for GCP implementations
Hi @ryanrozich not currently no - the loader currently only supports Kinesis as a data-source to pull from.
We have generally found on GCP that given we stream the data directly into BigQuery the “realtime” use-cases for Elasticsearch on AWS can be done directly in BigQuery without needing to load into Elasticsearch as well.
Is there a particular use-case you are looking to achieve with loading Elasticsearch on GCP?
This is my first time working with Snowplow on GCP, I didn’t realize that the BigQuery loader streams data directly to BigQuery, thats very cool!
ElasticSearch was our go-to for real-time monitoring, but it looks like a streaming loader could solve that too.
We also used Elasticsearch in AWS deployments for a few other things:
(a) Monitoring failed events, figuring out why events failed and getting reports on them using Kibana – super helpful.
(b) Building Kibana dashboards from event data was a breeze since we could query any field in the JSON easily. Not sure how querying JSON in BigQuery SQL compares.
(c) Finding events and digging into event lists and Kibana dashboards with Lucene query strings was always quick and fun. Just typing a search or clicking on some facets for different event properties made everything more interactive and easy to observe. Not a must-have, but it sure made debugging smoother!
If we wanted to replicate the use cases in GCP with an ES cluster, do you think it would be easier to customize the ES Loader for kensis to subscribe to pubsub vs kenesis, or could this GCP template do the same thing? Modèle Pub/Sub vers Elasticsearch | Cloud Dataflow | Google Cloud
So it is in our backlog to add Elasticsearch support to GCP - though not near the top of the pile currently!
Some of these use-cases you can build out by actually querying the bad row data directly in BigQuery as well. You can see this guide here: Querying failed events in Athena and BigQuery | Snowplow Documentation
It is not quite “real-time” but its generally fast enough for most use-cases.
On-to this one - the main challenge you have in inserting the “enriched” data into Elasticsearch is that you need to convert the TSV into a JSON. You can do this with our Analytics SDKs fairly easily but you would need to do a transformation on that data before loading into Elasticsearch.
The bad-rows are already JSON and as such you may be lucky in that streaming directly from the bad stream into Elasticsearch just works. Its not something we have attempted but certainly worth having a go at to see if it does what you need it to do!
I didn’t realize that you could query failed events in BigQuery too, also very cool!
This is great news for my GCP implementation. Are there any capabilities yet to query failed events in Snowflake?
Thanks again!
Are there any capabilities yet to query failed events in Snowflake?
Not that we have support for - it should be technically possible to load the data in but its not a path we have gone down!
However if you are on AWS loading Snowflake you can follow the above guide but for Athena instead.