Using Snowplow data to feed other applications

bernardosrulzon · September 5, 2016, 1:49am

Hey all!

We’re in the process of building an infrastructure that will allow Snowplow data to be used in other applications (e.g. a python-based algorithm to predict the number of service providers interested in a given service). We currently use Redshift as a data warehouse, copying our transactional MySQL databases to be used together with web-analytics events.

We have basically two use cases:

1. When data could be outdated by a few days
In simple cases, where we can load historical to train a model, data could be outdated by a few days without significant impact. We could connect to Redshift, load data into memory, and work from there because Redshift has a small-ish limit of concurrent connections. Is there a better option?

2. When data must be fresh
In more complex cases, we need data to be up-to-date within a few seconds. With the Snowplow real-time pipeline implemented, Elasticsearch would be an obvious choice here. But the ES cluster would need to store massive amounts of data (including all historical data) - is that the way to go? What other options should we consider?

I’d really appreciate if you could share your experience in this subject.

Thanks!
Bernardo

Simon_Rumble · September 5, 2016, 10:01am

What you’re describing in 2 is a Lambda Architecture. Your Elasticsearch
cluster doesn’t need to have more data in it than up to the last time a
batch mode loaded into Redshift.

Check out these links:

How to setup a Lambda Architecture for Snowplow
http://discourse.snowplow.io/t/how-to-setup-a-lambda-architecture-for-snowplow/249
Lambda Architecture http://lambda-architecture.net/

Topic		Replies	Views
Sample Data For Redshift Redshift	3	2048	December 22, 2016
Snowplow real-time analysis with on-premise pipeline For engineers	6	2594	January 4, 2018
On-premise Realtime Pipeline For engineers	2	2438	January 3, 2018
SendGrid+Snowplow+AWS S3&Redshift For engineers	30	4592	October 30, 2019
Using AWS Athena to query the 'good' bucket on S3 For data modelers & consumers	2	8907	June 28, 2017

Using Snowplow data to feed other applications

Related topics