Using Snowplow data to feed other applications

Hey all!

We’re in the process of building an infrastructure that will allow Snowplow data to be used in other applications (e.g. a python-based algorithm to predict the number of service providers interested in a given service). We currently use Redshift as a data warehouse, copying our transactional MySQL databases to be used together with web-analytics events.

We have basically two use cases:

1. When data could be outdated by a few days
In simple cases, where we can load historical to train a model, data could be outdated by a few days without significant impact. We could connect to Redshift, load data into memory, and work from there because Redshift has a small-ish limit of concurrent connections. Is there a better option?

2. When data must be fresh
In more complex cases, we need data to be up-to-date within a few seconds. With the Snowplow real-time pipeline implemented, Elasticsearch would be an obvious choice here. But the ES cluster would need to store massive amounts of data (including all historical data) - is that the way to go? What other options should we consider?

I’d really appreciate if you could share your experience in this subject.


What you’re describing in 2 is a Lambda Architecture. Your Elasticsearch
cluster doesn’t need to have more data in it than up to the last time a
batch mode loaded into Redshift.

Check out these links:

1 Like