On-Premise PostgreSQL storage. Still requires S3?

dbh · September 25, 2017, 9:46pm

Hello,

I’m interested in running as much of Snowplow on-premise as possible. Currently, I’m stuck on the step of configuring picking up of Enriched events and putting them into PostgreSQL storage.

Some of the documentation I’ve seen seems to imply that S3 is required for persisting the events to Storage.

Is there a way to swap-out using Amazon S3 buckets for an on-premise option, for communicating Enriched Events to persist them to long-term storage in PostgreSQL?

If s3 is required, has anyone tried using use a local Minio instance instead?

FYI, I’m using Kafka as the back-end, and have simple steps configured for

Javascript Tracker
Snowplow collector (scala), writing to Kafka
Snowplow stream enrich, reading from Kafka

Currently, the company I work for uses Mesosphere DC/OS. There is considerable investment in running their systems in DC/OS (locally hosted) and keeping data within their own data centers.

Snowplow use cases are as follows

Web-click tracking and analysis - most traditional Snowplow use case.
1a. Track/analyze users’ clicks, page navigation, changing of selected options, timings on pages and page items, etc. Perform analytics on this info.
Serve as a “Messaging Framework” - Hence the interest in Kafka.
2a. Audit Only Events - capture for audit only events from various applications and systems, for auditing and further analytics.
2b. Serve as Central Messaging Hub / Broker, so for Actionable Events (events from one or more systems/apps that require further applications/systems to perform some processing ). These would also be audited. Here, minimal latency between putting an event in a collector and having the enriched event available in a Kafka topic will be critical.

I appreciate any comments and guidance.

mike · September 26, 2017, 2:11am

In theory you should be able to switch out S3 for Minio/Openstack though in practice I’ve never seen anyone do it. Out of interest what is your use case for on premise?

anton · September 26, 2017, 5:51am

Hello @dbh,

Few moments here.

There’s no way right now you can put enriched data from S3 into any relational database. In vanilla batch pipeline we have additional Shred step that prepares enriched data for loading into Redshift and Postgres.
Even with Shred step, Postgres right now lacks support of self-describing JSON - it loads only atomic.events table, which is most likely less than you want.
S3 right now is hardcoded into RDB Loader, so it simply doesn’t know how to fetch data from other sources. This is obviously not going to remain in this state forever - we’re planning to add new cloud providers and storage targets, which inevitable will also open opportunities for on-premise solutions. But considering previous points this one is least of our problems.

All above make Postgres load with any object storage apart from S3 hardly feasible right now. But still we saw many efforts (1, 2, 3) on this forum to build on-premise pipeline using Kafka. I believe people usually end up with Kafka JDBC Connect, which is less persistent than object storage, but looking very promising.

Hope that helps.

dbh · September 26, 2017, 8:19am

Thank you! I’ve added use case descriptions to the original post.
FYI - Usage of Mesosphere DC/OS may be similar to OpenStack.

dbh · September 26, 2017, 8:26am

Thanks very much for the feedback! This is good to know and may shape the direction for the analytics part of the project.

FYI, I added some use case info to the original post.

reena389 · September 20, 2018, 6:43pm

Check this link : https://www.confluent.io/blog/kafka-connect-sink-for-postgresql-from-justone-database/
https://github.com/justonedb/kafka-sink-pg-json

Topic		Replies	Views
Collect to Kafka, enrich with Kafka then what? Storage targets	15	5453	September 17, 2018
Can we Setup Snowplow without using AWS? Duplicate	2	1676	February 27, 2018
Unstructured data handling and storage Storage targets	1	1704	August 6, 2020
How to run StorageLoader in standalone system without connecting to external server eg: Amazon AWS Storage targets	1	2042	July 19, 2017
Using Snowplow with Postgresql	1	1357	December 10, 2019

On-Premise PostgreSQL storage. Still requires S3?

Related topics