On-Premise PostgreSQL storage. Still requires S3?

anton · September 26, 2017, 5:51am

Hello @dbh,

Few moments here.

There’s no way right now you can put enriched data from S3 into any relational database. In vanilla batch pipeline we have additional Shred step that prepares enriched data for loading into Redshift and Postgres.
Even with Shred step, Postgres right now lacks support of self-describing JSON - it loads only atomic.events table, which is most likely less than you want.
S3 right now is hardcoded into RDB Loader, so it simply doesn’t know how to fetch data from other sources. This is obviously not going to remain in this state forever - we’re planning to add new cloud providers and storage targets, which inevitable will also open opportunities for on-premise solutions. But considering previous points this one is least of our problems.

All above make Postgres load with any object storage apart from S3 hardly feasible right now. But still we saw many efforts (1, 2, 3) on this forum to build on-premise pipeline using Kafka. I believe people usually end up with Kafka JDBC Connect, which is less persistent than object storage, but looking very promising.

Hope that helps.

Topic		Replies	Views
Selfhosting of Snowplow Kafka real-time pipeline	4	3293	April 7, 2017
Enriched Events are not loaded in postgres AWS batch pipeline (Legacy)	1	373	January 11, 2024
Loading from S3's enriched events to PostgreSQL	4	1394	May 20, 2019
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich RFCs	1	4137	June 25, 2017
Using Snowplow with Postgresql	1	1357	December 10, 2019

On-Premise PostgreSQL storage. Still requires S3?

Related topics