Install on VPS not GCP or AWS possible?

michacassola · November 15, 2022, 2:03pm

Hey, is it possible to install Snowplow on generic cloud instances with open source software only not using GCP or AWS?

josh · November 16, 2022, 12:47am

Hey @michacassola parts of it certainly are possible to install without Cloud servers but the full flow is not really possible.

We currently support Kafka and RabbitMQ as streaming layers between the core micro-services so you can get Enriched and Validated data into a topic with either of these systems. Not all of our micro-services support these queues however - especially when it comes to downstream loading of warehouses.

You should be able to orchestrate the micro-services in whatever way works for you - running the JARs directly, running the container directly or using systems like docker-compose/swarm or more advanced schedulers like Kubernetes.

However getting that data into a data-warehouse like Databricks or Snowflake is where this starts to get a little more difficult as we lean on cloud blob storage to support loading data into these destinations → so without something like S3 you can’t get it into a DWH.

If you are purely interested in real-time data or have the ability to deal with loading the data where it needs to go from Kafka / RabbitMQ then it certainly should be possible!

Would you mind sharing what the use-case you are trying to fulfill is?

michacassola · November 16, 2022, 1:29pm

Hey @josh , thanks so much for your detailed answer!
I am just shopping around for an open source Google Analytics alternative and never fell in love with Matomo too much. I know that what you offer is overkill for my usecase, but better to have more power than you need than too less. Saw it in a youtube video being praised as an alternative, but am realising that you offer data management not really web analytics like GA?

I am no expert, but aren’t there capable open source data warehouses out there?

Also S3 isn’t a problem, many other providers offer S3 compatible object storage.
Also there is Minio: https://min.io
I think CEPH also has S3 compatiblity.

josh · November 17, 2022, 3:24am

Snowplow can certainly be used as a replacement for GA from a tracking point of view but you won’t get any of the out of the box vizualisations and such. What you get here would be an OS Data Pipeline with also OS Data Models that you can run on your warehouse to pull insights from the data you collect. Our DBT Web Model supports BigQuery, Databricks, Redshift, Snowflake & Postgres (GitHub - snowplow/dbt-snowplow-web: A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.).

The best way to get started and see if Snowplow is a fit would likely be to go down the Quick Start route and spin up a base pipeline in an AWS Environment. This will stream data in realtime into a Postgres instance that you can then model with the above package.

We also have our Try Snowplow experience which is hosted by us for just having a poke around at what you can do with Snowplow.

There certainly are - we just don’t have any out of the box support for loading these warehouses as of yet!

Very true but this depends on our micro-services allowing endpoint overrides so that the AWS SDK still works which is not true of every service as of today. We have some support for this in core components to use things like Localstack or other AWS compatible APIs but its not a core focus to have that compatibility layer.

Topic		Replies	Views
Selfhosting of Snowplow Kafka real-time pipeline	4	3297	April 7, 2017
Porting Snowplow to Google Cloud Platform RFCs	7	7903	February 8, 2024
On-premise Realtime Pipeline For engineers	2	2438	January 3, 2018
Open source quick start launched on AWS New releases	1	938	July 21, 2021
Snowplow on-premise For engineers	1	1053	July 25, 2022

Install on VPS not GCP or AWS possible?

Related topics