I am a software engineer (increasingly a data engineer!) at a B2B SaaS company. We’ve been getting increasingly frustrated with our existing analytics systems (GA, we can’t afford 360). The problems I believe Snowplow would solve for us are:
Data ownership. We are EU based, and many of our customers are in heavily regulated sectors, and have strict expectations of their data processors
Data integration. I have found no tool like Snowplow that natively supports individual events being extracted to your data warehouse and combined with application data. At least, not without paying $$$$$. We’re interested in knowing the habits of individuals, not just aggregations.
What I am more concerned about though is complexity and maintenance. Being B2B, our customer base is not large, certainly not in the millions of events per day territory, and you could count the heads in our ops and data departments on one hand, so I wonder if I am overthinking things.
Could others comment on whether they see Snowplow as a tool for large companies, who need the scalability of its architecture, or whether it is practical for SMBs to maintain with little ongoing overhead? Even if it is, what alternatives are there that might be worth investigating?
As a B2B SMB that has been using Snowplow for years, I can definitely say yes, it is suitable. There is overhead to maintain it, but the costs are minimal and other than spending a few days every few months for upgrades, once you have it up and running the day to day time investment is minimal to zero.
Snowplow definitely addresses your two requirements, but it also gives you lots of flexibility in how you use the data. We use Snowflake as a data warehouse, and pipe all the data from Snowplow into Snowflake, then do other analysis and visualization from there using an assortment of tools. Other than the obvious tongue twister and mixups with snow-things it works perfectly and the costs are manageable as well.
For context, we are a dev team of 2 full stack developers that spend the vast majority of our time on our web application, Snowplow/Snowflake are also our responsibility but take very little effort to set up and maintain. We ingest roughly 5-6 million events per day total across the sites we use Snowplow for.
Definitely agree with @Brandon_Kane, the pivot point to using Snowplow often isn’t scale. It’s needing high quality, complete data sets you can trust (many facets to this that you’ve already mentioned). It is comforting to know that the pipeline scales so well too, as it won’t become a deciding factor on a subsequent move.
There’s plenty of support available if you do decide to adopt the tech.
I’ve over a decade using Omniture (aka Site Cat / Adobe analytics) and the rest of the Marketing Cloud. I think you’ll find the flexibility of Snowplow is one of the greatest assets, the fact you can create your own JSON schemas and enrichments that fit your business needs imo is unparalleled. Further to this, putting the data into your own EDW rather than separately like many other solutions is a huge bonus and should not be underestimated. In the past I’ve had to either do this with Adobe workbench by extracting from EDW by creating a primary key or push to an EDW, the added overhead, complexity and maintenance is a serious time sink. Finally other solutions can be a bit if a black box, each comes with its own data processing framework and the idiosyncrasies can be frustrating. You control the end to end with Snowplow so theres none of the ambiguity.
As for the pipeline it will just scale to your business needs but really it’s a non issue.
To give you some idea the smallest company I’ve seen Snowplow successfully implemented in has been around a 10 person startup where they went with the open source. It’s not necessarily a small amount of effort to setup but once you have everything up and humming it requires reasonably little ongoing maintenance.
The largest we’ve helped at is around 5k employees (though I know of installations that are far large than this) with the Snowplow Insights product (commercial offering of Snowplow) which is significantly easier to get started with than open source. We’ve seen a lot of companies that fit in between these two extremes as well.
Things to consider are:
From a data ownership perspective it’s hard to go with a better tool given that Snowplow runs within your region / AWS / GCP account
Matomo is a fantastic tool but has inherited using MySQL for analytics under the hood. MySQL is a pain for analytics workloads because it’s not a OLAP / columnar database in the same class as BigQuery, Redshift and Snowflake.
Either for the open source or commercial offerings you will need to budget for running the cloud infrastructure. The largest variant in cost here is typically the data warehouse (Redshift, Snowflake or BigQuery). The infrastructure cost for lower volume pipelines setup for high availability is often around $1-2k USD / month though it is certainly possible to optimise this.
The starting price point for Snowplow Insights (depending on your volume and plan) is significantly lower than the start point for GA360
Do you have engineering capacity to support running Snowplow open source? Many companies do but have to assess the cost of engineering support with what it would cost for the commercial version where infrastructure is managed for them.
Snowplow is open source - so you can run, inspect and modify the underlying code at will. A lot of effort has been made to make the ecosystem as extensible as possible which means that if there’s something the pipeline doesn’t do today - you can either request (or a make a pull request) a feature.
I believe that Snowplow very much leads the analytics ecosystem rather than follows the crowd. Early developments like schema versioning, validation and enrichments have paved the road for other analytics vendors to offer similar features.
Disclosure: I work at a company that is a Snowplow partner (along with other analytics vendors)
I just want to add that there’s another way to think of Scalability - that of scaling the range of use cases/analytics. Over time, the problems you’re solving now will evolve into different questions, and new questions will arise.
Because Snowplow is a tool built for doing a good job of structured data collection, rather than solving for the problem you have now (and obv because you own the data), a well-designed implementation will likely benefit you as those use cases evolve. It should also suit iteration (build on top of the implementation rather than rebuild it).
So if in two years, for example, you realise that some of your metrics are no longer fit for purpose (and if you’ve done a good job of the design), you should be in a position to not only iterate on your tracking, but recompute your data in a way that allows your 2yr history to be relevant/valuable where you have already been collecting data also.
(following suit with Mike - disclaimer: I work for Snowplow)