We use snowplow for analytics in our company and we have faced with a problem of querying events table. We deployed snowplow in AWS using terraform. Also we use metabase as UI tool for analysis.
During the work we have uploaded a lot of schemas and the events table has grown up a lot. And it has become so big that metabase is not able to upload the table.
Also I am not able to query the table from the Postgres DB. It returns me Killed.
So my question: is there any ways to optimize the events table or querying it. Maybe I can split the table somehow, What can be solutions for occurred problem?
So, the events table is basically a master historical record of every event you’ve ever tracked - which isn’t well suited to frequent access or analytics workloads. Also by nature just a list of events isn’t a great starting point for analysis since generally analysis would like to start with some business context (which is achieved with simple aggregations) - for example page ping events alone don’t tell you anything, but if aggregated up to page view level, they expose a lot of rich information upon which a lot of interesting analysis is based.
For that reason, the recommended way to interact with the data for analysis is to set up a data modeling step of your pipeline, which aggregates the raw events to something more intelligible from an analysis point of view. Typically the output tables of that modeling process are what you would layer metabase or similar tools over.
We have a bunch of standard data models, which you’ll find somewhere in the docs I linked above. I recommend starting with the dbt ones - they’re most accessible starting point.
Actually, we can’t afford any data modeling for now. dpt is not suitable for us, and I don’t quite understand how sql-runner works. Does it create a new table in my PostgresDB? Because this is what we actually need. And where do I need to run it, locally?
Maybe there are any other ways to do what we want.
sql-runner will run locally and use a combination of yaml and sql files. The yaml organises the sql into steps. Once fully configured and you have the Snowplow playbooks in the right steps the final output will be views or tables (up to you) covering aggregations of User, Sessions, page views which is an extremely powerful starting point to build on. Thats being said the runner is more configuration heavy, dbt is way more accessible. Both dbt or SQL runner approach will give you what you need here. You can then build on the models with joins, for example a structured events view (if you used them) or with you own custom contexts/unstruct views.