Loading Snowplow events into Redshift requires creating tables to store these values according to their schema. When I was searching for how to create the tables, I came across two options:
Generating the schemas using
Running the command
./igluctl.jar static generate schemas/ where the
schema folder contains the schemas from the iglu-cental repository and applying manually the create table scripts into Redshift.
Using the scripts from iglu-central
https://github.com/snowplow/iglu-central/tree/master/sql provides the create table scripts ready to run.
The problem is that when using the first option, I ran into an error on the
rdb_load step of the
`ERROR: Data loading error [Amazon](500310) Invalid operation: Number of jsonpaths and the number of columns should match. JSONPath size: 13, Number of columns in table or column list: 14`
It turns out that there’s a mismatch between the schema generated by igluctl and the hosted assets bucket (
The second option (using the ready-to-run scripts) worked out for me, as they were compatible with the schemas in the bucket.
What I want to know is if I am doing something wrong and why are the schemas (https://github.com/snowplow/iglu-central/tree/master/schemas) different from the create table scripts (https://github.com/snowplow/iglu-central/tree/master/sql).
Also, is there a way to automate the process of creating tables in Redshift?
@Pelielo, sure there is. However, first, could you explain why you would amend tables DDL for which is stored in Iglu Central? That repository is for the events and contexts supported by Snowplow tracking, dedicated methods in various tracking SDK.
Do you intend to evolve your own, custom, event (as opposed to those hosted by Snowplow)?
Generally, you would utilise your own Iglu server to host JSON schemas for your custom events/contexts in addition to those hosted in Iglu Central. Your resolver configuration file would contain a reference to your Iglu server as well. You can use different priorities to state which Iglu is the primary one which normally always should be your own Iglu (in case you need to override JSON schemas hosted by Snowplow)
Similarly, the JSONPaths could also be hosted in your own bucket which is denoted by
jsonpath_assets in your EmrEtlRunner configuration file, for example https://github.com/snowplow/snowplow/blob/r112-baalbek/3-enrich/emr-etl-runner/config/config.yml.sample#L9. Again, your own assets bucket takes priority over assets hosted by Snowplow.
@ihor, thank you for the response.
I am indeed thinking about creating custom events and your answer did clarify some of my doubts, but actually the problem I ran into was that the table definitions created while using
igluctl on iglu-central/schemas were different than the definitions on iglu-central/sql.
Can you tell me how the create table automation process could be achieved?
@Pelielo, those table definitions were generated a long time ago - some are as far as 3 years back. The difference you might see is Igluctl (later versions) supports new compression - ZSTD. When your table is defined to use that compression it allows to save up to 60% of your Redshift disk usage.