Currently, I am testing the gcp-lake-loader. There are a few things I want to understand and check.
Currently, the event and context columns are named like this:
Is there a way to simplify for simplicity? Any in-built transformers for this?
Can I get rid of canonical null columns to shorten the width of the rows?
@Jayant_Kumar I don’t see the need to simplify this, in columnar storage it doesn’t really matter much how many columns you have as it has minimal effect on data size (null columns/fields don’t allocate any extra space).
It just looks little messy with bunch of attributes which means nothing to many.
Arguably it would be better to have some sort of filter support to disable some of them by defaults.
I’m not sure how you could do this? The idea in the creation of columns is that they should (at some stage) contain non-null values so you can’t really hide them at all. Ideally nobody should really be looking at the raw table - in downstream modelling you can select and filter out to just the things you are interested in.
@mike If you think about creating data catalog over the raw data. It will look like a mess.
I was thinking to drop events using transformers after enrich stage. But I am not sure if that work, the reason being the loaders deserialise events using Event from analytics sdk. So it may not help that.
If there would have been a way to drop null fields using a config or something it would have been great. Loaders anyway support schema evolution, so when those columns have values, they cam start dumping.