This is more of a query whether my idea can be utilized in Snowplow trackers. My idea is that, I would like to filter the data using some external API so that snowplow collect only specific type of data. I tried to send data to my AWS collector , through another third party URL but it is not working.
My query is , is my idea valid ? Will snowplow allow such filtering?
Can you give an example of what sort of filtering logic you are trying to employ? Some filtering may require an external service vs some filtering that may be more appropriate at either the tracker or load balancer level.
My filtering service is running at ‘filter.example.com’
3.I am sending tracking data to ‘filter.example.com’ and in this filtering server, I redirecting all the hits towards my collector url(c1.example.com).
I have gone through javascript tracker. Now I want to do following filtering
Stop users tracking data from certain countries.
2.Stop few texts which not complying with our reporting system such as this : ‘ợc+hợp+nhất|’. These texts breaks the reporting files such as CSV, TSV
Blocking some kind of contents for example porn content etc.
Although you can do this in Javascript (by calling an API to lookup the possible country for an IP address) this is probably easier to do at the load balancer / CDN level by serving an empty Javascript file / tracker. Avoid blocking the Snowplow requests directly as this will cause them to queue up in the users local storage.
Do you know where this data is coming from? Everything in the Snowplow pipeline should be UTF-8 so this shouldn’t break loading or processing any parts of the pipeline but you may want to filter it out somewhere depending on how it’s being sent / if it’s expected (you could likely due this in your schema definitions).
Is it porn URLs that are being sent through in fields or something else? This can be a trickier one to remove but I suspect the best place to do this would be a custom enrichment that flags adult sites and removes / redacts or drops the event depending on your desired behaviour.
The IP lookup enrichment runs after collection so its primary use is to add geographic information to an event for analysis and filtering - rather than blocking.
If you want to stop events before they are collected this depends a bit more on the use case e.g., do you want to stop events because you don’t have consent to collect or you just don’t want to collect for some other reason?
Depending on this you could look at blocking countries at the CDN level (though a warning that this still be an approximation based on the IP address) or alternately you could run some client side code that retrieves the country from an IP address using an API (such as ipify) and then determines whether the tracker should be initialised or not.
I suppose you mean at CDN level , that is at application level? Do you think that , we can filter at snowplow end by adding some custom enrichment using javascript?
You could filter in the enrichment process but you may be better off with using the API enrichment rather than the Javascript enrichment so you can easily change out resources + databases as they get updated.
Yes - you would send the parts of the event that you want to filter to this API and then you could flag the events appropriately and remove them from your database.