When I work with the DevOp team,we want to minimize the resources, risk, etc… A question of the process that we have right now is: What do these 2 enrich(stream enrich after collector, EmrEtlRunner enrich) steps really do in the process? Are they adding new fields? Increasing amount of records?
@AllenWeieiei, they can do a lot of things depending on your needs. In broad terms, the following 2 tasks are performed:
data validation against the corresponding JSON schemas (data quality)
widening of the captured data with additional info (configurable)
The 1st item is a must. Any data failing validation will be rejected and (depending on your pipeline architecture) set aside for further examination/recovery/reprocessing.
The 2nd item allows you to enhance your data with additional values including those coming from 3rd parties. These are configurable and optional. Snowplow pipeline is very flexible and rich in its ability to be customized to your specific needs. Here’s the link to various enrichments you can add to the pipeline: https://github.com/snowplow/snowplow/wiki/Configurable-enrichments.