Dumb question, but I’ve been through as much of the documentation as I can find. I have custom events stored in S3, and want to batch process them once an hour with additional validation rules and enrichment data mappings. For example, take the contents of the URL referrer field, if known translate it to X otherwise translate it to Y. Where and how do I program the Scalding to define this as a map reduce function? I’ve found the EmrETLRunner config file, but am not seeing where the actual business logic resides.
Yes, I saw that but it was not clear that this is the primary extension mechanism. So the Hadoop parallelism will be by event, and I should just do a call out to an external service that does the data translation of various parameters? No data look ups in Hadoop this way, correct?
It means you can run and rerun the Snowplow enrichment process without causing side effects in other systems (in functional programming terms, pure versus impure function).
The whole our idea is to avoid any errors if possible and make sure it goes to the end of pipeline. We plan to achieve this goal by just normalizing incoming data (like if we got string field, but we expect integer here, we just ‘fix it here’, convert into right type and log it (We keep logs in ElasticSearch BTW). By analyzing logs we can fix problems on our code. There could be many situations (especially in early stage of developing our analytics) when just adding new field for self-describing events and contexts could lead whole data moved to bad bucket which is really not good for us.