Dumb question, but I’ve been through as much of the documentation as I can find. I have custom events stored in S3, and want to batch process them once an hour with additional validation rules and enrichment data mappings. For example, take the contents of the URL referrer field, if known translate it to X otherwise translate it to Y. Where and how do I program the Scalding to define this as a map reduce function? I’ve found the EmrETLRunner config file, but am not seeing where the actual business logic resides.
Yes, I saw that but it was not clear that this is the primary extension mechanism. So the Hadoop parallelism will be by event, and I should just do a call out to an external service that does the data translation of various parameters? No data look ups in Hadoop this way, correct?
Correct - the Hadoop parallelism is by event. We are working on adding support for you writing a custom enrichment as a packaged JVM jar (so you could write it in Java or Scala), but in the meantime, yes the JavaScript enrichment is the way to go.
If you’d rather not put the logic inside the JavaScript enrichment, in R79 you’ll be able to integrate an external service holding the logic, using the API Request Enrichment.
@alex, I was asking is it possible to do that in Rhino Javascript enricher Sorry, I shouldn’t use ‘enrichment’, I was talking about this particular javascript enricher.
I am pretty sure it’s possible to make an HTTP call from inside the JavaScript Enrichment - but if you can, it would be cleaner to handle the error in-band, just returning an error context which will be attached to the event for further processing downstream…
It means you can run and rerun the Snowplow enrichment process without causing side effects in other systems (in functional programming terms, pure versus impure function).
The whole our idea is to avoid any errors if possible and make sure it goes to the end of pipeline. We plan to achieve this goal by just normalizing incoming data (like if we got string field, but we expect integer here, we just ‘fix it here’, convert into right type and log it (We keep logs in ElasticSearch BTW). By analyzing logs we can fix problems on our code. There could be many situations (especially in early stage of developing our analytics) when just adding new field for self-describing events and contexts could lead whole data moved to bad bucket which is really not good for us.