I was trying to set up enrichment pipeline. Downloaded GeoLite2-City.mmdb from Maxmind and put it in a self-hosted S3 bucket which is opened to public. Then I updated the ip_lookups.json to
The enricher app is run on an EC2 and I can
wget in the EC2 from
However when I ran the pipeline I only got null values for all
I then checked if the
user_ipaddress could be queried in the GeoLite2-City.mmdb I got from Maxmind using
mmdblookup and I could get the correct geolocations.
enricher.config.hocon and resolver.json both look good to me and the enricher app doesn’t complain anything on INFO log level. Other enrichments eg. ua_parser are run properly and return correct values in
Enricher app is Version 1.4.0.
Anyone encountered similar behaviour?
We haven’t made any changes in IP Lookups enrichment in 1.4.0, apart from bumping the MaxMind client, but still it would worth to check it with 1.3.2 - you can use the same configuration.
I checked a pipeline running IP Lookups enrichment with 1.4.0 and it populates the
geo fields just fine. The only difference I see yet is that it uses S3 URI in its config, not HTTP.
Another thing worth to check is if you’ve restarted the enrich after you updated the config. Enrich won’t update configuration automatically, it happens only once, at the initialization.
@anton I found out why…In the first time I ran it I pointed enricher to the wrong mmdb. Unbeknownst to me the enricher downloaded the file and cached it for subsequent re-initialisation of the app. I had to manually remove the file before rerunning the app to get the ip enrichment working.
A couple suggestions:
- can we change the cache behaviour so the app overwrite
ip_geo every time it gets initialised?
- can we add validation to mmdb format and content for the file that actually gets used instead of just json schema?
Thanks for confirming it’s not a bug @jmak123!
As of your suggestions:
- Our new FS2 Enrich actually does something very similar already. Your Stream Enrich downloaded the asset locally and then just re-used it. Whereas FS2 Enrich would start using the downloaded asset for some time, then downloaded a new one in background and if that new one has changed (MD5 has is different), it would reinitialize the enrichment. The interval for this refresh is configurable, but FS2 supports only GCS for now unfortunately - we’ll make it working with AWS in future.
- On your second point, I’m actually surprised it didn’t throw an exception. This should be probably raised in an upstream lib https://github.com/snowplow/scala-maxmind-iplookups/. May I ask you what was that in that file? Was it just empty or corrupted?
Also Stream Enrich has
--force-cached-files-download, which re-downloads the asset once.
I started with a url of the legacy s3 hosted by snowplow which is basically an html that says access denied. Snowplow seemed to run just fine with a corrupted file like that.