Enrichments, how to enable in quickstart examples?

Hello,

I’ll write this topic, following my tries searching for informations about how to properly enable an enrichment, If someone already created a similar post, so sorry and appreciate if someone point me to the link as I can’t find the right answer myself so far.

I’m running the quickstart project on AWS, almost everything is working fine, I can collect and query my events and so.

Then I started to explore more and decided to start testing the ip_lookups enrichment. Following the Docs, I found the basic about what is enrichments, the available enrichements but when I stared to read and follow the guidances to setup the IP Lookups enrichment I got some problems, in the part 3:

3. Configure the enrichment for your pipeline

Snowplow BDP customers can enable the IP Lookup enrichment for your pipeline in the Snowplow console. Open Source will need to upload the enrichment json for use in their Snowplow pipeline.

I understand they made a console for the BDP, but for the open source we need to upload the enrichment file (I already have the json file with the url for my self hosted maxmind file, following the example) but I can’t find where we need to upload that file. It’s vague for me, in my pipeline process I can’t find where, and in my iglu-server, I already pushed the schema files, right? So the only thing missing is this file to be upload? Where? How? There’s a guide I’m missing to read?

Maybe I’m missing some part of the Docs, or I don’t understood how to upload that file using the igluctl.

Also, I noticed the very bottom link that shows the Output maybe is wrong, it appears copied from other enrichment:

Output

This enrichment adds a new context to the enriched event with this schema.

Thank you so much! :slight_smile:
PR

Hey guys, I found more information:

I missed that part (my bad) on my quickstart journey:

And also, this:

I edited my terraform file, added the enrichment part for the ip_lookups, and applied the changes, but now I stoped receiving events at all. applying the changes impact in something I need to restart or adjust? I can see the tracker posting, but it doesn’t appear in my postresql anymore. Any help appreciated.

It’s likely that something went wrong with the enrichment, or the enrichment has caused the data to fail validation. Either way, the data would land in failed events - if you check that there should be error messages to help debug.

I edited my terraform file, added the enrichment part for the ip_lookups, and applied the changes, but now I stoped receiving events at all.

That worked for us! Maybe post what you entered as enrichment json?

Hey guys, thanks for the replies!

Here is the code I’m running on my terraform, following the guide posted above.

locals {
  enrichment_ip_lookups = jsonencode(<<EOF
{
    "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0",
    "data": {
        "name": "ip_lookups",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": true,
        "parameters": {
            "geo": {
                "database": "GeoLite2-City.mmdb",
                "uri": "s3://path-to-my-s3-assets/third-party/maxmind"
            }
        }
    }
}
EOF
  )
}

and:

  # Enable this enrichment
  enrichment_ip_lookups = local.enrichment_ip_lookups

After that I applied with terraform apply. Things updated and then I stoped to received events.

The bucket or mmdb file needs some special permission? It’s under my assets bucket, private.

I’ll look now about the failed events stream / bad buckets to check if I have more details.

Hey guys, just to let you and others know how I’ve solved my issues (And receive feedback if I did something outside the patterns).

I didn’t found any info inside bad buckets and tables, and also I didn’t saw anything wrong inside the streams, so I was blind to check what was happening, the events simple didn’t appears into the table.

I tried uploading my maxmind file to a new private bucket but changed the permission to be public accessible and the events started to be enriched successfully. I’m not sure if this is the correct, but while my file was inside a private bucket it didn’t worked. Maybe some miss-configuration on my side?

The docs only says to upload the file inside a private bucket, and as I know my permissions for all the terraform configs would be enough to access the file inside the enrichment server. No? Maybe the only missing part is to set the database file to be public?

Thank you! and please, if my solution is wrong, appreciate some guidance to make it better.

You shouldn’t need to have this in a public bucket.

I’d double check:

  • simulating the role / policy that you have setup on the instance that is fetching the asset to ensure the permissions are correct
  • see if there are any errors or warnings raised on the instance on initialisation if it cannot fetch from S3
  • failing that - look at Cloudtrail logs (for S3 specifically) and have a look if you can see the API call itself that is likely failing

Hi @prss you likely need to use this option:

This allows you to add the bucket you are hosting the databases in to the IAM policy for the role. Hope this helps!

Thanks @josh and @mike. I’ve tested using the option input_custom_s3_hosted_assets_bucket_name and it worked successfully :slight_smile:

many kudos!

2 Likes

Glad to hear it! If you have time we would welcome a PR into the README that would have made using that setting more obvious to you than it was to save future users pain.

1 Like

Hey josh, how are you? Can you point me the correct url to the repo where I can find that README?

I’ll be glad to contribute for sure.

Also, something happened here, while I was evaluating the pipeline, I’ve installed a tag on our website product and configured some enrichments.

This weekend, the storage on our rds database reached 100% of use, of course I didn’t change the 10gb from the quickstart tutorials and it ended that way. I managed to add some storage manually, before reading about the auto scaling storage option. I will enable it later.

But I noticed that after updating it with more space, I started to receive the old events (from the weekend) very slowly, and the real-time ones didn’t came. So, I’m trying to understand how I can normalize the current events to come and if I can recover the bunch of events that was collected during the weekend. And why they appear to came so slowly.

Can you clarify how do I manage to get the things working again? Any adjustment I need to do in the kinesis or the loader instances?

Thank you alot!

Hi guys, good night.

Just updating here with how I solved my problem about the events that was stored but not loaded while my db storage was full.

I tried to follow the problem and found that my ec2 responsible by loading the events was not using any resource (cpu etc) so I tought that at some point while trying to load the events into my db (that was full) it stoped the routine. Then I tried to restart the service to check.

After restarting the events started to load, but very slowly and of course the reason was that this instance was very modest for the job of loading an accumulated 2 days of events. So, I changed the instance_type in my terraform config and deployed a t3.medium to do the job. After the deployment things started to go more fast and the next problem was the RDS that also needed to change for a better config to received the amount of events needed.

After that, I changed back the instance types, and things are now normal.

I don’t know if this is the correct approach, but I didn’t find any guidelines to follow regarding pipeline failures, what to do and how to recover properly. If it existis, please send me the link, I appreciate.

Hope my solution help someone and if any better way to handle that is available, please update here for us :slight_smile:

Thanks!

Hey @prss the URL is the same as the one with the setting I pointed you to!

How you resolved the issue is pretty close to what you should do indeed - one of the limitations of the quick-start is that auto-scaling is not built-in to the solution (generally in these cases you would want to have alerting + auto-scaling rules to dynamically handle this). Manually scaling up is however a totally valid way to get things working again.

Hey @josh thanks for the feedback! My next step will be implementing a health dash status to monitoring and alerts, also, the auto-scaling configuration! thank you!

Also, about the correction I referred later, was about this Doc ip-lookup-enrichment, that in the “Output” section of the article, link the reference schema of the full example guide (about ip lookups) to the link schema for spider-and-bots. I don’t know If I can and where I can adjust that and open a PR, maybe this part didn’t allow changes via github.

Thanks again! Have a great day.