Enricher high CPU utilisation issue

Hi,

I am running a stream-enrich-kinesis in production, after running enricher for fews minutes only my machine hosting enricher goes down due to high CPU utilisation by the enricher.

The CPU utilisation goes beyond 100% many times. Also, I am getting around 200K records but my enricher is able to process only 120K due to which latency is very high in snowplow pipeline.

I am running 8 instances instances of enricher with 2cpu and 8Gi memory and having 16 shards in kinesis.

Below is the config of my enricher.

enrich {

  streams {
    appName = "snowplow-enrich"
    sourceSink {
      enabled = kinesis
      region = <>

      threadPoolSize = 10

      aws {
            accessKey = <>
            secretKey = <>
      }
      initialPosition = LATEST
      backoffPolicy {
        minBackoff = 3000
        maxBackoff = 600000
      }
      maxRecords = 10000
    }

    in {
      raw = "snowplow-good"
    }

    out {
      enriched = "snowplow-enriched-good"
      bad = "snowplow-enriched-bad"
      pii = "snowplow-enriched-pii"
      partitionKey = "user_ipaddress"
    }

    buffer {
      byteLimit = 4500000
      recordLimit = 500
      timeLimit = 3000
    }

  }
}

Kindly help me to resolve this issue. Our all uses cases are dependent on snowplow performance only.

Hi @karan,

Welcome to the Snowplow community! Thank you for reporting the issue.

We’re currently investigating what the issue could be, we’ll get back to you once we’ve identified the culprit.

Thanks… Please let me know ASAP my pipelines are dependent on snowplow only.

@karan could you share the logs of Stream Enrich when this happens please ?

@BenB Please find below logs of enricher. We are getting errors as below but it’s not failing the enricher. Also CPU usage increases abruptly will writing processing records.

[RecordProcessor-0010] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0010] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0010] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-000 in stream snowplow-enriched-good under account 942XXXXX.

[RecordProcessor-0010] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 8161 milliseconds…

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 140 out of 140 records

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

Thanks for the logs @karan.

This happens when Kinesis quotas are reached. Please check AWS documentation to know more about these quotas.

In your configuration we can see partitionKey = "user_ipaddress". Could you please check the throughput of each shard and check that they are equally balanced when using the IP address as a partition key?

If you don’t need to have the data partitioned by IP downstream, we would recommend to not set this parameter to have the data equally partitioned by random UUIDs.

IP is not required, I will remove it and will update you regarding the performance… Thanks

@BenB. Thanks again… I have made the changes and removed partition key. Please find below configuration

   out {
      enriched = "snowplow-enriched-good"
      bad = "snowplow-enriched-bad"
      pii = "snowplow-enriched-pii"
      partitionKey = ""
    } 

Above configuration changes helped me to increase the processing speed of enricher by 50%(now enricher is processing around 180K+ records as compared to previous speed of 120K).
But enricher still showing the high CPU utilisation in machines. Please find below screenshots of CPU utilisation by enricher.

Thanks for your feedback @karan, good to hear that you could increase the speed of Enrich with the same number of machines.

Regarding the high CPU utilization, we suspect KCL to be the culprit. On this comment we can see someone else having the CPU going 100%. Stream Enrich is still using an old version of the lib and we plan to bump it to the latest (PR here). We will keep you updated when the new version will be released.

Do the CPUs stay at 100% forever or does it get back to normal after some time ?

Initially it behaves normally but after few mins its starts utilising CPU above 100% continuously.

Also could you let me know what’s the enricher config should be to process 400k records?

Right now, I have 16 shards(kinesis behaving normal ), 8 instances of 2 CPU and 4Gi memory each.

With 16 shards we can’t have 400k events per second, is it 400k events per day ?

To determine the optimal number of shards you need to know the maximal throughput per second and then you can use the formula here.

Then what we usually do is we set up an auto scaling group that decides the number of enricher instances based on the CPU usage.

@BenB Sorry for late reply… Here is correction in my input. We are getting around 3000 to 5000 records per seconds as I saw in kinesis monitoring graph… Please find below get and Put records
graph from kinesis snowplow-good queue:

Also, we have increased the number of CPUs for enricher… below is the configuration

2 machine of 16 cores each and 5 instances of 2 CPU each… After increasing the CPUs to 36, we are still getting over utilisation of CPUs(CPU getting utilised over 100%)

So, any suggestion on number of CPUs required to balance the enricher? as we have tried deploying this solution in K8s with auto scaling, the enricher in k8s started properly but it started creating new instances after every few minutes. So, it causes problem in k8s cluster.

@BenB any suggestions on above ?

Hi @karan, sorry for the late reply.

Your k8s configuration might be correct. The problem comes from Stream Enrich with the CPU going 100% without apparent reason, which leads to more instances being created.

We have just released Stream Enrich 1.3.0, which we hope will fix the issue with a bump of KCL library, feel free to try it out and see if the problem persists. If not, we will use profiling to check what is using CPU in Stream Enrich to identify the culprit.

Thanks @BenB for reply… Definitely I will try new enricher and let you know the outcome…

@BenB I tried the new enricher its looks much stable and consume less CPU as compared to the previous enricher. But when I checked logs most of the records are getting dumped in Snowplow-enriched-bad queue as compare to the previous enricher where most of the records are getting dumped in snowplow-enriched-good.

Also, major thing is processing speed previous enricher is able to process event very fast as compared to the new enricher(with pervious enricher we are able to process 150K+ records but with 1.3.0 enricher its around 80-90K only almost half).

New Enricher Logs v 1.3.0

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-00XXXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 251 out of 251 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 601 records from shardId-000000XXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0000] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-00000XXXXX in stream snowplow-enriched-bad under account.

[RecordProcessor-0000] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 8179 milliseconds...

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 445 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 445 out of 445 records

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0001] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-00XXXXXXXX in stream snowplow-enriched-bad under account.

[RecordProcessor-0001] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 5919 milliseconds...

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 251 out of 251 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 76 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 76 out of 76 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000XXXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 397 records from shardId-0000XXXXXXX

Previous version logs :

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0003] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 46 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 43 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 49 records from shardId-0000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 43 out of 43 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000000

[RecordProcessor-0004] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 557 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 24 records from shardId-00000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 26 records from shardId-00000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0003] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 552 records from shardId-0000000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 66 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 66 out of 66 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 46 records from shardId-0000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 61 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 26 records from shardId-000000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 30 records from shardId-000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 7 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 7 out of 7 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 11 records to Kinesis stream snowplow-enriched-good

So, any how can I increase the processing speed of new enricher ?

Thanks a lot for the feedback @karan.

It looks much stable and consume less CPU as compared to the previous enricher.

That’s great news. Does this mean that the CPU isn’t going 100% any more?

But when I checked logs most of the records are getting dumped in Snowplow-enriched-bad queue as compare to the previous enricher where most of the records are getting dumped in snowplow-enriched-good.

Could you please take a look at the errors contained in this bad queue? They will say what the problem is.

with pervious enricher we are able to process 150K+ records but with 1.3.0 enricher its around 80-90K only almost half

This might be related to all these bad rows that you have, In the new logs we can see Rate exceeded for shard shardId-00XXXXXXXX in stream snowplow-enriched-bad. The number of shards for the bad queue is not big enough for the number of bad rows that you have, thus Stream Enrich is being throttled and during this time it’s not processing new data, thus the drop in throughput. You could increase the number of shards for the bad queue, but still we need to see what these bad rows contain and fix the issue.

@BenB : No, its again consuming 100%+ CPU, I removed the the sql enrichment due to which records are getting failed and check the usage it b/w 100-200% CPU Utilization.

Below is the error I am getting for which SQL enrichment is getting failed… Same messages are getting passed in old enricher.

{“enrichment”:{“schemaKey”:“iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0”,“identifier”:“sql-query”},“message”:{“error”:“The placeholder map error. The map: Some(IntMap()), where count is: 1”}}]}

To be precise every message is getting failed for SQL enrichment.

Below is my sql enrichment config

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0",
  "data": {
    "name": "sql_query_enrichment_config",
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "enabled": true,
    "parameters": {
      "inputs": [
        {
          "placeholder": 1,
          "json": {
            "field": "contexts",
            "schemaCriterion": "iglu:com.snowplowanalytics.iglu/anything-a/jsonschema/1-*-*",
            "jsonPath": "$.question_id"
          }
        }
      ],
      "database": {
        "mysql": {
          "host": "host",
          "port": port,
          "sslMode": false,
          "username": "user",
          "password": "password",
          "database": "database"
        }
      },
      "query": {
        "sql": "select field1,field2... fieldn from questions_meta a inner join  questions b on a.question_id = b.question_id AND a.question_id = ?"
      },
      "output": {
        "expectedRows": "AT_MOST_ONE",
        "json": {
          "schema": "iglu:com.snowplowanalytics.iglu/anything-b/jsonschema/1-0-0",
          "describes": "EVERY_ROW",
          "propertyNames": "AS_IS"
        }
      },
      "cache": {
        "size": 3000,
        "ttl": 60
      }
    }
  }
}

Hi @karan !

Regarding SQL enrichment failing, we identified an issue where the enrichment was failing if one of the placeholder (question_id in your case) was null. This has been fixed in Stream Enrich 1.3.1.

We managed to reproduce the CPU issue and to use profiling on Stream Enrich to investigate. A race condition in scala-lru-map, used by IP lookups enrichment, was causing a memory leak, making memory usage keep growing, until staying close to 100% all the time, thus triggering a lot of garbage collection and using all the CPU for that. This explains why Stream Enrich was not processing events but not failing. This issue got solved in 1.3.0.

It’s not clear to me what 200% CPU usage means. Would you have a way to provide us with a graph showing the CPU usage of each machine running Stream Enrich over time, like this one (in CloudWatch metrics) ?

Also, which command do you use to start Stream Enrich?