Enrich 3.0.0 released

We are very excited to release enrich 3.0.0.

Assets

This release concerns 3 assets:

  1. enrich-kinesis: this is the new enrich asset for AWS that aims at replacing Stream Enrich.
  2. enrich-pubsub: this is now the only enrich asset maintained for GCP.
  3. Stream Enrich: this asset for AWS is still supported until the transition to enrich-kinesis is complete and until a new asset enrich-kafka is ready. In this release it just received libs bumps.

As announced previously in this post, Beam Enrich is now deprecated, in favor of enrich-pubsub.

enrich-kinesis

This new enrich asset for Kinesis is based on fs2 and shares most of its codebase with enrich-pubsub.

Compared with Stream Enrich, this app brings several improvements:

  • It can export metrics. More details can be found on this page.
  • Assets used in the enrichments (e.g. MaxMind DB) can be periodically refreshed while enrich is running with this config parameter:
"assetsUpdatePeriod": "7 days"
  • It uses Kinesis Consumer Library 2.x.
  • It provides the pipeline operator with more possibilities for fine-tuning.
  • It is now possible to use Kinesis aggregation, which consists in putting several user records (e.g. enriched events) into one Kinesis record. It allows to improve the throughput and/or possibly reduce the number of shards needed (in particular if records are bigger than 1 kb). More information about aggregation can be found here. It can be activated with the following section in the configuration (e.g. for enriched events):
"output": {
  "good": {
    "aggregation": {
       "maxCount": 1000
       "maxSize": 51200
    }
  }
}
  • It is possible to run the app with a very minimal configuration file, like such:
{
  "input": {
    "streamName": "collector-payloads"
  }

  "output": {
    "good": {
      "streamName": "enriched"
    }

    "bad": {
      "streamName": "bad"
    }
  }
}

Instructions to run enrich-kinesis can be found on this page and details about its configuration on this page.

enrich-pubsub

More parameters have been exposed in the config file to get more fine-grained control on the app.

All the details about its configuration can be found on this page.

Javascript enrichment: ECMAScript 6 features (#508)

Users of the Javascript enrichment will be pleased to hear that starting from this version, most of ECMAScript 6 features are supported. For example, ES6 features like the arrow => syntax and the const keyword are now available. This change is fully backward-compatible and the existing configs will keep on working.

More details on Javascript enrichment can be found on this page.

Enriched events validation in enrich-kinesis and enrich-pubsub (#517)

Enriched events emitted by enrich are expected to match atomic schema. If an event is not valid against this schema (for instance because a field is too long), a bad row should be emitted instead of the enriched event. In order to improve furthermore the data quality inside the pipeline, enrich 3.0.0 introduces this additional check.

However, we are aware that this is a breaking change, and we want to give some time to users to adapt, in case today they are working downstream with enriched events that are not valid against atomic. For this reason, this new validation was added as a feature that can be deactivated like that:

"featureFlags": {
  "acceptInvalid": true
}

In this case, enriched events that are not valid against atomic schema will still be emitted as before, so that enrich 3.0.0 can be fully backward compatible. It will be possible to know if the new validation would have had an impact by 2 ways:

  1. A new metric invalid_enriched has been introduced. It reports the number of enriched events that were not valid against atomic schema. As the other metrics, it can be seen on stdout and/or StatsD.
  2. Each time an enriched event is invalid against atomic schema, a line will be logged with the bad row that would have been emitted normally instead of the enriched event (add -Dorg.slf4j.simpleLogger.log.InvalidEnriched=debug to the JAVA_OPTS to see it).

In a few months, weโ€™ll remove the feature flag and it will become impossible to emit invalid enriched events.

Metrics for enrich-kinesis and enrich-pubsub (#494)

There were 2 issues with the metrics periodically sent by enrich-pubsub:

  1. The counts of collector payloads, enriched events and bad rows were ever-increasing and not reset to 0 after sending the metrics.
  2. These counts were sent to StatsD with this format: snowplow.enrich.good:1234|g|#key1:value1 where g means gauge, whereas it should be c for counter.

This has been fixed. On top of that, it is now possible to see the metrics directly in the logs of the app, with this section in the config file:

"monitoring": {
  "metrics": {
    "stdout": {
      "period": "1 minute"
      "prefix": "snowplow.enrich."
    }
  }
}

Because enrich-pubsub and enrich-kinesis share most of the code, all of the above is also true for the latter.

More information about metrics can be found on this page.

YAUAA context 1-0-3 (#515)

The context attached by YAUAA enrichment has been updated to 1-0-3.

Compared to 1-0-2, this version allows a longer agentVersionMajor string field, which addresses a problem in which some some user agents caused the old maximum length to be exceeded, resulting in a failed event.

Telemetry in enrich-kinesis and enrich-pubsub (#487)

enrich-kinesis and enrich-pubsub introduce telemetry, which consists in regularly sending heartbeats with some meta-information about the application (schema here). This is done to help us to improve the product, we need to understand what is popular, so that we can focus our development effort in the right place.

At the base, telemetry is sending the application name and version every hour. It would be helpful for us if users could provide userProvidedId in the config file :

"telemetry": {
  "userProvidedId": "myCompany"
}

Telemetry can be deactivated by putting the following section in the configuration file:

"telemetry": {
  "disable": true
}
5 Likes