Beam Enrich: Dataflow Worker logs reporting "java.lang.NullPointerException"

robkingston · March 13, 2020, 7:39am

Earlier this week Beam Enrich started throwing errors out of the blue. Before 2020-03-10 09:00 AEDT, our Dataflow job was successfully pushing events onto our “Enriched-Good” topic, which were then being loaded just fine into our BQ dataset.

But after 2020-03-10 09:00 AEDT the pipeline stopped working. When the job kicks off, only the first ~5 seconds of enriched events show up in the enriched-good subscription. After that, an error fires off every ~10 seconds (below) and no more events land inside enriched-good.

Our setup is:

beam-enrich-1.0.0
- us-central-1a
- n1-standard-4
snowplow-bigquery-loader-0.2.0
- us-central-1a
- n1-standard-4
BigQuery located in australia-southeast

We haven’t made any changes recently that would have impacted this. Though Google recently announced changes to Pub/Sub cross region access.

And the error been thrown is:

java.lang.NullPointerException
	at io.circe.Printer$PrintingFolder.onString(Printer.scala:278)
	at io.circe.Printer$PrintingFolder.onString(Printer.scala:262)
	at io.circe.Json$JString.foldWith(Json.scala:302)
	at io.circe.JsonObject$MapAndVectorJsonObject.appendToFolder(JsonObject.scala:433)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.Printer.pretty(Printer.scala:194)
	at io.circe.Json.noSpaces(Json.scala:116)
	at com.snowplowanalytics.snowplow.badrows.BadRow.compact(BadRow.scala:32)
	at com.snowplowanalytics.snowplow.badrows.BadRow.compact$(BadRow.scala:32)
	at com.snowplowanalytics.snowplow.badrows.BadRow$EnrichmentFailures.compact(BadRow.scala:140)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$3(Enrich.scala:270)
	at cats.data.NonEmptyList.map(NonEmptyList.scala:76)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$2(Enrich.scala:270)
	at cats.data.Validated.leftMap(Validated.scala:213)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$1(Enrich.scala:270)
	at scala.collection.immutable.List.map(List.scala:286)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.enrich(Enrich.scala:269)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$2(Enrich.scala:188)
	at com.snowplowanalytics.snowplow.enrich.beam.utils$.timeMs(utils.scala:153)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$1(Enrich.scala:185)
	at com.spotify.scio.util.Functions$$anon$6.processElement(Functions.scala:220)

Having trouble figuring out what’s happening here. It was rock solid up until the other day and now a backlog is forming with the shadow of data loss hanging over us now. Though we’re not yet dependent on our RT Pipeline, we want to make sure we have a few months of solid data before transitioning from Batch.

anton · March 13, 2020, 12:09pm

Hey @robkingston,

Its not somehting we’ve seen before - we’ll jump on investigation straight away. Meanwhile, could you please try Beam Enrich 1.1.0? It should be available on DockerHub: https://hub.docker.com/layers/snowplow/beam-enrich/1.1.0/images/sha256-66dfafbd5dd909eae57bada014ae9be36dfbbda0ae5f1b7ef0f9abd077509871?context=explore

robkingston · March 13, 2020, 12:12pm

@anton - thanks. I managed to solve this. Rolling back to 0.4.1 solved it. Backlog clearing …

Both beam-enrich-1.0.0 and beam-enrich-1.1.0 fail in the same way. I noticed the new badrows functionality was added in beam-enrich-1.x.x but I’m not yet using that. It’s strange because it was working fine until the other day.

anton · March 13, 2020, 12:22pm

The fact that it was working before tells me that problably it stumbles upon some very unusual bad rows our JSON library then struggles to encode. On the other hand, I still don’t see anything that can contain null there, so it must be something related to Beam Enrich passing nulls around.

Anyways, we’ll investigate this. Any other information you can provide is highly appreciaetd.

robkingston · March 13, 2020, 12:47pm

Where might I find this problematic bad row? Happy to share it with you if I can find it. E.g. Null field, null record etc…

I don’t consume RT bad rows data right now, but it should be in our batch pipeline’s bad rows.

BenB · March 13, 2020, 2:45pm

Hi @robkingston,

I’m afraid that this bad row can’t be found anywhere as it’s created inside beam-enrich and never emitted, and you can’t find it in your batch bad rows as batch is not using the new bad row format and the problem is related to the encoding of these new bad rows.

The stack trace shows that the problem occurs for EnrichmentFailures bad row, so maybe one of your enrichments was working well before and stopped working (API, SQL, …) and for some reason the associated bad row can’t be encoded with circe.

I wonder if it could be than one of the bad row field is set to null whereas a value is expected. We continue to investigate and will let you know as soon as we find something.

robkingston · March 15, 2020, 10:44am

Cheers, @BenB. Hopefully this will help make others’ pipelines more resilient.

We’re not in any hurry to upgrade here. Backlog is clear now!

BenB · March 16, 2020, 10:53am

Hi @robkingston,

Great that you could clear your backlog.

Could you tell us which enrichments you activated in your pipeline please?

BenB · March 17, 2020, 8:43am

Hi @robkingston,

0.4.1 and 1.0.0 should emit the same bad rows, so in theory the problematic bad rows with the new format in 1.0.0 should also be emitted in 0.4.1 with the old format.

Could you tell us if you have bad rows with 0.4.1 and if yes would it be possible to get a sample please?

robkingston · March 17, 2020, 9:22am

@BenB - here are the enrichments we have enabled:

campaign_attribution.json	 ip_lookups.json	     referer_parser.json
currency_conversion_config.json  ua_parser_config.json
event_fingerprint_config.json	 pii_enrichment_config.json

The only change in our enrichments was to upgrade referer_parser to support the changes in 1.x.x (and rolled it back in 0.4.1). The schemas both validate and the 2-0-0 referer_parser schema was working successfully for a few weeks with what may be a config issue here.

Here are the two versions.

0.4.1:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/1-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "referer_parser",
        "enabled": true,
        "parameters": {
            "internalDomains": []
        }
    }
}

1.x.x:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "referer_parser",
        "enabled": true,
        "parameters": {
            "database": "referers-latest.json",
            "internalDomains": [],
            "uri": "gs://sp-referer-parser/third-party/referer-parser"
        }
    }
}

BenB · March 17, 2020, 10:16pm

Thanks for the info @robkingston! Do you still have some bad rows after having rolled back to 0.4.1?

robkingston · March 18, 2020, 2:37am

@BenB we always have at least some bad rows. Happy to share some of them if it helps.

Are there specific errors I can pull for you? I suspect it’s only a handful of bad rows that we saw within a few weeks of data. Can pull them from S3 for you.

BenB · March 18, 2020, 7:19am

@robkingston if you could send us some bad rows related to enrichment problems it would be great. The bad rows that were failing in 1.0.0 should also be emitted by 0.4.1, just with a different format. It would help us to see what the 1.0.0 equivalent is and what could be wrong. Thanks a lot!

BenB · March 18, 2020, 1:51pm

Hi @robkingston,

We might have found the origin of the problem, it would occur when one of the query string parameters value is null and something goes wrong in the enrichments. We created the issue on Github. Thanks a lot for reporting.

We’ll let you know when a new version with the fix is out. Meanwhile if you could still send us some bad rows from 0.4.1 related to enrichment problems it would be really great so that we can confirm our hypothesis.

BenB · March 18, 2020, 3:37pm

We could confirm the origin of the issue : https://github.com/snowplow/snowplow/issues/4305#issuecomment-600692200

sduck · April 23, 2020, 7:26am

I’m experiencing the exact same issue and I’m happy to learn you seem to have found the origin.

I would like to avoid reverting to 0.4.1 as it uses a quite old SDK not supporting the Dataflow streamingEngine. Is there any news on when to expect a release containing a fix?

anton · April 24, 2020, 6:52pm

Hi @sduck,

The RC with the fix will be published on Monday. The final version somewhere later next week.

Topic		Replies	Views
Beam Enrich failing in GCP Dataflow with java.lang.NullPointerException Enrichment	12	1331	July 7, 2020
Deploy beam-enrich results in a NullPointException Enrichment	0	1126	March 8, 2019
Beam enrichment fails when Dataflow is back online Enrichment	3	1014	February 27, 2019
Dataflow create by Beam Enrich got error Enrichment	2	2037	June 27, 2019
Enricher not working Enrichment	4	2411	January 31, 2018

Beam Enrich: Dataflow Worker logs reporting "java.lang.NullPointerException"

Related topics