Earlier this week Beam Enrich started throwing errors out of the blue. Before 2020-03-10 09:00 AEDT, our Dataflow job was successfully pushing events onto our “Enriched-Good” topic, which were then being loaded just fine into our BQ dataset.
But after 2020-03-10 09:00 AEDT the pipeline stopped working. When the job kicks off, only the first ~5 seconds of enriched events show up in the enriched-good subscription. After that, an error fires off every ~10 seconds (below) and no more events land inside enriched-good.
Our setup is:
beam-enrich-1.0.0
us-central-1a
n1-standard-4
snowplow-bigquery-loader-0.2.0
us-central-1a
n1-standard-4
BigQuery located in australia-southeast
We haven’t made any changes recently that would have impacted this. Though Google recently announced changes to Pub/Sub cross region access.
And the error been thrown is:
java.lang.NullPointerException
at io.circe.Printer$PrintingFolder.onString(Printer.scala:278)
at io.circe.Printer$PrintingFolder.onString(Printer.scala:262)
at io.circe.Json$JString.foldWith(Json.scala:302)
at io.circe.JsonObject$MapAndVectorJsonObject.appendToFolder(JsonObject.scala:433)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
at io.circe.Json$JObject.foldWith(Json.scala:364)
at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
at io.circe.Json$JObject.foldWith(Json.scala:364)
at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
at io.circe.Json$JObject.foldWith(Json.scala:364)
at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
at io.circe.Json$JObject.foldWith(Json.scala:364)
at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
at io.circe.Json$JObject.foldWith(Json.scala:364)
at io.circe.Printer.pretty(Printer.scala:194)
at io.circe.Json.noSpaces(Json.scala:116)
at com.snowplowanalytics.snowplow.badrows.BadRow.compact(BadRow.scala:32)
at com.snowplowanalytics.snowplow.badrows.BadRow.compact$(BadRow.scala:32)
at com.snowplowanalytics.snowplow.badrows.BadRow$EnrichmentFailures.compact(BadRow.scala:140)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$3(Enrich.scala:270)
at cats.data.NonEmptyList.map(NonEmptyList.scala:76)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$2(Enrich.scala:270)
at cats.data.Validated.leftMap(Validated.scala:213)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$1(Enrich.scala:270)
at scala.collection.immutable.List.map(List.scala:286)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.enrich(Enrich.scala:269)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$2(Enrich.scala:188)
at com.snowplowanalytics.snowplow.enrich.beam.utils$.timeMs(utils.scala:153)
at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$1(Enrich.scala:185)
at com.spotify.scio.util.Functions$$anon$6.processElement(Functions.scala:220)
Having trouble figuring out what’s happening here. It was rock solid up until the other day and now a backlog is forming with the shadow of data loss hanging over us now. Though we’re not yet dependent on our RT Pipeline, we want to make sure we have a few months of solid data before transitioning from Batch.
@anton - thanks. I managed to solve this. Rolling back to 0.4.1 solved it. Backlog clearing …
Both beam-enrich-1.0.0 and beam-enrich-1.1.0 fail in the same way. I noticed the new badrows functionality was added in beam-enrich-1.x.x but I’m not yet using that. It’s strange because it was working fine until the other day.
The fact that it was working before tells me that problably it stumbles upon some very unusual bad rows our JSON library then struggles to encode. On the other hand, I still don’t see anything that can contain null there, so it must be something related to Beam Enrich passing nulls around.
Anyways, we’ll investigate this. Any other information you can provide is highly appreciaetd.
I’m afraid that this bad row can’t be found anywhere as it’s created inside beam-enrich and never emitted, and you can’t find it in your batch bad rows as batch is not using the new bad row format and the problem is related to the encoding of these new bad rows.
The stack trace shows that the problem occurs for EnrichmentFailures bad row, so maybe one of your enrichments was working well before and stopped working (API, SQL, …) and for some reason the associated bad row can’t be encoded with circe.
I wonder if it could be than one of the bad row field is set to null whereas a value is expected. We continue to investigate and will let you know as soon as we find something.
0.4.1 and 1.0.0 should emit the same bad rows, so in theory the problematic bad rows with the new format in 1.0.0 should also be emitted in 0.4.1 with the old format.
Could you tell us if you have bad rows with 0.4.1 and if yes would it be possible to get a sample please?
The only change in our enrichments was to upgrade referer_parser to support the changes in 1.x.x (and rolled it back in 0.4.1). The schemas both validate and the 2-0-0 referer_parser schema was working successfully for a few weeks with what may be a config issue here.
@BenB we always have at least some bad rows. Happy to share some of them if it helps.
Are there specific errors I can pull for you? I suspect it’s only a handful of bad rows that we saw within a few weeks of data. Can pull them from S3 for you.
@robkingston if you could send us some bad rows related to enrichment problems it would be great. The bad rows that were failing in 1.0.0 should also be emitted by 0.4.1, just with a different format. It would help us to see what the 1.0.0 equivalent is and what could be wrong. Thanks a lot!
We might have found the origin of the problem, it would occur when one of the query string parameters value is null and something goes wrong in the enrichments. We created the issue on Github. Thanks a lot for reporting.
We’ll let you know when a new version with the fix is out. Meanwhile if you could still send us some bad rows from 0.4.1 related to enrichment problems it would be really great so that we can confirm our hypothesis.
I’m experiencing the exact same issue and I’m happy to learn you seem to have found the origin.
I would like to avoid reverting to 0.4.1 as it uses a quite old SDK not supporting the Dataflow streamingEngine. Is there any news on when to expect a release containing a fix?