BigQuery Stream Loader errors

Hi,
we are seeing regular errors with our Stream Loader machines. This is with a volume of 10-20 million events per day.

2023-01-27 08:10:58.738 CET
[io-compute-3] INFO com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Shutdown - Source of events was cancelled
2023-01-27 08:10:58.740 CET
[io-compute-3] ERROR com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Main - Application shutting down with error
2023-01-27 08:10:58.741 CET
com.google.cloud.bigquery.BigQueryException: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115) at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:507) at com.google.cloud.bigquery.BigQueryImpl.insertAll(BigQueryImpl.java:1097) at com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Bigquery$.$anonfun$mkInsert$2(Bigquery.scala:91) at blocking @ com.permutive.pubsub.producer.grpc.internal.PubsubPublisher$.$anonfun$createJavaPublisher$1(PubsubPublisher.scala:46) at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Bigquery$.go$1(Bigquery.scala:54) at *> @ com.snowplowanalytics.snowplow.storage.bigquery.streamloader.StreamLoader$.$anonfun$run$1(StreamLoader.scala:66) at flatMap @ fs2.Stream.$anonfun$parEvalMapAction$6(Stream.scala:2133) at *> @ com.snowplowanalytics.snowplow.storage.bigquery.streamloader.StreamLoader$.$anonfun$run$1(StreamLoader.scala:66) Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 500 Internal Server Error
2023-01-27 08:10:58.743 CET
POST https://www.googleapis.com/bigquery/v2/projects/project/datasets/snowplow/tables/events/insertAll?prettyPrint=false
2023-01-27 08:10:58.743 CET
{
2023-01-27 08:10:58.743 CET
"code" : 500,
2023-01-27 08:10:58.743 CET
"errors" : [ {
2023-01-27 08:10:58.743 CET
"domain" : "global",
2023-01-27 08:10:58.743 CET
"message" : "An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support.",
2023-01-27 08:10:58.743 CET
"reason" : "internalError"
2023-01-27 08:10:58.744 CET
} ],
2023-01-27 08:10:58.744 CET
"message" : "An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support.",
2023-01-27 08:10:58.744 CET
"status" : "INTERNAL"
2023-01-27 08:10:58.744 CET
}
2023-01-27 08:10:58.744 CET
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
2023-01-27 08:10:58.744 CET
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
2023-01-27 08:10:58.744 CET
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
2023-01-27 08:10:58.744 CET
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
2023-01-27 08:10:58.744 CET
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
2023-01-27 08:10:58.744 CET
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
2023-01-27 08:10:58.745 CET
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)

This seems like a problem on Google’s site, but since the error message suggests it could be solved by “Retrying the job with back-off as described in the BigQuery SLA”, my question would be if there is any back-off mechanism already implemented in BigQuery Stream Loader?

Thanks,
Andreas

Hi @volderette this is a good question. I have also seen those errors before, but I had not yet paid them enough attention.

Our BigQuery loader does have some retrying logic. If you like looking at code, then it is implemented on these lines and it is configured here. Basically, the loader invites the underlying 3rd party bigquery client library to handle all retries, up to a certain time limit.

However… now that I’ve looked at it more deeply, it seems our current retry settings only affect failed API methods, which is not quite the same thing as retrying failed BigQuery jobs. I found it explained a bit in this comment in Github.

So – I think we (snowplow maintainers) can do a better job of making the loader retry on this type of failure. And I think it will be a nice improvement to the loader, which will benefit many Snowplow users.

I will open an issue in Github, and then try to get a fix out in the next release. Thank you for bringing this to my attention.

3 Likes

Thank you for the great explanation and the addition of the ticket to Github @istreeter!