Use Micro-Batching iinstead of streaming inputs

volderette · March 9, 2020, 8:07am

Hi,

Dataflow jobs offer the possibility of sharding. Would it be possible to use this for micro batching instead of streaming inserts into BigQuery? This could save some costs as Streaming inserts are billed separately…

Example Dataflow-Job, which loads data in micro batches:

.apply("Write to Custom BigQuery",
BigQueryIO.writeTableRows()
.withNumFileShards(30)
.withTriggeringFrequency(Duration.standardSeconds(90))
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withSchema(tableSchema)
.to(table);

Cheers
Andreas

anton · March 9, 2020, 8:40am

Hi @volderette,

Snowplow BigQuery Loader supports batch mode from the version 0.1.0: https://github.com/snowplow-incubator/snowplow-bigquery-loader/wiki/Setup-guide#loading-mode. However, I need to admit we never used it internally and I have a vague memory of someone complaining on this forum it was throwing OOM errors.

evaldas · March 9, 2020, 9:02am

If you have files on google storage then loading should be fairly straightforward as bigquery does it fairly quickly even for bigger files. Even better if data is partitioned by date in which case you could safely reload only files belonging to 1 date multiple times without having to worry about duplication. Though I’m not certain this is supported by DataFlow.

Topic		Replies	Views
GCP: Ideal setup For engineers	7	1289	April 30, 2020
[RFC] Big Query Loader (Google Cloud Dataflow version) deprecation RFCs	0	844	July 8, 2022
Is there a way to use the storage API when BigQuery Loader? GCP pipeline	2	500	November 30, 2023
java.lang.OutOfMemoryError: Java heap space BigQuery Loader GCP pipeline	2	2781	May 8, 2019
Google Cloud Dataflow example project released New releases	8	2871	April 23, 2017

Use Micro-Batching iinstead of streaming inputs

Related topics