Dataflow jobs offer the possibility of sharding. Would it be possible to use this for micro batching instead of streaming inserts into BigQuery? This could save some costs as Streaming inserts are billed separately…
Example Dataflow-Job, which loads data in micro batches:
.apply("Write to Custom BigQuery",
Snowplow BigQuery Loader supports batch mode from the version 0.1.0: https://github.com/snowplow-incubator/snowplow-bigquery-loader/wiki/Setup-guide#loading-mode. However, I need to admit we never used it internally and I have a vague memory of someone complaining on this forum it was throwing OOM errors.
If you have files on google storage then loading should be fairly straightforward as bigquery does it fairly quickly even for bigger files. Even better if data is partitioned by date in which case you could safely reload only files belonging to 1 date multiple times without having to worry about duplication. Though I’m not certain this is supported by DataFlow.