Failed inserts for time Partitioned table

Abhishek_Singh · September 6, 2021, 5:52pm

Hi, I have setup a snowplow pipeline in GCP. I am able to load events table in BigQuery without any issue until I tried the below.

I tried to setup a similar pipeline where my target table is a time partitioned table. I’ve read the discussion Google Cloud Platform data pipeline optimization. I created the partitioned table based on the derived_tstamp column (with DAY granularity) manually before running the pipeline with the same schema suggested in atomic schema. But, as I am firing events from the tracker, the events are not loading to the table rather moving to the failed inserts topic in PubSub although the mutator is able to mutate the table and custom columns are getting added to the table. Could you please suggest what is the possible reason and how to resolve?
Also, my ultimate aim is to create a partitioned table with event_name as the partitioned column. Please advise how can I achieve the same.

mike · September 6, 2021, 10:24pm

Failed inserts, at least while the mutator is running are generally pretty standard but as long as they are being retried (BQ repeater) they should appear in the pipeline after the mutator has successfully created the columns.

BigQuery won’t (at the moment) let you partition by a string column. The advice here is to partition first by a timestamp and then cluster within that partition, using something like event_name.

Abhishek_Singh · September 7, 2021, 8:50am

Hi @mike,
Thanks for your response. I have been trying for few days now but the failed insert records do not seem to be loading by any chance.

I will definitely try the clustering option. Thank you.

dilyan · September 7, 2021, 10:22am

@Abhishek_Singh With regards to the failed inserts not being retried, are you trying to load real-time data (ie, load data as it is being collected) or are you trying to load a historical archive?

The mutator needs some time to make the table changes, so when the repeater sees a fail insert, it won’t retry it immediately. It waits some time (15 mins by default) before it tries to re-insert the event. This way the mutator has enough time to mutate the table. However, this waiting period is calculated as a difference between now() and the collector_tstamp of the event. So if you are trying to load historical data whose collector_tstamp is already more than 15 mins before now() then the repeater will re-try them straightaway. That leaves very little time for the mutator to do its job and the events will ultimately go to your dead-end bucket on GCS.

Abhishek_Singh · September 9, 2021, 6:29am

@dilyan, thanks for the detailed explanation on the failed inserts. This is working fine now. The problem, I believe in my case, was that I was stopping the bqloader after waiting for sometime.

Topic		Replies	Views
BigQuery Loader - Time partitioned table GCP pipeline	2	1419	April 23, 2020
Google Cloud Platform data pipeline optimization GCP pipeline	11	4529	April 14, 2020
Bq-failed-inserts topic reason GCP pipeline	3	1141	September 1, 2021
All events are considered as failed For engineers	4	850	February 21, 2023
Querying Failed BigQuery Events in GCS GCP pipeline	2	844	January 24, 2023

Failed inserts for time Partitioned table

Related topics