Hi, I have setup a snowplow pipeline in GCP. I am able to load events table in BigQuery without any issue until I tried the below.
I tried to setup a similar pipeline where my target table is a time partitioned table. I’ve read the discussion Google Cloud Platform data pipeline optimization. I created the partitioned table based on the derived_tstamp column (with DAY granularity) manually before running the pipeline with the same schema suggested in atomic schema. But, as I am firing events from the tracker, the events are not loading to the table rather moving to the failed inserts topic in PubSub although the mutator is able to mutate the table and custom columns are getting added to the table. Could you please suggest what is the possible reason and how to resolve?
Also, my ultimate aim is to create a partitioned table with event_name as the partitioned column. Please advise how can I achieve the same.
Failed inserts, at least while the mutator is running are generally pretty standard but as long as they are being retried (BQ repeater) they should appear in the pipeline after the mutator has successfully created the columns.
BigQuery won’t (at the moment) let you partition by a string column. The advice here is to partition first by a timestamp and then cluster within that partition, using something like event_name.
@Abhishek_Singh With regards to the failed inserts not being retried, are you trying to load real-time data (ie, load data as it is being collected) or are you trying to load a historical archive?
The mutator needs some time to make the table changes, so when the repeater sees a fail insert, it won’t retry it immediately. It waits some time (15 mins by default) before it tries to re-insert the event. This way the mutator has enough time to mutate the table. However, this waiting period is calculated as a difference between now() and the collector_tstamp of the event. So if you are trying to load historical data whose collector_tstamp is already more than 15 mins before now() then the repeater will re-try them straightaway. That leaves very little time for the mutator to do its job and the events will ultimately go to your dead-end bucket on GCS.
@dilyan, thanks for the detailed explanation on the failed inserts. This is working fine now. The problem, I believe in my case, was that I was stopping the bqloader after waiting for sometime.