Api enrichment from BQ vs batch update later

can you suggest which approach is better getting enrich data in real time from BQ by API enrichment vs update directly once in a day later

I think this really depends on your use cases.

Using the API/SQL Enrichment will mean your context/entities are present in your atomic events table as soon as the row loads into BigQuery. This can be very handy if that data needs to be consumed from atomic events in near real time, or you want to consume that data from a PubSub stream before it even hits BigQuery. However, you’re probably going to increase your BigQuery invoice slightly by running queries against it and you’re going to introduce some minor lag into your pipeline by calling another API (although not much, but if you have other near real time usecases, an extra second or two might be important). Additionally, you’re are creating another dependency in your pipeline by calling an API, if this API goes down then you’ll likely start seeing failed events.

Most use cases don’t require these context/entities to be on the atomic events row in near real time, in fact I’d probably say 95% don’t. So I’d always start with doing the joining in a data model when you create some derived tables (like how the Snowplow Web and Mobile models work), perhaps once an hour or once a day depending on how important it is to be able to query the data near to the time the event occured. If this turns out to not be sufficient then it’s time to consider the API/SQL enrichments.

1 Like