We recently migrated to snowplow real time stack. We have moved to Kafka realtime pipeline. The migration went very well. Expect one issue in one of our custom tracking.
So in this custom tracking we have integrated pixel tracker to track email opens. We have observed that the open rate of emails have surged by about 1.5X. The count from other trackers of same tracking is matched to the batch number.
One notable analysis shows that the surge of open rate has no increase in unique user, but shows an increase in open/user.
We are not able to figure out what could be the reason for this. Is the open tracking is duplicated in realtime stack or there was a drops in batch tracking which is fixed now.
Hi @jimy2004king , can you verify if these are duplicated events? If they are, they would have the same event_id. I’m wondering if you’re not seeing some duplicates caused by using Kafka’s at-least-once delivery guarantees.
If they are indeed duplicates, you could do some cleanup in the warehouse, where you only keep one of them.
If it’s an increase in opens per user (and it’s not the same event ids as @dilyan has mentioned) then I’d try to narrow it down by email clients.
Some email clients eagerly pre-fetch (and cache) images within emails so this can inflate open rates as the image pixel may be requested more than once.
@dilyan I checked for duplicates and there are no duplicates all the event_id's are different.
@mike If it was due to email providers/clients pre-fetch then it should happen on both batch as well as realtime stack.
One more interesting finding we discovered yesterday. We started adding two pixels in the email one for batch and one for realtime. And the stats were same we got 1.5X count for realtime then in batch.
We also played around with one of the emails delivered to our test account. We opened that email multiple times on 2-3 different machines and geography. Plus I downloaded the HTML content and pasted it in a local html file and opened it multiple times to see how both the pixel request gets executed. While there was no difference in the way two request were going. But upon checking the raw data for this particular user we found only 3 entries in batch and 21 entries in realtime. I am pretty sure that we opened it more than 3 times.
This is pointing out that batch collector / EMR process is trying to reduce the entries with some algo like IP and other similarities and drop subsequent request. While realtime is not dropping it.
Hi @jimy2004king did you migrate from a CloudFront batch pipeline or a Clojure (ElasticBeanstalk) stack?
If it’s the former it might have something to do with the CDN caching options not recording subsequent hits as your client as already cached the i pixel so the CDN does not serve it again.
@josh We have migrated from Clojure (ElasticBeanstalk) to Scala Stream Collector.
But I would check with my team for the CDN caching thing. Thanks for the heads up.