I have a suspicion that I have duplicate values in the domain_userid attribute in my data. After reading your related post about the event_id attribute, I wanted to ask you if a similar approach could be used for domain_userid as well.
Could you clarify what you mean by duplicate domain user ID (perhaps by sharing the SQL query)? The domain user ID is a user identifier (generated by the Javascript tracker and–by default–stored in a cookie) so we expect different events to share specific values of the domain user ID.
Thank you for your message. First of all I would like to tell you that we updated the Snowplow JS web tracker to its latest version one week ago and maybe this could have caused issues to our data?
The reason why I have this suspicion about duplicate domain_userids is because I have notice user cases like:
The same user, at exactly the same time is in two different sessions (this is a case of a user before the update of the JS tracker)
This user case is not an outlier in my data and this is the reason why I wanted to start from something basic and important (such as the uniqueness of the user id) in order to understand where the problem is.
Here is a query I am using to get such kind of issues:
SELECT A.domain_userid, A.domain_sessionid, A.domain_sessionidx, A.derived_tstamp, A.event, A.page_url
FROM atomic.events A
INNER JOIN (SELECT Distinct domain_userid, derived_tstamp
FROM atomic.events
WHERE derived_tstamp >= '2017-10-19’
GROUP BY domain_userid, derived_tstamp
HAVING COUNT(Distinct domain_sessionid) > 1 AND COUNT(Distinct domain_sessionidx) > 1 AND COUNT(Distinct page_url) > 1) B
ON A.domain_userid = B.domain_userid AND A.derived_tstamp = B.derived_tstamp
WHERE A.derived_tstamp >= ‘2017-10-19’ AND A.event = 'page_view’
ORDER BY A.derived_tstamp
Also, adding to my previous post, there are cases of new users (after the update of the tracker) who have the same behaviour.