Identity stitching and member resolution
The bigger picture behind how users are interacting with your site
Identity stitching is an a key part of understanding how people frequent your website and what it is they are doing. In this article I am going to discuss the concept of identity stitching, my method of implementation and things to be conscious of given the personally identifiable information (PII) we are handling. Identity stitching can then play an invaluable role in sessionisation and attribution.
So what is identity stitching?
Identity stitching is a process that enables us to enrich events where we don’t know the identity of the person, with data collected and processed elsewhere or after the fact to associate these unidentifiable events to a person.
I stress at this point that the focus here isn’t to find out what John Smith is doing but to develop a system where we can have a better understanding what people as a group(s) are doing, and to emphasise this point I still recommend implementing identity stitching even if you choose to anonymise to all PII data.
Lets use a crude example to explain what identity stitching helps us understand:
John Smith is sitting on the train, starring down at his feet and decides its time for a new pair of trainers. He gets out his phone, starts by using a search engine for inspiration “trainers in fashion” and is returned a set of results. At the top of the results is a paid ad from your company “Trainers in this season by www.bestkicks.com”, and it’s a site he’s never been to before. John proceeds to click your ad, look around your website, finds a pair of trainers he likes but decides he’ll wait to get home and show his partner before he rushes his decision. When John gets home he goes to www.bestkicks.com and shows Jane Smith these new trainers, she also loves them so John proceeds to purchase them.
Now this is a pretty basic example, but the pattern of events is something that repeats multiple times for Best Kicks, a person is exposed to the brand by an ad presented in a search engine, they then come back directly to the website after a bit of time for consideration and purchase. At the face of it there isn’t a problem, an ad is exposing a person to the brand then they come back and purchase after a bit of time. The user has a cookie id or a device id so we are able to associate the booking to the ad click by attribution. However what happens when instead of making the booking on the same phone he originally browsed on John goes home and opens the site on his laptop or tablet to get a better sized image of the trainers? the data behind this journey then becomes disjoint. The ad session occurred on his phone and the company has no known association between the two devices, the data presents as one session where an ad generated the visit that didn’t convert and another session where a user came directly to the company website and bought trainers with no association between the two. This is where identity stitching comes into action, the process will allow us to associate the phone traffic to John and therefore understand that the ad was the source that drove John to the website to begin with.
This gives us a better understanding of the complexities behind the modern user journey to a purchase/conversion.
Implementation
The foundation of my implementation was on the premise that every event will have at least some sort of ambiguous identifier like a cookie id or device id, they may also have a more specific identifiers which we can use in identity stitching.
These ambiguous identifiers are set in your tracker implementation, many trackers have a default version of this that is fulfilled by a client side cookie initiated on a user’s device that will persist for a certain amount of time, the longer the better.
A colleague of mine was able to implement one better, a browser id, this is an id that is computed and is unique to the each browser on a device and will therefore persist for the length of time the user uses that browser on that device.
The first thing I done was identify all the identifiers we were tracking, these were things like cookie id, browser id, user id, domain user id, idfa, email address etc and sort them into two groups ‘known identifiers’ and ‘unknown identifiers’.
Known identifiers
These are identifiers that exist on an event that we can use to associate traffic with an actual person, examples of these are things like email addresses or internal user ids. Known identifiers aren’t always present they normally only occur when the user does something in their journey that requires them to affirm who they are like submit a form, login or purchase etc. In the modern world where the user experience should take precedence on a site these are becoming less and less enforced to create a frictionless user experience but their presence is key in us understanding who the person behind the traffic is.
At this point its important to note that these identifiers can be PII, with that in mind we need to be conscious of how we collect these - for user safety, and how we store them - for GDPR compliance. Especially things like email addresses, we want to be certain we are encrypting these where ever possible and documenting where they are stored.
Unknown Identifiers
A form of unknown identifier can be found on each event, examples of these can be cookie id, device id, idfa, unique browser id etc. Although these on their own don’t tell us much about who a user actually is, they allow us to group events by a ‘user’ together and will be used as a mapping key to join known identifiers to events should a mapping be identified.
Hierarchy of Identifiers
The basis of the hierarchy is that some identifiers are more valuable than others, by valuable I mean they tell us more about a user than others or simply make downstream modelling easier. Generally, known identifiers are more valuable than unknown identifiers, this can then be broken down further within these groups. The importance of specifying this hierarchy influences our implementation, after listing all our identifiers, splitting them into known and unknown identifiers we then note the order of importance of each of these identifiers (eg. user id is more important than email address, and email address is more important than cookie id).
Having established the foundations of identifiers we can now build a process where we record all distinct combinations of identifiers as they occur, as we record new combinations we continue to add these to this table, lets call this the identity associations table. This allows us to build up a mapping of identifiers as we learn more about our users. At this stage our only concern is to collect distinct combinations, we want to decouple any identity stitching process as actual identity stitching can be an iterative implementation process and it can also be enriched from other data sources.
In this example you can see that the cookie id ‘a5a87af1-aa62–4b99-ad63–9bb81656da0f’ was first recorded without any other identifier information other than the cookie id. Two days later, this same person comes back to the website but decides to log in, so we now add a new row to this table where we’ve recorded the new combination of cookie id with a user id. You can already see that if we simply look for any previous traffic that has occurred with this cookie id we can now attach a user id. We also want to store the first recorded timestamp of each combination, this will help us de-dupe later on.
Identity Stitching
With an associations table that has recorded all the unique combinations of identifiers we can now build a simple process referencing our hierarchy we decided earlier, for each unknown identifier we assign the ‘best’ identifier based on this hierarchy. I preserved this identifier in a generic field called ‘attributed user id’, this field can have any variation of our different unknown/known identifiers depending on which ever has been assigned. I also created another column ‘stitched identity type’ that records what identifier in the attributed user id represents, this is useful later on for filtering traffic based on how well we can categorise the user.
This table will be used to join onto any event, as we already know each event will have an unknown identifier of some sort so we can join on the unknown identifier onto our identity stitching table to enrich our events with a ‘better’ identity if evidence of one exists.
There may be a couple of outlier situations you need to cater for and these might be more or less apparent based on your business model. The classic example is ‘shared devices’, when two people (users) have logged in on the same device. What this looks like in our data is that your identity associations table will frequently have multiple different known identifiers for the same unknown identifier, we can choose to handle this in different ways but the most sensible approach would be to assess the situations that occur for example you might have a business model where multiple logins on the same device is expected, therefore you might want to take the association that occurs the most. The end goal here is to only ‘stitch’ to one identity otherwise you will end up duplicating out data in downstream modelling so you might need to assess what method best fits your business model.
Things to note
Couple of things that are worthwhile pointing out:
- If you aren’t able to create a unique browser id, the lifecycle of your client side cookie is quite important. The longer the lifespan the higher level of resolution you will return. If for example your cookie only lasts for 2 weeks, a customer can browse your website without logging in, come back after two weeks -at which stage their cookie would have expired- and will be given a new cookie (with a new cookie id) and you won’t be able to stitch their previous journey.
- Client side tracking itself contains a mass of interesting information about the user, however its well within a user’s rights to restrict your access to recording this information. They are also within their rights to simply block this information being sent to you entirely, with that in mind your ability to stitch their identity can become an extremely difficult task. We used Server Side tracking to help overcome the issues presented by client side tracking limitations.
- This process can evolve to include information from other sources, if for example there’s a large proportion of traffic where we are only able to associate the traffic to an email address, but (in accordance to our hierarchy) we ideally want to get traffic up to an internal user id level. We can simply reference another internal user table (collected elsewhere) that stores an email address with a user id to further update our identity stitched table with user ids.
Next Steps
A high level of stitching will not be immediate, the nature of the process is simply the more we collect; the better it gets. All it takes is for us to be able to associate a user on a device to a known identifier once for us to be able to reclassify all traffic on that device to that person. With that in mind we want to promote ways within the user journey that can allow us to collect this information as often as possible whilst making sure our user experience with the site isn’t compromised to do so.
I’ve witnessed some creative ways to intertwine these into the user journey without impacting their experience here are some examples:
- Embed a user identifier into a hyperlink included in email communications, this way when a user clicks on a link within a promotional email the internal user id is embedded into their landing page URL, so even if they don’t log in we know who this user actually is. This is particularly helpful for businesses where the vast majority of their conversions occur on a personal computer, the theory being that a lot of the inspiration towards the conversion occurs on a mobile device if a person never logs in or transacts on their mobile device then there will never be any record to associate the traffic to this mobile device to the user that then transacted on their personal computer.
- Create promotional content on a website that requires user to login using an email address or other social media authentication. The goal here is to keep it to bear minimum where essentially all you need to do as a user is press a button or enter their email address.
- Competition forms, now this isn’t the primary goal of competitions their purpose should still be around PR and customer acquisition but whilst you’re servicing the competition it makes sense to also collect this information to better understand what devices users are using.
If you know/think of anymore I’d love to hear them.
Thank you for taking the time to read this, if you haven’t seen my other pieces this article is part of my implementation approach to modelling raw event data up to useable and scalable data assets for business intelligence level reporting: