Time to sessionise

Robin Patel
9 min readDec 1, 2020

--

Grouping events together to associate traffic sources to user behaviour

The purpose of this article is to explain the basis of sessionisation and how I implemented a business definition of a session using Snowplow Team event level data on Snowflake.

I hope you’ve had a chance to read my two previous posts that lead into this one, they basically describe how I’ve modelled data in preparation for sessionisation.

What is sessionisation?

A session is a group of user interactions with your website that take place within a given time frame. — Google Analytics

Sessionisation is the process that groups a person’s events together into a session to articulate what a person has done on your website/app in a single sitting. The main point of it is to help identify what sources generated traffic that done something of interest, these ‘things of interest’ can be anything from member sign ups to conversions. This data can then be utilised to optimise traffic based on the type of behaviour it generates or increase efficiency of spend (return on investment - ROI).

Sessionisation approaches

There are two main approaches on how to create sessions, pre and post data collection.

Pre-collection sessionisation

Pre-collection sessionisation is where the responsibility is put on the event tracker to define events that belong together in a session. Most good event collector services will have pre-sessionisation out of the box, this basically works by the tracker generating a session id (normally comprises of encrypting a concatenation of the timestamp, browser identifier etc) and assigning this to each event. This session id is preserved in a cookie that is stored on the user’s computer, the cookie has a expiry time (usually 30 minutes) and becomes invalid after this time, if a user visits the site the tracker checks to see if an existing valid cookie exists, if it does it uses the session id stored in the cookie, if no valid cookie exists it generates one.

Limitations:

  • No real flexibility
  • Requires user to agree to store cookies
  • Vulnerable to client side tracking gaps eg. ad blockers, data transmission errors

Post-collection sessionisation

Post-collection sessionisation is handled by modelling event level data into sessions. This doesn’t rely on a tracker to define a session and therefore adds flexibility to adjust the notion of a session to meet business requirements.

Limitations

  • Requires advanced data modelling
  • Can be costly to compute

So which one should you use? In most cases the pre-collection sessionisation approach is the way to go, it makes things a lot easier, there’s no need to create complex modelling logic to sessionise your data, no additional compute costs associated to group events into sessions and quicker to implement. If however -like me- the limitations of pre collection sessionisation aren’t something you can afford to compromise on and you have a passion for ensuring your data is the best representation of your business then embark on post-collection sessionisation via data modelling.

Define what a session is for your business

If you are still on the fence about whether to use pre or post collection defined sessions then this step might help you with your decision. Start out by working out what your company deems as a ‘sitting’ of continuous events for a user, examples of this can be whether things like the duration of inactivity, switching devices or new marketing signals or a combination of various things constitute as a new session.

In my situation, the business had various marketing initiatives that were generating site/app traffic, many of these were overlapping and therefore it was important that new marketing signals on an event signified the start of a new session. The Google analytics definition of a session accounts for this, if a person is on a site, goes away and returns to the site within 30 mins but with new marketing parameters this is counted as a new session. However the Snowplow pre-collection defined session doesn’t handle this, and rightly so, the definition for which marketing signals should or shouldn’t mark the start of a new session can be very business specific. We defined our session expiry rules on new marketing parameters, a duration of 30 minutes of inactivity or a change in device.

How to sessionise using SQL

The general premise of sessionisation implementation is group events together based on a user, order them chronologically and then sub divide these events into sessions based on your business decisions mentioned in the previous section.

I mentioned that my previous posts helped set some foundational datasets to assist with sessionisation. Identity stitching allows us to group events together by assigning the ‘best’ user identifier for traffic based on the unknown identifiers that exist on each event. Modelling url strings and categorising them allows us to divide events based on whether or not url parameters or referrer categories have been observed. These combined with event timestamps and device/platform assigned to each event were all the pieces we needed to meet the business requirement.

My approach to sessionising on these criteria was to break the division logic by computing each criteria separately, then computing sessions if any of the criteria signal a new session.

SQL window functions are an integral part of this implementation so if aren’t familiar I would advise you read up on them. Here is a great post by Sherwin Zheng demonstrating how to use them.

Note the dialect of SQL I’m using for these examples is Snowflake.

Time of inactivity

The premise here is quite simple, for every event we want to check if the previous event by the same user was more or less than 30 minutes prior to the timestamp of the current event, if it is more then lets set time_diff_flag = 1 and if it isn’t set time_diff_flag = 0. We can then cumulatively sum the time_diff_flag column and it will group events together based on an expiry of 30 minutes.

The LAG window function (present in most common SQL dialects) allows you to traverse to the column identified for a previous row grouped by a common partition and sorted in your desired order. Here’s some example psuedo code:

LAG(event_tstamp) OVER (PARTITION BY attributed_user_id ORDER BY EVENT_TSTAMP) AS previous_tstamp

This will produce the last timestamp for each event based on the attributed user id. We can then compare this to the event_tstamp of the current event and test if this is more or less than 30 minutes.

TIMEDIFF(second, LAG(event_tstamp) 
OVER (PARTITION BY attributed_user_id ORDER BY event_tstamp), event_tstamp) < 1800 --30 mins

Important to note, that if this event is the user’s first event, the lag function will return NULL, so you want to handle this when you assign the time_diff_flag.

Example time_diff_flag

We can then use the SUM window function with a sliding window to compute the cumulative sum of the time_diff_flag column to group events together based on lag difference.

SUM(time_diff_flag)
OVER (PARTITION BY attributed_user_id ORDER BY event_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS time_diff_partition
SUM window function with sliding window

Change in marketing signals

Change in marketing signals can be quite business specific, this requires knowledge of how your business is using marketing parameters to drive traffic. Snowplow enrichment service does extract utm params from url strings and harmonises them into standard fields mkt_medium, mkt_source etc, if you are not using their enrichment or want to have personal control of what url parameters get harmonised you can simply do this yourself by COALESCE’ing the parameters together, in my implementation I only needed to do this for paid ad click ids.

COALESCE(
a.url_parameters:gclid::VARCHAR,
a.url_parameters:msclkid::VARCHAR,
a.url_parameters:dclid::VARCHAR,
a.url_parameters:clickid::VARCHAR,
a.url_parameters:fbclid::VARCHAR
) AS click_id

I advise discussing with your marketing and other traffic driving departments (eg. CRM) what are the explicit list of url parameters that are in use that would act as signals to identify new traffic. The list we devised was utm_campaign, utm_medium, utm_source, utm_term, utm_content and click_id. Any change in any of the above parameters would signal the start of a new session.

With each event you also have the referrer url, this is the url for the event that was previous to this event, these referrers can be internal to your website (if the event is not the landing page of a session) or external to your website (if this is the landing page of a session or part of an external redirect like a payment gateway).

Using the URL hostnames we categorised earlier, these categories can be used to identify the referrer medium of each event. The reason we use these is because the presence of url parameters are not enough, sometimes someone will return to the website without any url parameters, (this is usually from organic search or partner sites) when they do the utm params and the click id will often be null. Note that almost all events will have a referrer, so we utilise the hostname categorisation we done previously to only attach the referrer medium for hostnames that we consider are not part of our user journey within the website, eg. any other referrer medium than internal, payment gateway and oauth.

So our marketing signal for the initiation of a new session will be a combination of changes of shortlisted url parameters and the referrer medium.

Similarly to the time diff flag, I first create a field (marketing signals flag) that is a combination of the utm params discussed above and the referrer medium, this can be done using a CONCAT function. So the result will look something like this:

Example marketing_signals_flag

One of the important features I leverage for the next step is the IGNORE NULLS option within window functions. The window function LAST_VALUE combined with a sliding window and the ignore nulls argument allows us to persist the flag down through the events that aren’t landing pages.

LAST_VALUE(marketing_signals_flag) 
IGNORE NULLS OVER (PARTITION BY attributed_user_id
ORDER BY event_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS marketing_signals_partition
LAST_VALUE window function with ignore nulls and sliding window

Sessionising

Now that we’ve computed the two main partitions for sessionisation to match the business’s needs all that’s left to do is partition based on the combination of preprocessed ‘partitions’.

I decided to identify my sessions by using the unique id of the first event in the session, this has a few benefits:

  1. We don’t need to recompute a unique id, each event is unique so the first id for any grouping of them has to be unique.
  2. It is an easy link back to the original event stream to find event that is considered the landing event for each session for debugging and future enrichment.
  3. It aids with incremental loading, when incrementally loading sessions you might find that you have to retract and reprocess sessions (because of late arriving or data processing started mid way through a session).

With that in mind the final step becomes a simple FIRST_VALUE window function that is partitioned by the time_diff_partition, marketing_signals_partition and a device_platform.

Note even if you don’t want to sessionise cross platform events together you should still include the device type/platform in your partitioning logic because sessions can occur on multiple devices at the same time, eg. someone could be browsing on their phone and their computer at the same time.

FIRST_VALUE(event_id)
OVER (PARTITION BY attributed_user_id,
time_diff_partition,
marketing_signals_partition,
device_platform
ORDER BY e.EVENT_TSTAMP) AS session_id
FIRST_VALUE window function

As you can see from the example above it has persisted the the event_id from the landing page event as the session_id for events that are grouped together in a session.

The output table generated by this process is an event level link table that will work as a way of joining session level tables to event level tables and be used for downstream modelling to aggregate events up to a session. I persist a few key fields that are helpful when switching grain between sessions and events.

Example sessionisation table

Again thank you for taking the time to read this article, I hope these are helpful for anyone trying to solve the same problems. I welcome any feedback. On the next one I’ll talk about making session level attributes to help you categorise each session, which will feed into channelling and then attribution.

--

--

Robin Patel
Robin Patel

Written by Robin Patel

Data enthusiast trying to find his own way through data with a modern approach and sharing his experiences as he goes.

No responses yet