Using Lag To Calculate Time Between Events in R and SQL
Last week, amidst building some dashboards for the customer experience team at work, I came across an interesting problem, which led me to a cool function called lag()
that I hadn’t come across before. It struck me as a pretty common problem type for time-series data, so I thought I’d make a short blog post about it both for my own sake, and in case it can help anyone else out.
The Problem
Our customer experience data comes from a messaging platform called Intercom. After some preliminary cleaning steps (damn you, nested JSON files!), my data looked broadly like this:
messages %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at |
1 | user | 2020-07-20 18:29:25 |
1 | user | 2020-07-20 19:56:45 |
1 | admin | 2020-07-20 19:57:34 |
1 | admin | 2020-07-20 19:57:36 |
1 | admin | 2020-07-21 21:08:39 |
2 | user | 2020-07-20 18:57:19 |
2 | user | 2020-07-20 18:59:22 |
2 | admin | 2020-07-20 19:00:50 |
2 | admin | 2020-07-20 19:00:54 |
2 | admin | 2020-07-20 21:04:56 |
3 | user | 2020-07-20 19:29:27 |
3 | admin | 2020-07-20 19:30:07 |
Each row in the dataframe is a message, with a date-time stamp. Messages are grouped into conversations, indentified by the convo_id column. Within conversations, messages can be sent either from users, or from an admin from the customer experience team.
Now, extracting the time taken for the admin to respond to the first message in a conversation is fairly straightforward…
messages %>%
group_by(convo_id, message_from) %>%
summarise(first_message = min(created_at)) %>%
pivot_wider(names_from = message_from, values_from = first_message) %>%
mutate(first_response_time = round(difftime(admin, user, units = "mins"), 2)) %>%
select(convo_id, user_first_message = user, admin_first_reply = admin, first_response_time) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | user_first_message | admin_first_reply | first_response_time |
1 | 2020-07-20 18:29:25 | 2020-07-20 19:57:34 | 88.15 mins |
2 | 2020-07-20 18:57:19 | 2020-07-20 19:00:50 | 3.52 mins |
3 | 2020-07-20 19:29:27 | 2020-07-20 19:30:07 | 0.67 mins |
4 | 2020-07-20 20:53:04 | 2020-07-20 20:54:56 | 1.87 mins |
5 | 2020-07-20 21:12:44 | 2020-07-21 08:05:45 | 653.02 mins |
6 | 2020-07-20 21:47:10 | 2020-07-20 23:57:01 | 129.85 mins |
7 | 2020-07-20 21:52:29 | 2020-07-20 23:55:59 | 123.50 mins |
8 | 2020-07-20 22:39:49 | 2020-07-21 00:05:10 | 85.35 mins |
9 | 2020-07-20 23:33:06 | 2020-07-21 00:10:33 | 37.45 mins |
10 | 2020-07-20 23:59:43 | 2020-07-21 00:01:16 | 1.55 mins |
11 | 2020-07-21 02:14:06 | 2020-07-21 02:15:26 | 1.33 mins |
12 | 2020-07-21 06:36:36 | 2020-07-21 06:39:02 | 2.43 mins |
And, since I actually working in SQL for my dashboards, here’s what that would look like in SQL:
WITH admin_reply AS (SELECT convo_id,
MIN(created_at) AS admin_first_reply
FROM messages
WHERE message_from = 'admin'
GROUP BY convo_id)
SELECT convo_id,
MIN(created_at) AS user_first_message,
admin_first_reply,
TIMESTAMP_DIFF(user_first_message, admin_first_reply, MINUTE) AS response_time
FROM messages
LEFT JOIN admin_reply
ON messages.convo_id = admin_reply.convo_id
WHERE message_from = 'user'
GROUP BY convo_id
But, what about if we want to extract the time taken for the admin to respond to each subsequent user message in a conversation?
To do this, we need a way to directly compare the date-time of each admin message to the date-time of the last user message in the conversation.
Note that users and admin may send multiple messages in a row so we can’t even assume that we’re always looking for the previous message in a conversation.
So what can we do?
Turns out what we need is a helpful function called lag()
.
lag()
is a neat function that pulls out the previous value in a vector. It’s accompanied by a sister function lead()
that pulls out the next value in a vector.
By combining lag()
with mutate()
we can pull the previous entry from the created_at column into a new column to compare a message with its predecessor side by side.
Note that group by conversation id, because we only want to pull the date-time of the previous message if it belongs to the same conversation. We also need to use lag()
on the message_from column to find out who sent the previous message in the conversation:
messages %>%
group_by(convo_id) %>%
mutate(last_message_from = lag(message_from),
last_message_at = lag(created_at)) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at | last_message_from | last_message_at |
1 | user | 2020-07-20 18:29:25 | NA | |
1 | user | 2020-07-20 19:56:45 | user | 2020-07-20 18:29:25 |
1 | admin | 2020-07-20 19:57:34 | user | 2020-07-20 19:56:45 |
1 | admin | 2020-07-20 19:57:36 | admin | 2020-07-20 19:57:34 |
1 | admin | 2020-07-21 21:08:39 | admin | 2020-07-20 19:57:36 |
2 | user | 2020-07-20 18:57:19 | NA | |
2 | user | 2020-07-20 18:59:22 | user | 2020-07-20 18:57:19 |
2 | admin | 2020-07-20 19:00:50 | user | 2020-07-20 18:59:22 |
2 | admin | 2020-07-20 19:00:54 | admin | 2020-07-20 19:00:50 |
2 | admin | 2020-07-20 21:04:56 | admin | 2020-07-20 19:00:54 |
3 | user | 2020-07-20 19:29:27 | NA | |
3 | admin | 2020-07-20 19:30:07 | user | 2020-07-20 19:29:27 |
Now, we’re only interested in the response time of an admin to a user message, so we need to filter for messages sent by admin, where the previous message was from a user. Once we’ve done this, we can calculate the time taken for the admin to respond:
messages %>%
group_by(convo_id) %>%
mutate(last_message_from = lag(message_from),
last_message_at = lag(created_at)) %>%
filter(message_from == "admin" & last_message_from == "user") %>%
mutate(response_time = round(difftime(created_at, last_message_at, units = "mins"), 2)) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at | last_message_from | last_message_at | response_time |
1 | admin | 2020-07-20 19:57:34 | user | 2020-07-20 19:56:45 | 0.82 mins |
2 | admin | 2020-07-20 19:00:50 | user | 2020-07-20 18:59:22 | 1.47 mins |
3 | admin | 2020-07-20 19:30:07 | user | 2020-07-20 19:29:27 | 0.67 mins |
4 | admin | 2020-07-20 20:54:56 | user | 2020-07-20 20:53:04 | 1.87 mins |
4 | admin | 2020-07-20 20:56:14 | user | 2020-07-20 20:55:27 | 0.78 mins |
4 | admin | 2020-07-20 20:58:52 | user | 2020-07-20 20:56:38 | 2.23 mins |
4 | admin | 2020-07-20 21:01:00 | user | 2020-07-20 20:59:40 | 1.33 mins |
4 | admin | 2020-07-20 21:01:26 | user | 2020-07-20 21:01:25 | 0.02 mins |
4 | admin | 2020-07-20 21:04:13 | user | 2020-07-20 21:02:14 | 1.98 mins |
4 | admin | 2020-07-20 21:04:43 | user | 2020-07-20 21:04:42 | 0.02 mins |
4 | admin | 2020-07-20 23:59:06 | user | 2020-07-20 21:05:35 | 173.52 mins |
5 | admin | 2020-07-21 08:05:45 | user | 2020-07-20 21:13:39 | 652.10 mins |
I wasn’t totally sure how this would translate into SQL, but thankfully, SQL also has a helpful LAG() function. So here’s how my solution looks in SQL:
WITH lagged_messages AS (SELECT convo_id,
message_from,
created_at,
LAG(user_type) OVER
(PARTITION BY convo_id ORDER BY created_at) AS last_message_from,
LAG(created_at) OVER
(PARTITION BY convo_id ORDER BY created_at) AS last_message_at
FROM messages)
SELECT convo_id,
message_from,
created_at,
last_message_from,
last_message_at,
TIMESTAMP_DIFF(created_at, last_message_at, MINUTE) AS response_time
FROM lagged_messages
WHERE message_from != last_message_from
For extra credit, here’s a final fun problem: As mentioned earlier, sometimes, users and admin send multiple messages in a row.
Currently, we have calculated the time between the user’s last message and the admin’s response.
But is that really what we want?
For instance, what if a user sends a message, waits around… and eventually, losing patience, hits the admin up with another message?
In this case, taking the date-time of the user’s last message before the admin response could lead to severely understimating the admin’s response time.
Instead, what we want is the date-time of the user’s first message in the series preceding the admin response.
To get this, all we need is, yet again, lag()
!
First, we need to lag the message_from column, to find out who sent the previous message in a conversation chain:
messages %>%
group_by(convo_id) %>%
mutate(last_message_from = lag(message_from)) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at | last_message_from |
1 | user | 2020-07-20 18:29:25 | |
1 | user | 2020-07-20 19:56:45 | user |
1 | admin | 2020-07-20 19:57:34 | user |
1 | admin | 2020-07-20 19:57:36 | admin |
1 | admin | 2020-07-21 21:08:39 | admin |
2 | user | 2020-07-20 18:57:19 | |
2 | user | 2020-07-20 18:59:22 | user |
2 | admin | 2020-07-20 19:00:50 | user |
2 | admin | 2020-07-20 19:00:54 | admin |
2 | admin | 2020-07-20 21:04:56 | admin |
3 | user | 2020-07-20 19:29:27 | |
3 | admin | 2020-07-20 19:30:07 | user |
At this point, we want to filter out any instances where the user or admin sends a second or third message in a row. We can do this by filtering out any rows where the message_from and last_message_from columns have the same value:
messages %>%
group_by(convo_id) %>%
mutate(last_message_from = lag(message_from),
last_message_from = ifelse(is.na(last_message_from), "none", last_message_from)) %>%
filter(message_from != last_message_from) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at | last_message_from |
1 | user | 2020-07-20 18:29:25 | none |
1 | admin | 2020-07-20 19:57:34 | user |
2 | user | 2020-07-20 18:57:19 | none |
2 | admin | 2020-07-20 19:00:50 | user |
3 | user | 2020-07-20 19:29:27 | none |
3 | admin | 2020-07-20 19:30:07 | user |
4 | user | 2020-07-20 20:53:04 | none |
4 | admin | 2020-07-20 20:54:56 | user |
4 | user | 2020-07-20 20:55:27 | admin |
4 | admin | 2020-07-20 20:56:14 | user |
4 | user | 2020-07-20 20:56:38 | admin |
4 | admin | 2020-07-20 20:58:52 | user |
At this point, we can use lag()
again to get the date-time of the previous message in the conversation chain. The next steps are the same as in our first example; filter to keep only admin messages that were preceded by a user message, and caclulate response time:
messages %>%
group_by(convo_id) %>%
mutate(last_message_from = lag(message_from),
last_message_from = ifelse(is.na(last_message_from), "none", last_message_from)) %>%
filter(message_from != last_message_from) %>%
group_by(convo_id) %>%
mutate(last_message_at = lag(created_at)) %>%
filter(message_from == "admin" & last_message_from == "user") %>%
mutate(response_time = round(difftime(created_at, last_message_at, units = "mins"), 2)) %>%
head(12) %>%
regulartable() %>%
autofit()
convo_id | message_from | created_at | last_message_from | last_message_at | response_time |
1 | admin | 2020-07-20 19:57:34 | user | 2020-07-20 18:29:25 | 88.15 mins |
2 | admin | 2020-07-20 19:00:50 | user | 2020-07-20 18:57:19 | 3.52 mins |
3 | admin | 2020-07-20 19:30:07 | user | 2020-07-20 19:29:27 | 0.67 mins |
4 | admin | 2020-07-20 20:54:56 | user | 2020-07-20 20:53:04 | 1.87 mins |
4 | admin | 2020-07-20 20:56:14 | user | 2020-07-20 20:55:27 | 0.78 mins |
4 | admin | 2020-07-20 20:58:52 | user | 2020-07-20 20:56:38 | 2.23 mins |
4 | admin | 2020-07-20 21:01:00 | user | 2020-07-20 20:59:40 | 1.33 mins |
4 | admin | 2020-07-20 21:01:26 | user | 2020-07-20 21:01:25 | 0.02 mins |
4 | admin | 2020-07-20 21:04:13 | user | 2020-07-20 21:01:40 | 2.55 mins |
4 | admin | 2020-07-20 21:04:43 | user | 2020-07-20 21:04:42 | 0.02 mins |
4 | admin | 2020-07-20 23:59:06 | user | 2020-07-20 21:05:14 | 173.87 mins |
5 | admin | 2020-07-21 08:05:45 | user | 2020-07-20 21:12:44 | 653.02 mins |
Finally, this is my SQL solution to the same extended problem:
WITH lagged_users AS (SELECT convo_id,
message_from,
created_at,
LAG(message_from) OVER
(PARTITION BY convo_id ORDER BY created_at) AS last_message_from
FROM messages),
lag_times AS (SELECT convo_id,
message_from,
created_at,
last_message_from,
LAG(created_at) OVER
(PARTITION BY convo_id ORDER BY created_at) AS last_message_at
FROM lagged_users
WHERE message_from != last_message_from)
SELECT convo_id,
message_from,
created_at,
last_message_from,
last_message_at,
TIMESTAMP_DIFF(created_at, last_message_at, MINUTE) AS response_time
FROM lag_times
WHERE user_type = 'admin' AND prev_user_type = 'user'
Additional Code
The Intercom data structure is a little more complex than the dataframe I used to go through the example. For anyone grappling with Intercom data, I’ve left the full SQL code that I used to extract response times below.
WITH first_message AS (SELECT conversations.id AS convo_id,
MAX(user.id) AS user_id,
MAX(user.type) AS user_type,
MIN(conversations.created_at) AS created_at
FROM `intercom.conversations` AS conversations
GROUP BY convo_id),
next_messages AS (SELECT conversations.id AS convo_id,
cp.value.author.id AS user_id,
cp.value.author.type AS user_type,
cp.value.created_at AS created_at
FROM `intercom.conversations` AS conversations,
UNNEST (conversation_parts) AS cp),
all_messages AS (SELECT * FROM first_message
UNION ALL
SELECT * FROM next_messages
WHERE user_type <> "bot"
ORDER BY convo_id, created_at),
lagged_users AS (SELECT convo_id,
user_id,
user_type,
created_at,
LAG(user_type) OVER (PARTITION BY convo_id ORDER BY created_at) AS prev_user_type
FROM all_messages),
tidy_lagged_users AS (SELECT convo_id,
user_id,
(CASE WHEN user_type = 'lead' THEN 'user'
WHEN user_type IS NULL THEN 'none'
ELSE user_type
END) AS user_type,
created_at,
(CASE WHEN prev_user_type = 'lead' THEN 'user'
WHEN prev_user_type IS NULL THEN 'none'
ELSE prev_user_type
END) AS prev_user_type
FROM lagged_users),
lag_times AS (SELECT convo_id,
user_id,
user_type,
created_at,
prev_user_type,
LAG(created_at) OVER (PARTITION BY convo_id ORDER BY created_at) AS prev_message_time
FROM tidy_lagged_users
WHERE user_type != prev_user_type)
SELECT convo_id,
user_id AS admin_id,
name AS admin_name,
user_type,
created_at,
prev_user_type,
prev_message_time,
TIMESTAMP_DIFF(created_at, prev_message_time, MINUTE) AS response_time
FROM lag_times
LEFT JOIN `intercom.admins` AS admins
ON lag_times.user_id = admins.id
WHERE user_type = 'admin' AND prev_user_type = 'user'