Using Lag To Calculate Time Between Events in R and SQL

July 29, 2020 9-minute read

Last week, amidst building some dashboards for the customer experience team at work, I came across an interesting problem, which led me to a cool function called lag() that I hadn’t come across before. It struck me as a pretty common problem type for time-series data, so I thought I’d make a short blog post about it both for my own sake, and in case it can help anyone else out.

The Problem

Our customer experience data comes from a messaging platform called Intercom. After some preliminary cleaning steps (damn you, nested JSON files!), my data looked broadly like this:

messages %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at
1	user	2020-07-20 18:29:25
1	user	2020-07-20 19:56:45
1	admin	2020-07-20 19:57:34
1	admin	2020-07-20 19:57:36
1	admin	2020-07-21 21:08:39
2	user	2020-07-20 18:57:19
2	user	2020-07-20 18:59:22
2	admin	2020-07-20 19:00:50
2	admin	2020-07-20 19:00:54
2	admin	2020-07-20 21:04:56
3	user	2020-07-20 19:29:27
3	admin	2020-07-20 19:30:07

Each row in the dataframe is a message, with a date-time stamp. Messages are grouped into conversations, indentified by the convo_id column. Within conversations, messages can be sent either from users, or from an admin from the customer experience team.

Now, extracting the time taken for the admin to respond to the first message in a conversation is fairly straightforward…

messages %>% 
  group_by(convo_id, message_from) %>% 
  summarise(first_message = min(created_at)) %>% 
  pivot_wider(names_from = message_from, values_from = first_message) %>% 
  mutate(first_response_time = round(difftime(admin, user, units = "mins"), 2)) %>% 
  select(convo_id, user_first_message = user, admin_first_reply = admin, first_response_time) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	user_first_message	admin_first_reply	first_response_time
1	2020-07-20 18:29:25	2020-07-20 19:57:34	88.15 mins
2	2020-07-20 18:57:19	2020-07-20 19:00:50	3.52 mins
3	2020-07-20 19:29:27	2020-07-20 19:30:07	0.67 mins
4	2020-07-20 20:53:04	2020-07-20 20:54:56	1.87 mins
5	2020-07-20 21:12:44	2020-07-21 08:05:45	653.02 mins
6	2020-07-20 21:47:10	2020-07-20 23:57:01	129.85 mins
7	2020-07-20 21:52:29	2020-07-20 23:55:59	123.50 mins
8	2020-07-20 22:39:49	2020-07-21 00:05:10	85.35 mins
9	2020-07-20 23:33:06	2020-07-21 00:10:33	37.45 mins
10	2020-07-20 23:59:43	2020-07-21 00:01:16	1.55 mins
11	2020-07-21 02:14:06	2020-07-21 02:15:26	1.33 mins
12	2020-07-21 06:36:36	2020-07-21 06:39:02	2.43 mins

And, since I actually working in SQL for my dashboards, here’s what that would look like in SQL:

WITH admin_reply AS (SELECT convo_id,
                            MIN(created_at) AS admin_first_reply
                     FROM messages
                     WHERE message_from = 'admin'
                     GROUP BY convo_id)
SELECT convo_id,
       MIN(created_at) AS user_first_message,
       admin_first_reply,
       TIMESTAMP_DIFF(user_first_message, admin_first_reply, MINUTE) AS response_time
FROM messages
LEFT JOIN admin_reply
  ON messages.convo_id = admin_reply.convo_id
WHERE message_from = 'user'
GROUP BY convo_id

But, what about if we want to extract the time taken for the admin to respond to each subsequent user message in a conversation?

To do this, we need a way to directly compare the date-time of each admin message to the date-time of the last user message in the conversation.

Note that users and admin may send multiple messages in a row so we can’t even assume that we’re always looking for the previous message in a conversation.

So what can we do?

Turns out what we need is a helpful function called lag().

lag() is a neat function that pulls out the previous value in a vector. It’s accompanied by a sister function lead() that pulls out the next value in a vector.

By combining lag() with mutate() we can pull the previous entry from the created_at column into a new column to compare a message with its predecessor side by side.

Note that group by conversation id, because we only want to pull the date-time of the previous message if it belongs to the same conversation. We also need to use lag() on the message_from column to find out who sent the previous message in the conversation:

messages %>% 
  group_by(convo_id) %>% 
  mutate(last_message_from = lag(message_from),
         last_message_at = lag(created_at)) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at	last_message_from	last_message_at
1	user	2020-07-20 18:29:25		NA
1	user	2020-07-20 19:56:45	user	2020-07-20 18:29:25
1	admin	2020-07-20 19:57:34	user	2020-07-20 19:56:45
1	admin	2020-07-20 19:57:36	admin	2020-07-20 19:57:34
1	admin	2020-07-21 21:08:39	admin	2020-07-20 19:57:36
2	user	2020-07-20 18:57:19		NA
2	user	2020-07-20 18:59:22	user	2020-07-20 18:57:19
2	admin	2020-07-20 19:00:50	user	2020-07-20 18:59:22
2	admin	2020-07-20 19:00:54	admin	2020-07-20 19:00:50
2	admin	2020-07-20 21:04:56	admin	2020-07-20 19:00:54
3	user	2020-07-20 19:29:27		NA
3	admin	2020-07-20 19:30:07	user	2020-07-20 19:29:27

Now, we’re only interested in the response time of an admin to a user message, so we need to filter for messages sent by admin, where the previous message was from a user. Once we’ve done this, we can calculate the time taken for the admin to respond:

messages %>% 
  group_by(convo_id) %>% 
  mutate(last_message_from = lag(message_from),
         last_message_at = lag(created_at)) %>% 
  filter(message_from == "admin" & last_message_from == "user") %>% 
  mutate(response_time = round(difftime(created_at, last_message_at, units = "mins"), 2)) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at	last_message_from	last_message_at	response_time
1	admin	2020-07-20 19:57:34	user	2020-07-20 19:56:45	0.82 mins
2	admin	2020-07-20 19:00:50	user	2020-07-20 18:59:22	1.47 mins
3	admin	2020-07-20 19:30:07	user	2020-07-20 19:29:27	0.67 mins
4	admin	2020-07-20 20:54:56	user	2020-07-20 20:53:04	1.87 mins
4	admin	2020-07-20 20:56:14	user	2020-07-20 20:55:27	0.78 mins
4	admin	2020-07-20 20:58:52	user	2020-07-20 20:56:38	2.23 mins
4	admin	2020-07-20 21:01:00	user	2020-07-20 20:59:40	1.33 mins
4	admin	2020-07-20 21:01:26	user	2020-07-20 21:01:25	0.02 mins
4	admin	2020-07-20 21:04:13	user	2020-07-20 21:02:14	1.98 mins
4	admin	2020-07-20 21:04:43	user	2020-07-20 21:04:42	0.02 mins
4	admin	2020-07-20 23:59:06	user	2020-07-20 21:05:35	173.52 mins
5	admin	2020-07-21 08:05:45	user	2020-07-20 21:13:39	652.10 mins

I wasn’t totally sure how this would translate into SQL, but thankfully, SQL also has a helpful LAG() function. So here’s how my solution looks in SQL:

WITH lagged_messages AS (SELECT convo_id,
                                message_from,
                                created_at,
                                LAG(user_type) OVER 
                                  (PARTITION BY convo_id ORDER BY created_at) AS last_message_from,
                                LAG(created_at) OVER 
                                  (PARTITION BY convo_id ORDER BY created_at) AS last_message_at
                         FROM messages)
SELECT convo_id,
       message_from,
       created_at,
       last_message_from,
       last_message_at,
       TIMESTAMP_DIFF(created_at, last_message_at, MINUTE) AS response_time
FROM lagged_messages
WHERE message_from != last_message_from

For extra credit, here’s a final fun problem: As mentioned earlier, sometimes, users and admin send multiple messages in a row.

Currently, we have calculated the time between the user’s last message and the admin’s response.

But is that really what we want?

For instance, what if a user sends a message, waits around… and eventually, losing patience, hits the admin up with another message?

In this case, taking the date-time of the user’s last message before the admin response could lead to severely understimating the admin’s response time.

Instead, what we want is the date-time of the user’s first message in the series preceding the admin response.

To get this, all we need is, yet again, lag()!

First, we need to lag the message_from column, to find out who sent the previous message in a conversation chain:

messages %>% 
  group_by(convo_id) %>% 
  mutate(last_message_from = lag(message_from)) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at	last_message_from
1	user	2020-07-20 18:29:25
1	user	2020-07-20 19:56:45	user
1	admin	2020-07-20 19:57:34	user
1	admin	2020-07-20 19:57:36	admin
1	admin	2020-07-21 21:08:39	admin
2	user	2020-07-20 18:57:19
2	user	2020-07-20 18:59:22	user
2	admin	2020-07-20 19:00:50	user
2	admin	2020-07-20 19:00:54	admin
2	admin	2020-07-20 21:04:56	admin
3	user	2020-07-20 19:29:27
3	admin	2020-07-20 19:30:07	user

At this point, we want to filter out any instances where the user or admin sends a second or third message in a row. We can do this by filtering out any rows where the message_from and last_message_from columns have the same value:

messages %>% 
  group_by(convo_id) %>% 
  mutate(last_message_from = lag(message_from),
         last_message_from = ifelse(is.na(last_message_from), "none", last_message_from)) %>% 
  filter(message_from != last_message_from) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at	last_message_from
1	user	2020-07-20 18:29:25	none
1	admin	2020-07-20 19:57:34	user
2	user	2020-07-20 18:57:19	none
2	admin	2020-07-20 19:00:50	user
3	user	2020-07-20 19:29:27	none
3	admin	2020-07-20 19:30:07	user
4	user	2020-07-20 20:53:04	none
4	admin	2020-07-20 20:54:56	user
4	user	2020-07-20 20:55:27	admin
4	admin	2020-07-20 20:56:14	user
4	user	2020-07-20 20:56:38	admin
4	admin	2020-07-20 20:58:52	user

At this point, we can use lag() again to get the date-time of the previous message in the conversation chain. The next steps are the same as in our first example; filter to keep only admin messages that were preceded by a user message, and caclulate response time:

messages %>% 
  group_by(convo_id) %>% 
  mutate(last_message_from = lag(message_from),
         last_message_from = ifelse(is.na(last_message_from), "none", last_message_from)) %>% 
  filter(message_from != last_message_from) %>% 
  group_by(convo_id) %>% 
  mutate(last_message_at = lag(created_at)) %>% 
  filter(message_from == "admin" & last_message_from == "user") %>% 
  mutate(response_time = round(difftime(created_at, last_message_at, units = "mins"), 2)) %>% 
  head(12) %>% 
  regulartable() %>% 
  autofit()

convo_id	message_from	created_at	last_message_from	last_message_at	response_time
1	admin	2020-07-20 19:57:34	user	2020-07-20 18:29:25	88.15 mins
2	admin	2020-07-20 19:00:50	user	2020-07-20 18:57:19	3.52 mins
3	admin	2020-07-20 19:30:07	user	2020-07-20 19:29:27	0.67 mins
4	admin	2020-07-20 20:54:56	user	2020-07-20 20:53:04	1.87 mins
4	admin	2020-07-20 20:56:14	user	2020-07-20 20:55:27	0.78 mins
4	admin	2020-07-20 20:58:52	user	2020-07-20 20:56:38	2.23 mins
4	admin	2020-07-20 21:01:00	user	2020-07-20 20:59:40	1.33 mins
4	admin	2020-07-20 21:01:26	user	2020-07-20 21:01:25	0.02 mins
4	admin	2020-07-20 21:04:13	user	2020-07-20 21:01:40	2.55 mins
4	admin	2020-07-20 21:04:43	user	2020-07-20 21:04:42	0.02 mins
4	admin	2020-07-20 23:59:06	user	2020-07-20 21:05:14	173.87 mins
5	admin	2020-07-21 08:05:45	user	2020-07-20 21:12:44	653.02 mins

Finally, this is my SQL solution to the same extended problem:

WITH lagged_users AS (SELECT convo_id,
                             message_from,
                             created_at,
                             LAG(message_from) OVER 
                              (PARTITION BY convo_id ORDER BY created_at) AS last_message_from
                       FROM messages),
     lag_times AS (SELECT convo_id,
                          message_from,
                          created_at,
                          last_message_from,
                          LAG(created_at) OVER 
                            (PARTITION BY convo_id ORDER BY created_at) AS last_message_at
                   FROM lagged_users
                   WHERE message_from != last_message_from)
SELECT convo_id,
       message_from,
       created_at,
       last_message_from,
       last_message_at,
       TIMESTAMP_DIFF(created_at, last_message_at, MINUTE) AS response_time
FROM lag_times
WHERE user_type = 'admin' AND prev_user_type = 'user'

Additional Code

The Intercom data structure is a little more complex than the dataframe I used to go through the example. For anyone grappling with Intercom data, I’ve left the full SQL code that I used to extract response times below.

WITH first_message AS (SELECT conversations.id AS convo_id,
                              MAX(user.id) AS user_id,
                              MAX(user.type) AS user_type,
                              MIN(conversations.created_at) AS created_at
                       FROM `intercom.conversations` AS conversations
                       GROUP BY convo_id),
     next_messages AS (SELECT conversations.id AS convo_id,
                              cp.value.author.id AS user_id,
                              cp.value.author.type AS user_type,
                              cp.value.created_at AS created_at
                       FROM `intercom.conversations` AS conversations,
                       UNNEST (conversation_parts) AS cp),
     all_messages AS (SELECT * FROM first_message
                      UNION ALL
                      SELECT * FROM next_messages
                      WHERE user_type <> "bot"
                      ORDER BY convo_id, created_at),
     lagged_users AS (SELECT convo_id,
                             user_id,
                             user_type,
                             created_at,
                             LAG(user_type) OVER (PARTITION BY convo_id ORDER BY created_at) AS prev_user_type
                       FROM all_messages),
     tidy_lagged_users AS (SELECT convo_id,
                                  user_id,
                                  (CASE WHEN user_type = 'lead' THEN 'user'
                                        WHEN user_type IS NULL THEN 'none'
                                        ELSE user_type
                                   END) AS user_type,
                                  created_at,
                                  (CASE WHEN prev_user_type = 'lead' THEN 'user'
                                        WHEN prev_user_type IS NULL THEN 'none'
                                        ELSE prev_user_type
                                   END) AS prev_user_type
                          FROM lagged_users),
     lag_times AS (SELECT convo_id,
                          user_id,
                          user_type,
                          created_at,
                          prev_user_type,
                          LAG(created_at) OVER (PARTITION BY convo_id ORDER BY created_at) AS prev_message_time
                   FROM tidy_lagged_users
                   WHERE user_type != prev_user_type)
SELECT convo_id,
       user_id AS admin_id,
       name AS admin_name,
       user_type,
       created_at,
       prev_user_type,
       prev_message_time,
       TIMESTAMP_DIFF(created_at, prev_message_time, MINUTE) AS response_time
FROM lag_times
LEFT JOIN `intercom.admins` AS admins
  ON lag_times.user_id = admins.id
WHERE user_type = 'admin' AND prev_user_type = 'user'