Notice: For best experience, you can duplicate my Deepnote notebook here and run the supporting code yourself.

At Traction Tools we’re highly commmited to make our clients succeed. We run a platform for EOS, which is a system that facilitates entreprenuers to run their business, internal operations, and effective meetings on the cloud.

However, as a SaaS company, it’s very common to deal with issues like churn and customer retention. Here we’re going to discuss how we analyze churn and what are some of the important factors that makes our customer stay or cancel their subscription.

It is very common for companies to try to predict customer churn using the so-called black-box models which are highly complex algorithms that can detect if a client is going to cancel their subscription based on a number of factors.

This is not necessarily bad, but there are better ways to predict tenure and calculate the probabilities of a user churning while using interpretable models which helps us understand what is causing our users to cancel their subscription.

This short article is aimed at data scientist and business analysts that would like to have a better understanding on how to calculate a churn probability for a client, causes, and the overall churn ratio.

Introduction

Because we do software for EOS, and we offer our platform to users that would love to have effective meetings. Following our business model we have teams which include an n number of users, and an n number of meetings run every week per team.

This allows us to have a sample dataset that includes:

  • Weekly Average Meetings: How many meetings the user runs per week.
  • Active User Count: How many users are within a team.
  • Has Churned: Marks the observable event of “death” (i.e: cancellation.)
  • Cluster Labels: A categorical variable that tell us if the account has high or low activity in the platform.
  • Tenure: How many days was the account active in our platform

In this article we will work only with a sample containing synthetic data and limited features to maintain sensitive information private.

Key Objectives

Key objectives from this analysis are:

  1. Performing a basic and short EDA (Exploratory Data Analysis) to get insights
  2. Getting the median lifetime of our customers
  3. Validating if the median lifetime varies per account activity

To run this analysis we’ll use a Python environment with libraries such as Pandas, Matplotlib, and Lifelines. Without further ado, let’s start jumping right into the exploratory data analysis (EDA).

Exploratory Data Analysis

Let’s start by importing some libraries that we’re going to use and also our dataframe to inspect it.

import pandas as pd
import matplotlib.pyplot as plt

# Stablish chart style and figure size
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 14, 7

# Load our dataframe and visualize a random sample of the dataframe
accounts = pd.read_csv('clustered_users.csv')
accounts.sample(10)
weekly_avg_meetingsactive_user_counthas_churnedcluster_labelstenure
4243121low_activity798
6915100low_activity800
2833150low_activity117
9586700high_activity673
2779630high_activity1083
1035151low_activity365
3993170low_activity992
3050140low_activity42
59711440high_activity860
1174490low_activity593

The first thing I’m noticing in this sample is that there is a low number of high activity accounts. Let’s clarify our assumption by running a .value_counts() method on our dataframe.

# Let's pass the normalize argument to give us percentages
accounts.cluster_labels.value_counts(normalize=True)

# Output
low_activity     0.829308
high_activity    0.170692
Name: cluster_labels, dtype: float64

As expected, 83% of our sample is composed by low activity accounts, while 17% of it is taken by high activity accounts.

It would be a good idea to visualize this number in a horizontal bar chart, fortunately this can be easily done using the pandas method .plot().

accounts.cluster_labels \
    .value_counts(normalize=True) \
    .plot(kind='barh')

# Always label your axes
plt.title('Population Activity Distribution')
plt.ylabel('Cluster Label')
plt.xlabel('Percentage');

Having a visual representation helps us to identify an issue here, if we run a survival analysis we might have to divide these two groups to better understand their behaviors and lifetime in our platform.

Now let’s consider the tenure column, which is the one that will tell us how much time does a client stays with us. We can run the .describe() method to get some basic statistics on about this feature.

The first thing I want to do before we run the .describe() method is transforming the column to months instead of days.

accounts['tenure'] = accounts.tenure / 30.4167
accounts.tenure.describe()

# Output
count    3064.000000
mean       17.252132
std        11.720486
min         0.098630
25%         8.284922
50%        14.564368
75%        23.876686
max        59.342401
Name: tenure, dtype: float64

Now, we have 3,063 observations here, and we can notice that the mean tenure is 17 months, with a lower bound of 8 months and an upper bound of 23 months.

However, this is not the appropriate way of measuring churn because we cannot say that every client stay with us from 8 to 23 months as not everyone has the same experience, furthermore, we already saw that we have different group of accounts and this information might vary wildly.

Now, let’s compare the behavior of accounts that have cancelled against accounts that are still active in this sample and try to get some insights.

import numpy as np

# group by churn and activity and use
# median and mean aggregations
accounts.drop('tenure', axis=1) \
    .groupby(['has_churned', 'cluster_labels']) \
    .agg([np.mean, np.median])

# Output
                           weekly_avg_meetings        active_user_count
                                          mean median              mean median
has_churned cluster_labels
0           high_activity            10.711111   10.0         39.337374   37.0
            low_activity              2.486349    2.0         11.398427   10.0
1           high_activity            10.285714    7.5         39.642857   37.5
            low_activity              1.900000    1.0          7.957895    7.0

What we have here is a multi-index describing the mean and median values for our features, broken down by account activity for active and cancelled accounts.

Ok, that was a mouthful, but let’s focus on meetings first:

  • When it comes to having weekly meetings, active and cancelled accounts with high activity have the same number of meetings on average a week, but a difference of -2.5 when we calculate it using the median.
  • For active accounts with low activity, the mean and the median don’t differ wildly. Two meetings a week is reasonable, however, we can notice that for cancelled accounts the number of weekly meetings changes to 1 instead of 2 meetings a week.

Now, what can we conclude about the active user count in each team?

  • When it comes to high activity teams the mean and median value for active and cancelled accounts don’t differ much.
  • On the other hand, for low_activity accounts that are active we observe that they usually have about 10 users in their team, however, for cancelled accounts we can observe that they usually have 7 users on their subscription, which is a different behavior that requires further analysis.

EDA Key Insights

With this information we can already conclude that keeping our users busy in the platform is paramount to retain them, and because Traction Tools is a collaborative space, having more users in their team increases engagement.

We can already start developing retention strategies to succeed with our customers. Based on this information we can also build machine learning algorithms to detect churn, anomalies, and clients that will provide more value over time.

Let’s go a bit further and try to estimates probabilities around the insights we have discovered.

Using Lifelines for Survival Analysis

There’s a great library out there for properly doing survival analysis created by Cameron Davidson-Pilon called Lifelines.

One of the best libraries for survival analysis that I’ve tried so far. Let’s use this to analyze the chance of survival at any time of our clients subscription.

Global Survivability Rates

We’ll start using the Kaplan-Meier fitter to analyze the the survivability rates for the whole population.

from lifelines import KaplanMeierFitter

# Filter observed events only
churn_filter = (accounts.has_churned == 1)
cancelled_accounts = accounts[churn_filter]

# Feed the model with our dataset for churned accounts
kmf = KaplanMeierFitter()
kmf.fit(cancelled_accounts.tenure, label='Churned Customers')

# Plot the survival chance of our population
fig, ax = plt.subplots()
kmf.plot(ax=ax, at_risk_counts=True)
ax.set_title('Kaplan-Meier Survival Curve — Churned Customers')
ax.set_xlabel('Customer Tenure (in Months)')
ax.set_ylabel('Customer Survival Probability (%)')
plt.show();

Now we can see the total survival chance of our population at any point in time. In the example above we can observe that there’s initially a 100% chance of survival and it slowly declines as the time goes by.

Of 408 observations, we can see that in the 10th month 232 of them still have an active subscription, but 176 of them have already cancelled.

Now let’s try to get the median survival time and also the survival chance at this point in time.

median_surv_time = kmf.median_survival_time_
surv_chance = kmf.cumulative_density_at_times(median_surv_time).iloc[0]
print(f'The median survival time is: {median_surv_time:0.2f} months')
print(f'With a survivability of: {surv_chance:0.2f}%')

# Output
The median survival time is: 11.74 months
With a survivability of: 0.50%

It seems that after the 11th month our clients have a 50/50 chance of cancelling their subscription. Let’s try now getting a lower bound and upper bound to make sure we have a confidence interval instead of only the median value.

from lifelines.utils import median_survival_times

median_ci = median_survival_times(kmf.confidence_interval_)
lower_bound, upper_bound = median_ci.loc[0.5]
kmf.median_survival_time_

print(f'Survival Lower Bound is: {lower_bound:0.2f} months')
print(f'Survival Upper Bound is: {upper_bound:0.2f} months')

# Output
Survival Lower Bound is: 10.42 months
Survival Upper Bound is: 12.95 months

Now we know that we should take care of accounts that are between 10 to 13 months old. Using this information we can trigger actions to take care of these customers in order to improve their lifespan in the platform.

Segmented Survivability Rates

However, there’s one thing that we have to notice, these are values for the entire population, but we know that we have different types of clients in our sample, and we should separate these two populations and observe their behavior.

To achieve this we’ll separate our populations using the clustered_labels column which separates the accounts by activity.

from lifelines.plotting import add_at_risk_counts

low_ = (cancelled_accounts.cluster_labels == 'low_activity')
high_ = (cancelled_accounts.cluster_labels == 'high_activity')

fig, ax = plt.subplots()
low_kmf = KaplanMeierFitter()
low_kmf.fit(cancelled_accounts.tenure[low_], cancelled_accounts.has_churned[low_], label='Low Activity Accounts')
low_kmf.plot(ax=ax)

high_kmf = KaplanMeierFitter()
high_kmf.fit(cancelled_accounts.tenure[high_], cancelled_accounts.has_churned[high_], label='High Activity Accounts')
high_kmf.plot(ax=ax)

add_at_risk_counts(low_kmf, high_kmf);

We can already observe that there’s a BIG difference in survivability between accounts with low activity and accounts with high activity. Let’s now get the lower and upper bounds for these types of accounts.

low_median_ci = median_survival_times(low_kmf.confidence_interval_)
lowact_lower_bound, lowact_upper_bound = low_median_ci.loc[0.5]

high_median_ci = median_survival_times(high_kmf.confidence_interval_)
highact_lower_bound, highact_upper_bound = high_median_ci.loc[0.5]

print('Low Activity Accounts:')
print(f'\t- Survival Lower Bound is: {lowact_lower_bound:0.2f} months')
print(f'\t- Survival Upper Bound is: {lowact_upper_bound:0.2f} months')

print('High Activity Accounts:')
print(f'\t- Survival Lower Bound is: {highact_lower_bound:0.2f} months')
print(f'\t- Survival Upper Bound is: {highact_upper_bound:0.2f} months')

# Output
Low Activity Accounts:
    - Survival Lower Bound is: 9.83 months
    - Survival Upper Bound is: 12.26 months
High Activity Accounts:
    - Survival Lower Bound is: 17.88 months
    - Survival Upper Bound is: 31.30 months

This is great! We now know that high activity accounts have more chances of staying with us for a long time than low activity accounts. While low activity accounts can be retained between 9 to 12 months, high activity accounts can stay with us between 17 to 31 months.

From a business development perspective, this is useful information that we can use to help our customers move from a low activity account to a high one to create a more engaging space for them.

Understanding the Impact of Covariates

To finalize this short study, I’d like to understand what would be the impact of our variables like Weekly Average Meetings and Active User Accounts in a team. This will help us to answer questions like:

  • Having more users in a team space improves the chances of survival?
  • Does having more meetings also affect the chances of survival?

What we’re trying to find out with these questions is if increasing the activity on accounts can change the probability of an account leaving the service early on their subscription.

To begin with we’ll use the Cox Proportional Hazard model to understand how these variables affect the survival chance of a customer. We can directly import the model from Lifelines and then fitting the model with our dataset.

from lifelines import CoxPHFitter
import numpy as np

no_clusters_accounts = accounts.drop(['cluster_labels'], axis=1)

cph = CoxPHFitter()
cph.fit(no_clusters_accounts, duration_col='tenure', event_col='has_churned')

# Output
<lifelines.CoxPHFitter: fitted with 3064 total observations, 2656 right-censored observations>

Once we fitted our model with our dataset we can see how our variables affect the churn probability for different groups. In the example bellow we can see how the survival probability changes for accounts with 5, 15, 25, 35, and 45 users.

cph.plot_partial_effects_on_outcome('active_user_count', np.arange(5, 50, 10), cmap='coolwarm_r');

It’s clear that users with sizable teams are more likely to stick around. Because Traction Tools is a collaborative space then it makes sense that having more people in organization accounts improves retention.

Now, I want to see if having a fair amount of meetings within a week improves retention.

cph.plot_partial_effects_on_outcome('weekly_avg_meetings', np.arange(0, 8, 2), cmap='coolwarm_r');

As expected, running a fair amount of meetings improves retention. This is one of the reasons why high activity accounts are more likely to stick around than low activity users.

Conclusion

Trying to manage customer churn is no easy task, however, we were able to uncover a good amount of insights that allow us to drive strategies and make informed decisions based on data. This insights allow us to understand our users when it comes to churning, build alert systems and campaigns based on AI, and provide training to our customers to make collaboration happen.

This is how we use data at Traction Tools to make important decisions, democratize information, and provide value to our customers.

Also, a huge thank you to the team at Deepnote for enabling these tools to help us adopt and scale information as a second language throught our company, can’t thank them enough!