What is churn?
Customer churn is a well known phenomenon in many businesses. In fact, in virtually every single one that you may imagine, since every business has customers and sells “stuff” to them. What is sold is irrelevant for the fact that the customer is always present. Even if the business is B2B, there are customers (only this time they are other businesses). Thus, customer is always the cornerstone of every business and losing them is terrible no matter what we sell.
Luckily for us, losing customers does not happen very often. Depending on the business, it may happen in a ratio of 1 to 4 (a high customer loss ratio) or maybe 1 to 20 for loyal-customer-based businesses. It always depends on the kind of business. In any case, reducing this ratios will be our goal. However, defining what is an acceptable ratio for a business is not easy, nor reusable across business cases. What ratio would be acceptable for a telecommunications company? What for a gym? What for a supermarket? In the latter case, is it even possible to measure churn?
Depending on the context, the very first question we must answer when dealing with churn is: what is churn IN OUR BUSINESS SCENARIO? This can be viewed as an extension of the classic problem definition phase in any Machine Learning project, but with a twist. And, as in any Machine Learning project, monetizing and valuing our model is key to understand how good it really is beyond statistical (and, let’s be honest, not useful for the real goal of the model, which is to be applied to a business) measures. Translating models into money is never easy, and churn is no exception, but churn is a widely studied case across industries, and from them we can start having an idea of how important it is (taken from here):
- Increasing customer retention rates by 5% increases profits by 25% to 95%
- It costs five times as much to attract a new customer than to keep an existing one
Of course, feel free to adjust these numbers and / or apply them to your own business case, but these give us a mind-blowing perception of the business critical role of a model like this. It helps to rapidly realize about the profitability that derive from the changes in our model (increasing 1% in a performance measure in the model impacts in X money for our business).
Once we have established what is churn for our particular business case, we always will want to know not just what, but, at least, this key points:
- Knowing which customers are more likely to leave we can design retention campaigns and avoid their abandon.
- These retention campaigns have to be focused on key areas of our business. Knowing the reasons that the customers who are likely to leave share will help us targeting customers profiles and designing actions.
In balance there is virtue
As the philosopher said, in balance there is virtue. This is true in life and is no different in Machine Learning, especially in a case like churn. We have already mentioned that churn ratios can be higher or lower depending on the business case, but it will be always unbalanced in favor of the ‘not churned’ cases. This is a well known situation in Machine Learning, and it makes harder to our models (in this case, a binary classifier since we have 2 classes -‘churned’ and ‘not churned’) to learn from the data since less cases of the ‘churned’ class are shown. There are diverse techniques to balance our data and make the learning task easier to the algorithms.
We could apply
- Oversampling: Increasing the minority class (our ‘churned’ cases) creating synthetic cases that are similar enough to the real cases.
- Undersampling: Decreasing the majority class. Balancing the dataset can be achieved eliminating cases from the majority class that are not significant because other extremely or exactly identical cases already exist.
- Hybrid methods: Combining oversampling and undersampling methods.
There are other techniques associated with the algorithm that we are building our model on, but as in any other area of Machine Learning there is no free lunch: test combinations, investigate which one fits your data and algorithm best and apply it.
Trust Occam’s razor
Machine Learning models can become very complex, with dozens or hundreds of features to compute and to understand. This happens in churn analysis too, especially when computed business metrics come into the recipe. Although for our algorithms this can be beneficial (the richer the information, the better the model) we can fall into the “dimensionality curse”. Having too many features when we can come up with a model good enough (remember, we will never have a perfect model) with fewer features is a waste of time and compute power.
This is especially critical if we think about one of the key questions in our approach to churn: “why are my customers abandoning me?”. If the answer relies on hundreds of factors it will be extremely hard to make it understandable to our colleagues or clients when designing retention campaigns or, even more important sometimes, when we are selling our results!
On top of that, having simple models usually makes them more robust and less prone to overfitting (a.k.a. overspecialized). Data will always surprise us when we put our model in production. Customers will always behave differently and new customers will arrive with new cases. Having simpler models (as long as they perform well enough) will be usually better than overcomplicated models since they usually generalize better with new data (they will perform better against unseen data).
The why of everything together
Having a good churn model is great but not enough. Being able to detect churn and predict which customers are more likely to abandon our business is not useful if we are not able to interpret why, which aspects of their behavior are the most important ones for those customers who churn. This is achieved through model interpretation since the model is a summary of the whole dataset. At the same time, the dataset is (or should be) a summary of our customers’ behavior. Thus, interpreting the model we are interpreting our customers’ activity, but not in a general scope but focusing on their churn-related actions or profile.
Model interpretation is a hot topic nowadays in Machine Learning. It is a fundamental part of ML for many reasons. Some of them are:
- Allows us to understand better our models beyond metrics
- It is useful to debug and improve models
- To detect and avoid bias in our datasets and models
- To reach prescriptive analytics (as the aforementioned retention campaigns)
Feature importances can be extracted from our models to check which features have contributed the most for the model to decide whether a customer will churn or not. Different algorithms save this information in different ways. For example, the tree-based algorithms save this as metadata depending on how much each feature has helped them in the training phase to learn from the dataset about our customers’ behavior. Other algorithms, like the logistic regression, save these importances in the form of coefficients or weights for each feature. In every algorithm that saves its feature importance information we can extract and plot it.
Other algorithms, like Support Vector Machines or Neural Networks, have been historically considered “black boxes”. That is not true anymore thanks to different techniques like Permutation Feature Importance or the Shapley Additive Explanations. They aim to be model-agnostic since they don’t relay on the inner characteristics of the model to inform about the importances but they challenge the model feeding them with modified data and then analyzing their performance. If the modifications have had a deep effect on the model’s performance they assign a higher score the modified feature. Requiring constant model challenges and interactions with it these techniques can be time-consuming, but allow us to explain virtually any model, including complex ensemble models used frequently in production systems.