RFM and K-means based analysis of user value

1. Overview#

Industries such as finance, retail, and telecommunications often need to segment users to label different tags in order to carry out personalized and precise marketing activities. The RFM model is based on a series of user transaction behaviors such as actual transactions, consumption, purchases, or recharges (referred to as "transactions" below) as the basis for user segmentation. It is simple yet practical. The RFM model consists of three indicators:

Recency: The interval of the user's most recent transaction time. A higher R value indicates that the customer's transaction occurred further in the past, while a lower R value indicates that the customer's transaction occurred more recently.
Frequency: The number of transactions the user has made in a recent period of time. A higher F value indicates that the customer's transactions are more frequent, while a lower F value indicates that the customer's transactions are less active.
Monetary: The amount of transactions the user has made in a recent period of time. A higher M value indicates that the customer's value is higher, while a lower M value indicates that the customer's value is lower.
It can be seen that these three indicators are calculated based on the user's transaction behavior, and these indicators basically represent information such as user activity, purchasing power, and loyalty. The goal of the operator is to calculate the RFM indicators for each user, divide the user groups into different categories for differentiation, in order to analyze and carry out precise marketing actions. Refer to Figure 1:

Figure 1 User classification (I found a good Japanese description in the picture, so I used it, it should be understandable)

R, F, and M can be defined in each direction: high, low, two directions. We find the median of R, F, and M. R=the most recent consumption. If it is higher than the median, it is high. If it is lower than the median, it is low. This results in 2_2_2=8 types of user classifications, which can be used to analyze high-value users, key development users, and churned users, and carry out targeted marketing actions. When formulating operational strategies, it is necessary to consider the proportion of various types of users in the product, as well as the actual business logic of the product. Refer to Table 1:

策略.png

Table 1 Operational strategies for different user classifications

2. Empirical Analysis#

2.1 Data Preprocessing#

Obtain data from a CSV file (real operational data of a certain e-commerce platform during a certain period), including user ID, transaction date, transaction amount, and other fields. The data overview is as follows:

Number of feature variables: 4 (USERID, ORDERDATE, ORDERID, AMOUNTINFO)
Number of data records: 4442
Presence of NA values: None
Maximum value of AMOUNTINFO: 9188
Minimum value of AMOUNTINFO: 0.01

Do missing value processing and outlier processing on the original data. Since there are no missing values in this data, no processing is required (if there are, they can be deleted or supplemented as needed). For transaction amounts of 0.01, which are confirmed to be operational test data, they are deleted. Convert the transaction date (ORDERDATE) to a date format for subsequent calculation of time intervals.

2.2 RFM Score Calculation#

Specify the most recent time point
Since the most recent transaction date in the data is December 5, 2018, this date is specified as the most recent time point, and all time interval calculations are based on this date.
Calculation of R, F, M values
Using the user ID (USERID) as the primary key, calculate the maximum value of the transaction date (ORDERDATE), count the transaction dates (ORDERDATE), and calculate the sum of the transaction amounts (AMOUNTINFO). Combine the most recent time point obtained in step 1 to obtain the R, F, and M values.
Divide the obtained R, F, and M values into intervals using quantiles, usually divided into 5 parts. At the same time, specify the interval labels using the labels tag. For R, the larger the value, the further away from the most recent time point, and the smaller the interval label. F and M are the opposite of R.
The distribution of the calculated RFM data can be seen in Figures 2 and 3:

方法前 - 散点图.png

Figure 2 RFM-Scatter plot distribution

方法前.png

Figure 3 RFM-3D plot distribution

From Figures 2 and 3, it can be seen that there are many early customers in this user group, the transaction frequency is concentrated within 50, and the transaction amount is generally not very high.

2.3 Weight Calculation#

Generally, the Analytic Hierarchy Process (AHP) is used for weight calculation. It is commonly used in the development of personal credit risk rating models in financial companies, and can also be used in the retail industry for loan rating. Using the AHP method, first compare the relative importance of each factor through pairwise comparison, and then calculate the weights of the indicators based on the relative importance. The consistency of the indicator weights needs to be tested, that is, the smaller the Consistency Ratio (CR) of the random one-time ratio indicator, the better (generally requires CR < 0.1). The expert scoring matrix can refer to Table 2:

表 2.png

Table 2 Expert scoring matrix

The final weights are determined as: [W_R, W_F, W_M] = [0.30, 0.54, 0.16] (detailed calculation process omitted). Since this case only has 3 indicators, a small number, and is not easily confused, it can be simply set directly.

2.4 User Classification Based on RFM Interval Labels + Classification Rules#

The first classification method is based on RFM interval labels + classification rules (Song Tianlong, 2017), as shown in Table 1. Taking R as an example, "RS distribution" refers to the average value of RS, and "High" means higher than the average value. F and M are the same. The classification results can be seen in Figure 4:

方法 1.png

Figure 4 Classification results based on method 1

From Figure 4, it can be seen that this user group mainly consists of important retained users and general retained customers, with a small number of general value customers and general maintenance users, and a serious lack of important value users and important development customers.

2.5 User Classification Based on Actual RFM Values + K-means#

The second method is to directly use the actual RFM values, combined with the K-means clustering method for classification, without using RFM interval labels. When using this method, pay attention to:

Whether RFM has discrete values and whether they are evenly distributed
RFM has different dimensions and needs to be standardized
The number of categories for K-means clustering

Discrete value and standardization processing can be done using conventional methods. Discrete value processing can use the MAD method or simply take the logarithm. Standardization processing can use the z-score method. In this case, only standardization processing is performed. As for the number of categories for classification, the elbow method (detailed reference: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set) can be used to infer the number of categories to use, as shown in Figure 5, take the node where the curve in the figure gradually decreases, that is, k=4.

碎石检验.png

Figure 5 Elbow method

Perform K-means clustering analysis based on the classification number of 4. Since the number of classification categories is not fixed at 8, the customers can be simply classified into k-level star customers. The classification results are shown in Figure 6:

方法 2 最终排名.png

Figure 6 Classification results based on method 2

At this time, the user classification simply classifies the users, and the difference between different classification users is not reflected. At this time, the centroids of each category can be taken and their rankings displayed. The classification results combined with the rankings can be seen in Figure 7:

方法 23d.png

Figure 7 Classification + ranking results based on method 2

The density plot can be used to view the RFM distribution of each ranking user category (watermelon12138, 2019). Taking the 3-star members as an example, refer to Figure 8:

概率密度图.png

Figure 8 Density plot of 3-star members

From Figure 8, it can be seen that the number of days visited by this category of users is around 800, the transaction frequency is concentrated around 20, and the transaction amount is distributed widely, ranging from 15000 to 25000.

2.6 Comparison of the Two Classification Methods#

From the perspective of the two different classification results, although they are both divided into 4 categories, the division of the 4 categories is different. The former is easier to classify according to the expectations of the operators in terms of classification rules, while the latter focuses more on the automatic learning ability of machine learning and delegates more tasks to K-means clustering. Although the former also calculates the user's value score, the classification of the former does not depend on the user's value score, and the user's value score is only used as a reference. The latter uses K-means for clustering, but cannot rank the user categories. The ranking of user categories mainly depends on the user's value score and RFM weight.

The source code of this article can be found in my code sharing on JoinQuant:

https://www.joinquant.com/view/community/detail/dee9aa758086d5a37923300e6b288456

References:

Song Tianlong. "Python Data Analysis and Data-driven Operations"
HarveyLau. "RFM-Clustering" https://github.com/HarveyLau/RFM-Clustering
watermelon12138. "K-means clustering algorithm, Pandas density plot and TSNE display of clustering results" https://blog.csdn.net/watermelon12138/article/details/86549474
Wilame Lima Vallantin. "Apply RFM principles to cluster customers with K-Means"