Business Analytics Micro-Case #2: Clustering

By: Paul Newton, Direct of Business Analytics

Paul Newton

Director of Business Analytics

Business Analytics Micro-Case #2: Clustering

During the summer, we introduced our series on Micro-Cases in Analytics where we showed how the laborious task of applying credits can be solved in seconds with the right algorithm, and that the benefits can be easily recaptured countless times thereafter.

In our second Micro-Case, we dip our toes into Machine Learning in order to determine which Procurement Card holders use their cards in a markedly different manner to the majority of holders. In other words, who are the outliers? We use the same State of Oklahoma P-Card dataset that we used in the previous case.

Intro

“Markedly different” is an admittedly vague term to assess credit card users, but as is often the case when analyzing large or complex datasets, we didn’t know where to start. Consequently, we selected an Unsupervised Machine Learning model where, instead of predicting any particular outcome, we leveraged the algorithm’s ability to detect patterns in P-Card usage.

We utilized an algorithm called K-Means Clustering to group our 7,365 employees (with 630K transactions) so that they end up in groups with “similar” spending profiles. The smaller the number of employees in a cluster, the more of an outlier the members are in terms of their spending patterns.  Statistically, this is done by minimizing the numerical distance between the within-group co-ordinates. Below, we cluster based on the spend by expense category – so in a very simplified example, someone who spends $1,000 on Travel will have a larger within-group distance to a cluster with an average spend of $10,000 than it will one with a $500 average.  (This is just one dimension, and the algorithm looks across all 11 expense categories/dimensions for employees).

Clustering

The first task in clustering is to decide how many clusters of cardholders best explain the data. This can be solved mathematically, and we do this by summing the squared distances between the cluster center point and each employee’s transactions. As we add more clusters (center points), the distance between the two points declines – but at some stage, the declines are too small to meaningfully differentiate between clusters. Below is the Scree Plot that we calculated to conclude that 5 clusters optimize the trade-off between the information gained from additional clusters, and the noise of those additional clusters.

 

 

Now we group our users into the five clusters, and the K-Means model identifies clusters with the results below, where the cluster number is followed by the number of employees, and each dollar figure is the Average employee spend:

 

 

It is immediately noticeable that the 5 employees in Cluster 1 average ~$2.6M in Travel, and that the single Cluster 4 employee has spent $20.1M, with 73% of that being with Retail merchants. Furthermore, there are three Cluster 3 employees who average $3.5M with Retail Merchants and $4M overall.

Cluster 5 is the vast majority of cardholders, so can be viewed as “typical” usage. Cluster 2 has 80 employees with an average spend of $662K ($53M in total), so in addition to the obviously outliers above, these 80 employees would also be a clear track to follow up on.

Conclusion

Despite having to evaluate 630K rows and 7,365 employees/cards, it took less than a second for the K-Means Clustering algorithm to group our users into 5 meaningful groups. These groups not only made it easy to digest a voluminous amount of data, but also allowed us to quickly identify and evaluate key outliers in a more focused manner.

By extension, this same approach can be used to very quickly determine the outlier user profiles from travel expenses, customers or almost any dataset where the volume or number of possible combinations prevents a manual approach.

Finally, as was the case with our apply_credits algorithm in Micro-Case #1, after just a little set-up and configuration to solve the problem initially, a previously time-consuming problem is solved in seconds for each subsequent iteration.

Questions on this topic? Contact Paul Newton, Director of Business Analytics at pnewton@cviewllc.com.

More From Paul

More in Business Analytics