Friday, August 14, 2015

Cluster Analysis on a selection of ATP Players

I have always felt, with the goldmine of data that Jeff Sackmann of tennisabstract.com provides, a lot of interesting studies and analyses can be made. I chose to devote some time to one such analysis to see if I could cluster players based on their statistics. To my pleasant surprise, some expected members came out to belong to the same cluster. Say, if Ivo Karlovic was not going to be with John Isner in one cluster, I would assume the analysis to have had some gross miscalculations/assumptions.

But that's not just about it. Playing styles is one thing, but player statistics could be quite different. For example, a player like Roger with his mighty serve could still have a 'winning %s in Serves' number similar to a relatively seemingly weak server like Rafa, thanks to the dynamics that shots in rallies after the serve play. The core of my analysis took these into account and for that reason, I never chose individual variables relatable to style of play like No.of Aces, Aces/Match etc. I chose percentages that matter the most in the sport.

Some details on the model.

Match data includes data:
  • All ATP matches data from 2005 to Mid 2015 
  • For Players who have been Active in 2014 or 2015 (played at least one match in the last 1.5 years) 
  • Players who have served more than 1500 Aces since 2005 (this was used to retain the more popular names); So don't expect to see where and alongside who would, arguably the most popular player from the week, Nick Kyrgios, fit in. 
Variables used in the model:
Serve Stats
First Serves In %
First Serves Points Won %
Second Serves Points Won %
Break Points Saved %
Return Stats
First Serves Return Points Won %
Second Serves Return Points Won %

Algorithm: k-means Clustering. It involves taking values of these variables for all players and grouping players such that the euclidean distance between players for all variables in one group is minimal with minimum errors between and across the cluster groups. After some random trials, a size of k=5 looked to provide a decent 'accuracy-making sense' tradeoff.

Getting this data in order from the source involved a bit of a circus as the base data set doesn't have details by player but by match. i.e., it would show all these stats for match loser and match winner in one row for each match. Putting that circus together actually helped. The point gets mentioned here to emphasise the fact that data exploration is and should always be part of any analysis effort and that makes you get familiar with the data set.

Now to the results. This was how the players had got clustered into their respective groups:

The mean of the different variables under consideration in these groups looked like:











Cluster 4 was the litmus test result that validates the analysis from tennis-fan terms. All the big servers in the game end up landing here. At enviable 1st Serves Won % (78% mean!), these are the towering crusher-serve senders. They obviously end up having poor numbers for 1st and 2nd serve Return Pts won% (25% and 44%)

Cluster 2 is the one that carries 3 off the big 4 and of course, the talented delPo doesn't get left behind and joins it promptly. These guys are the clinical ones, setting benchmark in 4 off 5 categories and staying only behind by a whisker in the First Serves in department. Happy to see another of my personal favourites Nikolay Davydenko in there - Hi Kolya! Verdasco seems to be the lone outlier in the group but hey, wouldn't Nando find in an instant that double faults weren't part of the variable set!

Cluster 5 houses Andy Murray & Stan the Man! The group seems to have the lowest first serves % mean (which I suspect could be due to some skewing it to look that bad!) and Murray's conspicuous absence from Cluster 2 could have been due to his 2nd Serves (he is part of the group that stands at 50% in this variable against 54% at the group where the other elite members are sheltered)

Cluster 1 comprises members who have been on and off and their pain point seems to be with respect to 2nd serves and of course, as a result, break point conversion. The lads there are probably weak mentally compared to ones from other groups.

Cluster 3 is the one that houses the consistent journeymen who hover around the 10 to 40 rank and their weakest link seems to be their 2nd Serves & Returns. They are, in more ways than one, closer to the cluster 1. Look at the names in the two groups. Those are your regular R32 guys at slams!

I also looked at how the different members of the clusters fared in combination of these variables.

More on that with plots on how the variables interact with each other for these clusters and a deep-dive into Cluster 2 is shown in my continuation here.

Any interesting insights you see in there?! Feel free to share that and any other comments you may have.


No comments:

Post a Comment