Welcome to the GOTM analytics series! Here’s where I’ll keep writing posts about stats from the PGA Tour and how they affect results, courses, players. I have a background in computer programming, love scraping data, and grabbed a bunch of data already from pgatour.com meaning it’s time for a series of posts where I analyze stats on Tour. If you want to look at the method and code that I wrote to snag this data, you can check out the post on my other blog, Big-Ish Data, specifically this post.
Considering there are tons of different ideas on golf specific stats, I decided to write golf posts here, and leave the other blog for more specific programming topics.
If you have other ideas for Tour stats analytics, hit GOTM up on Twitter and I’ll see what I can do!
And to start out, I just want to say that this post violates Betteridge’s Law because the Strokes Gained stats are very impressive at determining scoring averages, and defining which parts of the game players are good at.
Part 1 of the series here is analyzing the importance of Strokes Gained. So what is Strokes Gained? Take a look at the PGA Tour’s press release. If you’re looking for a more technical explanation of strokes gained, read this article.
Basically, they know the average number of shots it takes a Tour player to hole out from a specific distance. They count the number of shots it takes a player to finish the hole from that distance, and then credits him for + or – strokes from the average, with a positive number indicating he is that many strokes better than the average. Then they add his strokes gained from the four different parts of a hole: 1) off the tee, 2) approaching the green, 3) around the green, and 4) putting. The total of those four is his total SG, but SG can also be broken down into the four components.
As an example, let’s look at Daniel Berger:
- Off the Tee: Berger ranks 39th, with an average of +.416.
- Approach-the-Green: Berger ranks 18th, with an average of +.595.
- Around-the-Green: Berger ranks 152nd, with an average of -.136.
- Putting: Berger ranks 25th, with an average of +.462.
To get Berger’s total SG, we add up all four averages to find a total SG of +1.337 which is exactly what the stats show for him here! (Note that these stats are updated weekly, so depending on when you read this, the numbers may not match exactly.) Super cool, and fantastic to show different places where players are better, and where they probably need to practice more. Based on these stats, Daniel Berger may want to focus more on shots around the green, since his average there is pulling down his total SG.
The goal of this analysis is to answer questions about the SG stats:
- How good is the Strokes Gained stat at predicting scoring average? (In case you’re wondering right now, it’s really really good).
- Which of the strokes gained values is most correlated with a player’s scoring average?
- If you have four different players who are all +1 strokes gained in each of the four different SG stats, which one would you expect to have the lowest scoring average?
Analytics terms defined
Before getting back to the golf stuff, I want to talk quickly about two terms you’ll need to be familiar with for the rest of this article to make sense: normal distributions and r-squared.
The scoring averages on tour follow a normal distribution, like this:
Look at that! So pretty. In this case, the average scoring average is 70.923 with a standard deviation of 0.591. This means that 68.27% of player’s scoring averages are between (70.923 + 0.591 =) 71.514 and (70.923 – 0.591 =) 70.332.
Before running analysis on stats, it’s important to make sure they’re normally distributed. Clearly the scoring average stats are normally distributed, and so are the strokes gained stats, as I’ll show quickly below.
We’re good to go on that front!
One last definition to mention that I’ll talk about a lot is the coefficient of determination, also known as r-squared. You’ll see me mention this a lot. R-squared values range from 0 to 1, and the higher the r-squared value, the more correlated the data sets. And since all the data sets we are using are normally distributed, these numbers are correctly comparable.
When looking at Strokes Gained stats for the first time, I decided to check how correlated the SG: Total stat is with Scoring Average. Checking that there’s an initial indication of how correlated these stats can be. When I saw the following graph and how correlated the numbers are, I knew this was a great part of PGA Tour stats to analyze.
Look at how incredibly, dead, freaking accurate this is! The r-squared value is an impressively high at 0.926, and just by looking at the graph, you can see how correlated those dots are.
Another quick test to see how valid this regression is by checking that fitted line on the graph. Using the equation of that line, if a player has 0 strokes gained total will probably have a scoring average of ~71.079, which is very close to the average scoring average mentioned above of 70.923. Since the slope of that line is -0.96, it means that adding one stroke gained in total, your scoring average will drop by 0.96 strokes. Not exactly 1, but so close that it proves how correct that data is.
Ok, now time to test the more specific SG stats.
In case you’re not sure how golf works, the first shot you hit on a hole is off the tee. And in this case, the SG stat measures performance using all tee shots on par 4s and par 5s.
For any particular hole, the SG Off-the-Tee is calculated by taking the hole’s scoring average and subtracting the overall average shots into the hole from where your tee shot ends up, and then subtracting 1 to account for the shot that you already took off the tee. Again, I highly suggest checking out this article that walks you through the calculations for an actual hole that Rickie Fowler played.
As you can see it’s much more randomly distributed than SG Total. Instead of a very straight correlated scatter, this looks more oval-ish. But this is to be expected, since being great off the tee doesn’t necessarily mean that you’re overall great at shooting low scores.
The much lower r-squared value here of 0.2236 also implies a decent correlation. The other super impressive thing about this regression is that if you have 0 strokes gained off the tee, meaning you are exactly average off the tee, your scoring average will be 70.962, which is also dead on with the average scoring average.
In this case, the slope of the fitted line is ~0.86, meaning that if your strokes gained for all the other stats stays the same, but your SG Off the Tee improves by a shot, our guess for your scoring average would drop by only 0.86 strokes.
Moving onward to the next shot on a hole, the approach.
Approach-the-Green shots gained take into account shots that are more than 30 yards from the green, and excludes shots from the tee on par 4s and par 5s as talked about above. Approach-the-Green shots gained does include tee shots on par 3s, considering all of those shots are approaching the green.
And look at that! The r-squared value for this correlation is 0.4301! Meaning it’s more “accurate” than strokes gained off the tee, which had an r-squared value of .2236. The graph visually shows this improved accuracy too – it’s more of a condensed group than random dots above.
With 0 strokes gained approaching the green, your scoring average is predicted to be 71.007, which again, is dead on and the slope of the fitted line shows that if you have a stroke gained from approaching the green, your scoring average is predicted to roll down by 1.14 strokes.
According to PGA Tour, “Around-the-Green measures player performance on any shot within 30 yards of the edge of the green. This statistic does not include any shots taken on the putting green.”
R-squared here is ~0.124, which is less than the first two, meaning that knowing how good you are around the green doesn’t necessarily imply that you’re better overall.
Again, 0 SG: Around-the-Green predicts a scoring average of 70.946 which is fantastically correct. In this case, if you have plus or minus 1 stroke gained or lost, your predicted scoring average moves by 1.028! Which means this is nailed for a great rating.
Another interesting outcome is that no players have Strokes Gained Around-the-Green more or less than a stroke! It doesn’t mean that strokes gained around the greens isn’t important, it’s just that it’s tough to lose a bunch of shots if you have a bad short game. So going back to our example of Daniel Berger in the beginning, even though his SG Around-the-Green is his lowest of the four, it doesn’t affect his scoring average all that much.
Last thing here is how good you are at putting, and how many strokes you gained by making buckets.
R-squared is 0.1536, which is a pretty good prediction, better than around the green, but less than all other strokes.
Scoring average with 0 strokes gained is a dead on 70.937, and the one stroke gained scoring average difference is 0.747, meaning that if you increase your SG putting by one, you only gain 0.747 strokes overall.
Also, based on the graph, the max strokes gained is just over 1, but max strokes lost is -1.5. It’s somewhat tough to gain a bunch of strokes from putting, but easier to lose strokes if you’re terrible at putting. I always say this, but knowing your speed putting is key. Don’t 3 putt.
Predicting Scoring Averages
Finally, I want to show something about how it’s possible to predict scoring averages more accurately using the strokes gained stats. Instead of just checking the correlation of the variables, I ran the code that is used to predict future data using training data. In this case, I took all of the SG: Total and Scoring Average stats from 2013-2016, used 80% as training data, and then 20% as testing data. It’s the same algorithm, but in this case, also shows that the fit can be used for future values.
Same, very elevated r-square value as when using all the stats instead! Great to see the similarities.
Another way to show the similarities here, and to show how great linear regression is, here’s the graph of predicting SG Total, using all the other SG stats. As you can see, it’s pretty easy to figure that out considering the definition of SG Total is just adding up the values for the other SG stats.
The coefficients calculated by the regression are… you guessed it… 1! Since SG Total is just adding up the SG values for the other stats. Always good to check that you’re programming correctly by running an algorithm you know the result of.
And in case you’re curious about the different coefficients of the stats, and the data for the line, here you go. Sometimes, machine learning is dead on accurate.
SG: Off-the-Tee coef: 1.00008532348 SG: Approach-the-Green coef: 1.00014159888 SG: Around-the-Green coef: 1.00008220847 SG: Putting coef: 1.00004283695 Intercept: 6.20688007397e-06 Mean squared error: 0.00 Variance score: 1.00 R: 0.999999496465 R^2: 0.99999899293
How Correlated are the different SG stats?
Remember how I kept mentioning the slopes of the fitted lines for the specific SG stats? Here’s the list again:
- SG: Off the Tee = 0.86
- SG: Approach-the-Green = 1.14
- SG: Around-the-Green = 1.028
- SG: Putting = 0.747
- SG: Total = 0.96
The question here is why SG: Putting is low, and SG: Approach-the-Green is high? Again, this means that if a player’s SG: Putting increases by a shot, our predicted change in their scoring average is only 0.747 shots lower. But their SG: Total will increase by 1 and with that number, our guess is that their scoring average will decrease by 0.96 shots.
If a player’s SG: Approach-the-Green increases by a shot, we predict their scoring average will drop by 1.14 shots. Again, in total we’ll see their scoring average drop by only 0.96 shots.
Where do the other shots come from?
The quest here is to check the correlation of the SG stats to each other. For example, if a player starts stuffing approach shots, but gets worse at making those putts, then the scoring average will balance out. Let’s quickly look at the correlation between the different SG stats, and you can see, they’re not correlated at all.
Look at these – they’re blobs! It’s true, some look like they’re slightly correlated. For example, SG: Off the Tee and SG: Around-the-Green do look somewhat correlated, such that if your SG: Around-the-Green increases, your SG: Off the Tee might drop.
But look at those incredibly small r-squared values! The largest one is ~0.07 (SG: Off the Tee vs. SG: Around-the-Green) which is so small that which means those numbers cannot be trusted! You cannot predict a player’s SG in stats just by knowing their SG in a specific one. Some some players are good at one thing, others are good at another, and the SG stats are separated into groups well enough that they’re not correlated.
The biggest thing here is how incredibly impressive Strokes Gained is correlated to scoring average. Based on the slope of the line from regression, if a player has 1 SG: Total, their scoring average will by 0.96 shots lower. For a fitted regression line, this is very accurate. The developers were able to take specific data of each shot every player hits and turn that into something that more represents a player’s skill for different parts of the course, rather than just relying on the simple scoring average stat. Scoring average does do a pretty good job at quantifying a player’s skill, but having more refined data is fantastic.
For me, the most interesting thing about this analysis is how fantastically SG: Approach-the-Green is to scoring average. The largest r-squared value, as well as the largest slope of the fitted line means that improving your approach shots will lower your scoring average the most compared to improving on the other stats. Just remember, correlation does not imply causation. For example, if a player is great at stuffing approach shots, that probably means his swing is good enough to fit in with drivers as well. So if you improve your SG: Approach-the-Green, I’d guess your SG Off the Tee also improves.
The only, only reason that the stats aren’t dead on for predicting scoring average is that the scoring average from distances on the course are difficult to nail down. Course conditions, weather conditions, and unlucky lies in the rough aren’t able to be used in the averages. I can assume there’s some way to use that information to change the averages. For example, possibly looking at the wind speed and seeing how the scoring average changes per speed and using that to change the distance scoring averages. But that’s just not a good way to run predictions considering how refined that can get. Not a good way for numerical predictions.
Finally, the last thing I’d be curious about is seeing how many shots a player has from the certain locations. With the scoring average of 71, we can say that 14 shots come from off the tee, probably something like 18 approaching the green, but as for the others, it’s not exactly certain. I’d be curious to see if there’s a correlation of the SG stats and the number of shots a player takes from the different buckets.
That’s it for part 1! Again, any suggestions on what to analyze going forward, follow and hit me up on Twitter.