Friday, July 17, 2020

Predicting NBA Team Scoring Rates with Splits

Using splits when analyzing sports is, generally, something that should be done with caution. Entire seasons worth of player and team performance are subject to noise so taking that data, breaking it into specific splits, and drawing conclusions leads to even noisier results.

With that being said....I am going to see if I can use team scoring rate splits to better predict next year overall scoring efficiency. The impetus behind my idea was that I thought certain splits may better isolate how effective an NBA offense is at scoring efficiently. For example, one of the splits I looked at was scoring efficiency on possessions that started with a dead ball. The theory is that by looking at just these possessions, I could get more signal about future offensive efficiency because this split is not reliant on the opponent missing a shot and the location of the rebound after the missed shot. The issue is, in general, teams have about one fourth to one third the possessions following a dead ball versus those that start after an opponent miss.

I looked how all situations scoring efficiency in year N can be predicted by scoring efficiency in all situations, off of a miss, after a dead ball, and after a steal in year N-1. I included scoring efficiency after steals to show how offensive value generated off of steals is not something teams should count on year over year since they are highly dependent on the opponent and transition efficiency in general is a noisy. I made five simple linear models to predict year N all situation offensive efficiency (measured in points per 100 possessions): scoring efficiency in all situations in prior year adjusted for minutes continuity between years N and N-1, scoring efficiency after opponent misses in prior year adjusted for minutes continuity between years N and N-1, scoring efficiency after dead balls in prior year adjusted for minutes continuity between years N and N-1, a model using all three of the splits adjusted for minutes continuity between years N and N-1, and a model to look at the change in overall scoring efficiency with just minutes continuity between years N and N-1. The minutes continuity data can be found here.

The following represents the adjusted R squared values for each linear model:
Of all the models, just using the prior year overall scoring efficiency is most predictive followed by the mixed split model and the model using efficiency after opponent misses. First, my assumption that performance after dead balls might give more signal into a team's "true" offensive efficiency seems to not hold much weight. That model was barely more predictive than the model using scoring effectiveness after steals. I would attribute this to the sample issues I alluded to above, where there were only about one fourth to one third of the possessions after dead balls relative to those after a miss for a given team season. Furthermore, not breaking the data down into splits proves to be the best way to handicap a team's offensive efficiency in the following year. Now keep in mind, this is a very simple way to approach such handicapping; the R squared between all situations scoring efficiency in years N and N-1 is still only 0.38 after adjusting for minutes continuity. What is missing in this analysis is a more granular look at minutes continuity; the way I accounted for minutes continuity does not account for the quality of the players nor does it account for changes in player ability by means of aging. This manifests itself in the model using just simple roster continuity having an adjusted R squared of just 0.012. Incorporating player value and performance (through one or a blend of the many RAPM variants out in the public) and player aging would yield a better correlation.

What can be gleaned from this study? This is a reaffirmation that using splits to predict team performance is not a worthwhile endeavor. Splits can be useful for describing team and player past performance, but their contributions to past performance (either positive or negative) is noise for the purposes of prediction and should be regressed heavily towards the mean. 



Wednesday, July 15, 2020

Considering a Nolan Arenado Trade

Over at Pinstripe Alley readers were asked to come up with a post about possible Yankees trade targets, including Nolan Arenado, a player general manager Brian Cashman has been rumored to covet. In a vacuum, I would think that any fan of the team would love to have Arenado don pinstripes. The issue is baseball players do not exist in a vacuum; players compete on the diamond and in return receive an agreed upon salary. The issue with Arenado is two-fold; his contract is large compared to most of his peers (he has been guaranteed the sixth largest contract amongst active contracts, per Cots Contracts) and he can opt out of his contract after the 2021 season. The opt-out clause makes valuing Arenado in a trade tricky. With that being said, I wanted to try to find a fair way to value Arenado and construct a deal with would potentially make sense for the Yankees, if possible.


As I alluded to above, Arenado is guaranteed a lot of money. To be precise from 2020 through 2026 he is set to receive 234 million dollars from the Rockies, per RosterResource. Now, using the methodology for valuing players I described here (a quick summation if you want to avoid the link: I use a simple aging curve combined with a Monte Carlo simulation of a players WAR total and cost of a win in free agency to get a distribution of possible dollar figures for a player) and using his Depth Charts win projection of about 4.86 WAR this coming year, I projected Arenado to be worth about 265.57 million dollars over the next seven seasons, so about 31.57 million dollars in surplus over seven years. The following is a distribution of his seven-year value, in millions of dollars:

The column headers correspond the percentiles. 95 percent of the time, he is expected to be worth at least 113.09 million dollars. Overall, Arenado’s contract is reasonable, without considering the opt-out clause.

I should note, if any of you may have forgotten, we are living in the middle of a global pandemic and the structure of the MLB season has been substantially altered to the tune of 60 games. As a result, Arenado will paid a prorated share of his 35 million dollar salary, which comes out to 12.96 million dollars. So, shave off 22.04 million dollars from 234 million and about 26.01 million dollars from his 50th percentile outcome. From Craig Edwards research into prospect value Arenado would be worth either a 55 FV pitcher or 50 FV position player without the opt-out clause, or some combination of players that would yield a similar result if the Rockies value quantity over quality.

Without the opt-out clause, I would end the analysis here. But the option is arguably the most important and valuable component of the contract from Arenado’s side. Before I put a dollar value on the option, let’s first look at Arenado’s value assuming he opts out after the 2021 season. Under normal circumstances he would be due 70 million dollars over the next two years. Using the methodology from above, here is a snapshot of his distribution of possible outcomes:

On average, he would be worth 7.06 million dollars in surplus, or about a 45+ FV prospect for two seasons. Given the Arenado’s stature as a player, this is surprising and speaks volumes on how well Arenado and his representation made out in their negotiation with the Rockies. Like we saw above, we need to consider the circumstances of the pandemic and the fact that if the Yankees were to trade for Arenado at the deadline on August 31st, they would only get him for about 30 games plus a potential playoff run. When accounting for his prorated salary over 30 games and his full salary next season, he is owed 41.48 million dollars over two years. The snapshot of his possible value distribution is as follows:
On average he will provide just a bit more than one million dollars in surplus. If the Yankees were to operate under the assumption that opt-out with 100 percent certainty, it would be hard to argue they should give up anything of substance to bring Arenado into the fold. Now, this fails to consider that the value of WAR is not linear and the value of the great players is higher because concentrating WAR into one roster spot gives teams the opportunity to be flexible with rest of the roster, but I did not think it was worth getting any more into the weeds.  

The real issue underlying all of this is the team cannot value Arenado as if he will definitely play out all seven years under contract nor can it assume he will definitely opt-out. Given the economic uncertainty brought on by the pandemic, I would tend to think Arenado would be more likely to opt in to the remaining five years on the deal. But let’s assume that the free agent market does not change as a result of the pandemic (which one might say is a foolish assumption, but I think provides a more interesting thought exercise). Let’s also assume that Arenado will opt out of his deal if he is projected to be worth more than 164 million dollars he is owed over the final five years of the deal starting at age 31. Using my Monte Carlo simulation function and assuming his projection is about half a win less at age 31 as it is at age 29, Arenado would be worth at least 164 million dollars about 57 percent of the time. There are two ways to account for this in the valuation of the contract. The first is to look at how much he is worth when he opts in, account for that in the valuation and take a weighted average of his surplus value when he opts in and out, where the weights are the percent chance he makes each decision. Using this methodology, if he opts in he would be worth on average 129.26 million over five years putting the contract 34.74 million dollars underwater at the end. Remember, over the first two years with the pandemic, he is worth 1.31 million in surplus. Taking the weighted average of the two figures yields a valuation of his contract being worth 14.15 million in the red.

The other way of accounting for the opt out clause is to isolate the value of the clause on Arenado’s end of the deal and add it to the total money guaranteed to get the value of the combination of the opt out clause and the guaranteed salary. When Arenado opts out of the deal, he is worth on average 208.14 million dollars on average. Since he opts out 57 percent of the time, you can value the opt out as the additional money he would make by hitting the open market multiplied by the percent chance he goes to the market. Since he would be worth 208.14 million over the five years, we will assume he would receive that amount on a contract. This would be a net gain of 44.14 million dollars (164 subtracted from 2018.14). Multiplying 44.14 by 57 percent and you get 25.16 million dollars. Add the 25.16 million to the 234 million he is owed and you get a seven-year value of 259.16 million dollars. Remember from the beginning, he is projected to be worth 265.18 million dollars over the seven-year period, not accounting for the pandemic, so the contract is worth 5.99 million in surplus, which is microscopic relative to the size of the deal. Obviously, I am not considering the pandemic in this scenario but not much changes when you do so. When accounting for his smaller prorated salary this year and the value he will provide in the shortened season, the total surplus does not change much.

So, what did we learn from all of this? Well Arenado and his agency (he was represented by Wasserman Media Group) did great work in negotiating this deal. They got him an opt out early which gives Arenado the chance to hit the free agent market if the Rockies do not move in a direction he is comfortable with and the opt out makes it difficult for the Rockies to move him for a decent return. When I consider the opt out clause and the probability that he exercises it, his contract is either a slight negative or a slight positive on the team end. Given the player-friendly nature of the deal (which I do not mean to paint as a negative, just a negative when looking to trade for Arenado. I love seeing players going out and negotiating deals in their favor) and the uncertainty derived from the player option, I would be weary of dealing anything of value for Arenado. If the Yankees are truly interested in his services, I would rather see the team wait a couple of seasons and try its luck in free agency.


Saturday, July 11, 2020

Can Players Drive Goals Above Expected?

The main pillar of hockey analysis in the year 2020 expected goals, a concept first introduced in soccer and one that eventually made its way over to hockey. Expected goals attempts to gauge the probability that a shot attempt finds the back of the net. A shot attempt with an expected goal value of 0.09 should be expected to be converted 9% of the time, assuming that the expected goals model is well calibrated. The driving forces behind all expected goal models are the x and y coordinates of the shot attempt, which can be found in the play-by-play data from the NHL API. This should be intuitive, the closer a shot is to both the goal line and the center of the ice, the greater the chance that shot will be converted. Now, the shot coordinates are not the only couple of variables. Analysts, such as those at Evolving-Hockey, MoneyPuck, and Natural Stat Trick have incorporated variables such as the handedness of the shooter, the time between successive shots, how long players have been on the ice, etc. I would going to this link and reading about the model at Evolving-Hockey for some more background. Included in that write-up is a short history of expected goal models.

The premise of using expected goals to evaluate players is that players do not have too much control over either their individual shooting percentage, their teammates shooting percentage, and their on-ice save percentage. This is not totally true (forwards, especially compared to defenseman, have some control over the two former figures. With that being said, these percentages should be heavily regressed towards the mean), but mostly true, so knowing where the shots are coming from and the nature of the shots should be more valuable than either goals for and against or even just shot attempts for an against.

Now, as I alluded to above, there is some evidence that players can some, albeit little, ownership over some of the percentages that are mainly chalked up to luck in hockey analyst circles. The analysts over at MoneyPuck have done work to gauge shooting talent using goals above expectation and Bayesian statistics. I wanted to see if I could find any signal in on-ice goals above expectation. To do so, I looked at the the differences between actual and expected goal Regularized Adjusted Plus-Minus (RAPM), which can be found at Evolving-Hockey. RAPM is derived from linear regression, where the target variable (in our case goals and expected goals) is regressed against the skaters on the ice, the score and strength state, zone starts, whether or not the offensive team played a back-to-back, and whether or not the offensive team was the home team. RAPM looks to isolate each players individual impact on his team's target variable differential.

For my analysis, the variable of interest was the difference between actual and expected goal RAPM per 60 minutes of even strength play, which I will refer to as RAPM delta for the remainder of the post. The idea is to figure out whether differences in the two RAPM figures can be considered meaningful for a player and then contribute that difference to the ability to drive goals by means of skills not captured in expected goals (mainly by individual shooting talent and buoying teammate shooting talent).

I took every skater season pair from the 2007-2008 season through 2019-2020 where the skater played at least 50 minutes in each season at even strength (an example of a season pair is Connor McDavid from 2018-2019 to 2019-2020). First I wanted to get a sense of the stickiness of RAPM delta year over year.
This is almost completely noise. With weights corresponding to the average time on ice between the two seasons in the pair, this yields an R-squared of about 0.01. As I alluded to above, deviations from the mean in shooting talent or on-ice percentages need to be regressed heavily towards the mean, even for multi-year samples. So the fact that RAPM delta in year N effectively does not explain anything about RAPM delta in year N+1 is not surprising. To account for the need of a heavy dose of regression, I put together a similar linear regression, but instead of using years N and N+1 weighted by average time on ice, I used career RAPM delta up to year N and RAPM delta in year N, weighted by career time on ice up to year N. 
There are less data points here because I am just looking at non-rookie skaters from the 2019-2020. Surprisingly, there is not much more signal here compared to when I looked at just prior year RAPM delta. This regression yielded an R-squared of about 0.027, still almost entirely noise. 

My last resort was to look at all of the players in my data set. I created a model where the target variable was RAPM delta. Instead of using any prior RAPM deltas, however, I created a series of dummy variables for each player with a season pair in the data set and regressed all of these variables (1645 in total) against RAPM delta. 
Now we see some signal. This regression resulted in a R-squared of 0.2458, a much stronger correlation than the previous two models. The coefficients in front of each dummy variable represent the estimated effect of each player on goals above expectation per 60 minutes. The following is the distribution of those coefficients: 
The distribution looks normal, as was expected. The fact that it is centered at about -0.05 goals per 60 minutes can be attributed to expected goals slightly underestimating actual goals. 

So, there appears to be some merit to players being able to drive on ice goals above expected. Given the research done on player shooting talent over at MoneyPuck, I think this type of behavior can manifest itself at the on-ice level, but probably not as strongly as on the individual level (given the confluence of variables at the on-ice level). The next step would be to try to nail down an accurate figure or reasonable possible distribution of figures for each individual player.

I will leave you with a table with some estimates of player RAPM delta (i.e. the coefficients from the regression). I took the top 10 and bottom ten RAPM delta impacts for players who played at least 1,000 even strength minutes this year. I would take these with a heavy grain of salt, there is still not an especially strong correlation between the model predictions and actual results and I did no further regression to the mean for the uncertainty of players who do not have a lot of career minutes. 



Wednesday, July 8, 2020

Statcast Aging Curves: Looking at How Hitter Exit Velocity Behavior Changes

Recently I looked at aging curves in the NHL. In my last couple of posts, I used the delta method to determine how forward skills age and how overall player value changes. Now, I am moving to baseball and, more specifically, exit velocity aging curves from the Statcast data. Since 2015, MLBAM has made some of the data recorded by TrackMan (and in the future Hawk-Eye) available to the public through its website Baseball Savant.

I wanted to see how hitter batted ball profiles aged, so I pulled a few batted ball metrics from the leader-boards and developed an aging pattern. In this study, I weighted each delta by the average number of batted balls between two player seasons.

First, the most commonly cited metric from Statcast, average exit velocity. While I have some qualms with how average exit velocity is presented versus what it actually means in practice (average exit velocity gives no context to the spread of a hitter's exit velocities, so Christian Yelich and Franmil Reyes profile similarly). Nevertheless, here is how average exit velocity ages during a hitter's career:
Age, to reiterate the point I made in my first aging curve post, is the second age in a given bucket. So age 24 on the chart corresponds to the delta between age 23 and age 24. The peak is a bit later than I expected, given our understanding that generally peak performance generally occurs somewhere between age 24 and 27. Maybe average exit velocity over a season is not a good indicator of whether or not that season was successful, relative to that player's ability. Let us now look at how maximum exit velocity ages: 
This is more in line with the traditional aging curve for player performance. For the sake of easy comparison, I put the maximum and average exit velocity curves on the same chart, along with average line-drive/fly-ball exit velocity.
Average exit velocities in balls put into the air age similarly to total average exit velocity (i.e. a less robust curve and later peak). This makes me think that maximum exit velocity is a distillation of where a player is in his career than either total average exit velocity or air-ball exit velocity. This finding is similar to the conclusion Rob Arthur came too in a 2018 study at The Athletic (subscription required). This is despite the fact that the maximum exit velocity for a player in a given season consists of a sample size of one (because it is the maximum of a distribution) while the other two measures consider all balls put in play. Subjectively, I would think that maximum exit velocity is a better indicator of physicality than average exit velocity, the latter of which can remain relatively unchanged even if the maximum declines through the development of some of the "soft skills" of being a major league hitter (waiting for good pitches to drive, avoiding swinging at pitches outside the strike-zone, etc).  

Tuesday, July 7, 2020

NHL Skill Aging Curves: GAR Rates

This is a continuation of my study into NHL player aging patterns. In my last post I introduced aging curves and applied the concept to forward scoring and shot rates. Now I am going to apply the concept to Evolving Hockey's goals above replacement metric, which is a regression based statistic meant to capture a player's value in one single figure. It is composed of even strength and special teams contributions on both offense and defense and penalty differential (penalties drawn versus penalties taken). I used the same delta method I described in the last aging curves post and this time I looked at the aging patterns for forwards, defensemen, and goaltenders, the latter of which a separate GAR model applies. Instead of using total GAR, I used GAR per 60 minutes to avoid the issues associated with comparing and analyzing players based on cumulative metrics. Metrics where value is accumulated are affected by circumstances outside of a player's control (mainly, the coach deciding how much he is going to play). Thus, to remedy the issue and try to isolate skill, I looked at the rates at which each skater accumulated GAR. First, here are the aging curves for skaters for the three main components of GAR:
Skater offense peaks early. It only takes about a couple of years for skaters to reach their peak offensive value and once they reach that peak, they start a relatively slow decline until their late 20s where that decline hastens. Defensive value seems to be at its peak basically right when skaters enter the league but declines much slower than offensive value. Value derived from drawing and avoiding taking penalties declines sharply the moment players enter the league. I believe this can be attributed to the idea that drawing penalties is mainly a product of a player's speed and physicality overwhelming opposing defenders, both attributes that are basically in decline the moment a player steps on the ice at the NHL level. When putting these components together and weighting them appropriately, one can yield a total GAR rate. Here is the GAR rate aging curve for all skaters: 
Unsurprisingly players peak early in their careers and decline slowly into their late 20s followed by a more rapid decline. Here are GAR rate curves split by forwards and defensemen: 
Defenseman retain their value a bit longer than their forward counterparts. Given that offensive and penalty drawing value peak earlier than defense, it is not too surprising that defenseman maintain their value longer when coming into the league because the former two skills are more important to being a productive forward. 

Finally, I created a goaltender GAR rate curve. This comes with the disclaimer that "goalies are voodoo" and both NHL teams and public analysts alike have always had trouble gauging the value of a given goaltender. With that being said, using Evolving Hockey's attempt at quantifying the value of a goaltender, here is how goalie value ages: 
Goaltenders basically begin to decline the moment they step into an NHL crease. Even with the uncertainty of valuing goaltenders, maybe teams should not be signing 30 year-old goaltenders to massive free agent contracts and long-term extensions. Just a thought and something to keep in mind with Braden Holtby and Jacob Markstrom hitting the free agent market at the conclusion of the Stanley Cup Playoffs.

NHL Skill Aging Curves: Scoring and Shot Rates

Player skills, no matter the skill or sport, are a dynamic entity constantly fluctuating throughout careers. Not only does a player's overall value (represented by a figure such as WAR) fluctuate as he ages, but so do his skills. The way these skills develop over time, however, do not follow the same path. Aging curve studies have been undergone ad nauseam and those studies can be found online (for specific examples I would look here and here).  In most sports, conventional wisdom is to say that a player's "peak" or "prime years" are in his late 20s (so between ages 27 and 29). After further investigation, this assumption does not hold up and players actually peak earlier than we originally thought, especially in the NHL.

I looked at scoring and shot rates for NHL forwards over the last decade. While scoring and shot rates are not perfectly correlated with player value, they are measures of two skills that teams tend to pay for. Later, I will conduct a similar study with WAR. I pulled all of the information from Evolving Hockey. Both the scoring and shot rates were at even strength. To construct the curves, I used the delta method. The delta method consists of binning the changes in some measure of player performance by age grouping and averaging the changes in the statistic of choice weighted by playing time within each bin, and then looking at when those weighted averages peak across the landscape of the sport. I determined the playing time weights of each player season pair by taking the average of the playing time in each season in each bin. So, for example, when looking at Nathan MacKinnon in the age 21/22 bucket, MacKinnon's time on ice weight was the average of his even strength time on ice in his age 21 and 22 seasons, 1206.93 and 1137.47 minutes respectively.

The following represents the aging curve for the average NHL forwards even strength scoring rates:
The age on the x-axis represents the age in the season of interest within the bin (so age 24 represents the age 23/24 bucket because for that bucket we are concerned with the scoring rates for that age relative to those at age 23). In general, NHL scoring rates for forwards peak at around 24, improve slowly before that peak, gradually decline for the rest of their 20s, and quickly decline in their 30s. This peak is much younger than previously thought and at least 3 years before a player can become an unrestricted free agent. Keep this in mind when your favorite team signs a 30 year-old star forward to a massive free agent contract.

Similarly I looked at forward shot rates and found that the peak age is even earlier than scoring rates:
The peak happens around age 22 and there is usually only small improvement if the player enters the league before that. You can basically assume that when a player comes into the league, he will be firing pucks toward the net at as high a rate as he ever will be throughout his career. 

Like I alluded to at the start, you can look at how players age with any number of statistics. While scoring and shot rates are reasonable barometers of forward productivity, more robust regression-based WAR metrics have been developed to value hockey players. In my next post I will look at how WAR and its component parts age, again using the data offered at Evolving Hockey.