Wednesday, October 28, 2020

NBA Championship Equity Based on Best Players

The Lakers recent NBA Finals victory got me thinking about how seemingly unusual their roster was shaped. The teams that win the championship almost always have at least a couple of high-end, All-NBA type talents on their roster (the 2011 Mavericks and 2004 Pistons are exceptions, but certainly not the norm). Still, this Lakers roster seemed especially top heavy, with LeBron James and Anthony Davis surrounded by a just solid crop of players. According to the newest Box Plus-Minus (BPM) model available at basketball-reference (the write-up of which can be found here if you are like me and want to get into the weeds), the third best player on the Lakers was JaVale McGee who was out of the rotation by end of the Western Conference Finals and did not play a game in the final round. Of the players in the rotation against the Heat, the third best player during the regular season was Danny Green, who added about half a point above average per 100 possessions during the regular season. McGee added 1.5; the difference between Davis (who posted a BPM of 8) and McGee was the second largest this century between the second and third best player on the NBA champion after the gap between the second and third best players on the 2012 Heat, Wade and Bosh. The gap between the Lakers this year and the third largest difference in the sample (incidentally between Kobe Bryant and Rick Fox on the 2001 Lakers) was the same as the difference between third and eighth of the list (which was the gap on the 2009 Lakers; the Lakers have won a lot of championships). 

It seemed like there were a lot of worthy challengers for the NBA crown this year who boasted much deeper rosters than the Lakers, including the Clippers, Bucks, Celtics, Heat (who they vanquished in championship round), the Nuggets (who they beat in the semifinal round), and even the Rockets (who they beat in the second round). Yet the Lakers came out as the champion on the backs of their two superstars. We know the NBA is a star-driven league, much more so than any other sport excluding maybe the quarterback on a football team. So it begs the question: how much championship equity does a team going into the playoffs based on its best player and its best several players? To investigate, I took every team season since 2000 and pulled out the best player on each team and the three best players on each team based on BPM and a minimum of 500 minutes played in the regular season. I built two models: the first is a logistic regression where the target was whether or not a team won the championship and the variable was only the BPM of the best player. The second is another logistic regression where the target is whether or not the team won the championship, but the variables were the BPM of the best player, the second best player, and the third best player. 

First, I took a look at the predicted championship probabilities of the model using data from only the best player. I will note I tried incorporating other data about the best player into the model, such as usage and shooting efficiency, but it actually made the model less accurate (based on AIC which gauges in-sample predictive power). For context, here is the distribution of BPM figures for the best players on championship teams versus all other teams in the sample: 

Unsurprisingly, the teams that win championships have top players significantly better than the average team. The exceptions are Chauncey Billups on the 2004 Pistons, Dirk Nowitzki on the 2011 Mavericks, Kobe Bryant on the 2009 and 2010 Lakers, and Kawhi Leonard on the 2014 Spurs. Notable seasons on non-title winners were LeBron in 2009 and 2010 on the Cavaliers, Chris Paul on the 2009 Hornets, Steph Curry on the 2016 Warriors, Giannis Antetokounmpo on the 2020 Bucks, James Harden on the 2018 Rockets, Kevin Garnett on the 2004 Timberwolves, and Kevin Durant on the 2014 and 2015 Thunder. When you look at the output of the regression model, the best player has an outsized affect on a team's championship equity. 
Interestingly, the marginal gains associated with your best player getting a little bit better have an outsized effect on your probability of winning a title at the higher-ends of the player production spectrum. For example, if a team had a best player that was about one point per 100 possessions better than average and added a player in the off-season who was about 6.25 points per 100 possessions better than average, that would would add about five percent of championship win probability in the average season this century (based on the model). The same championship equity would be added if the best player on a team went from about 8.75 points per 100 better than average to about 10. So on the upper end of the player ability spectrum, adding about 1.25 points is the same as 5.25 points when considering just championship equity.

When incorporating the second and third best players into the model, the model becomes more accurate (by AIC). This is not surprising because I am incorporating more information about each team's roster. Theoretically, adding in all of the players would be the most accurate, but there would probably be significant amount of diminishing returns after the seventh or eighth best players since rotations shorten in the playoffs. But in any sort of modelling process, there is a balance that needs to be struck between accuracy and the amount of information required to make a good prediction. A more complicated model that requires 20 inputs but adds only five percent more accuracy compared to a model with four  inputs is not really a better model. Marginal gains in accuracy compared to monumental changes in inputs is not good process. Thus, I figured for now looking at just the three best players would suffice do to its simplicity and intuitiveness. 

With the model trained, I could generate championship probabilities based on the play of the three best players on the team, just like the model that only incorporated the best player. My initial reaction was to look at the landscape of each team's top two players and outliers in championship probability. 
The Warriors show up a few times here, as they put together won of the most dominant stretches of play in league history. Interestingly, we also see the 2018 and 2019 Rockets with James Harden and Chris Paul at the helm. In 2018 specifically, the model indicated that the Rockets actually had a better chance at the championship than the Warriors. This is based on regular season performance and the Warriors after the Kevin Durant signing were known to loaf through games so the results should be taken with a grain of salt. Another couple of teams that unfortunately ran into the Warriors juggernaut were the 2015 and 2016 Thunder, featuring the aforementioned Kevin Durant and Russell Westbrook. The 2001 Jazz were the last hurrah for Stockton and Malone as serious playoff contenders. I also isolated teams that were similar to this year's Lakers, where the third best player was much worse than the top two. I filtered by teams with an implied championship probability of at least 20 percent and a third best player who was at best worth 2.5 points per 100 above average. 
The "Heattles" show up here as do Durant's Thunder and the Stockton and Malone Jazz. Famously, Kevin Garnett carried a relatively barren roster (by contender standards) to the Western Conference Finals in 2004. The second and third best players on that team were Sam Cassell and Fred Hoiberg; this masterpiece from Garnett was one of the best seasons by a big-man in NBA history. Alternatively, here are some of the deepest contenders of the century (both the second and third best players were at least 3 points per 100 better than average): 

Here we see different versions of great Spurs teams, an organization known for its depth and ability to develop players for most of the Greg Poppovich tenure. The earlier Spurs teams were led by the trio of Tim Duncan, Tony Parker, and Manu Ginobili while some of the later teams featured Kawhi Leonard and LaMarcus Aldridge. The Chris Paul and Blake Griffin led Clippers also make an appearance. After recent renditions of the Rockets and Thunder, these Clippers were the best group to not win a title. 

When taking into account each team's championship probability every year since 2000, we can look at which teams exceeded or disappointed their expected championship output. First, all the teams that won at least one championship in the 21st century:
The Lakers lead the way in titles over expected. Some of this can be attributed to the Shaq and Kobe teams not taking the regular season seriously. The rest is attributable to the back-to-back championship teams in 2009 and 2010 not being especially strong title winners. Miami is second and for similar reason to the Shaq and Kobe teams they exceeded their expectations. Detroit and Toronto come out as surprising as "lucky" by this measure (a small ratio is expected to actual titles). For Toronto, they did not have much championship equity until they traded for Kawhi Leonard and in his lone season in Toronto Leonard often sat out regular season games to stay fresh. Detroit in 2004 was the worst champion by the model. Dallas had some great teams between 2000 and 2010 and oddly finally got a ring with Dirk when he was well past his MVP-level peak. 


As I mentioned above, the Thunder, Rockets, and Clippers were very unfortunate given the players they employed this century. Milwaukee's misfortune is concentrated in the past couple of seasons where Giannis posted some of the best seasons of the century. Minnesota's misfortune was also concentrated in a couple of seasons, 2004 and 2005. Also showing up here is Orlando (2008 through 2010), the Suns (mostly due to the Seven Seconds or Less teams) and Utah (the tail end of the Stockton and Malone connection). 

Finally, I looked at the strength of the top teams every season for the last 20 years. The model output does not consider other teams when computing championship probability. For example, the 2018 Rockets title chances are not affected by the fact that the 2018 Warriors existed. The model was not trained on individual seasons but the entire sample. Thus, the predicted probability is the probability of winning a championship given the players on hand in the average season in the 21st century. The issue for the 2018 Rockets is the average NBA season does not also include a team of the Warriors caliber. The top teams from each season have an outsized effect on the summation of title chances for all teams in a season. So when summing up all the title chances in each season, seasons over one can be considered top-heavy, while seasons under one have title contenders with worse players than the average contender. I call the sum of the probabilities the heavy weight index. The average heavy weight index in the 21st century is about one, which is not surprising. The landscape of the league, on the other hand, is noteworthy: 
Recent seasons, especially 2015 through 2018, were especially top heavy. Those seasons were dominated by the Warriors, Cavaliers, Thunder, Spurs, and Rockets. These teams had some of the most impressive top-end talent the league has ever seen and just happened to exist in the same universe so the Rockets and Thunder missed out on titles. All of these seasons had heavy weight indices of at least 1.5, so they were at least 50 percent more top-heavy than average. Seasons in the first ten years of the sample featured much less impressive talent, especially in the years where Kobe's Lakers took how titles. These seasons were almost 40 percent less top heavy than average. With the break-up of the Thunder, Warriors, Cavaliers, and now Rockets, the league is starting to balance itself out again after some extremely top-heavy seasons. I think, subjectively, having an index a bit over one (the best teams are slightly better than the best teams in an average season), but not so dominant that they bowl over the competition offers the best entertainment product. With many top players hitting the free agent market after the 2020-2021 season, it will be interesting to watch how the contenders sort themselves out and if any team adds talent to the extent of some of the great teams of the last 20 years.

Thursday, October 8, 2020

Addendum to Investigation of Gary Sanchez's Struggles

In my last post I talked about Gary Sanchez and his 2020 struggles at the plate. I cited his excellent batted ball statistics persisting in 2020 and the main reasons for his demise being attributable to a bloated strikeout rate that is bound to regress and some bad luck on balls in play. After further reflection, my stance on his exceedingly low BABIP has not changed. It was one of the worst figures of the past five seasons and given how hard he hits the ball when it is put into play, I do not expect anything close to that figure to persist. Even though he hits the ball with so much authority, Sanchez may always post lower BABIP figures  due to his propensity for hitting fly balls and the proportion of his hits that are home runs (home runs are not included in BABIP), but nevertheless there will be some positive regression on this front. 

The strikeout rate requires more nuance. I noted his slight improvement in his approach; he improved his chase rate and swung at more pitches in the strike zone in 2020. His overall swing rate did not change much and he saw a slightly higher percentage of pitches in the strike zone. The issue was his contact rates, both inside and outside the zone. His zone contact rate declined a little bit from an already unimpressive figure, but the larger issue was Sanchez, relative to league average, could not put the bat on the ball when he chased. This meant that even though he was chasing less, his rate of contact when he did chase was such where swings on pitches outside of the zone had an outwardly bad effect on his results. I also included some analysis on the probability that this increase in strike out rate was purely the result of variance and found there was some non-zero possibility that this was the case. I threw the improved approach, substantial negative regression in contact rates, and variance into a blender and concluded that he was bound to be much better in the strikeout department in 2021. 

Thinking more about the strikeout issue, I thought I failed to offer context behind Sanchez's rising strikeout rate and look at other players who saw large changes in strikeout rate and how they fared in later seasons. It is easy to say that any massive increase in strikeout rate should be followed by a corresponding decrease towards the players "true talent". 

I pulled data on every set of three hitter seasons since 2015 where the hitter had at least 150 plate appearances in each season. I then took the calculated the changes in strikeout rate between year N and year N-1 and between year N-1 and N-2. 

Positive integers indicate increases in strikeout rate, which are generally bad for hitters (but not always, could be an indication of a hitter being more selective or selling out for some power). As you can see in the direction of the trend line, increases in strikeout rate are often followed by decreases the following season. When you isolate Sanchez, the results are concerning: 
This is the same set of points with Sanchez highlighted. You do not want to find yourself on the top right of this chart, which indicates multiple seasons where your strikeout rate increased. Sanchez's 2020 is especially ugly in this regard. I will not that his 2019 season looks bad in this visualization, but in 2019 Sanchez posted his best numbers on contact. That might indicate that he was selling out for some power coming off a down 2018 season. Using the data presented above, I built a simple linear model predicting a player's change in strikeout rate. The model had two inputs: the prior change in strikeout rate and the player's age (we know changes in strikeout rate are partly a function of age). After fitting the model to the data, I wanted to look at how Sanchez looked in 2019 and 2020 relative to expectation (i.e. the output of the model). In 2019, based on changes from 2017 to 2018, we should have expected Sanchez to trim 0.57 percentage points off of his strikeout rate while in actuality he added 3.47 percentage points. In 2020, we should have expected him to lose about 0.72 percentage points off of his strikeout rate. Instead, he added eight percentage points. His 8.72 percentage point change over expected was in the 98th percentile in the entire sample. His two year total change is 7th in baseball from 2018 to 2020. Furthermore, I built another model that gauged the probability of a player trimming his strikeout rate based again on age and prior year strikeout change. Sanchez again sticks out. 
Player seasons in the top right quadrant are disappointing relative to expectation. Sanchez both seasons was expected to trim his strikeout rate and he did the opposite. I will note that most of the seasons where players added to their strikeout rate much more than expected came from 2020, due to the small samples. Still, I think there is some reason to be concerned with Sanchez and his increasing strikeout rate. If he wants to return to his 2019 level, he is going to have to buck the strikeout trend. Sanchez will be 28 going into next season and has been a major league regular for about four years now when you account of the fact that he did not play full seasons in 2016 and 2020. He is probably not going to get better from a batted ball perspective at this point; exit velocity peaks in a players mid 20s. Improvements will have to be made in his contact rates and correspondingly his strikeout rate. Whether or not he has the ability to make these improvements will largely dictate whether or not the Yankees tender him a contract going into his second year of arbitration. 

Tuesday, October 6, 2020

Gary Sanchez 2020 Struggles

Gary Sanchez has been the target of scorn among Yankee fans for a couple of years now and the distaste has only grown during the abbreviated 2020 season. Sanchez is much maligned for his subpar defense, which is always on display given the catcher is involved on every pitch. What this fails to recognize, however, is the value Sanchez brings with the bat and how that stacks up to his peers at the catching position. Since 2016 (he became a regular on August 3rd of that season), Sanchez has been the fifth most valuable catcher in baseball, per FanGraphs WAR. On offense alone he has provided the second most value while ranking just 10th in plate appearances among catchers. On defense he has been more middle-of-the-pack, but 2019 was actually his only season where he was below average with the glove, at least according to UZR and FanGraphs' catcher framing model. This is all to say I think the criticism of his play has been largely unfounded and lacks context, in that despite what fans may think he has been one of the best catchers in MLB since he became a full-time regular.

Having contextualized Sanchez's career performance, 2020 was still a disaster. Sanchez had a normalized batting line 31 percent worse than league average (based on wRC+) and struck out in 36 percent of his plate appearances, the 6th highest figure among all players who had at least 150 plate appearances. The 36 percent figure is by far the highest of his career and about eight percentage points worse than his previous high (which was 2019). This was the first time Sanchez was below replacement level over a 150 plate appearance sample in his entire career. The only stretch of play that resembled this was May of last season where he regularly posted batting lines about 10 percent worse than league average. For Sanchez, who has been about 20 percent better than average from 2016 to 2019, posting this type of line even over a tiny small sample of 178 plate appearances is surprising and unlike anything we have seen from him.

The million-dollar question (and the one every member of the Yankees front office will be thinking about this off-season) is whether or not this abbreviated 2020 season is reason to panic. Making a rash decision or evaluation over 178 plate appearances, on its face, seems foolhardy. Just last season, players such as Austin Meadows, Franmil Reyes, Robinson Cano, Yuli Gurriel, and Joey Votto posted 60 game lines similar to Sanchez in 2020. Meadows was one of the best hitters in the American League, Gurriel finished with a line 26 percent better than league average, Reyes almost hit 40 home runs, Cano was great when he played this year, and Votto took a step forward in 2020 after a rough 2019 season. Good, even great hitters, are capable of having stretches like this. Someone reading carefully might counter and say that I am cherry-picking good players when in reality, the list of players with stretches similar to Sanchez's 2020 is littered with more bad players than good. That person would be correct. However, we have a demonstrably larger sample of Sanchez being a very good hitter as opposed to a below replacement-level contributor.

Still, I have not answered the original question: should we be worried about Sanchez's performance going forward? We need to address the strikeout rate and his performance on contact. Even though he made some incremental improvements in taking walks, striking out 36 percent of the time is not conducive to being an effective hitter. From 2018 through 2020, among players seasons with at least 178 plate appearances, just 11 players were able to post above average batting lines while striking out in at least 33 percent (about one third) of their plate appearances. The players that are able to overcome massive strikeout rates are among the best in the league at hitting the ball with authority. Players like Joey Gallo (2018 and 2019), Miguel Sano (2019), Brandon Lowe (2019), and Ian Happ (2018) have struck out at similar rates to Sanchez in 2020 and were above average hitters in those seasons. The list of players also includes Willy Adames (who has about league average batted ball stats but ran a 0.388 BABIP in 2020), Jake Cave (a solid fourth outfielder/AAAA guy who actually had characterisitcs that support high BABIPs until 2020), and Tyler Austin (the quintessential AAAA player with pop but not enough to offset strikeout woes). The common thread here (besides Adames and his outlier 2020 performance) is these players have well above average exit velocity readings and hard hit rates (balls in play at or above 95 MPH). When they make contact, they make it count. Sanchez, unsurprisingly, did not post good enough results on contact and saw a sharp year-over-year decline in 2020.
For context league average wOBACON (wOBA on balls in play or contact) in a given year is anywhere between 0.370 and 0.380. Expected wOBACON is based on the launch angle and exit velocity of a batted ball. The past three seasons, Sanchez has underperformed his expected wOBACON figures (though in 2020,  the leaguewide xwOBACON was much smaller than wOBACON which makes me think there was a calibration error in the model with the new HawkEye data or there was something weird with the baseball). Still, Sanchez actual results on contact were not what we should have expected based on his batted ball characteristics and much lower than league average and his career norms. Should we expect a player with barrel rates in the top five percent of the league in each of the past three seasons post below-average results when he puts the ball in play? Probably not. To say he was unlucky on balls in play in 2020 is an understatement. Sanchez posted a 0.159 BABIP, the third lowest figure in the past five seasons for player seasons with at least 170 plate appearances. The only worse seasons were Edwin Encarnacion in 2020 (0.156 BABIP along with a 13.2 percent barrel rate and 33 percent hard hit rate) and Ryan Schimpf in 2017 (0.145 BABIP along with a 16.5 percent barrel rate and 30.9 percent hard hit rate). Sanchez had a 17.4 percent barrel rate (97th percentile, after posting a 99th percentile figure last year) and a 49.5 percent hard hit rate (91st percentile). Not many players hit the ball as consistently hard as Sanchez while also posting barrels (the highest value batted ball type) at similar rates. Only six players in MLB posted both a higher hard hit rate and higher barrel rate than Sanchez in 2020. 

All of this is meant to show that we should expect a healthy amount of positive regression from Sanchez in 2021 from a batted ball perspective. Players who hit the ball like Sanchez are among the elite hitters in the league. Can Sanchez get back to that level in 2021? Even if he maintains this level of performance on balls in play, he needs to trim his strikeout rate to be an all-star level contributor. Is this 36 percent strikeout rate here to stay or is he the guy who strikes out about a quarter of the time, his career rate going into the year. The classic sabermetrician in me says he should probably fall somewhere in between those two marks in 2021, if anything closer to the 25 percent because for much of his career he performed at that level. The issue is we know strikeout rate is one of the fastest metrics that stabilizes quickly for a hitter. But do not mistake stability for predictability. Stability indicates the amount of time (in this case plate appearances) required for a stat to be able to adequately explain a player's talent over the prior sample. Predictability is the amount of time we need for a sample of a stat to explain the stat in a future sample with the same size. So we need to diagnose if Sanchez's strike out woes are the product of variance or something has changed in his "true talent" level and we should expect strikeout rates similar to 2020 going forward. Borrowing from my methodology in a previous post, I am going to look at the probability of a player posting a 36 percent strikeout over 178 plate appearances by pure chance for varying levels of "true talent". I simulated the 178 plate appearance sample 1,000 times for each true talent level. The following is the distribution of strikeout rates for varying levels of "true talent" strikeout rates: 

The dashed line represents an in-sample strikeout rate of 36 percent. Even for hitters with strikeout rates in the range of 25 to 30 percent, a hitter would be expected to strikeout at least 36 percent of the time fairly often. Furthermore, here is the percent of 178 plate appearance samples that showed strikeout rates exceeding 36 percent: 
A 30 percent strikeout hitter is expected to post strikeout rates at least as bad as Sanchez in 2020 about 35 percent of the time. For a 25 percent strikeout rate hitter (Sanchez's career rate going into the 2020), about a 15 percent probability. So posting a season like 2020, where Sanchez had 178 plate appearances, was definitely in the realm of possibilities just by pure chance. And if you look at his approach at the plate, nothing really changed in 2020. 
His overall swing rate barely changed from the past two seasons. He swung at slightly more pitches inside the zone while laying off more pitches outside of the zone compared to 2019. His chase rate was the best of his career. So I would argue his approach actually improved from 2019 despite much worse results. His zone contact rate was slightly down, but not so much so where you would see an influx in strikeouts. If anything, the increase in swings in the strike zone should have offset that and allow him to continue to put good hitter's pitches into play. Where Sanchez had a lot of problems was making contact when he chased pitches. While he has consistently posted out of zone whiff rates worse than league average, in 2020 Sanchez only made contact on 45.4 percent of his swings on pitches outside the strike zone, compared to a league average rate of about 60 percent. This was a large departure from his career rate going into the year (about 54 percent). The only way I can see this sustaining itself is if Sanchez is dealing with an issue of seeing the ball. But if he has a vision issue, how could he have both increased the amount of pitches he swung at in the zone while decreasing his chase rate. He improved at picking out balls from strikes. So the idea that he was not seeing the ball does not hold merit. I expect his contact issues to improve back towards his career norms going into the season, which should come with a corresponding decrease in strikeout rate. 

When digging into the data, Sanchez's awful 2020 results at the plate seem to be the result of poor luck and being on the wrong end of variance. Yankees fans have been especially tough on him and have called upon Aaron Boone to put him on the bench throughout the entirety of the shortened 2020 season. Given what know about his underlying performance and regression, I would expect Sanchez to have a bounce-back 2021 season. To believe that an all-star level contributor suddenly turned into one of the worst hitters in MLB would indicate a lack of understanding of the variance associated with outcomes in baseball and not appreciating the brevity of the 2020 season and its small samples.