Saturday, February 20, 2021

The Effects of Curveball and Four-Seam Fastball Spin Axis Differential

A swinging strike is arguably the most important pitch result from the perspective of a pitcher. A strike, not matter how it is obtained, is a strike when you look at a scorecard. However, when evaluating and predicting the performance of a pitcher, swinging strike rates are vital. A swinging strike is different than a foul ball: in the case of a foul ball the batter makes contact with the pitch. A swinging strike is different from a called strike: in the case of a called strike the batter does not swing, thus we cannot know the effectiveness of the pitch if the batter swung. The swinging strike is the outcome that gives us, as analysts, the most positive information of the pitcher. We see the batter makes the decision to swing at the pitch and upon making that decision he misses. So while all strikes are technically equal, the swinging strike gives us the most information about the effectiveness of a pitch. 

For these reasons, modelling pitches in terms of their probability of generating a swinging strike is the best way in gauging pitch quality. What goes into a swinging strike? Well we know velocity and spin rate are essential factors.  Higher velocities give batters less time to make the optimal decision. Higher spin rates allow pitches to move as much as possible given their spin axis (transverse spin and seam-shifted wakes play into pitch movement but all else being equal spinning the ball as much as possible is beneficial, sometimes with the exception of changeups and splitters. That is a topic for another time). 

Once the pitcher makes the decision to swing at a pitch, attempting to disguise its movement is of the utmost importance. Pitchers that throw both a fastball and a curveball have the benefits of deception because the spin of the pitches are opposite. Four-seam fastballs have close to perfect back spin (depending on the pitcher). Curveballs have close to perfect top spin (again, depending on the pitcher). To the batter, these pitchers look eminently similar. The orientation of the seams are seemingly the same, but the actual spin is the opposite. Leveraging the spin characteristics of these two pitches can give the pitchers advantages in terms of deception. 

With the implementation of Hawk-Eye cameras in all MLB stadiums in 2020, the spin axis of every pitch can be directly tracked. Previously with Trackman, MLBAM tracked the spin axis of pitches by inferring the axis from the movement of the pitch. The Hawk-Eye cameras can directly track the spin axis of the pitch. This gives analysts a better understanding of how pitches pair with each other and gives insight into pitches the move more than their movement-inferred spin axes would indicate

In this post I wanted to look at the effect of spin axis differential between curveballs and four-seam fastballs on those pitches swinging strike rates. There is a lot of research into the effect of spin axis on the effectiveness of pitches, all else being equal, but I wanted to look at how pitchers who throw both of these pitches can get an extra edge. I built a model for pitchers who throw both of these pitches and how often they should generate swinging strikes. 

First let us look at which pitchers generate the most spin on both four-seam fastballs and curveballs. I took all pitchers in 2020 who threw 100 four-seamers and 100 curveballs and compared their spin efficiencies on each spin type (the percent of spin that contributes to transverse movement). 

There is not much of a relationship between the two quantitates. Lance Lynn is notable because despite the fact that he does not have great spin efficiency on either pitch he is supremely effective. I will note that Lynn dos not throw his curveball often and is known to manipulate the shape of his fastball. He is unusual in his ability to throw different fastballs to great effect. Shane Bieber was the best pitcher in the league in the shortened 2020 season and he shows well by these measures. Lucas Sims has elite spin rates overall but seems to have trouble generating high-end transverse movement given the ample spin he imparts on the ball. Hyun Jin-Ryu does not throw with great velocity but maintains great results threw efficient spin. For further insight into spin rates and spin efficiency on four-seamers and curveballs I direct the reader to the following visual where the reader can get an idea of which pitchers have the best raw pitch characteristics (without consideration of arsenal and seam-shifted wake. Bauer, Lugo, and Sims stand out here): 
For all pitchers who threw 100 curveballs and four-seam fastballs, I binned the swinging strike rates of the curveballs and fastballs by swinging strike rate (the percentage of pitches that resulted in swinging strikes). I also gave context into the effectiveness of the pitchers in each bin by summarizing the pitchers in each bin by wOBA allowed. The spin axis differential should be important but there are other factors at play so the qualitative importance of the the spin axis differential can be derived from this visual: 

The spin axis and spin axis differential is displayed in terms on arms on a clock. For example, a curveball with perfect top spin has a spin axis of 6:00 (it moves straight down). A fastball with perfect backspin has a spin axis of 12:00 (relative to other pitches it has ride. Obviously in an absolute sense the pitch does not rise). The spin axes were taken from the perspective of the batter. I will not that this is not literally the axis around which the ball spins. It describes the axis of rotation in terms of how the ball should move from transverse spin. 

The visual above gives the reader a qualitative idea of the importance of spin axis differential between curveballs and four-seamers in a pitcher's arsenal. In the case of generating swinging strikes on either pitch there is not a clear relationship. I wanted to account of the overall effectiveness of the pitchers in each bucket by accounting for their wOBA allowed. My thought was maybe the axis differential matters, but so do other aspects of a pitch. Maybe pitchers who happened to be really good do not get optimal axis differential on their four-seamers and curveballs and derive value via other means? It is impossible to totally discern these relationships from a chart. So I built a model to empirically evaluate the importance of the spin axis differential. 

The model is a general additive model that takes the smoothed relationship between a pitch's velocity, location, and movement, whether or not the pitcher has the platoon advantage, and a variable for spin axis differential modelled as a random effect. The model was trained on 80 percent of the pitches thrown by the pitchers who threw 100 four-seamers and curveballs. It was then tested on the remaining 20 percent of those pitches. The model had an accuracy of 89.8 percent in that it was correct 89.8 percent of the time when it predicted whether or not a pitch would be a swinging strike. From here I could apply the model to every pitch in the data set. There are many ways you can look at this data. I created a column in the data frame that I called expected swinging strike rate, the probability that the pitch was a swinging strike. I grouped and summarized the expected swinging strike rate data by pitcher and posted a thread on twitter for those who are interested in which pitches faired best. Guys like Bieber, Glasnow, Gray, and Cole have curveballs at the top of this leaderboard but worth noting Griffin Canning by pure "stuff" has an excellent curveball. And Josh Staumont has the only fastball that appears at the top when you combine four-seamers and curveballs. 

As interesting as the player-level data is, ultimately I wanted to check the importance of spin axis differential for a pitcher's four-seamer and curveball. The variable for spin axis differential was significantly significant based on its p value, though much less vital to predicting swinging strikes than the pitches velocity, movement, and location. This p value in general is often misinterpreted but is encouraging nonetheless. Similar to my last post about park effects, I could extract the distribution of possible effects for each spin-axis deviation. Here are the results: 

Remember 6:00 is perfect spin mirroring. Based on the model while this spin axis differential is important in predicting swinging strike rates but much less so than the other characteristics. Here you can see that while the spin axis differential is important, there is no discernable pattern. Perfect spin-mirroring has almost no effect while 6:30 has a negative effect. On the other hand 7:00 and 5:00 have the highest positive effects followed by 8:30, the latter of which is far off from perfect spin-mirroring. 

What can we conclude from this? I think spin mirroring is still very important. Making your pitches appear the same to the batter's eye is essential in deceiving him. However, my model does not capture this phenomenon and my hypothesis is that the random effect in my model is not really addressing the effect of the axis differential. Instead it is capturing the effect of the pitcher overall and the best pitchers at generating swinging strikes happen to fall into the 7:00, 5:00, and 8:30 buckets. I will point out the if you look at the twitter thread with the expected swinging strike rate leaders, the top pitchers cluster around axis differentials between 5:30 and 6:30 (Bieber, Canning, Ray, Glasnow, Duffey, Cole, and Young) which is close to perfect mirroring and from the standpoint of the batter is probably barely discernable. Those guys make up half of the top of the leaderboard. Still, velocity and movement remain the most important factors in getting hitters to whiff but I remain convinced that small edges are to be had with effective spin-mirroring. That does not mean work on this type of data is over. There are better models that can be built that better isolate the effect of spin-mirroring from the overall quality of the pitcher. And we will have more data. Hawk-Eye was implemented in the shortened 2020 season so we have barely any pitch data relative to other seasons. I only had 82 pitchers who threw 100 four-seamers and curveballs and the pitchers who met this threshold obviously did not throw a full season's worth of pitches. In a sport with as much variation as baseball, this is not a sufficient sample. In the future, as our Hawk-Eye based dataset grows, analysts (including myself) will be able to generate better insights on the effects of spin and learn more about the hitter-pitcher interaction, the most consequential part of any baseball game. 

Monday, February 15, 2021

Homerun Park Factors

Anyone who watches baseball understands a park can affect the outcome of a batted ball and the overall season lines for players. Certain parks are huge and seem to have an endlessly vast outfield. Others seem like they just require a flick of the wrist to punch the ball out of the ballpark. Homeruns are the most obvious manifestation of park factors. Other plate appearance outcomes (singles, doubles, triples, and even strikeouts) are affected by the park in which a given plate appearance takes place. Homeruns, however, are the most visible and most consequential aspect of park factors. With the influx of Statcast data, park factors can be refined with batted ball characteristics which require a much smaller sample to stabilize than homerun rates. Homerun rates are also very dependent on teams/players. The Statcast batted ball data lets us strip the identity of the player and team from the batted ball at hand and evaluate it solely on how well it is struck. 

I built a generalized additive model with a smoothing term combining batted ball exit velocity, vertical launch angle, and spray angle (or horizontal launch angle). I also included a random effect for the home ballpark. The model was a logistic regression, in that the predictors were regressed against a binary output (whether or not a batted ball was a homerun). 

To train the model, I used 80 percent of the batted balls from 2020 (selected randomly). The remaining 20 percent would be the test set. After training the model and applying it to the test set, I yielded an accuracy of 97.8 percent. This meant that of the balls in the test set the model deemed having at least a 50 percent chance of leaving the ball park, about 97.8 percent of them were homeruns. The following is the ROC curve from the model and you can see how how small the false positive rate came out (the further away from the line with a slope of one and intercept of zero, the better): 

Before I show the park effects, I applied the model to all of the batted balls in 2020 and looked at the results from the player level. Here is a visualization with all players with at least 15 expected homeruns (based on the model) in the 2020 season:

Voit stood out from the rest of the field. Many of the names are not too surprising but some less-heralded names include Adam Duvall (just signed a one-year deal with the Marlins), Wil Myers who San Diego spent all last season trying to dump, and Teoscar Hernandez who always showed plus raw power but never could parley that with enough contact to garner significant playing time. As one can see from the chart, players can have significant deviations from their expected homerun totals. Players who hit more homeruns than their batted ball data would indicate got a bit lucky. Correspondingly players could have gotten unlucky based on the quality of the balls they put in play. 
The further to the right and down a player sits means he got more unlucky and up and to the left indicates the player was lucky. I included all players who deviated by at least five homeruns from their actual output. Harper and Freeman had excellent shortened seasons (Freeman won the NL MVP award) but were actually unlucky in terms of homerun output. Machado, Betts, and Ramirez were three of the top vote getters for MVP in their respective lead and seem to have been buoyed by homerun luck. 

Finally let's look at the park factors. Like I said above the park was a random effect in the model; I knew each park had some effect on whether or not a batted ball was a homerun but I was not sure how much of an effect. The following is a chart showing each park's effect on homeruns labeled by the home team for that park. The effects are based on how much the park is estimated to have an effect on the size of the intercept of the model. The more positive the value, the more of a positive effect the park has on the homerun probability. The more negative the effect the more a park shrinks the probability of a homerun. The model was fit many times over with different random effects for each park. I included the 95 percent confidence intervals for each park to show how much these effects may overlap. 
Cincinnati has the largest positive effect by far followed by Yankee Stadium and Dodger Stadium. On the other end of the spectrum, Miami brings up the rear followed by the two Bay Area parks. One may quibble with Coors Field in Denver only ranking ninth. I would note that Coors almost definitely has the largest effect on overall run scoring. But the outfield in Coors was made so large to combat the thin air in Denver that hitting homeruns is not as easy as the run-scoring environment would suggest. Also note the position of Arizona. Arizona used to be a bandbox like Coors due to the altitude but upon the introduction of a humidor in Chase Field both the homerun hitting and run-scoring environments have taken a tumble in favor of the pitchers. The new Rangers park (which now has a retractable roof) plays down compared to its older and completely outdoor counterpart. Arlington is one of, if not the hottest place to play in the summer facilitating the ball flying much further than in other parks. One final note, the Bay Area teams surprisingly play in the coldest weather in the summer months. This has a negative effect on the flight of the ball (from the perspective of the batter). This in conjunction with the large park dimensions make both parks especially tough to deposit the ball into the outfield bleachers.