Anyone who watches baseball understands a park can affect the outcome of a batted ball and the overall season lines for players. Certain parks are huge and seem to have an endlessly vast outfield. Others seem like they just require a flick of the wrist to punch the ball out of the ballpark. Homeruns are the most obvious manifestation of park factors. Other plate appearance outcomes (singles, doubles, triples, and even strikeouts) are affected by the park in which a given plate appearance takes place. Homeruns, however, are the most visible and most consequential aspect of park factors. With the influx of Statcast data, park factors can be refined with batted ball characteristics which require a much smaller sample to stabilize than homerun rates. Homerun rates are also very dependent on teams/players. The Statcast batted ball data lets us strip the identity of the player and team from the batted ball at hand and evaluate it solely on how well it is struck.
I built a generalized additive model with a smoothing term combining batted ball exit velocity, vertical launch angle, and spray angle (or horizontal launch angle). I also included a random effect for the home ballpark. The model was a logistic regression, in that the predictors were regressed against a binary output (whether or not a batted ball was a homerun).
To train the model, I used 80 percent of the batted balls from 2020 (selected randomly). The remaining 20 percent would be the test set. After training the model and applying it to the test set, I yielded an accuracy of 97.8 percent. This meant that of the balls in the test set the model deemed having at least a 50 percent chance of leaving the ball park, about 97.8 percent of them were homeruns. The following is the ROC curve from the model and you can see how how small the false positive rate came out (the further away from the line with a slope of one and intercept of zero, the better):
Voit stood out from the rest of the field. Many of the names are not too surprising but some less-heralded names include Adam Duvall (just signed a one-year deal with the Marlins), Wil Myers who San Diego spent all last season trying to dump, and Teoscar Hernandez who always showed plus raw power but never could parley that with enough contact to garner significant playing time. As one can see from the chart, players can have significant deviations from their expected homerun totals. Players who hit more homeruns than their batted ball data would indicate got a bit lucky. Correspondingly players could have gotten unlucky based on the quality of the balls they put in play.
Finally let's look at the park factors. Like I said above the park was a random effect in the model; I knew each park had some effect on whether or not a batted ball was a homerun but I was not sure how much of an effect. The following is a chart showing each park's effect on homeruns labeled by the home team for that park. The effects are based on how much the park is estimated to have an effect on the size of the intercept of the model. The more positive the value, the more of a positive effect the park has on the homerun probability. The more negative the effect the more a park shrinks the probability of a homerun. The model was fit many times over with different random effects for each park. I included the 95 percent confidence intervals for each park to show how much these effects may overlap.
No comments:
Post a Comment