Monday, February 15, 2021

Homerun Park Factors

Anyone who watches baseball understands a park can affect the outcome of a batted ball and the overall season lines for players. Certain parks are huge and seem to have an endlessly vast outfield. Others seem like they just require a flick of the wrist to punch the ball out of the ballpark. Homeruns are the most obvious manifestation of park factors. Other plate appearance outcomes (singles, doubles, triples, and even strikeouts) are affected by the park in which a given plate appearance takes place. Homeruns, however, are the most visible and most consequential aspect of park factors. With the influx of Statcast data, park factors can be refined with batted ball characteristics which require a much smaller sample to stabilize than homerun rates. Homerun rates are also very dependent on teams/players. The Statcast batted ball data lets us strip the identity of the player and team from the batted ball at hand and evaluate it solely on how well it is struck. 

I built a generalized additive model with a smoothing term combining batted ball exit velocity, vertical launch angle, and spray angle (or horizontal launch angle). I also included a random effect for the home ballpark. The model was a logistic regression, in that the predictors were regressed against a binary output (whether or not a batted ball was a homerun). 

To train the model, I used 80 percent of the batted balls from 2020 (selected randomly). The remaining 20 percent would be the test set. After training the model and applying it to the test set, I yielded an accuracy of 97.8 percent. This meant that of the balls in the test set the model deemed having at least a 50 percent chance of leaving the ball park, about 97.8 percent of them were homeruns. The following is the ROC curve from the model and you can see how how small the false positive rate came out (the further away from the line with a slope of one and intercept of zero, the better): 

Before I show the park effects, I applied the model to all of the batted balls in 2020 and looked at the results from the player level. Here is a visualization with all players with at least 15 expected homeruns (based on the model) in the 2020 season:

Voit stood out from the rest of the field. Many of the names are not too surprising but some less-heralded names include Adam Duvall (just signed a one-year deal with the Marlins), Wil Myers who San Diego spent all last season trying to dump, and Teoscar Hernandez who always showed plus raw power but never could parley that with enough contact to garner significant playing time. As one can see from the chart, players can have significant deviations from their expected homerun totals. Players who hit more homeruns than their batted ball data would indicate got a bit lucky. Correspondingly players could have gotten unlucky based on the quality of the balls they put in play. 
The further to the right and down a player sits means he got more unlucky and up and to the left indicates the player was lucky. I included all players who deviated by at least five homeruns from their actual output. Harper and Freeman had excellent shortened seasons (Freeman won the NL MVP award) but were actually unlucky in terms of homerun output. Machado, Betts, and Ramirez were three of the top vote getters for MVP in their respective lead and seem to have been buoyed by homerun luck. 

Finally let's look at the park factors. Like I said above the park was a random effect in the model; I knew each park had some effect on whether or not a batted ball was a homerun but I was not sure how much of an effect. The following is a chart showing each park's effect on homeruns labeled by the home team for that park. The effects are based on how much the park is estimated to have an effect on the size of the intercept of the model. The more positive the value, the more of a positive effect the park has on the homerun probability. The more negative the effect the more a park shrinks the probability of a homerun. The model was fit many times over with different random effects for each park. I included the 95 percent confidence intervals for each park to show how much these effects may overlap. 
Cincinnati has the largest positive effect by far followed by Yankee Stadium and Dodger Stadium. On the other end of the spectrum, Miami brings up the rear followed by the two Bay Area parks. One may quibble with Coors Field in Denver only ranking ninth. I would note that Coors almost definitely has the largest effect on overall run scoring. But the outfield in Coors was made so large to combat the thin air in Denver that hitting homeruns is not as easy as the run-scoring environment would suggest. Also note the position of Arizona. Arizona used to be a bandbox like Coors due to the altitude but upon the introduction of a humidor in Chase Field both the homerun hitting and run-scoring environments have taken a tumble in favor of the pitchers. The new Rangers park (which now has a retractable roof) plays down compared to its older and completely outdoor counterpart. Arlington is one of, if not the hottest place to play in the summer facilitating the ball flying much further than in other parks. One final note, the Bay Area teams surprisingly play in the coldest weather in the summer months. This has a negative effect on the flight of the ball (from the perspective of the batter). This in conjunction with the large park dimensions make both parks especially tough to deposit the ball into the outfield bleachers. 

No comments:

Post a Comment