Thursday, March 4, 2021

RAPM Style wOBA Estimates

The most common all-in-one statistics in basketball are based on the adjusted plus-minus (APM) framework. The basic idea is as follows: how many points per 100 possessions does a player add to a team's scoring ledger. It is a regression that consists of the player in question, the other nine players on the court in the stints he plays, and the score differential in the segments that player is on the court. This has spawned a plethora of other metrics including box plus-minus, ESPN's real plus-minus, 538's RAPTOR, Basketball Index's LEBRON, and most notably regularized adjusted plus-minus (RAPM). Ryan Davis over at nbashotcharts.com has developed the current gold-standard for RAPM (at least in the public sphere) where he includes a raw RAPM, multi-year RAPM, and luck adjusted variants of both where he accounts for opponent three point shooting and free throw shooting, two quantities which a single player does not have control over. The newer variants of plus-minus metrics are usually developed with and compared to players' RAPM figures when building those metrics. 

The key word in RAPM is regularized. What this means is that when fitting the model, the coefficients (both the external factors and the actual player coefficients) are regressed towards zero in a process called ridge regression. The rate at which the coefficients are regressed towards zero is controlled by the hyperparameter lambda. This technique has been utilized in the other sports applications, most notably in the hockey by a few public-facing analysts. For context, there is no confidence intervals given as an output for the ridge regression, but bootstrapped based distributions can be derived by fitting the model over and over again. This can take a while depending on the size of the training data set. 

I set out to apply the ridge regression framework to derive estimates of true hitting and pitching talent in the form of wOBA and wOBA allowed, respectively. The inputs into the model were the pitcher, the batter, whether or not the batter had the platoon advantage, the park, and the month. The month was included because offenses get the benefit the ball traveling further in the warmer air. The parameters added to the model were meant to give context neutral estimates of each batter and pitcher in the dataset. Players who play in favorable offensive or defensive parks are adjusted accordingly as are players who are deployed in such a way where they often have the platoon advantage. 

I trained the model on data from the 2019 and 2020 season. After training the model I pulled the the hitters and pitchers who had 200 plate appearances or 200 batters faced respectively. Below are tables including the top and bottom few names from my dataset. wOBA is the batters actual wOBA from 2019 through 2020 and woba_est is the player's context neutral wOBA estimate and ERA- is the context neutral ERA indexed to 100 where anything below 100 is better than average : 


Model data via MLBAM. Plate appearances, batters faced, and ERA- via FanGraphs