Sabermetric Musings: QB EPA per Play Reliability

With the influx in evaluating football teams based on expected points added (EPA) and the accessibility of play-by-play data, NFL fans are better equipped then ever to gauge the relative merits of every team and player around the league. Instead of using metrics that either fail to address context (yards per carry/catch, total yards) or just are not reliable due to extremely small samples (touchdowns, interceptions, touchdown to interception ratio), EPA adjusts for down, distance, yard line, and a plethora of other factors (this is provides a more in-depth explanation of EPA and the history of the statistic).

Anecdotally, EPA is most commonly used to evaluate team offensive and defensive efficiency (in the form of mean EPA per play). This does lack some nuance (the average EPA per play tells us nothing about the distribution of EPA for an offense or defense), but nevertheless if you are going to boil down a unit to one number, mean EPA per play is about as useful a measure as we have access to. Given the out-sized effects quarterbacks have on a team's offense, it is natural to use EPA to gauge the play of quarterbacks.

The natural question when using any statistic in analyze a quarterback is when does that statistic become reliable? How large of a sample do we need before we have an idea of how well that quarterback has played? Using one play is meaningless; randomly select one Patrick Mahomes throw and you can come up with a 50 yard touchdown or a pick-six. Same idea holds for two throws, three throws, etc. So how many throws do we need before a sample is reliable? I tried to answer that question using the play-by-play data since 2010. I should note, measuring reliability or stability is different than predictability. When a metric stabilizes tells us how many samples we need to evaluate something that has already happened. So if a metric stabilizes or becomes reliable after 50 plays, we can look at the past 50 plays and see how well a player or team performed in those 50 plays. That does not mean that player or team will perform to that level over the course of the next 50 plays. We can just be reasonably confident that the level of performance displayed over the prior 50 plays is a good description of that past performance but it should not be used to predict future performance.

To conduct my analysis, I randomly sampled a number of plays from all plays where a quarterback either ran or threw the ball. I did so twice on 1,000 separate occasions for each number of plays. The number of plays I sampled ranged from 25 plays on the low-end and 400 plays on the high-end in increments of 25 plays. For context, a starting quarterback who plays in every game will throw anywhere between 400 and 700 passes and will have between 10 and 100 carries (unless you are Lamar Jackson, Cam Newton, or Josh Allen). First I looked at the spread in average EPA per play between two samples.

The numbers above each window indicates the number of plays in the samples. The spread starts out very wide for 25 plays, so using 25 plays to evaluate a quarterback would be foolhardy. As the plays per sample grows, the spread tightens. Once we get to 125 or 150 plays, the change size of the spread shrinks and does not change much as we add plays to a sample. So where can we say "this sample is reliable"? I took the average difference between sample pairs and the standard deviation in that difference for all pairs and compared them by number of plays per sample:

The heights of the error bars represent the average difference plus or minus the standard deviation of the average difference. At 25 plays, the deviation in sample difference is very large relative to the average. This effect is mitigated as the sample of plays grows. By the time we get to 275 plays, the change in standard deviation between sample sizes is about five percent of the total standard deviation. For the rest of the sample sizes, this five percent figure is relatively constant. Thus, I would say that a sample of about 275 plays is the point where EPA per play becomes reliable.

275 plays is somewhere between six and eight games' worth of plays for a quarterback, if he starts the entire season. That means we need six, seven, or eight games to be reasonably confident of the level at which the quarterback played. Keep that in mind when you see various members of the media and talking-head types freak-out after a game or two or you see a player have a great three games to end a season and people start to sing his praises going into the next season. Also remember, this does not mean that we need six games to predict how well a quarterback will perform in the future. 275 plays just provides a good snap-shot of how well a quarterback played in the past.

Sabermetric Musings

Sunday, August 30, 2020

QB EPA per Play Reliability

No comments:

Post a Comment