Moving the Chains: Does a Timeout Before Fourth Down Help the Offense's Chances?
Note: this piece originally ran on Football Outsiders in October 2019 (former link: https://www.footballoutsiders.com/stat-analysis/2019/timeouts-and-fourth-downs)
Introduction/Summary
As NFL offensive play-callers have continued to become more aggressive over the past decade, a growing trend throughout the league has been the importance of fourth downs. From the infamous Belichick “fourth-and-2” play in November 2009 to the Eagles’ historic “Philly Special” play in Super Bowl LII, decisions to pull the trigger on the final down have influenced the outcomes of major games, both for better and for worse.
The benefits of leaving the offense on the field on fourth downs have already been well-documented by data analysts nationwide. But while it seems to be beyond reasonable doubt that the mindset of risk aversion is slowly fading in football, the conclusion of “teams should go for it on fourth down more often” doesn’t necessarily entail how those teams should approach those attempts. Often, fourth down plays are in incredibly high-leverage situations, and coaches call timeouts immediately before in order to (a) decide whether they actually want to leave the offense out there, and/or (b) figure out what the optimal play call will be. Because of this, a natural question forms: does calling a timeout before a fourth down play help an offense’s chances of succeeding? Using data from NFLScrapR, I attempted to find out.
For the sake of keeping this to a reasonable length, I won’t publish all of the R code here (though I’m not opposed to sharing it privately with anyone interested). But as an abridged version, I used all of the play-by-play data available via NFLScrapR for the past 10 completed seasons (2009 to 2018), which I placed into a data frame called pbp_all. I removed all broken plays from that set (e.g. a botched field goal attempt where the holder keeps and runs with the ball, which NFLScrapR classifies as neither a rush nor a pass). From there, I isolated all fourth downs which didn’t result in punts or field goals into a data frame called pbp_all_NoSpecials_fourthDown, which has a size of 4,633 plays. I then split that data frame into the plays that came immediately after timeouts (pbp_all_NoSpecials_fourthDownPlaysFollowingTimeout), of which there are 907, and plays that weren’t immediately after timeouts (pbp_all_NoSpecials_fourthDownsNOTAfterTimeout), of which there are 3,726. I performed some further manipulations which will be detailed later, including but not limited to classifying plays as “Short” for fourth-and-2 or less, “Med” for fourth-and-3 to 6, and “Long” for fourth-and-7 or more.
The “too long, didn’t read” summary of the project: when it comes to all fourth downs overall, there is no evidence that calling a timeout helps the offense in a statistically significant manner. Furthermore, if we stratify the data by play type and distance, we can reach more detailed potential conclusions. The data suggests that if we split our set of 4,633 plays into six categories based on both distance to go (i.e. short vs. medium vs. long) and by play type (i.e. run vs. pass), calling a timeout does not appear to clearly benefit the offense in any of those six instances. At first glance, the numbers suggest that in one of the six categories (fourth-and-long plays resulting in runs), calling a timeout is actually detrimental to the offense, but a deeper analysis of what actually happened on those plays implies that this conclusion is very noisy based on its small sample size.
The Basics: Timeout vs. No Timeout
To embark on this project, the clear first step was to compare all plays following timeouts with all plays not following timeouts as broadly as possible. To do such a task without considering play type and/or distance to go is quite simple, involving a mere two-sample T-test. For those unfamiliar with the T-test, it’s essentially a method to determine whether the difference between two different sample mean values is actually significant, or if it exists primarily due to random chance. Most commonly, a p-value of ≤ 0.05 is used to signify potential statistical significance, though that varies based on context. If we perform a T-test on the conversion rates of all fourth down plays following timeouts vs. all fourth down plays not following timeouts, we get the following:
In other words, though the mean conversion rate for fourth downs following timeouts was slightly higher than that for plays not following timeouts (by a 0.504 to 0.501 margin), this margin was small enough that it doesn’t appear to be indicative of a systemic edge in favor of the plays following timeouts. One potentially interesting side note is that if we run the same T-test using win probability added (WPA) instead of conversion rate, we see a somewhat different result:
The relatively small p-value of 0.058 doesn’t rule out that there could be a WPA-based advantage to calling a timeout, if we held other factors like play type and distance constant. That gap still could be due to random chance, since having a p-value below or near 0.05 never guarantees a significant difference to exist, but it could also signify that calling a timeout might be more likely to lead to a more explosive play, even if it doesn’t necessarily increase the chance of merely converting the fourth down. While the primary purpose of this paper is to evaluate what a timeout does to a team’s odds of converting on fourth down, we’ll also touch on the “explosive play” idea at the end.
Splitting by Distance
A necessary task for this project is to stratify our data by the distance to go on each play. To split the categories up individually by each yard likely would’ve led to some very noisy data due to small sample sizes (e.g., there probably haven’t specifically been that many fourth-and-11 passing attempts that came immediately after timeouts), which is why I went with the “Short”, “Medium”, and “Long” system. The pbp_all_NoSpecials_fourthDown data frame has 2,300 fourth-and-short plays, 1,211 fourth-and-medium plays, and 1,122 fourth-and-long plays. A break down of all of these plays can be seen in the following group_by table:
For those who favor a more visual approach rather than the code, all three categories are accounted for in the following graph, made using the ggplot2 package in R:
The black brackets represent 95% confidence intervals. Generally, if these intervals overlap, it means that we can’t confidently say that there’s a significant difference between the two groups being compared. All three pairs of above intervals overlap, indicating that in all three distance categories, calling a timeout doesn’t seem to noticeably help or hurt the offense.
Runs vs. passes: does it make a difference?
If distance wasn’t able to exemplify any impact created by timeouts, incorporating further divisions based on play type would be the logical next step. As a frame of reference, all fourth down passes have a mean conversion rate of 0.429, and fourth down runs have a mean rate of 0.646, which makes sense given that fourth down runs typically come with far less yardage to go.
That data can be seen in graphical form here:
A similar conclusion to our first graph is reached. Even when we break into runs and passes separately, calling a timeout has no apparent impact on the success of either.
Combining it all: play type and distance to go
We’ve seen that the length of a fourth down attempt on its own doesn’t seem to indicate any positive value in calling a timeout, nor does the “run vs. pass” designation of a play. What happens if we look at both simultaneously? The following table is the most detailed one yet, grouping all fourth down plays separately based on distance, run vs. pass, and whether a timeout was called or not:
For more visual thinkers, the same information is conveyed in the following three graphs:
At first glance, there are hints that a timeout can have some substance. Particularly, it looks like the gaps between “timeout vs. no timeout” on fourth-and-long runs and fourth-and-medium runs may be worth looking into. (As an aside, the first of the three graphs clearly indicates that on fourth-and-short, a run is more effective than a pass.)
Do the graphs deceive us, or does calling a timeout actually harm the offense in these fourth-and-medium and fourth-and-long runs? To find out in more detail than looking at the black confidence intervals, we can run some T-tests:
It appears that we have something. A p-value of only 0.02 — that has to mean we’re in business, right? We have statistical proof that calling a timeout hurts the offense on fourth-and-long runs?
Unfortunately, deeper analysis proves that this is a case where the numbers can be deceptive. Scrolling back up to our group_by table shows that the sample size for fourth-and-long runs following a timeout is only 13 plays, a small enough sample that it’s reasonable to actually dissect the plays one by one. Specifically, three of the 13 plays involved plays late in the fourth quarter of blowout games where the offensive team was winning, and simply used its fourth down play to either take an intentional safety or run around in the backfield for as long as possible to kill clock. (Drew Brees did it twice, and Seahawks punter Jon Ryan did once).
As such, for lack of a better word, there are only 10 total “real” plays meeting the criteria of fourth-and-long, run, and following a timeout, over the last 10 seasons. This is simply way too small of a sample size to be drawing any conclusions about. (For what it’s worth, the p-value of a T-test jumps from 0.02 to 0.07 after removing those three plays). Furthermore, of the 52 plays in pbp_all_fourthandlong_Runs_NOTimeout, 28 of them are fake punts or field goals, which also means the sample size of true fourth-and-long runs with no timeout is much smaller than originally suggested. The bottom line is that running the ball on fourth-and-7+ out of a traditional offensive formation is both incredibly rare and incredibly ill-advised, and the presence of a timeout doesn’t change that in either direction.
Does a timeout help one get “the big play?”
Our earlier data suggested that a timeout could be more likely to lead to a more explosive play. To find out if there’s any substance to this, we can create a new metric called “Big Plays”, which I deemed to be any fourth down play that was either a touchdown, or a conversion that picked up 10+ yards. The following graph breaks down how likely big plays are by distance to go:
At first glance, our discrete intervals on the right suggest that a timeout before makes a big play far more likely when a timeout is called on fourth-and-short. But a series of successive T-tests proved this to be misleading. Specifically, the average fourth-and-short run that came after a timeout was 7 yards closer to the end zone than the average one that came without a timeout, with a P-value of 9.96 * 10-7. (In other words, not due to random chance.)
Possible Sources of Error/Other Comments
As is the case with any statistical analysis project, there are some factors that are very difficult to account for. For starters, there’s the issue of how to handle the fluky misclassified plays (like the example of the botched field goal attempt). From a coding standpoint, omitting them made the most sense, because the fact that NFLScrapR classifies them as neither passes nor runs would complicate the calculations. From a football standpoint, determining the “right” course of action has to be done on a case-by-case basis, because some of these plays were intended punts or field goals gone wrong (therefore, rightly being omitted from the project), while some were intended offensive plays that went wrong so quickly that the statisticians couldn’t tell if they were supposed to be runs or passes. This is one of several cases where I would need unlimited time and unlimited access to NFL game film to evaluate each play fairly.
On a similar note, another key question is how to handle the fake field goals/punts that were actually done on purpose. These are technically offensive plays, but because they are defended so differently due to both sides having special teams formations pre-snap, it’s easy to argue that these should also be omitted. However, it would require significantly more complex code involving text analysis to detect and remove those plays, and those plays were infrequent enough that they didn’t hugely skew any of the data (save for the fourth-and-long runs subset, which was already very small). So for this reason, I chose not to attempt to remove them all, despite the arguable flaw it causes.
There are also some plays where a fourth down conversion wasn’t that meaningful, and the defense didn’t care about allowing it. For example, when the Jets played the Texans in 2009, New York faced a fourth-and-2 from HOU 43, with two seconds left in the first half. Mark Sanchez completed a pass to Dustin Keller for 25 yards, but even though the data marks this as a fourth down conversion, it was still a success from the defense’s standpoint. Like was the case with fake field goals, these plays were infrequent enough that it would’ve been counter-productive to either scroll through all 4,000-plus play descriptions to find them or beef up the code to detect these plays automatically, so I let them stand as is.
Finally, in perhaps the most arguable concept here, I didn’t isolate plays where the offensive team called timeout, instead looking at all fourth downs that followed a timeout at all. This was for two reasons: partially in an effort to boost our reasonably small sample size, but primarily because in the context of the project, which team called the timeout isn’t that important. Whether the offense or defense makes the call, both teams get 60+ seconds to figure out how they are approaching the upcoming play, and the purpose of the project is to determine whether a fourth down conversion is more likely or not after that minute-or-so period happens. In that regard, it doesn’t matter who called the timeout; it’s not as if the offense doesn’t get to discuss its play call if the defense was the one to call timeout. It certainly could be material for a future project to look into whether the offense or defense calling the timeout makes a difference, but I felt this project was lengthy enough as is.
Thanks for the read, and I hope to hear any positive or negative feedback. I’d like to give a special thanks to Keegan Abdoo of NFL Next Gen Stats and Bailey Joseph of the Oklahoma City Thunder for giving specific R tips.
Cole Jacobson is an Editorial Researcher at the NFL Media office in Los Angeles. He played varsity sprint football as a defensive lineman at the University of Pennsylvania, where he was a 2019 graduate as a mathematics major and statistics minor. With any questions, comments, or ideas, he can be contacted at jacole@alumni.upenn.edu, and @ColeJacobson32 on Twitter.