KEES+: A Stuff Model That Cares About Pitch Tunneling

21 min readMar 13, 2024

By: Kees Hendrik van Hemmen

KEES+ Data:

https://docs.google.com/spreadsheets/d/1jo9aojnYg6gPp7JT0nmpBHs74G8NIh9L6nJoiEUYtqU/edit?usp=sharing

HENDRIK+ Data: To Be Uploaded (I’m still tweaking this model in hopes of improving it even more)

KEES+ Github Link: To Be Uploaded (I’m making the Jupyter Notebook nice and neat so you don’t have to bumble through my crazy workflow)

FF=4-seam;SI=Sinker;FC=Cutter;FS=Splitter;ST=Sweeper;CH=Changeup;SL=Slider;CU=Curveball

A few months ago I had an idea. I had seen the popular Stuff models that had blown up in the public Sabermetrics realm during the 2023 MLB season, and I wanted to try my hand at making my own. Not only that, but I wanted to add a bit of flair: I wanted to make a Stuff model that encoded certain intuitions that we have about pitch tunneling into the model. I didn’t just want to know how hard a pitcher’s pitches were to hit — I wanted to know how hard they were to hit in the context of the rest of their pitch arsenal. This was inspired predominantly by this article from prospectlive.com and Lance Brozdowski’s Youtube, which often discusses principles surrounding how different pitches interact with one another.

The idea was basically this: I would take the average, 95th percentile, and 5th percentile of all of the metrics that would typically go into a Stuff+ model (velocity, horizontal and vertical break, spin rate, etc.) for each pitch thrown by each pitcher. Then I would engineer new features for each pitch: the difference between each of those aggregated values (average fastball velocity, 95th percentile fastball spin rate, 5th percentile horizontal breaking ball break, etc.) and the equivalent values for each individual pitch (velocity, spin rate, etc.). The result would be a model that could encode truths that many pitching coaches know about designing a pitching arsenal. A good example would be something like this: a normal old-fashioned stuff model might like a fastball with lots of tailing movement. But when paired with a breaking ball with lots of sweeping movement, the two cease to compliment one another, as they’re too easy to tell apart. A normal Stuff+ model wouldn’t be able to account for this. Ideally, mine would.

Anyway, enough of the buildup. Let’s get down into the nitty gritty. The target variable for the model was delta expected run value (ERV from here on), which is a fancy way of saying the change in the number of runs you’d expect the batting team to score before the inning ends. Statcast data has this as a built in variable — all I had to do was adjust the value to remove the context of outs, runners on base, and the count. Once that was done, I did some feature engineering. I started with the following features:

Spin Rate
Spin Axis
Release speed
Extension
Horizontal Release Point
Vertical Release Point
Horizontal Break
Induced Vertical Break

I then used these features to engineer 2 more ‘base’ features: Vertical Approach Angle and Horizontal Approach Angle. You can find explainers for those two metrics here and here respectively.

I then had to make a further adjustment: Vertical Approach Angle (VAA) is linearly related to pitch location. A pitch at the top of the zone has a higher VAA than a pitch at the bottom of the zone (read the VAA primer I just linked to understand why). I didn’t want this, because it would unintentionally encode pitch location into my stuff model (it’s not supposed to consider location!). The way I got around this was by changing VAA to VAA Above Average, taking the difference between the VAA of a pitch and the average VAA of said pitch given where it was located vertically in the zone. I got this idea from the aforementioned article on Fangraphs by Alex Chamberlain (thanks Alex!).

This left me with 10 ‘base’ features. I took the difference between each of these base features and the mean, 95th percentile, and 5th percentile value for each of the same features for each pitcher’s primary fastball, breaking ball, and changeup. This, along with a whole bunch of other features I engineered, left me with something like 200+ features at my model’s inception. After many weeks of feature selection and hyper parameter optimization that I won’t bore you with, I ultimately wound up with the following 20 features, sorted by importance, from most important to least important.

20 features were included in the final version of the KEES+ model

That’s actually a half-truth: I trained 4 models, one for each handedness combination (RvL, LvR, RvR, LvL) of pitchers and batters. What you see above is the feature importance for the Righty vs. Righty model, but it’s worth noting that the four models varied hardly at all in the order of importance for the majority of features. I had the idea for four different models, one for each handedness combination, from this piece by Chase Coppersmith.

HAA sadly did not make the cut, as it was highly correlated with horizontal break, and just didn’t add enough value to justify keeping. The same was true of almost all of the ‘Mean’ and ‘5th percentile’ differences. In any place you see “Difference” in the title above, it’s the difference from the 95th percentile of that metric for the pitch type and pitcher in question. The only exception is “Difference” metrics using VAA Above Average, which is the difference from the mean VAA Above Average for the pitch type and pitcher in question.

The ultimate result was a model that had two clear distinctions: 1) It loved VAA Above Average and 2) it cared about the tunneling metrics I’d introduced. And number 2 was not just a trivial add on — the tunneling metrics were amongst the most important in the dataset.

[Boring Data Science Aside]

Now, if you’re a machine learning engineer, you might be asking yourself: couldn’t those “tunneling” metrics actually just be highly correlated with their associated base features, and so you’re actually just taking a roundabout way to encoding the same features multiple times? I asked myself that same question. The answer is no: None of the 20 remaining features in the final model had a correlation coefficient greater than 80% with any other feature, with the overwhelming majority of them falling closer to 20%. No multicollinearity here, not that that would matter too much — the model I used was an xGBoost model, and being an ensemble learner it generally wouldn’t be affected by multicollinearity anyway.

[End Data Science Aside]

One final detail before we get into the fun part: my model was trained on the 2017–2019 and 2021–2022 seasons, and tested on the 2023 season.

With all of that covered, let’s get into the results. Meet KEES+: The new public Stuff model I ever-so-humbly named after myself. In the following piece I will compare KEES+ with an equivalent model, Eno Sarris’ Stuff+, to see how the two measure up to each other in different tasks.

[Edit: It’s come to my attention that Stuff+ may have recently been retrained on 2023 pitch data. This would constitute a significant advantage in explaining variance on 2023 data, which is my testing set. If this is the case, KEES+ may actually be outperforming Stuff+ by a wide margin across the board. Check out Chase Copperfield’s piece from last year for a look at how Stuff+ performed before this change.]

How Does KEES+ do when compared with other Stuff+ models?

The short answer? Well. The long(er) answer? It’s complicated.

Stuff+ does a better job of explaining K rate, full stop. No two ways around it. I suspect this may be because, rather than using ERV as a target variable, Sarris may have used whiff (a pitch that induced a whiff would receive a score of 1, whereas a pitch that did not would receive a score of 0). This would make the model more discerning about swing-and-miss, but less discerning about outcomes on contact. With that said, when we look at how the two perform on wOBA…

… Stuff+ is still superior. So, why should you care about KEES+ at all?

Because, it appears, KEES+ is more predictive of future ERA than Stuff+. “But, Kees” — you might be saying — “Stuff+ isn’t meant to predict ERA, it’s meant to be a raw measure of stuff quality, and thus meant to be more correlated with strikeout rate.” Well, be that as it may, when one compares Stuff+’s equivalent Pitching model Pitching+ with KEES+’s equivalent Pitching model HENDRIK+ (Hendrik is my middle name), we get some more illuminating results:

HENDRIK+ outperforms Pitching+ when used to predict a pitcher’s ERA the ensuing season as well. This is very much one of Pitching+’s purposes, and HENDRIK+ is, by the looks of it, at least marginally outperforming Pitching+ in doing so. The margin is not so large as to support claiming that HENDRIK+ is a superior model, but it is large enough to suggest that KEES+ and HENDRIK+ are modeling something different, and of equal importance, to that which Stuff+ and Pitching+ attempt to model.

It’s worth noting, also, that KEES+ is quite sticky. Year to year, 2022 KEES+ explained more than 70% of the variance in 2023 KEES+.

So, in summary, KEES+ explains variation in K% worse than Stuff+, but, as a predictive metric for future pitching performance it actually betters Stuff+ and its sister metric Pitching+.

All of which brings us to the truly interesting question. If we accept that KEES+ and Stuff+ both have merit in accomplishing the same task while producing very different outputs, then what is KEES+ doing well that Stuff+ is having more trouble with? And vice versa?

[Note: the standard deviation of pitches in Stuff+ is about 2x the standard deviation of pitches in KEES+. This doesn’t affect the following analysis, but it does explain why you’ll see fewer very high and very low KEES+ scores]

Pitch Type Comparison of Stuff+ and KEES+: 4-seam Fastballs

One thing I noticed in comparing my model to Stuff+ and PitchingBot was this: KEES+ cares way less about four seam velocity than the models available on Fangraphs:

All graphs in this section are comparing pitches in the 2023 season

When I first saw this, I was panicking a bit. I think we can all agree Félix Bautista’s 4-seamer is not a league average pitch… KEES+ gives it a score of 99.5. Dead set league average. Not a great look for the eye test.

This plays out in whiff and K rate comparisons. We can say with confidence that Stuff+ does a better job of predicting swing-and-miss on 4-seam fastballs than KEES+:

Stuff+ describes the variance in whiff rate for 4-seamers better than KEES+ in smaller samples

Finally, Stuff+’s greater ability to explain swing-and-miss on 4–seamers appears to carry over to what we really care about: wOBA.

In smaller samples, Stuff+ describes variance in wOBA better than KEES+

This is a pretty discouraging start to our deep dive on specific pitch comparisons. But wait! There’s a twist! When I increase the minimum number of pitches thrown from 250 to 500, a change occurs:

In larger samples, KEES+ is capable of describing variance in wOBA against 4-seamers just as effectively as Stuff+

KEES+ and Stuff+ describe wOBA against 4-seamers at equal proficiency when considering only 4-seamers thrown more than 500 times. There are two possible explanations for this.

The first possible explanation is simple: when you increase the number of pitches thrown, you necessarily filter out relievers. Relievers throw harder, on average, than starters. So it may be that KEES+ performs significantly better on the 4–seamers thrown predominantly by pitchers who have thrown more pitches — i.e., starters. This would make sense. Most starters that have effective 4-seam fastballs rely not only on velocity, but also on good fastball shape, rather than simply pumping 100 past opposing hitters. I wanted to rule out this explanation, so I compared the two models on fastballs thrown above and below 95 mph. Does Stuff+ perform better on high velocity fastballs?

KEES+ actually performs significantly better on high velocity 4-seamers than on low velocity ones

Nope. KEES+ actually outperforms Stuff+ on high velocity fastballs, whereas Stuff+ takes KEES+ to town on lower velocity fastballs. So we can safely rule out the reliever hypothesis.

The alternative explanation is also quite simple: the inferential distinction that KEES+ has from other models is partially down to pitch tunneling related features. It’s likely that these kinds of features would take longer to stabilize than something like, for instance, fastball velocity, which we’ve just established KEES+ cares about far less than Stuff+. Pitchers who have thrown their fastball more have also likely thrown their secondaries more, and throwing your secondaries more means that their associated metrics have likely stabilized to a greater extent. The result? KEES+ can more reliably discern how the 4-seamer is tunneling with said secondaries.

How KEES+ and Stuff+ perform on fastballs depending on where they’re located

One last note: it appears that the KEES+ is more indicative of success when throwing your fastball outside of the zone, whereas Stuff+ is more indicative of success when throwing in the zone. This probably makes Stuff+ a better measure of 4-seamer quality, as 4-seamers are generally pitches that pitchers use to throw for strikes.

So, what’s the takeaway here? Stuff+ is probably better for assessing 4-seamers in small samples (how nasty was your 4-seamer yesterday?) whereas KEES+ is at least as good at assessing 4-seamers in larger samples (how nasty was your 4-seamer last year?). KEES+ also performs better on high velocity 4-seam fastballs, whereas Stuff+ performs better in the strike zone.

Sinkers

I don’t have much to add here. KEES+ does much better on sinkers than Stuff+ when it comes to wOBA. I don’t know what Stuff+ is using as a target variable, so it’s possible that this isn’t a fair comparison. If you look at whiff rate, things flip:

Stuff+ predicts whiffs on sinkers better than KEES+

Stuff+ predicts whiffs on sinkers a lot better than KEES+, so if that’s what you want to know about a sinker, go to Fangraphs. That said — sinkers aren’t really a swing-and-miss pitch for most pitchers. Pitchers are generally looking for weak contact when throwing them, and I think that’s what’s showing up in the difference in these two models. Once again, Stuff+’s preference for velocity is giving it an edge in predicting swing-and-miss, but ultimately not lending a real edge in predicting pitch efficacy.

My takeaway? Use KEES+ to understand sinker efficacy, but use Stuff+ to understand whether or not a sinker will induce swing-and-miss.

Changeups

Stuff models famously have a ton of trouble with arm-side break. Splitters (which I’ve omitted from this analysis because the sample sizes were so small), and in particular changeups, are really hard to model. This is often thought of as being due to one of two reasons: either a) changeup efficacy is primarily down to how a changeup tunnels with its fastball or b) pitch location, which appears in Pitching+ models but not Stuff+ models, plays a big role in predicting changeup success. I don’t know if I answered that question with KEES+, but I definitely learned some interesting stuff.

I’ve used whiff rate and xwOBA, rather than wOBA, for this analysis. This is because neither of the two models does very well at predicting wOBA on changeups. I suspect this is because changeups are heavily dependent on deception. As a result, they’re heavily dependent on whiffs. Most of the skill in a good changeup is about inducing whiffs. Once a hitter makes contact with a changeup, the nastiness of the changeup has very little bearing on the quality of the contact. This is different from something like a sinker, for example: a great sinker is not necessarily hard to make contact with, but rather it’s hard to square up and drive. That means that good changeups sometimes get hit hard in smaller samples (like a single season, for example) and predicting wOBA becomes difficult because of noise. Expected wOBA helps filter this out, as does whiff rate.

The performance here is pretty similar. Stuff+ does slightly better when it comes to describing the variance in xwOBA against changeups. This is a bit of a disappointment, as I was expecting the tunneling related features in KEES+ to help with modeling changeups. It appears that was not to be. Stuff+ also once again does significantly better at assessing whiffs, interestingly:

Somehow this doesn’t bear out to nearly as an exaggerated extent in xwOBA, so clearly there’s more to limiting quality contact in changeups than I previously posited. Another interesting bit: KEES+ seems to have a better handle on hard changeups, whereas Stuff+ seems to be more effective with slow changeups. Take a look:

The statistical significance on this last bit is pretty dicey, so take it with a grain of salt, but it’s interesting enough to be, well: interesting.

Finally, I took a look at changeup performance based on location of each pitch.

Model performance in predicting Run Value of changeups based on pitch location

The results? KEES+ seems to be more indicative of a changeup’s performance within the strike zone, whereas Stuff+ appears more indicative of a changeup’s performance outside the strike zone. This makes sense — changeups that induce lots of whiffs are generally changeups that are thrown outside of the zone, and Stuff+ seems to care quite a bit about inducing whiffs. On the other hand, many pitchers who count on their changeup as a primary offspeed offering still throw their changeup in the zone, typically looking for in zone whiffs and called strikes. KEES+ may be picking up on this.

My takeaway for changeups? Stuff+ is doing better here, but if I had a larger sample I suspect KEES+ would do as well if not outpace it. More on that later.

Curveballs, Sliders, and Cutters

Let’s start with curveballs.

Stuff+ has a slight edge here for xwOBA and a huge edge for whiffs. I suspect this is because there are, broadly speaking, two types of effective curveball. There are big, loopy curveballs that pitchers throw in the zone and use to steal called strikes. There are also shorter breaking, higher velocity curveballs that pitchers use to induce swing-and-miss. If I had to bet, I’d say my model is doing better at seeing the value in the curveballs being thrown for strikes, while Stuff+ is doing a better job dealing with the whiff inducing curves.

Note that the minimum pitches thrown is very different in these two plots. This was done to increase statistical significance.

After a lot of massaging the numbers to get r scores that were statistically significant, it looks like my hypothesis might be correct. Stuff+ appears to do better with curveballs that induce lots of swing and miss, whereas KEES+ appears to model curveballs that induce very little swing and miss better. I wouldn’t lean on this too hard as an insight, but it’s something to keep an eye on. You can also see it bear out when you look at curveball performance based on location:

KEES+ and Stuff+ performance of Curveballs in the zone

KEES+ is performing better on curveballs located in the zone.

KEES+ and Stuff+ performance of Curveballs outside the zone

Stuff+ is performing better on curveballs located outside of the zone. This seems to support the two-types-of-curveball theory. That said, overall, I’d say Stuff+ has a slight edge for curves. When you throw a curveball you’re looking for swing-and-miss, and Stuff+ captures that better.

On to sliders.

Both of these models do really well at predicting wOBA against sliders. Stuff+ does better in smaller samples, but KEES+ closes the gap as a pitcher throws his slider more, ultimately eclipsing Stuff+ somewhere around 500–600 pitches. This seems to be a trend across a lot of different pitch types, and follows logically: as tunneling metrics stabilize, the inferential power of KEES+ sometimes eclipses that of Stuff+. You can see this bare out one more time in slider whiff rate:

Left: Minimum pitches thrown = 250, Stuff+ far superior; Right: Minimum pitches thrown = 500, KEES+ superior

Slider takeaway? Stuff+ for small samples, KEES+ for larger samples.

Lastly for breaking balls, we look at cutters.

Left: Minimum pitches thrown = 400, Stuff+ superior; Right: Minimum pitches thrown = 500, KEES+ superior

Once again, we have the same trend. Somewhere around 500 pitches thrown, KEES+ eclipses Stuff+. KEES+ also keeps pace when it comes to predicting whiff rate:

The Major Discrepancies

Now that you know more about the model’s inner workings, let’s take a look at some pitchers and pitches that Stuff+ likes, but KEES+ doesn’t, and vice versa. First, let’s start with the ones KEES+ likes and Stuff+ does not:

The 20 pitches with the largest difference in KEES+ percentile rank and Stuff+ percentile rank

[A Note: “Kees Qualitative Assessment” isn’t actually qualitative: it’s just me dropping xwOBA Percentile Rank into buckets. Anything <15th percentile is pure filth, anything <30th percentile is plus, etc.]

Immediately here I feel we get some vindication: the first pitch on this list is a curveball that isn’t inducing whiffs at a high rate. I haven’t looked at how often he’s throwing it in the zone, but I’d venture to guess this is a big, loopy curve that Thompson’s using to steal strikes all the time. It’s certainly big and loopy: he’s getting more than 7 inches of vertical drop on this curveball than league average. KEES+ likes it a lot, rightfully, and Stuff+ thinks its garbage — probably because it doesn’t have the characteristics that usually induce swing-and-miss.

Otherwise, we’ve got a mixed bag of sinkers that my model likes and Stuff+ does not. Again, that’s mostly down to whiff inducing characteristics I suspect. There are also a half dozen or so changeups in here that KEES+ likes a lot (rightfully) and Stuff+ hates. Finally, in 20th, we’ve got Rich Hill’s cutter. For some reason KEES+ loves Rich Hill’s secondary offerings. All of them. And, in 2023, they were all absolutely awful. A peculiar blind spot. Now let’s go to the pitches that Stuff+ likes, but KEES+ does not.

The 20 pitches with the largest difference in Stuff+ percentile rank and KEES+ percentile rank

Hands up: this list looks a lot better than mine. Whereas KEES+ identified 15/20 pitches that are above the median pitch in terms of xwOBA, Stuff+ identified 20/20. No misses. Though KEES+ had some major wins in the changeup and strike-stealing curveball department, it had some major losses when it came to identifying some of the best strikeout pitches in the game. Tyler Glasnow’s curveball and Blake Snell’s slider here are two major misses, with the both of them inducing more than a 50% whiff rate. Oof.

KEES+ actually grades all of Snell’s stuff, except his fastball (which in real life is his worst pitch) as below average. I suspect the reason for this is as follows: KEES+ weights VAA Above Average very highly. Snell and Glasnow both spin the h*ll out of the ball on their slider and curveballs respectively. The result is this: both pitches will reach home plate at a crazy low VAA, because the spin on the pitches is inducing tons of drop that’s not associated with gravity. That, and they’re burying them so deep trying to induce whiffs that the ‘above average’ adjustment that I’m doing to remove location is probably not entirely effective. “Average” VAA more than a few inches north or south of the zone gets wonky very fast. My guess is that KEES+ ultimately views these as super hittable pitches, not knowing that neither of these guys are ever leaving these pitches anywhere near the zone because they’re so nasty. An interesting quirk.

Otherwise, the theme here was swing-and-miss. The cutters and changeups on this list are all super high whiff rate pitches. Stuff+ hunts those characteristics — KEES+, less so.

Now for some pitchers:

The 10 Pitchers with the largest difference in KEES+ percentile rank and Stuff+ percentile rank

The first five names on this list all have something in common: they sport a changeup that KEES+ loves and Stuff+ absolutely hates. Pablo López falls into this boat as well. In every instance, KEES+ is correct to be significantly higher on the changeups than Stuff+ is. In a lot of cases, KEES+ also likes these pitcher’s primary fastballs a lot more than Stuff+, and in those instances it’s typically right as well. This further reinforces my conviction that KEES+ has the potential to perform better on changeups than Stuff+ across a larger sample, largely as a result of its encoding of tunneling metrics.

This seems important. A big question that’s often asked of Stuff models is why they so frequently fail to explain the success of changeup-reliant middle-of-the-rotation starters. It appears that perhaps KEES+ offers a metric that can explain some of this performance.

Justin Steele’s place on the list is down to his low velocity four seam — it was a super valuable pitch for him last year, and KEES+ appears to see merit either in its shape or tunnel, whereas Stuff+ does not. I suspect this is largely because it only averages 91.8 MPH. Joe Ryan is a similar story.

As for Flaherty — KEES+ likes his changeup, curveball, cutter, and 4-seam all significantly more than Stuff+. In the instance of his curveball and changeup, KEES+ is on to something. As for his cutter and fastball, far less so. Finally, the less said about Rich Hill on this list the better. If he has a comeback year in 2024, I’ll take a victory lap. Otherwise we will never discuss this again.

The 10 Pitchers with the largest difference in Stuff+ percentile rank and KEES+ percentile rank

Here’s the list where Stuff+ likes these guys significantly more than my model. Stuff+ likes Cease’s 4-seamer and slider a lot more than KEES+. The 4-seamer was not great last year, but the slider was. This is another instance where these wipeout swing-and-miss pitches seem to throw KEES+ for a loop. Musgrove’s curveball and slider appear to be the same, but interestingly his (high-velocity) four seam is preferred by KEES+ more than Stuff+. Snell I’ve already discussed. Olson’s another slider and curveball combination that Stuff+ likes. Darvish has his curveball and fastball. Bradish it’s the curveball and the slider again (though KEES+ likes both, just not as much as Stuff+).

The final takeaway? Stuff+ loves these spin monsters, guys who have swing-and-miss breaking balls that I’d venture to guess they are often throwing out of the zone. The Hollywood slider, the Wall Street curve. In contrast, KEES+ really likes the fastball-changeup pitch mix. The working man’s arsenal. The people’s pitchers, if you will. Both have their merits, and I’m curious to see how a model using KEES+’s changeups and sinkers paired with Stuff+’s 4-seamers and breaking balls might perform.

Conclusion

My main takeaways from this whole exercise are as follows:

Stuff+ performs better than KEES+ in small-to-medium samples
KEES+ eclipses Stuff+ when it comes to explaining wOBA variation somewhere around 500 pitches thrown for most pitch types
KEES+ seems to do better than Stuff+ on pitches that are meant to induce weak contact (sinkers, cutters), whereas Stuff+ appears superior for pitches that are meant to induce whiffs (4-seamers, curveballs).
KEES+ does not outperform Stuff+ when it comes to describing the variance in an individual pitcher’s K rate, but it does outperform Stuff+ when it comes to predicting a pitcher’s ERA in the ensuing season. The same is true of HENDRIK+ when compared with Pitching+, both of which are pitch location and count sensitive metrics.

I hope you had as much fun reading this as I had putting it together. Eno Sarris’ Stuff+ model, as well as Fangraph’s PitchingBot, Chase Copperfield’s C Stuff, and Thomas Nestico’s tjStuff+ were all major inspirations for this project. I don’t claim to have made the stuff model to end all stuff models, but I do hope this exercise at the very least begins to ask key questions surrounding the impact of pitch tunneling on pitcher stuff efficacy. If you liked this piece, feel free to reach out to me on twitter (@HemmenKees) and please go nosing around in the data (linked at the top of this piece) for interesting tidbits. My DMs are always open and I’m happy to answer questions.

I leave you with this, the 10 best pitches thrown more than 100 times in 2023 by pitchers still employed by the Boston Red Sox: