KEES+ Part 2: Introducing WhiZ and ChoZo

Yes, that’s actually what I’ve named two new sabermetrics.

Kees van Hemmen
15 min readApr 1, 2024

WhiZ, ChoZo, and kStuff data from the 2023 season: https://docs.google.com/spreadsheets/d/12JGo6i-b8efOznDZCiA_UDXY57QGKJhhwhz-0qQ9z88/edit?usp=sharing

A few weeks ago, I rolled out my brand new, eponymous “stuff model” KEES+. You can read about KEES+ in detail here, but for the purposes of this article you need only know the following: KEES+ is a stuff model that uses not only data about individual pitches, but also about how an individual pitch relates to the rest of a pitcher’s arsenal. The idea here was to encode insights surrounding “pitch tunneling” into a modern stuff model, in hopes of creating a stuff model with greater predictive power than other stuff models. The model I ultimately did wind up creating does appear to predict a given pitcher’s future success at least as well, if not marginally better than, other publicly available stuff models. However, though KEES+ did perform well in the aggregate, on a pitch-by-pitch basis it appeared to have quite a few blind spots. For starters, it seemed to undervalue high velocity fastballs and high spin breaking balls. It also failed to demonstrate a greater understanding of what makes changeups effective than other stuff models, which was a major disappointment as that was one of my initial hopes for the model. With these shortcomings in mind, amongst others, I set out this past week to remedy them, and produce a superior model that might have fewer weak spots while still leveraging the advantages KEES+ has over other models.

New Component Metric #1: WhiZ

As mentioned above, one of the clear shortcomings of KEES+ was that it seemed to under-weight extremely high velocity. As a consequence, some of the most dominant fastballs in the sport graded out as relatively average pitches, and Eno Sarris’ Stuff+ model outperformed KEES+ (albeit by a relatively small margin) on 4-seam fastballs. At first, I was not sure how to fix this. However, two things pushed me in the direction of creating the metric that would ultimately become WhiZ:

  1. During my comparison of the performance of Stuff+ and KEES+, I increasingly suspected that Stuff+ was trained not on Delta Expected Run Value (the change in the number of runs you’d expect the hitting team to score in an inning as a consequence of a pitch) but rather on Whiffs (where 1 would be a pitch that generated a whiff, and 0 was a pitch that did not). This was because, overwhelmingly, Stuff+ had much more success describing variation in a pitcher’s K% and whiff rates than KEES+, whereas KEES+ performed better on pitches typically known for producing soft contact.
  2. At some point in the last two weeks, I heard someone, on some podcast (that I believe was the SoxProspects Podcast, but I admit that after a short search I was unable to confirm this), say that the Tampa Bay Rays have basically started telling their pitchers in the minor leagues that if they can’t throw their stuff straight down the middle of the plate then they probably won’t have success at the Major League level with the organization. This article from Yahoo Sports seems to corroborate that claim.

After having been haunted by this second claim — that very well may be a complete fabrication of my subconscious — for a few weeks, I decided to put it to the test. The result: I took the data that I’d trained KEES+ on and filtered it down to only pitches that were what I called ‘deep strikes’: strikes thrown more than the diameter of one baseball within the strike zone. The idea here was this: I wanted to look only at strikes, but I didn’t want to include pitches that were benefitting from perfect placement in the zone. Lots of pitches located on the very edges of the plate induce defensive swings when the hitter recognizes it’s a strike too late — I wanted to minimize those.

The next thing I did was to limit the dataset again — this time to only pitches that induced a swing. I didn’t want this new model to consider called strikes — I just wanted it to answer the following question: If a given pitcher throws this pitch straight down the hitter’s throat, what are the chances that the hitter whiffs completely?

The result was WhiZ: Whiff in Zone [Rating]. I’m going to come back to this, but for now suffice it to say I am super happy with what WhiZ wound up becoming.

As you might expect, WhiZ hates sinkers (they don’t induce whiffs) and curveballs (they generally don’t perform well in the zone if you cut out called strikes, which WhiZ has). However, it loves splitters, sweepers, and 4-seamers. These last 3 are all super in-vogue and known for inducing lots of swing and miss, so this lines up exactly with what you’d expect.

New Component Metric #2: ChoZo

WhiZ produced lots of encouraging results. I’m spoiling some of what I’ll cover a bit later, but the broad strokes are as follows: WhiZ performed really well in predicting wOBA against fastballs and changeups (even when including those that were thrown outside the zone). WhiZ also, across the board, predicted whiff rate on a variety of pitches better than Stuff+ did. It even captured some of the phenomena I felt KEES+ was missing when it came to high spin breaking balls. However, it still wasn’t doing what I wanted it to on this last count: some of the super spinny, 70-grade curveballs and sliders that I highlighted KEES+ as struggling with last week were still grading out poorly by WhiZ. I was unsatisfied. I would need another metric: ChoZo.

The idea behind ChoZo was simple: if a model focused entirely on generating in-zone whiffs is struggling to capture what makes high-spin, high-whiff breaking balls successful, then there can only be one explanation: these pitches are finding their success when placed outside of the zone. And so ChoZo was born: Chase Outside of Zone [Rating].

This time, rather than training on “deep strikes” I trained ChoZo on “clear balls” — pitches that were located more than the full diameter of a baseball outside the strike zone. The logic here was the same: I didn’t want to include pitches on the fringes of the strike zone that were inducing weak, defensive swings. The target variable this time was not whiffs, though: it was swings. If a pitch received a swing, regardless of whether the swing lead to contact, it was given a 1. Otherwise it got a 0.

The result? Once again, satisfactory. Let’s look at what I mean.

ChoZo does not like hard stuff. No surprise there — fastballs are generally meant for throwing strikes, and ChoZo knows nothing about those. What ChoZo does like? Sweepers, sliders, splitters, and changeups. Basically, anything that breaks. You can see it can’t make up its mind about curveballs — I suspect this is because of the two types of curveball thesis from this article that I referenced in my last piece. Some curveballs are meant for inducing chase pitches, and others are for getting called strikes. The two of them don’t do each other’s jobs very well.

Comparing the Models: KEES+, Stuff+, WhiZ, and ChoZo Comparison

We’re going to start by going pitch-by-pitch here. First, 4-seam fastballs:

[Fair warning: the rest of this piece is going to have lots of very, very busy graphs. Be prepared.]

Four Seamers:

WhiZ stands out for modeling 4-seamer performance

We’re off to a great start here. WhiZ is outperforming KEES+ and ChoZo by a wide margin, and beating out Stuff+ by a slim margin. As we increase the minimum number of times a pitcher must have thrown their four-seamer to qualify, that margin between WhiZ and Stuff+ increases. This makes sense — WhiZ is building in all of the same tunneling effects as KEES+, and so it’s likely to perform better with more data. That said, it’s still getting the best of both worlds. Take a look at how WhiZ does when modeling whiff rates:

Not only is WhiZ outperforming all other metrics in explaining the variance in wOBA against — it’s blowing the competition out of the water when it comes to explaining whiff rates. This is what we were hoping to improve on from KEES+, and it looks like we’ve accomplished that.

Advantage: WhiZ

Sinkers:

KEES+ still leads the pack for sinkers

KEES+ is still modelling sinkers best, it appears. This was the case when I compared Stuff+ and KEES+ last week as well. It appears that KEES+’s true inferential edge comes from its ability to model soft contact. In the future I may look to replace KEES+ with a model focused entirely on soft contact, but for now it serves its role here.

It’s worth noting here that ChoZo actually outperforms KEES+ and Stuff+ at higher minimum pitch thresholds. This is interesting — I assume it’s due to a lot of sinkers dropping out of the zone and forcing weak contact. That said, given that the margins are small in the instances where ChoZo outperforms KEES+, and that I’m doubtful a sinker model focused entirely on inducing a hitter to chase is a logically sound one, I think KEES+ takes the edge here.

Advantage: KEES+

Changeups

WhiZ appears to be the standout when it comes to modeling Changeup efficacy

I won’t lie, I’m really happy with how this worked out. No matter what I do with minimum pitch thresholds, WhiZ is clearly modeling changeup efficacy better than either KEES+ or Stuff+. At higher pitch thresholds, ChoZo does compete with WhiZ — and I don’t yet fully understand why that keeps happening — but WhiZ is the dominant metric here in general. This seems to follow logically: changeups that induce whiffs in the zone are generally going to be changeups that pitchers are comfortable throwing in the zone more often. That means more called strikes. I also suspect that changeups that induce more whiffs also likely induce more soft contact (though the correlation may be weak). Regardless — the results are clear. WhiZ looks to be a much more effective means for predicting wOBA against changeups than the other models here.

Advantage: WhiZ

Sliders

ChoZo appears to model slider efficacy better than other models

We’ve made it to breaking balls — and, would you look at that, for the first time ChoZo seems to shine. ChoZo explains 33% of the variance in wOBA against sliders for sliders thrown more than 250 times in 2023. This is almost twice what Stuff+ and KEES+ manage (18%). It’s worth noting that this gap closes as the minimum number of pitches thrown increases, but ChoZo is still the clear standout. It appears my suspicion was correct: putting an emphasis on chase pitches does make for a superior breaking ball metric.

Advantage: ChoZo

Curveballs

ChoZo is once again the clear standout

No news here. ChoZo is again the model having the most success in predicting wOBA against a breaking ball: this time it’s curveballs. As you increase the minimum pitch threshold here, both WhiZ and Stuff+ close the gap a bit, but never with statistical significance. Advantage ChoZo (yes I laughed while writing that).

Advantage: ChoZo

Cutters

For cutters, we do not have a clear victor. When we look at at smaller minimum pitch threshold (more data points, fewer pitches thrown per data point) this is what we get:

Here, it looks like ChoZo is the clear standout. However, when we increase our threshold for minimum pitches (fewer data points, more pitches thrown per datapoint) look what happens:

WhiZ is suddenly the standout. A bit odd, but nothing we can’t solve. What if we cut out some noise by looking at xwOBA (estimated wOBA based on quality of contact)?

We encounter the same problem. ChoZo stands out in the larger sample, but WhiZ (and in this case, Eno Sarris’ Stuff+) both outperform it ChoZo when we increase our threshold, while maintaining statistical significance. Very odd. Sadly, I don’t think we’re going to get a clear answer here. More data might bear out the difference, but for now we’re going to have to learn to sit with the mystery. I suspect the issue here is as follows: cutters are neither ‘swing-and-miss’ pitches, nor ‘soft-contact’ pitches. Different pitches use them differently. They’re sort of tweener pitches — they’re a blanket group to describe hard sliders and cut fastballs that we can’t otherwise classify. As a result, each of these models is capturing different aspects of what makes different cutters effective. The end product? A split decision.

Advantage: Split Decision

Splitters

In a small sample, ChoZo seems to blow everyone out of the water again. This follows logically (splitters are definitely chase-oriented pitches) and holds up across sample size.

Advantage: ChoZo

Tying It All Together

So, what does this all mean? It’s certainly nice to have made a few models that describe different pitch outcomes well. That said, we want to integrate all of this into one model. How do we do that?

We build another model of course! This time, a super simple model: for each pitch type, all I did was run a linear regression of the three models (KEES+, WhiZ, and ChoZo) against the wOBA a pitch surrendered. As a result, for each pitch, this new model (kStuff) is basically just a different linear combination of the three models. The end result? A model that should tie all of our insights together, nice and neat:

[The training set here was the 2019, 2021, and 2022 seasons.]

As you can see, our new model (which I’m calling kStuff, because I think we’ve had enough ridiculous model names for one day) explains about 38% of the variance in wOBA for pitchers in 2023. That’s better than KEES+ did, but still short of Stuff+:

However, that’s not our final goal. The key value that KEES+ appeared to provide was the following: 2022 KEES+ predicted 2023 ERA better than 2022 Stuff+ predicted 2023 ERA. Can kStuff repeat the same trick?

Yep. The difference between kStuff and KEES+ is small (0.39 vs. 0.38) but both explain much more variance in future ERA than Stuff+ (0.34). Ultimately this is a disappointing return in terms of improvement given the amount of work that went in here. A linear combination of the WhiZ, ChoZo, and KEES+ models generally results in KEES+ making up most of the prediction. This is disappointing, confusing, and something I’ve not yet solved. For example, one thing that happened when making kStuff that I still don’t understand is the following: why would a linear model weight KEES+ higher than WhiZ for 4-seam fastballs, when training on wOBA as a target variable, given that WhiZ predicts wOBA against fastballs better than KEES+? I have no idea, but that is what’s happening in this linear combination. Very odd. It is here that I reach out for help — if you’re reading this, and think you know the answer, please shoot me a DM on twitter.

Nevertheless! While I’m left with the mystery of my misfiring linear model, WhiZ and ChoZo on their own have demonstrated their predictive merit on their own! If you see a bad Stuff+ grade on a given pitch, but its results look good, or it appears filthy on screen — check out ChoZo and WhiZ! They may hold the answers you look for. Basically what ChoZo and WhiZ are doing is this: they’re simplifying the task of a Stuff model by breaking it down into component parts. Let me explain. There are basically 4 kinds of positive value pitches:

  1. Pitches in the zone that don’t get swung at (called strikes)
  2. Pitches in the zone that get swung at and whiffed on (in zone whiffs)
  3. Pitches out of the zone that get swung at and whiffed on (out of zone whiffs)
  4. Pitches that are swung at and result in poor contact (Low quality batted balls)

KEES+, Stuff+, and basically every other Stuff model is trying to account for all 4 of these outcomes. The issue is this: pitches with the qualities that result in called strikes might also get hit very hard when a hitter swings at them. Or maybe pitches that induce out of zone whiffs might get thrown for balls more often. Perhaps pitches that induce in zone whiffs might also get hit hard when a hitter does make contact with them. I can go on and on here. A non-linear model like XGBoost can handle this, to an extent — but it will still get confused, in a sense, because it’s trying to understand multiple phenomena at once. By breaking the problem down into component parts (by modeling in zone whiff and out of zone chase value independently) we can begin to fill in the gaps of our understanding of pitch quality. Hopefully I can find a way to integrate these all into one, cohesive metric soon. But for now, I’ll have to settle for what I’ve accomplished so far.

A Quick Case Study on ChoZo and WhiZ

Accepting that kStuff was mostly a failed experiment, let’s go back to ChoZo and WhiZ. I’ve shown you their value: now let’s look at some individual pitchers and pitches.

WhiZ was primarily meant to help better understand changeup and fastball performance in the zone. It seems like we’ve accomplished that, but let’s look deeper for a second. Here are the 20 pitches from the 2023 season with the largest discrepancy between WhiZ and KEES+:

All changeups and fastballs, baby. More specifically, 15 of the 20 pitches identified were wrongly identified as below average pitches by KEES+, but correctly identified as above average by WhiZ. Even better: 10 of the pitches on this list allowed a wOBA against 100 points below the average for their respective pitch type last season. I’m super happy with this. On the other end of things? Here are the top 20 pitches that were rated highly by KEES+ but very low by WhiZ:

All plus sinkers! This makes sense. Sinkers don’t induce whiffs.

Now let’s look at ChoZo. We set out to use ChoZo to adjust for KEES+’s seeming inability to identify elite chase pitches. How’d we do?

Pretty well, again! Patrick Corbin’s slider is a thorn in my side, but we’ve captured Blake Snell’s slider, Blake Snell’s curveball, Dylan Cease’s slider and Tyler Glasnow’s curveball here — four pitches I identified in my piece on KEES+ as pitches that KEES+ struggled to identify as the plus pitches that they are. In contrast? Here are the 20 pitches that ChoZo disliked and KEES+ liked:

Sinkers and fastballs! Not chase pitches in either instance — so both make sense! Sweet. I’m over the moon. Mission accomplished — in a sense. Though I haven’t managed to reach my ultimate goal of integrating these models into one model to rule them all, one model to find them —

one model to bring them all, and in the darkness bind them…

Driveline Sauron

… these models do seem to cover some of the gaps in KEES+ that were created by it’s attempt to be a do-it-all Stuff metric.

Time to wrap up. I’ve linked the data at the top of this article for your browsing pleasure. I hope you find it as interesting as I did! Be sure to tune in next time, when I take a look at incorporating some of the awesome insights from Alex Chamberlain’s recent piece on command into an all new, revamped Pitching model.

--

--