Similarity Scores for Pitches

Similarity Scores for PitchesEthan MooreBlockedUnblockFollowFollowingFeb 10Recently, I learned about the concept of using euclidean distance between one observation and every other observation (after scaling) to obtain a similarity score.

I decided to try that out with Cal Poly’s Rapsodo data.

First, I decided on which variables to compare between pitches.

I settled on pitch type, pitch speed, spin rate, spin efficiency, x-movement, and y-movement.

I don’t love using pitch location because it isn’t really an inherent characteristic of a pitch and is more of an effect than a cause.

Next, I took out any pitches that did not have data for any of those variables.

Rapsodo does this quirky thing where it doesn’t track any pitch data unless the pitch is in the strike zone or very close, so there are quite a few pitches from our bullpen sessions that I needed to filter out here leaving me with just under 2000 pitches.

From here, I made pitch type into a dummy variable, as the pairwise distances I want to perform only operates on quantitative data.

At this point, I decided to normalize the data.

This is necessary because a difference of one unit of pitch speed (1 mph) is much more important to us than a difference of one unit of spin rate (1 rpm).

Normalization helps the model only see differences within a variable, not between variables that may be on different scales.

ExampleAfter getting the pairwise distances for the normalized data, I have a matrix where each column is a pitch and each cell in that column is the similarity score with the pitch in the corresponding row.

For example, the 6th pitch and the 2nd pitch have a similarity score of 0.

002 (closer to 0 means more similar, so these two pitches were very similar).

Note: I changed all diagonal values from 0 to infinity so they didn’t appear when I searched for minimums in each column.

This is a very small version of the matrix as a whole, but it shows how we can now find the most similar pitch (or least similar pitch) to any given pitch in the data set and rank each pitch in between from most to least similar.

Let’s take a look at these two pitches to see if they were actually similar:perfect censorshipYep, they really were.

Same pitcher, same pitch type, and everything else was very similar.

(Note: my model weights each of these variables equally.

)Comparing AveragesSo this model is cool, but it’s not super informative.

Most of the time, two pitches we identify as very similar are going to be the same type of pitch from the same pitcher, like above.

But what if we want to compare pitch types between pitchers?.Well then we’re dealing with averages.

Grouping the data by pitcher name and pitch type and taking the means of the remaining variables, I have a new data frame with which to mimic this process.

Following most of the same steps, I get another matrix, but one where each pitch type for each pitcher is only represented once.

Visualizing this new, much smaller matrix gives us this heatmap:sorry this looks like mike piazza’s back in 2004Here, the darker the cell, the more similar it is with the pitch type on the corresponding axis.

A few nights ago, I tweeted about two Cal Poly pitchers who have pretty similar sliders, on average:This is basically the extent of what my model can do.

I was able to sort all of the similarity scores in the entire second matrix to find the two most similar pitcher pitch types (a four-seamer and two-seamer from the same pitcher) but learned that Rapsodo may not be super accurate with its calculation of two-seam movement, making them seem more similar to four seamers than they really are.

When sorting for maximum distances, I found the pitcher pitch type that is most unique on the team:hammerThis makes sense, as no other lefty on the team throws a 12–6 curveball.

Going ForwardI looked into using this technique to compare each of our pitchers and their pitch types to a Major Leaguers using Baseball Savant data, but as is well-documented, Rapsodo reports movement in a different way than does Statcast and I don’t currently have the time to convert one into the other, so this will have to wait, though I do think it would be helpful to know which MLBers have pitches similar to those on our staff so we can study them and see how they get outs.

So for now these results may be more interesting than actually beneficial, but that may not always be the case.