A while back I went down the path of looking at similarity scores between pitches for different pitchers, but had since not really followed up.  Since they came up in a thread over at BtB, I thought it would be a good idea to revisit them now.  My basic methodology was the same that Josh Kalk used for pitchers, only removing the percentage thrown term.  In addition I also ran a set of scores where along with the three “physical” traits, I added whiff rate and GB%.

For my initial cut I only took pitches that I had over 150 instances of in the database in a hope that it wouldn’t skew the whiff and GB% distributions.  In other words if Joe Pitcher only threw 100 curveballs over the last 2 years, his curveball was not included for comparison. I’m not sure what I want to make the cutoff, or if I want one, but for this cut it was 150 instances.  The primary drawback is that high of a cutoff will eliminate a lot of the pitches where a comparison would prove useful, pitchers that are relatively new to the big leagues.  In retrospect I think I want to lower that bar substantially for “physical” only scores and keep it slightly higher when introducing results.

Anyway, on to results!!  I only ran the numbers on fastballs (both 2 and 4 seam) since I wanted to get a feel for how the methodology was going to work and what kind of numbers to expect.  The first table is the most similar fastballs based solely on their physical traits (movement and velocity)

Pitcher 1 Pitcher 2 Score
Joakim Soria T.J. Beam 0.008822813
Jim Johnson Jorge Julio 0.016173611
Armando Galarraga Trevor Cahill 0.029398104
Josh Geer Dan Giese 0.029431471
Tommy Hunter Brendan Donnelly 0.030507301
Brandon Lyon Jason Berken 0.030858617
Dan Wheeler Kris Benson 0.030926282
Brandon McCarthy James Parr 0.031952073
Jesse Crain Andrew Bailey 0.03290595
Steve Trachsel Keith Foulke 0.033642835

So what does the number in the score column represent? It is the sum of the differences between percentiles (as a decimal still so 90th percentile = 0.9) for each component. Clearly that number doesn’t look pretty, and I’m racking my brain to come up with a better way to present the number… any thoughts would be appreciated.

Moving on, the next table includes whiff and GB% to go along with the physical traits.

Pitcher 1 Pitcher 2 Score
Miguel Batista Luis Mendoza 0.088735311
Joe Blanton Leo Rosales 0.105227103
Sean Green Joe Smith 0.106274147
Kyle Lohse Adam Wainwright 0.107327243
Chris Sampson Shawn Camp 0.108540848
Braden Looper Carlos Silva 0.111769335
Max Scherzer Josh Roenicke 0.118732513
Francisco Cordero Chris Resop 0.122583619
Leo Nunez Jorge Julio 0.127953011
Pedro Martinez Matt Herges 0.136471458

For this set of scores all 5 components are equally weighted, which may or may not be valid. I’d like to put a little more thought into if I’d like to weight them differently or not. Anyway, that’s the first cut at getting some of this information out there. The to-do list with these is long. I’d like to re-run the fastball physical numbers with a broader net, run all the other pitch types, and look at some the top comps for various pitches that have good reputations/results (i.e. Mariano’s cutter, Brandon Webb’s sinker, Adam Wainwright’s curve).

Steve Sommer

Simulation analyst by day, father and baseball nerd by night

4 Responses to “Similarity Scores”

  1. Niiice! Thanks, man. I love this stuff.

    • No problem. When I get done refining the methodology and get a more final version I’ll make the data available if you want to use it to do some Tigers stuff.

  2. Hey cool. Why not just show the similarity scores like they do on B-R? Higher = better. On the other hand, it would be nice if they “meant” something. Could you express it like the percentage of how similar they are? I’m not sure how you would do that, but it would be more intuitive.

    • Yeah, I think I’ll try and change it to a higher = better scale. The more I think about it, it shouldn’t be too hard. For the physical, the worst you could be is basically 3, so I think I’ll do (3-score)/3 or something like that…

