A while back I went down the path of looking at similarity scores between pitches for different pitchers, but had since not really followed up. Since they came up in a thread over at BtB, I thought it would be a good idea to revisit them now. My basic methodology was the same that Josh Kalk used for pitchers, only removing the percentage thrown term. In addition I also ran a set of scores where along with the three “physical” traits, I added whiff rate and GB%.
For my initial cut I only took pitches that I had over 150 instances of in the database in a hope that it wouldn’t skew the whiff and GB% distributions. In other words if Joe Pitcher only threw 100 curveballs over the last 2 years, his curveball was not included for comparison. I’m not sure what I want to make the cutoff, or if I want one, but for this cut it was 150 instances. The primary drawback is that high of a cutoff will eliminate a lot of the pitches where a comparison would prove useful, pitchers that are relatively new to the big leagues. In retrospect I think I want to lower that bar substantially for “physical” only scores and keep it slightly higher when introducing results.
Anyway, on to results!! I only ran the numbers on fastballs (both 2 and 4 seam) since I wanted to get a feel for how the methodology was going to work and what kind of numbers to expect. The first table is the most similar fastballs based solely on their physical traits (movement and velocity)
| Pitcher 1 | Pitcher 2 | Score |
|---|---|---|
| Joakim Soria | T.J. Beam | 0.008822813 |
| Jim Johnson | Jorge Julio | 0.016173611 |
| Armando Galarraga | Trevor Cahill | 0.029398104 |
| Josh Geer | Dan Giese | 0.029431471 |
| Tommy Hunter | Brendan Donnelly | 0.030507301 |
| Brandon Lyon | Jason Berken | 0.030858617 |
| Dan Wheeler | Kris Benson | 0.030926282 |
| Brandon McCarthy | James Parr | 0.031952073 |
| Jesse Crain | Andrew Bailey | 0.03290595 |
| Steve Trachsel | Keith Foulke | 0.033642835 |
So what does the number in the score column represent? It is the sum of the differences between percentiles (as a decimal still so 90th percentile = 0.9) for each component. Clearly that number doesn’t look pretty, and I’m racking my brain to come up with a better way to present the number… any thoughts would be appreciated.
Moving on, the next table includes whiff and GB% to go along with the physical traits.
| Pitcher 1 | Pitcher 2 | Score |
|---|---|---|
| Miguel Batista | Luis Mendoza | 0.088735311 |
| Joe Blanton | Leo Rosales | 0.105227103 |
| Sean Green | Joe Smith | 0.106274147 |
| Kyle Lohse | Adam Wainwright | 0.107327243 |
| Chris Sampson | Shawn Camp | 0.108540848 |
| Braden Looper | Carlos Silva | 0.111769335 |
| Max Scherzer | Josh Roenicke | 0.118732513 |
| Francisco Cordero | Chris Resop | 0.122583619 |
| Leo Nunez | Jorge Julio | 0.127953011 |
| Pedro Martinez | Matt Herges | 0.136471458 |
For this set of scores all 5 components are equally weighted, which may or may not be valid. I’d like to put a little more thought into if I’d like to weight them differently or not. Anyway, that’s the first cut at getting some of this information out there. The to-do list with these is long. I’d like to re-run the fastball physical numbers with a broader net, run all the other pitch types, and look at some the top comps for various pitches that have good reputations/results (i.e. Mariano’s cutter, Brandon Webb’s sinker, Adam Wainwright’s curve).
Niiice! Thanks, man. I love this stuff.
No problem. When I get done refining the methodology and get a more final version I’ll make the data available if you want to use it to do some Tigers stuff.
Hey cool. Why not just show the similarity scores like they do on B-R? Higher = better. On the other hand, it would be nice if they “meant” something. Could you express it like the percentage of how similar they are? I’m not sure how you would do that, but it would be more intuitive.
Yeah, I think I’ll try and change it to a higher = better scale. The more I think about it, it shouldn’t be too hard. For the physical, the worst you could be is basically 3, so I think I’ll do (3-score)/3 or something like that…