Suppose that we have two players, whom we will call Allen and Bob. Allen and Bob are both right-handed hitters. Allen hits .290 against right-handed pitchers but .340 against lefthanders. Bob hits .290 against right-handed pitchers but .250 against lefties.
From this we attempt to derive a third measurement, which is the player’s platoon differential. Allen’s platoon differential is .050 (.340 minus .290); Bob’s is negative .040 (.250 minus .290). The platoon differential is what we could call a comparison offshoot — a measurement derived from a comparison of other measures.
The first problem with comparison offshoots is that they have the combined instability of all of their components. Every statistic in baseball is to a certain degree a measurement of a skill, to a certain degree a statement about the circumstances, and to a certain degree simply a product of luck. A pitcher goes 20-8-he goes 20-8 to a certain degree because he is a good pitcher, to a certain degree because he pitches for a good team, and to a certain degree because he is lucky (or unlucky). There is luck in everything, and baseball fans are always engaged in a perpetual struggle to figure out what is real and what is just luck.
In the case of anyone statistical record, it is impossible to know to what precise extent it reflects luck, but a player usually bats only 100 to 200 times a year against left-handed pitchers. Batting averages in 100 or 200 at-bats involve huge amounts of luck. If a player hits .340 against lefties, is that 20% luck, or 50% luck, or 80% luck? There is no way of knowing-but batting averages in 100-150 at-bats are immensely unstable. Walter Johnson hit .433 one year in about 100 at-bats; the next year he hit .194. Just luck.
It is hard to distinguish the luck from the real skill, but as baseball fans we get to be pretty good at it. The problem is, that .290 batting average against right-handed pitchers-that also involves a great deal of luck.
When we create a new statistic, platoon differential, as a comparison offshoot of these other statistics, the new statistic embodies all of the instability — all of the luck — combined in either of its components. Suppose that you take two statistics, each of which is 30% luck, and you add them together. The resulting new statistic will still be 30% luck [understanding, of course, that the 30% number here is purely illustrative, and has no functional definition).
But when you take two statistics, each of which is 30% luck, and you subtract one from the other [or divide one by the other), then the resulting new statistic — the comparison offshoot — may be as much as 60% luck. By contrasting one statistic with another to reach a new conclusion, you are picking up all of the luck involved in either of the original statistics.
But wait a minute — the problem is actually much, much more serious than that. A normal batting average for a regular player is in the range of .270. A normal platoon differential is in the range of 25 to 30 points — .025 to .030.
Thus, the randomness is operating on a vastly larger scale than the statistic can accommodate. The new statistic — the platoon differential — is operating on a scale in which the norm is about .0275 — but the randomness is occurring on a scale ten times larger than that. The new statistic is on the scale of a Volkswagen; the randomness is on the scale of an 18-wheeler. In effect, we are asking a Volkswagen engine to pull a semi.
But wait a minute, the problem is still worse than that. In the platoon differential example, I reached the conclusion I did by comparing one comparison offshoot with a second comparison offshoot-the platoon differential in one year with the platoon differential the next year. Dick Cramer, in the clutch-hitting study, did the same thing, and catcher-ERA studies, which look for consistency in catcher’s impact on ERAs, do the same thing; they compare one comparison offshoot with a second comparison offshoot. It is a comparison of two comparison offshoots.
When you do that, the result embodies not just all of the randomness in two original statistics, but all of the randomness in four original statistics. Unless you have extremely stable “original elements” — original statistics stabilized by hundreds of thousands of trials — then the result is, for all practical purposes, just random numbers.
From this we attempt to derive a third measurement, which is the player’s platoon differential. Allen’s platoon differential is .050 (.340 minus .290); Bob’s is negative .040 (.250 minus .290). The platoon differential is what we could call a comparison offshoot — a measurement derived from a comparison of other measures.
The first problem with comparison offshoots is that they have the combined instability of all of their components. Every statistic in baseball is to a certain degree a measurement of a skill, to a certain degree a statement about the circumstances, and to a certain degree simply a product of luck. A pitcher goes 20-8-he goes 20-8 to a certain degree because he is a good pitcher, to a certain degree because he pitches for a good team, and to a certain degree because he is lucky (or unlucky). There is luck in everything, and baseball fans are always engaged in a perpetual struggle to figure out what is real and what is just luck.
In the case of anyone statistical record, it is impossible to know to what precise extent it reflects luck, but a player usually bats only 100 to 200 times a year against left-handed pitchers. Batting averages in 100 or 200 at-bats involve huge amounts of luck. If a player hits .340 against lefties, is that 20% luck, or 50% luck, or 80% luck? There is no way of knowing-but batting averages in 100-150 at-bats are immensely unstable. Walter Johnson hit .433 one year in about 100 at-bats; the next year he hit .194. Just luck.
It is hard to distinguish the luck from the real skill, but as baseball fans we get to be pretty good at it. The problem is, that .290 batting average against right-handed pitchers-that also involves a great deal of luck.
When we create a new statistic, platoon differential, as a comparison offshoot of these other statistics, the new statistic embodies all of the instability — all of the luck — combined in either of its components. Suppose that you take two statistics, each of which is 30% luck, and you add them together. The resulting new statistic will still be 30% luck [understanding, of course, that the 30% number here is purely illustrative, and has no functional definition).
But when you take two statistics, each of which is 30% luck, and you subtract one from the other [or divide one by the other), then the resulting new statistic — the comparison offshoot — may be as much as 60% luck. By contrasting one statistic with another to reach a new conclusion, you are picking up all of the luck involved in either of the original statistics.
But wait a minute — the problem is actually much, much more serious than that. A normal batting average for a regular player is in the range of .270. A normal platoon differential is in the range of 25 to 30 points — .025 to .030.
Thus, the randomness is operating on a vastly larger scale than the statistic can accommodate. The new statistic — the platoon differential — is operating on a scale in which the norm is about .0275 — but the randomness is occurring on a scale ten times larger than that. The new statistic is on the scale of a Volkswagen; the randomness is on the scale of an 18-wheeler. In effect, we are asking a Volkswagen engine to pull a semi.
But wait a minute, the problem is still worse than that. In the platoon differential example, I reached the conclusion I did by comparing one comparison offshoot with a second comparison offshoot-the platoon differential in one year with the platoon differential the next year. Dick Cramer, in the clutch-hitting study, did the same thing, and catcher-ERA studies, which look for consistency in catcher’s impact on ERAs, do the same thing; they compare one comparison offshoot with a second comparison offshoot. It is a comparison of two comparison offshoots.
When you do that, the result embodies not just all of the randomness in two original statistics, but all of the randomness in four original statistics. Unless you have extremely stable “original elements” — original statistics stabilized by hundreds of thousands of trials — then the result is, for all practical purposes, just random numbers.
We ran astray because we have been assuming that random data is proof of nothingness, when in reality random data proves nothing. In essence, starting with Dick Cramer’s article, Cramer argued, “I did an analysis which should have identified clutch hitters, if clutch hitting exists. I got random data; therefore, clutch hitters don’t exist.”
Cramer was using random data as proof of nothingness — and I did the same, many times, and many other people also have done the same. But I’m saying now that’s not right; random data proves nothing — and it cannot be used as proof of nothingness.
Why? Because whenever you do a study, if your study completely fails, you will get random data. Therefore, when you get random data, all you may conclude is that your study has failed. Cramer’s study may have failed to identify clutch hitters because clutch hitters don’t exist — as he concluded — or it may have failed to identify clutch hitters because the method doesn’t work — as I now believe. We don’t know. All we can say is that the study has failed."
No comments:
Post a Comment