Send As SMS

Tuesday, August 15, 2006

Are pitchers chicken?

The issue of retaliation for hit batsman is a popular one among economists, due to its testing of the standard “people respond to incentives” theory.

The theory predicts that a pitcher will plunk more batters if he doesn’t have to come to the plate and risk being hit himself. That is, in the National League, when the pitcher faces potential retaliation, he has an incentive to refrain from deliberate plunkings – or, at least, to pitch more carefully to prevent the accidental hit batsman. But with the DH, the pitcher doesn’t have to bat, and he has less incentive to avoid the HBP.

(This is a hot topic, and there are more studies than I’m summarizing here, but I’ll eventually get to the only two Retrosheet-data-based ones I know of, which are probably the best.)

In 1997, Brian Goff, William Shughart, and Robert Tollison started the ball rolling – er, beaning – with a study showing that hit batsmen were substantially more common in the AL than NL, seemingly confirming the “moral hazard” hypothesis. (The paper doesn’t seem to be online
and I haven’t actually read it, but it’s cited a lot.)

Steven Levitt responded in 1998 with
this nice article, pointing out that almost the entire difference is explained by the fact that pitchers, being sucky hitters (I am paraphrasing), don’t get hit very much. Non-pitchers are hit at almost the same rate in both leagues. Furthermore, pitchers who hit more batters are no more likely to be hit themselves, which means that self-preservation is unlikely to be a motive.

More recently,
this study by J. C. Bradbury (of sabernomics.com) and Douglas Drinen used Retrosheet data to analyze the question in more detail. For all interleague games 1997-2003, they ran a regression on a game-by-game basis instead of team by team, attempting to predict HBP (received) based on a bunch of season and game team factors.

They find some serious significance. Walks, HR, and game score difference were significant at 1% or better.

Most importantly, the DH was significant at the 5% level (the DH meant more HBP, as expected). Retaliation, as measured by the number of batters the team itself hit, was extremely significant at 1% -- in fact, the observed value was seven standard deviations from the mean.

The authors write that this means the DH is associated with an 11% increase in HBP, and each batter hit increases the number of its own HBP by 10% to 15%.

I don’t know if the authors accounted for this, but if one HBP in the game increases the frequency of opponent HBPs by 10% in that game, it follows that it must have increased the frequency of subsquent retaliatory HBPs by more – since only half of HBPs can be retaliatory. (Assuming a team won’t retaliate after a retaliation.) And that’s the number that we’re most concerned with: how the rate of plunking increases after an HBP, not before.

Here’s an example, which you can ignore. Consider two teams, A and B. A hits B unprovoked in 1/3 of games. B hits A unprovoked in 1/3 of games. Nobody gets hit in the last 1/3 of games. Finally, teams retaliate 50% of the time they are hit unprovoked.

On average, every six games will look like this:

1. No HBP
2. No HBP
3. A hits
4. A hits then B hits
5. B hits
6. B hits than A hits

On average, A and B hit 1/2 batter per game each. But in games when a team is hit, it hits 2/3 batters per game. That looks like an increase of 33% due to retaliation. But we’ve seen that there is actually 50% retaliation. That comes out only if you look at the order in which the plunkings occurred.

So when Bradbury and Drinen find an increase of 10% to 15%, that’s probably understating the actual retaliation effect, because they’re including “pre-taliations” – and those presumably are less frequent, which brings the average down.

The same would be true of the home run and score situation. It seems to me that because this study doesn’t account for what came first, the effects it finds are lower bounds, and real life probably has an even stronger cause and effect relationship than what the study found.

But anyway, the main purpose of the study was to find a DH effect. And it did find one, significant at the 5% level.

Finally, Bradbury and Drinen have one more paper, this time analyzing the question in even more detail – by plate appearance rather than by game. For each plate appearance from several seasons in both leagues, they regressed HBP on about 15 variables, including score, batter and pitcher quality, whether the previous batter hit a home run, whether there was an opposing HBP in the previous half-inning, and so on.

Their findings: the DH increases the probability of an HBP by 11 to 17 percent. Also, the study finds that the pitcher has four times the chance of being personally hit after hitting an opponent in the previous half-inning. It appears that pitchers do reduce their HBP out of fear of getting hit themselves.

We can even estimate the fear. I’ll run through the calculation – let me know if I’ve made any mistakes.

Levitt’s study shows that pitchers are hit once per 335 at-bats. The increase in risk is three times that (the difference between average and four times average) or, say, about once per 100 at-bats. The chance of a pitcher coming to bat the inning following an HBP is, say, 40%. So after an HBP, the increase in pitcher plunkings is about 1 in 250 at-bats. That is, each 250 HBP committed by the pitcher cause him to get hit once himself.

So since pitchers reduce their HBP in response to this incentive, they marginally value their own safety about 250 times as much as they value the safety of the other team’s batters.

That might not be fair … Bradbury and Drinen’s study assumes that pitchers are out of danger if they don’t bat the next inning. Suppose that they always bear the four times risk in their next at-bat, even if it comes in a later inning or game. Then the 40% factor disappears, and instead of valuing their own safety 250 times as much, it becomes only 100 times as much.

So: introduce the DH, and a pitcher feels emboldened enough to hit 100 batters – just because it’ll save him being hit once. Is 100 to 1 the normal level of human self-interest? Or are pitchers particularly chicken?

Monday, August 14, 2006

How good was the WHA?

In 1985, Bill James’ famous minor-league study determined that the level of pitching in the majors was 18% higher than in AAA. That is, you’d have to discount minor-league hitter’s stats by 18% to predict what they would have done in the major leagues.

In “
League Equivalencies,” an article on hockeyanalytics.com, Gabriel Desjardins looks to do the same for hockey.

For instance, Desjardins found all players in the 1972-73 WHA (its inaugural season) who played in the NHL the following year. Those 39 players subsequently scored 46% as many points per game in the NHL as they had in the WHA, and so the “league quality” of the 1972-73 WHA was 0.46.

The WHA’s 0.46 was only slightly higher than the 0.43 for the minor-league AHL that year. “This is not surprising,” Desjardins writes, “since the WHA mined the AHL to fill out its teams.”

The WHA’s quality increased during its life – up to 0.76 the next year, and then irregularly to 0.89 in its final season of 1978-79. By contrast, the AHL stayed in the 0.50 range in the 70s, and is now at 0.45.

The Russian Elite League is the highest-quality non-NHL league at 0.91; the Czech league is second at 0.61, followed by Sweden (0.59) and Finland (0.54).

Desjardins argues, also, that the “real” quality is likely to be higher than the figures he presents, because players moving to the NHL normally get much less power play time than they did in the other leagues. That reduces their point scoring more than just their ability would suggest.

To which I would add: what about playing time? Wouldn’t it also be true that players good enough to be promoted will get less playing time in the NHL than they did in the minors? That would deflate their numbers even more. You’d think this would be a very large factor, at least as large as the power play issue. (On the other hand, you’d think that playing time in high-caliber leagues (like the Russian league) would be less of an issue, since the better the hockey, the more likely the NHL recruit is good enough to get substantial ice time.)

In baseball, of course, we have statistics broken down by plate appearance or outs made, so playing time is accounted for. In hockey, though, without playing time numbers, these results are lower-bound estimates and may be substantially off. But, on that basis, and taking the numbers for what they're worth, this is pretty valuable information.

A Response to Win Shares: A Partial Defense of Linear Weights

In a new article written for this blog, Charlie Pavitt defends the Linear Weights system against some Bill James criticisms:

"One of the impressions that reading [Win Shares] left me with is the seemingly constant attacks Bill makes in this book against Pete Palmer’s Linear Weights player evaluation system ... I believe these attacks to be at least partly misguided, in that, at least implicitly, Pete’s system is attempting to do something quite a bit different than Bill’s system is."

Click for the full article, "A Response to Win Shares: A Partial Defense of Linear Weights."

For those of you not familiar with Charlie's work, he regularly writes reviews of sabermetric studies for "By the Numbers." (Click here, scroll down for current and back issues.) He also maintains an indispensable sabermetric bibliography.

Friday, August 11, 2006

Bill James for Nobel

Apparently, economists are starting to pay attention to sports.

In one sense, they always have; there’s always been talk about player salaries, and stadium deals, and luxury boxes, and profit and loss. But that’s the traditional part of economics, the money part. All that talk about employment and recessions and interest rates and GDP and stuff is pretty boring, even when the subject is sports.

The fun part of economics is the side that applies it to human behavior. Economist Steven E. Landsburg argued that all of economics is based on one principle: “people respond to incentives.” And there are all kinds of interesting non-monetary incentives out there.

For instance (as Landsburg explains in
one of his books), you would think that the introduction of air bags and ABS in cars would reduce accidents. But accidents actually increased. The reason: with all the safety features protecting them, drivers have less incentive to drive carefully. And, more recently, Landsburg described a study describing how women respond to the incentives for faking orgasm.

As well, Tim Harford has explained why his favorite restaurant has
an incentive to hire rude waiters. And in Freakonomics, famed economist Steve Levitt explains his academic finding that real estate agents get higher prices when selling their own houses than when selling their clients’ houses. Again, that’s because of incentives. For an equivalent amount of effort, they get to keep 100% of any their own price, but only 5% of their client’s price.

Until recently, there hasn’t been a whole lot of this kind of thinking applied to sports. The only older example I can think of is Bill James arguing in the 1985 Abstract that because of the low level of competition in the AL West, the teams there had less incentive to get rid of their mediocre players. After all, if 86 games is enough to win the pennant, there’s less need to take chances than if you need 95 games or more.

But now, the economic way of thinking is gaining ground. Levitt has written
a paper on incentives to hit batters. This Levitt paper shows that when a third referee was added to college basketball, the number of fouls dropped (since the probability of getting caught increased) -- but when a second referee was added to NHL games, the frequency of penalties didn't change (because the probability of getting caught remained the same).

And, of course, there’s this study on how Sumo wrestlers cheat when it’s in their interests to do so (see my summary here).

There’s also the field of “
behavioral economics.” In one of my favorite chapters in “Basketball on Paper,” Dean Oliver talks about the famous “ultimatum game.” In that game, player A is given $10, and makes an ultimatum offer to player B about how to split it between them. If B agrees, the money is split. If B refuses, neither player gets anything.

In theory, A should offer B only a penny, and B should accept, since even a penny is better than nothing. But, in real life, players offered a penny (or even a dollar) refuse out of spite, feeling that fairness entitles them to a larger chunk of the $10.

Oliver parallels this game to a coach splitting playing time among his team. Players not receiving enough time (and glory) may rebel out of spite, putting in less effort and co-operation despite the negative effect it would have on their careers. The coach’s job, Oliver argues, is to manage the situation and keep players from feeling slighted by what is, in effect, the coach’s ultimatum.

So there’s three different aspects of economics so far – the boring money stuff, the fun incentive stuff, and the field of behavioral economics.

But now, there’s the possibility that economics may be branching into mainstream sabermetrics. “
The Wages of Wins,” the recent book by three academic economists, starts out talking about salaries and attendance. But it quickly moves on to evaluating basketball players, and much of the book deals with formulas for measuring offensive productivity – no incentives, no dollar figures, just sabermetric analysis. I wouldn’t have expected this from economics, since it’s not about responses to incentives, but the internal details of how to measure output. It’s as if an economic analysis of outsourcing suddenly started talking about what kind of computers make your customer service representatives in India most productive.

But, having said that, is there any other academic field is most suitable for sabermetric work? Probably not. In terms of the actual nuts and bolts of sabermetrics, it consists of taking large databases containing the end results of human interactions, and trying to come up with theories and relationships that make the data meaningful. And that’s what economists do all the time. Whether it’s clutch batting statistics, accident rates with and without air bags, sumo wrestling decisions in different types of critical situations, or lists of real estate transactions in Chicago involving real estate agents’ own houses, the logic and math involved in doing the actual work are pretty much the same. If Retrosheet had decades worth of play-by-play traffic data instead of baseball data, sabermetricians could tell as much about the value of performance tires as they can about the value of a stolen base.

So if this is a real phenomenon, instead of just a fad, we will see economists getting more and more into sports analysis over the next while. And if that happens, and academia eventually accepts sabermetrics as a worthy and legitimate branch of economics, there’s a good chance that Bill James could be in the running for a Nobel.

No, seriously, I mean it. You’ve got to admit that the body of Bill’s work is an amazing intellectual achievement. Is it as good as other Nobel work? My uneducated impression is that it certainly is. All that’s left is for some Nobel field to encompass this area of study in its purview. And, now, there’s a small but real chance it could be economics.

The Bill James Nobel is a long shot, sure, but it’s not impossible. If someone offered me, say, 500 to 1, I’d probably take it.

Thursday, August 10, 2006

NCAA home field advantage estimated within 14 points

How much blood can one study try to squeeze from a tiny little stone?

A lot. This academic study by Byron J. Gajewski, “There’s no Place Like Home: Estimating Intra-Conference Home Field Advantage Using a Bayesian Piecewise Linear Model,” tries to estimate home field advantage in Big 12 NCAA football – by using a sample of only 432 intra-conference games from 1996-2004.

With 432 games, the standard deviation of winning percentage for a .600 team is about .023. So, even if you observed a home winning percentage of .600, the 95% confidence interval would still be (.553, .647) – a pretty wide interval. So you’re not going to get all that much useful information from a sample of this size.

But the study is much more ambitious than even that – it tries to estimate a separate home field advantage (HFA) for each of the twelve teams. The assumption that each team has its own particular home field advantage is, I suppose, not unreasonable, but trying to find it off a sample of only 72 games per team seems like overreaching.

Further, those 72 games are played over nine years. If you assume that every team is going to have a different home field advantage, wouldn’t you also assume that it could vary from year to year, along with the players? This is college football, where there’s complete turnover every four years. Why assume that the 1996 Sooners will have the same HFA as the 2002 Sooners? The author calls the assumption “reasonable” because “the fan base is likely to be very stable,” making the implicit assumption (and I think I’ve seen at least one study disproving this for baseball) that HFA is a function of attendance.

There’s still more complexity. The study doesn’t just figure out the difference between a team’s home record and its road record. It actually tries to estimate each team’s intrinsic quality, and the quality of its opponent. And that’s hard to do. You can’t just take the season record, because of luck. You could take each season and regress it to the mean, but then you’re ignoring information about the previous or next season. For instance, a team that goes 6-2 is perhaps really a 5-3 team that got lucky -- but a team that goes 6-2 between 8-0 seasons might actually have 6-2 talent, or even 7-1.

The study chooses to solve this problem by fitting a straight line over the nine years, but allowing it to change direction at three fixed points. (That is, the best-fitting four straight lines of certain fixed length with no discontinuity.) This seems reasonable, but other, equally reasonable decisions could give substantially different results, especially with so few data points.

Finally, the author uses a Bayesian model and a simulation via “Markov Chain Monte Carlo” to get the final results. I’m not sure how this affects the conclusions, but some of them are unexpected. For instance, over the years of the study, Baylor was .167 (6-30) at home but .000 (0-36) on the road. A naive estimate of the home field advantage would be half of the difference, or .083. Gajewski’s method comes up with -.025, suggesting that Baylor was actually worse at home than on the road. (Part of the reason, presumably, is that they faced easier opponents at home, a fact the naive method wouldn’t consider.)

Here’s the full list. The study’s numbers are approximate, as I had to read them off a graph.

Team ....... Naive estimate .. Study estimate

Baylor ........ .083 .......... -.02
Colorado ...... .069 ........... .02
Iowa State .... .069 ........... .15
Kansas State .. .069 ........... .03
Kansas ........ .097 ........... .10
Missouri ...... .083 ........... .12
Nebraska ...... .139 ........... .22
Oklahoma State. .083 ........... .15
Oklahoma ...... .069 ........... .08
Texas A&M ..... .097 ........... .11
Texas Tech .... .139 ........... .17
Texas ......... .111 ........... .05


While I admittedly don’t understand all of the Bayesian and Monte Carlo aspects of the study, I can’t imagine how this small amount of data, with so many variables, could yield anything close to an accurate estimate of anything.

And the study admits it. The estimates in the table above have very wide 90% error bars -- estimating from the graph, about .15 (almost exactly one touchdown) in each direction. The Oklahoma State home field advantage could be as low as zero points, or as high as 14 points. Which, really, isn’t anything we didn’t already know.

Wednesday, August 09, 2006

New Issue: By the Numbers

The just-released May issue of “By the Numbers,” the sabermetrics newsletter of the Society for American Baseball Research (SABR), is now available at my website. I am the editor, so I won’t be writing reviews of these articles, just these summaries:

Academic Research: Pitcher Luck vs. Skill” by Charlie Pavitt: a review of two academic studies by Jim Albert, which talk about how to break down a pitcher’s record between luck and skill components.

The Wages of Wins – Right Questions, Wrong Answers” by Phil Birnbaum (me): a review of the recent book on sports sabermetrics/econometrics.

The Interleague Home Field Advantage” by Eric Callahan, Thomas J. Pfaff, and Brian Reynolds: a study that shows that the home field advantage in interleague games is significantly (in the baseball sense, not the statistical sense) higher than in other games.

Best of the Ball Hawks” by Tom Hanrahan: a study that evaluates the best centerfielders of all time using Win Shares and the Baseball Prospectus ratings.

We are always looking for material for future issues – please e-mail me if interested in contributing.

Tuesday, August 08, 2006

How much is a slap shot worth? An emprical study

Back in 1986, Jeff Z. Klein and Karl-Eric Reif released a book called “The Klein and Reif Hockey Compendium.” Obviously modeled after The Bill James Baseball Abstract, it was entertaining and had a lot of numbers, but was not all that sabermetrically informative.

In that book, and again in the
2001 update (now simply called “The Hockey Compendium”), they introduced a goaltender rating stat called “perseverance.” The idea was that save percentage isn’t good enough – goalies who face many more shots per game will also face a higher caliber of shots, and their rating should be adjusted upward to account for their more difficult task.

It sounded reasonable, but the authors didn’t do any testing on it -- they chose their formula because it looked good to them. With the proper data, it would be pretty simple to actually investigate how save percentages change with number of shots, and create a formula that matches the empirical evidence. I’ve always thought that would be a good study to do.

But now, Alan Ryder, at
hockeyanalytics.com, has an awesome study that goes many steps better and makes the Klein/Reif method obsolete. It uses (extremely useful!) NHL play-by-play data (sample) to investigate shot stoppability not just by number of shots, but by type of shot and distance from the net.

Here’s what Ryder did. First, he found that there are five special kinds of shots where type and distance don’t matter much – empty net goals, penalty shots, very long shots, rebounds, and scramble shots (shots from less than six feet away that weren’t rebounds). For those, the shot is rated by the overall probability of a goal from that type – so an even-strength rebound, which went in 34.8 percent of the time, counts as 34.8 percent regardless of the details of the shot.

For all other shots – “normal” shots -- distance matters. At even strength, the chance of scoring on a 10-foot shot was 15% -- but from 20 feet out, the chance dropped to 10%. (All figures are for even strength – on the power play, “normal” shots are uniformly about 50% more effective.)

He also found that the probabilities varied for different types of shots. Only 6.7% of slapshots were goals, but over 20% of tip-ins went in.

(Ryder is quick to note that the data does not mean that players should change their shot selection based on these findings, implictly acknowledging that players are likely choosing the shot most appropriate for the situation.)

Combining types and distances, Ryder came up with a graph of the chance of scoring on any combination of shot and distance, and then smoothed out the results. Unfortunately, we don’t get the full set of data, but we get a graph of relative probabilities. For instance, a slapshot is a bit above average in effectiveness (relative to other shot types at that distance) anywhere from 5 to 50 feet, but drops after that.

Having done all that groundwork, Ryder is now in a position to easily evaluate defenses and goaltenders. Basically, the best defense is one that keeps the opposition from taking more dangerous shots. It’s now possible to add up the probabilities of all shots taken, to see how many “expected goals” the defense allowed. For instance, if a team allows six shots, each with a 15% chance of scoring, it’s effectively yielded 0.9 of a statistical goal.

And, of course, you can now evaluate the goalies, too. If the offense’s shot probabilities added up to 4.4, but only four goals were scored, you can credit the goalie with the 0.4 goals saved. Or, as Ryder chooses to do it, you’d give him a “goaltending index” of 0.909, which is 4 divided by 4.4.

I’ll leave it to you to check out the study – which is very easy to read and understand – to find out who the best and worst goalies and defenses are. I’ll mention only one of Ryder’s examples. In 2002-03, the Rangers allowed 21 more goals than the Lightning. But, after adjusting for the types and distances of shots, it turns out that the Rangers’ goaltending was actually significantly better – but was more than made up for by a defense that allowed many more quality opportunities.

Ryder’s study is by far the best hockey study I’ve seen (subject to the disclaimer that I haven’t seen that many). My only concern is that, just as Ryder points out that not all shots are equal, it’s probably also true that not all 30-foot wrist shots are equal. There could be many other factors that affect that kind of shot – who the shooter is, whether it was a one-timer, whether the goalie is screened, whether the defense is out of position, and so forth.

This doesn’t affect a team’s overall rating (which is simply goals allowed), but it would affect the proportion of credit or blame to assign to the goaltender. If the goalie is faced with a lot of difficult 30-foot wrist shots, he will be underrated by this system. If the 30-foot wrist shots are easy, he will be overrated.

Is this a big factor? One way to find out would be to see if a goalie’s rating is reliable and consistent from year to year, especially when he changes teams. If it’s not, and to what extent it’s not, that would be evidence that defenses vary in ways that aren’t captured by shot type and distance alone.

Monday, August 07, 2006

A flawed competitive balance hockey study

Pitching has nothing to do with winning baseball games. I've proved it!

Here’s what I did: I ran a regression to predict team wins. I had ten dependent variables: team home runs, batting average, OPS, winning percentage, sacrifice hits, ERA, manager experience, total average, total payroll, and pitcher strikeouts.

After running the regression, only one variable was significant: team winning percentage. The others weren’t significant at all. And, so, obviously, ERA and strikeouts have nothing to do with winning. I’ve proven it!

Well, of course, I haven’t, and the flaw is kind of obvious: the regression includes “winning percentage” as one of the dependent variables. Winning percentage is almost exactly the same as the “wins” I’m trying to predict. In fact, the regression equation would work out close to:

Wins = (162 * winning percentage) + (0 * ERA) + (0 * batting average) + (0 * OPS) …

ERA has a lot of effect on wins, but it does so by changing winning percentage. In this case, winning percentage “absorbs” all the effects of ERA. That is, once you know winning percentage, knowing ERA doesn’t help you predict wins any better. A .600 team wins 96 games, regardless of how good its pitching staff was.

(I’m sure there’s a statistical term for this effect, when you have massive cross-correlations in your dependent variables that cause otherwise-significant variables to be absorbed by others. But if there is, I don’t know it.)

Suppose we try to cure the problem by getting rid of “winning percentage” and using “expected wins” (pythagoras) instead. Our correlation would still be very high, because pythagoras predicts wins very well. And, again, we’d wind up with ERA being insignificant, for the same reason – all of the information ERA gives you is already included in the information in "expected wins". A team that scores 500 runs and allows 450 will win about 90 games, regardless of its staff’s ERA.

One more try: let’s remove "expected wins", and add separate variables for “runs scored” and “runs allowed”. The flaw is more subtle, but it’s still there. Our correlation will drop a bit because, while you can predict winning percentage by a combination of runs scored and allowed (at about 10 runs equals one win), it’s not as accurate as pythagoras. But, still, the other variables will still wind up not significant. And again, that’s because once you know a team’s runs scored and allowed, the ERA does not give you any more information.

This last situation is the flaw in
this hockey study by Tom Preissing and Aju J. Fenn.

Preissing (who is now an active
NHL player) and Fenn tried to figure out what factors are predictive of competitive balance in the NHL (as measured by single season clustering around .500). They included variables like free agency, the draft, the availability of European players, the existence of the WHA, and so on. Unfortunately, they also included variables for competitive balance in goals scored and allowed – which, as we saw, is a fatal flaw.

Since GS and GA directly cause winning percentage, most of the study’s other variables show up as insignificant. For instance, the amateur draft may have increased competitive balance measured in wins – but if it did, it would have done so by the mechanism of increasing competitive balance in goals, or by the mechanism of increasing competitive balance in goals allowed.


That is, a league that has lots of variation in goals scored and lots of variation in goals allowed will have lots of variation in wins, regardless of whether there was an amateur draft or not.

Sadly, the flaw means we don’t really get any reliable information from the study. But it would sure be interesting to run it again, without those two variables.

(Thanks to
Tangotiger for the pointer.)

Site Meter