Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory
by user
Comments
Transcript
Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory
Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory Henry E. Brady July 16, 2002 Paper Presented at the Annual Meetings of the Political Methodology Group, University of Washington, Seattle, Washington Causation and Explanation in Social Science Henry E. Brady, University of California, Berkeley Causality Humans depend upon causation all the time to explain what has happened to them, to make realistic predictions about what will happen, and to affect what happens in the future. Not surprisingly, we are inveterate searchers after causes.1 Almost no one goes through a day without uttering sentences of the form “X caused Y” or “Y occurred because of X,” even if the utterances are for the mundane purposes of -- explaining why the tree branch fell (“the high winds caused the branch to break and gravity caused it to fall”), -- predicting that we will be late for work work (“the traffic congestion will cause me to be late”), or -- affecting the future by not returning a phone call (“I did not call because I do not want that person to bother me again”). All these statements have the same form in which a cause (X) leads to an effect (Y).2 Social scientists typically deal with bigger and more contentious causal claims such as: “The economy grew because of the increase in the money supply.” “The USSR became highly repressive in the 1920's and 1930's because of Stalin’s accession to power after Lenin.” “The Protestant Reformation caused the development of capitalism in the West.” “The lack of strict work requirements causes people to stay on welfare a long time.” “‘Duverger’s Law’ – Single member electoral districts with a plurality voting rule lead to two political parties while proportional representation creates many small parties.” “The butterfly ballot in Palm Beach County Florida in the 2000 election caused Al Gore to lose the election.” 1 Humans also depend upon concepts to describe and understand the world, and other parts of this book focus on how concepts are formed and used. 2 The word “because” suggests that an explanation is being offered as well as a causal process. One way of explaining an event is identify a cause for it. At the end of this paper, we discuss the relationship between determining causes and proffering explanations. 1 These are bigger causal claims, but the hunger for causal statements is just as great and the form of the statement is the same.3 The goals are also the same. Causal statements explain events, allow predictions about the future, and make it possible to take actions to affect the future. Knowing more about causality can be useful to social science researchers. Philosophers and statisticians know something about causality, but entering into the philosophical and statistical thickets is a daunting enterprise for social scientists because it requires technical skills (e.g., knowledge of modal logic) and technical information (e.g., knowledge of probability theory) that is not easily mastered. The net payoff from forays into philosophy or statistics sometimes seems small compared to the investment required. The goal of this paper is to provide a user-friendly synopsis of philosophical and statistical musings about causation. Some technical issues will be discussed, but the goal will always be to ask about the bottom line – how can this information make us better researchers? Three types of intellectual questions typically arise in philosophical discussions of causality: – Psychological and linguistic – What do we mean by causality when we use the concept? – Metaphysical or ontological – What is causality? – Epistemological – How do we discover when causality is operative?4 Four distinct theories of causality, summarized in Table 1, provide answers to these and other questions about causality. Philosophers debate which theory is the right one. For our purposes, we embrace them all. Our primary goal is developing better social science methods, and our perspective is that all these theories capture some aspect of causality.5 Therefore, practical 3 Some philosophers deny that causation exists, but we agree with the philosopher D.H. Mellors (1995) who says: “I cannot see why. I know that educated and otherwise sensible people, even philosophers who have read Hume, can hold bizarre religious beliefs. I know philosophers can win fame and fortune by gulling literary theorists and others with nonsense they don’t themselves believe. But nobody, however gullible, believes in no causation (page 1).” Even the the political scientist Alexander Wendt (1999) who defends a constructivist approach to international relations theory argues that a major task for social scientists is answering causal questions. He identifies “constitutive” theorizing, which we consider descriptive inference and concept formation, as the other major task. For a thoroughgoing rejection of causal argument see Taylor (1971). 4 A fourth question is pragmatic – How do we convince others to accept our explanation or causal argument? A leading proponent of this approach is Bas van Fraassen (1980). Kitcher and Salmon (1987, page 315) argue that “van Fraassen has offered the best theory of the pragmatics of explanation to date, but ... if his proposal is seen as a pragmatic theory of explanation then it faces serious difficulties” because there is a difference between “a theory of the pragmatics of explanation and a pragmatic theory of explanation.” From their perspective, knowing how people convince others of a theory does not solve the ontological or epistemological problems. 5 Margaret Somers (1998), in a sprawling and tendentious 63 page article replying to Kiser and Hechter’s advocacy of rational choice models in comparative sociological research (1991), ransacks philosophical 2 Table 1 Four Theories of Causality Major Authors Associated with the Theory Neo-Humean Regularity Theory Hume (1739); Mill (1888); Hempel (1965); Beauchamp & Rosenberg (1981) Counterfactual Theory Weber (1906); Lewis (1973; 1986); Manipulation Theory Gasking (1955); Menzies & Price (1993); von Wright (1971) Mechanisms and Capacities Harre & Madden (1975); Cartwright (1989); Glennan (1996); Approach to the Symmetric Aspect of Causality Observation of constant conjunction and correlation Recipe that regularly produces the effect from the cause. Approach to the Asymmetric Aspect of Causality Major problems solved. Emphasis on Causes of Effects or Effects of Causes? Studies with Comparative Advantage Using this Definition Temporal precedence Truth in otherwise similar worlds of “If the cause occurs then so does the effect” and “if the cause does not occur then the effect does not occur.” Consideration of the truth of: “If the effect does not occur, then the cause may still occur.” Singular causation. Nature of necessity. Effects of causes (E.g., Focus on treatment’s effects in experiments.) Experiments; Case study comparisons; Counterfactual thought experiments Consideration of whether there is a mechanism or capacity that leads from the cause to the effect. An appeal to the operation of the mechanism Necessary connection. Causes of effects (E.g., Focus on dependent variable in regressions.) Observational and causal modeling Observation of the effect of the manipulation Common cause and causal direction. Effects of causes (E.g., Focus on treatment’s effects in experiments.) Experiments; Natural experiments; Quasi-experiments Preemption. Causes of effects (E.g., Focus on mechanism that creates effects.) Analytic models; Case studies researchers can profit from drawing lessons from each one of them even though their proponents sometimes treat them as competing or even contradictory. Our standard has been whether or not theories to criticize rational choice theory and to defend historical sociology. Neither the attack nor the defense is very successful because Somers draws her ammunition from arguments made by philosophers of science. If philosophy of science were a settled field like logic or geometry, using its arguments might be a sensible strategy, but in its unsettled state, it seems an unreliable touchstone for social science methodologists. Turning to it for guidance merely multiplies our confusions as methodologists by their confusions as philosophers of science rather than adding to our understanding by providing new insights. In the end, the article accomplishes little for the practicing researcher except to suggest the limited usefulness of the philosophy of science for our enterprise. The flavor of the article can be quickly grasped: “Following the Lakatosian route out of Kuhn (through Popper) into the ‘hard core’ of ‘general theory,’ theoretical realism thus allows Kiser and Hechter to accomplish by theoretical fiat that which has for centuries confounded philosophers…” (page 748). Goldstone (1998) provides some useful correctives, and Skocpol and Somers (1980) provide a better defense of historical sociology. See also Kiser and Hechter (1998). As we shall see later, Somers also fails to understand the way that mechanisms can provide an explanation. 3 we could think of concrete examples of research that utilized (or could have utilized) a perspective to some advantage. If we could think of such examples, then we think it is worth drawing lessons from the causal theory. Indeed, we believe that a really good causal inference should satisfy the requirements of all four theories. Causal inferences will be stronger to the extent that they are based upon finding all the following. (1) Constant conjunction of causes and effects required by the neo-Humean theory. (2) No effect when the cause is absent in the most similar world to where the cause is present as required by the counterfactual theory. (3) An effect after a cause is manipulated. (4) Activities and processes linking causes and effects required by the mechanism theory. The claim that smoking causes lung cancer, for example, first arose in epidemiological studies that found a correlation between smoking and lung cancer. These results were highly suggestive to many, but this correlational evidence was insufficient to others (including one of the founders of modern statistics, R. A. Fisher). These studies were followed by experiments that showed that, at least in animals, the absence of smoking reduced the incidence of cancer compared to the incidence with smoking when similar groups were compared. But animals, some suggested, are not people. Other studies showed that when people stopped smoking (that is, when the putative cause of cancer was manipulated) the incidence of cancer went down as well. Finally, recent studies have uncovered biological mechanisms that explain the link between smoking and lung cancer. Taken together the evidence for a relationship between smoking and lung cancer now seems overwhelming. The remainder of this chapter explains the four theories in much more detail. Before providing this detail, we first define the notion of counterfactual which crops up again and again in discussions of causality. Then we briefly discuss the nature of psychological, ontological, and epistemological arguments regarding causality in order to situate our own efforts and to develop a language for thinking about causality. The central part of the chapter elaborates upon the four theories in Table 1. Then a well-known statistical approach to causation proposed by Jerzy Neyman, Don Rubin, and Paul Holland is discussed in light of the four theories. The chapter ends with a discussion of causation and explanation. Counterfactuals Causal statements are so useful that most people cannot let an event go by without asking why it happened and offering their own “because”. They often enliven these discussions with counterfactual assertions such as “If the cause had not occurred, then the effect would not have happened.” A counterfactual is a statement, typically in the subjunctive mood, in which a false or “counter to fact” premise is followed by some assertion about what would have happened if the premise were true. For example, if someone drank too much and got in a terrible accident, then a counterfactual assertion might be “if he had not drunk five scotch and sodas, he would not have had that terrible accident.” The statement uses the subjunctive (“if he had not drunk...., he would not have had”), and the premise is counter to the facts. The premise is false because the person did drink five scotch and sodas in the real world as it unfolded. The counterfactual claim is that without this drinking, the world would have proceeded differently, and he would not have had the terrible accident. Is this true? 4 The truth of counterfactuals is closely related to the existence of causal relationships. The counterfactual claim made above implies that there is a causal link between drinking five scotch and sodas (the cause X) and the terrible accident (the effect Y). The counterfactual, for example, would be true if drinking caused the person to drive recklessly and to hit another car on the way home because of his drinking. Therefore, if the person had not drunk five scotch and sodas then he would not have driven recklessly and the accident would not have occurred – thus demonstrating the truth of the counterfactual “if he had not drunk five scotch and sodas he would not have had that terrible accident.” Another way to think about this is to simply ask what would have happened in the most similar world in which the person did not drink the five scotch and sodas. Would the accident still have happened? One way to do this would be to rerun the world with the cause eradicated so that no scotch and sodas are drunk. The world would otherwise be the same. If the accident does not occur, then we would say that the counterfactual is true. Thus, the statement that drinking caused the accident is essentially the same as saying that in the most similar world in which drinking did not occur, the accident did not occur either. The existence of a causal connection can be checked by determining whether or not the counterfactual would be true in the most similar possible world where its premise is true. The problem, of course, is defining the most similar world and finding evidence for what would happen in it. Consider the problem of definition first. Suppose, for example, that in the real world the person drank five scotch and sodas, got on a bus, and the bus driver got into a terrible accident which injured the drinker. In this case, the most similar world, in which the (sober) person got on the bus would still have led to the terrible accident which hurt the drinker in the other world. Drinking could not be held responsible for the accident in this case. Or could it? What is the most similar world? Would the person who took the bus when he got drunk have taken the bus if he had not gotten drunk? Would he have driven home instead? If he had driven home, wouldn’t he have avoided the accident on the bus? Which is the most similar world, the one in which the person takes the bus or takes an automobile? This is a difficult question. Beyond these definitional questions about most similar worlds, there is the problem of finding evidence for what would happen in the most similar world. We cannot rerun the world so that the person does not drink five scotch and sodas. What can we do? Many philosophers have wrestled with this question, and we discuss the problem in detail later in the section on the counterfactual theory of causation. 6 For now, we merely note that people act as if they can solve this problem because they assert the truth of counterfactual statements all the time. 6 Standard theories of logic cannot handle counterfactuals because propositions with false premises are automatically considered true which would mean that all counterfactual statements, with their false premises, would be true, regardless of whether or not a causal link existed. Modal logics, which try to capture the nature of necessity, possibility, contingency, and impossibility, have been developed for counterfactuals (Lewis, 1973). These logics typically judge the truthfulness of the counterfactual on whether or not the statement would be true in the most similar possible world where the premise is true. Problems arise, however, in defining the most similar world. These logics, by the way, typically broaden the definition of counterfactuals to include statements with true premises for which they consider the closest possible world to be the actual world so that their truth value is judged by whether or not their conclusion is true in the actual world. 5 In everyday conversation, counterfactuals serve many purposes. They are sometimes offered as explanatory laments7 such as “if he had not had that drink, he wouldn’t have had that terrible accident” or as sources of guidance for the future such as when, after a glass breaks at the dinner table, we admonish the miscreant that “if you had not reached across the table, then the glass would not have broken.” For social scientists, discussions of counterfactuals are serious attempts to understand the mainsprings of history, and each of the following (contentious) counterfactuals which is parallel to the causal assertions listed earlier suggests an explanation about why things turned out as they did and why they might have turned out differently: “If the money supply had been increased more, then the economy would have grown more.” “If Stalin had not succeeded Lenin, then the Soviet Union would have been more democratic.” “If there had not been the Protestant Reformation, then capitalism would not have developed in the West.” “If welfare recipients were required to meet strict work requirements, then they would get off welfare faster.” “If Italy used plurality voting with single member districts instead of proportional representation with multi-member districts, then there would be fewer small parties and less government instability.” “If the butterfly ballot had not been so confusing, Al Gore would have won the 2000 election.” These counterfactuals, if true, provide us with a better understanding of these events and an ability to think about how we might change outcomes in the future. But their truth depends upon the validity of their implicit causal assertions. Exploring Three Basic Questions about Causality Causality is at the center of explanation and understanding, but what, exactly, is it? And how is it related to counterfactual thinking? Somewhat confusingly, philosophers mingle psychological, ontological, and epistemological arguments when they discuss causality. Those not alerted to the different purposes of these arguments may find philosophical discussions perplexing as they move from one kind of discussion to another. Our primary focus is epistemological. We want to know when causality is truly operative, not just when some psychological process leads people to believe that it is operative. And we do not care much about metaphysical questions regarding 7 Roese, Sanna, and Galinsky (2002) show that counterfactuals are often activated by negative affect (e.g., losses in the stock market, failure to achieve a goal) as well as being intentionally invoked to plan for the future. 6 what causality really is, although such ontological considerations become interesting to the extent that they might help us discover causal relationships. Psychological and Linguistic Analysis – Although our primary focus is epistemological, our everyday understanding, and even our philosophical understanding, of causality, is rooted in the psychology of causal inference. Perhaps the most famous psychological analysis is David Hume’s investigation of what people mean when they refer to causes and effects. Hume (17111776) was writing at a time when the pre-eminent theory of causality was the existence of a necessary connection – a kind of “hook” or “force” – between causes and their effects so that a particular cause must be followed by a specific effect. Hume looked for the feature of causes that guaranteed their effects. He argued that there was no evidence for the necessity of causes because all we could ever find in events was the contiguity, precedence, and regularity of cause and effect. There was no evidence for any kind of hook or force. He described his investigations as follows in his Treatise of Human Nature (1739): What is our idea of necessity, when we say that two objects are necessarily connected together? .... I consider in what objects necessity is commonly supposed to lie; and finding that it is always ascribed to causes and effects, I turn my eye to two objects supposed to be placed in that relation, and examine them in all the situations of which they are susceptible. I immediately perceive that they are contiguous in time and place, and that the object we call cause precedes the other we call effect. In no one instance can I go any further, nor is it possible for me to discover any third relation betwixt these objects. I therefore enlarge my view to comprehend several instances, where I find like objects always existing in like relations of contiguity and succession. The reflection on several instances only repeats the same objects; and therefore can never give rise to a new idea. But upon further inquiry, I find that the repetition is not in every particular the same, but produces a new impression, and by that means the idea which I at present examine. For, after a frequent repetition, I find that upon the appearance of one of the objects the mind is determined by custom to consider its usual attendant, and to consider it in a stronger light upon account of its relation to the first object. It is this impression, then, or determination, which affords me the idea of necessity.” (Hume (1738), 1978, page 155).8 Thus for Hume the idea of necessary connection is a psychological trick played by the mind that observes repetitions of causes followed by effects and then presumes some connection that goes beyond that regularity. For Hume, the major feature of causation, beyond temporal precedence and contiguity, is simply the regularity of the association of causes with their effects, but there is 8 In the Enquiry (1748, pages 144-45) which is a later reworking of the Treatise, Hume says: “So that, upon the whole, there appears not, throughout all nature, any one instance of connexion, which is conceivable by us. All events seem entirely loose and separate. One event follows another; but we never can observe any tye between them. They seem conjoined, but never connected. And as we can have no idea of any thing, which never appeared to our outward sense or inward sentiment, the necessary conclusion seems to be, that we have no idea of connexion or power at all, and that these words are absolutely without meaning, when employed either in philosophical reasonings, or common life.... This connexion, therefore, we feel in the mind, this customary transition of the imagination from one object to its usual attendant, is the sentiment or impression, from which we form the idea of power or necessary connexion.” 7 no evidence for any kind of hook or necessary connection between causes and effects.9 The Humean analysis of causation became the predominant perspective in the nineteenth and most of the twentieth century, and it led in two directions both of which focused upon the logical form of causal statements. Some, such as the physicist Ernst Mach, the philosopher Bertrand Russell, and the statistician/geneticist Karl Pearson concluded that there was nothing more to causation than regularity so that the entire concept should be abandoned in favor of functional laws or measures of association such as correlation which summarized the regularity.10 Others, such as the philosophers John Stuart Mill (1888), Karl Hempel (1965), and Tom Beauchamp and Alexander Rosenberg (1981) looked for ways to strengthen the regularity condition so as to go beyond mere accidental regularities. For them, true cause and effect regularities must be unconditional and follow from some lawlike statement. Their neo-Humean approach improved upon Hume’s theory, but, as we shall see, there appears to be no way to define lawlike statements in a way that captures all that we mean by causality. What, then, do we typically mean by causality? In their analysis of the fundamental metaphors used to mark the operation of causality, the linguist George Lakoff and the philosopher Mark Johnson (1980a,b, 1999) describe prototypical causation as “the manipulation of objects by force, the volitional use of bodily force to change something physically by direct contact in one’s immediate environment.” (1999, page 177) Causes bring, throw, hurl, propel, lead, drag, pull, push, drive, tear, thrust, or fling the world into new circumstances. These verbs suggest that causation is forced movement, and for Lakoff and Johnson the “Causation Is Forced Movement metaphor is in a crucial way constitutive of the concept of causation.” (Page 187) Causation as forceful manipulation differs significantly from causation as the regularity of cause and effect because forceful manipulation emphasizes intervention, agency, and the possibility that the failure to engage in the manipulation will prevent the effect from happening. For Lakoff and Johnson, causes are forces and capacities that entail their effects in ways that go beyond mere regularity and that are reminiscent of the causal “hooks” rejected by Hume, although instead of hooks they emphasize manipulation, mechanisms, forces, and capacities.11 9 There are different interpretations of what Hume meant. For a thorough discussion see Beauchamp and Rosenberg (1981). 10 Bertrand Russell famously wrote that “the word ‘cause’ is so inextricably bound up with misleading associations as to make its complete extrusion from the philosophical vocabulary desirable.... The law of causality, like so much that passes muster among philosophers, is a relic of a bygone age, surviving like the monarchy, only because it is erroneously supposed to do no harm.” (Russell, 1918). Karl Pearson rejected causation and replaced it with correlation: “Beyond such discarded fundamentals as ‘matter’ and ‘force’ lies still another fetish amidst the inscrutable arcana of even modern science, namely the category of cause and effect. Is this category anything but a conceptual limit to experience, and without any basis in perception beyond a statistical approximation?” (Pearson, 1911, page vi) “It is this conception of correlation between two occurrences embracing all relationship from absolute independence to complete dependence, which is the wider category by which we have to replace the old idea of causation.” (Pearson, 1911, page 157). 11 As we shall show, two different theories of causation are conflated here. One theory emphasizes agency and manipulation. The other theory emphasizes mechanisms and capacities. The major difference is the locus of the underlying force that defines causal relationships. Agency and manipulation theories 8 “Causation as regularity” and “causation as manipulation” are quite different notions, but each carries with it some essential features of causality. And each is the basis for a different philosophical or everyday understanding of causality. From a psychological perspective, their differences emerge clearly in research done in the last fifteen years on the relationship between causal and counterfactual thinking (Spellman and Mandel, 1999). Research on this topic demonstrates that people focus on different factors when they think causally than when they think counterfactually. In experiments, people have been asked to consider causal attributions and counterfactual possibilities in car accidents in which they imagine that they chose a new route to drive home and were hit by a drunk driver. People’s causal attributions for these accidents tend to “focus on antecedents that general knowledge suggest would covary with, and therefore predict, the outcome (e.g., the drunk driver),” but counterfactual thinking focuses on controllable antecedents such as the choice of route (Spellman and Mandel, 1999, page 123). Roughly speaking, causal attributions are based upon a regularity theory of causation while counterfactual thinking is based upon a manipulation theory of causation. The regularity theory suggests that drunken drivers typically cause accidents but the counterfactual theory suggests that in this instance the person’s choice of a new route was the cause of the accident because it was manipulable by the person. The logic of causal and the logic of counterfactual thinking are so closely related that these psychological differences in attributions lead to the suspicion that both the regularity and the manipulation theory tell us something important about causation. This psychological research also reminds us that causes are defined in relation to what the philosopher John Mackie calls a “causal field” of other factors and that what people choose to consider the cause of an event depends upon how they define the causal field. Thus, an unfortunate person who lights a cigarette in a house which ignites a gas leak and causes an explosion will probably consider the causal field to be a situation where lighting a cigarette and no gas leak is the norm, hence the gas leak will be identified as the cause of the explosion. But an equally unfortunate person who lights a cigarette at a gas station which causes an explosion will probably consider lighting the cigarette to be the cause of the explosion and not the fact that gas fumes were present at the station.12 Similarly, a political scientist who studies great power politics may consider growing instability in the great power system to be the cause of World War I because a stable system could have weathered the assassination of Archduke Ferdinand, but an historian who studies the impact of assassination on historical events might argue that World War I was a prime example of how assassinations can cause bad consequences such as a world war. As Mackie notes, both are right, but “What is said to be caused, then, is not emphasize human intervention. Mechanism and capacity theories emphasize processes within nature itself. 12 Legal wrangling over liability often revolves around who should be blamed for an accident where the injured party has performed some action in a causal field. The injured party typically claims that the action should have been anticipated and its effects mitigated or prevented by the defendant in that causal field and the defendant argues that the action should not have been taken by the plaintiff or could not have been anticipated by the defendant. 9 just an event, but an event-in-a-certain-field, and some ‘conditions’ can be set aside as not causing this-event-in-this-field simply because they are part of the chosen field, though if a different field were chosen, in other words if a different causal question were being asked, one of those conditions might well be said to cause this-event-in-that-other-field.” (Mackie, 1974, page 35) Those familiar with regression analysis in which multiple factors are said to cause an event might translate this result into the simple adage that some researchers look at one coefficient in a regression equation (that for the causal impact of assassinations) and other researchers look at another coefficient (that for the causal impact of instability in the great power system), but the lesson is larger than that. The historian interested in assassinations will collect and study cases of failed and successful assassinations and will measure their impact in terms of changes in governmental policies. These changes will include declarations of war, but they will include many other things as well such as the passage of the Civil Rights Act after the assassination of John F. Kennedy. The political scientist studying the balance among great powers will have an entirely different set of cases and probably measure outcomes such as declarations of war, alliances, embargoes, and other international actions. In terms of our earlier example involving cigarettes and gas, the historian is studying the consequences of cigarette smoking and the political scientist is interested in the consequences of the use of gas and gasoline. The lesson for the practicing researcher is that the same events can be studied and understood from many different perspectives and the researcher must think carefully about the causal field. These investigations of everyday causal thinking are very suggestive, but there is ultimately no reason why the way people ordinarily use the concept of causation should suffice for scholarly inquiry, although we would surely be concerned if scholarly uses departed from ordinary ones without any clear reason. Ontological Questions – Knowing how most people think and talk about causality is useful, but we are ultimately more interested in knowing what causality actually is and how we would discover it in the world. These are respectively ontological and epistemological questions.13 As we shall see, these questions are quite separate but their answers are often closely intertwined. Ontological questions ask about the characteristics of the abstract entities that exist in the world. Queries about the definition of events, the existence of abstract properties, the nature of causality, and the existence of God are all ontological questions. The study of causality raises a number of fundamental ontological questions regarding the things that are causally related and the nature of the causal relation.14 13 Roughly speaking, philosophy is concerned with three kinds of questions regarding “what is” (ontology), “how it can be known” (epistemology), and “what value it has” (ethics and aesthetics). In answering these questions, twentieth century philosophy has also paid a great deal of attention to logical, linguistic, and even psychological analysis. 14 Symbolically, we can think of the causal relation as a statement XcY where X is a cause, Y is an effect, and c is a causal relation. X and Y are the things that are causally related and c is the causal relation. As we shall see later, this relationship is usually considered to be incomplete (not all X and Y are causally related), asymmetric for those events that are causally related (either XcY or YcX but not both), and irreflexive (XcX is not possible). 10 What are the things, the “causes” and the “effects” that are linked by causation? Whatever they are, they must be the same things because causes can also be effects and vice-versa. But what are they? Are they facts, properties, events, or something else?15 The practicing researcher cannot ignore questions about the definition of events. Are “arm reaching,” “glasses breaking,” “Stalin succeeding Lenin,” “a Democratic USSR,” and “the butterfly ballot” all events? They certainly differ in size, complexity, duration, and other features. One of the things that researchers must consider is the proper definition of an event,16 and a great deal of the effort in doing empirical work is defining events suitably. Not surprisingly, tremendous effort has gone into defining wars, revolutions, firms, organizations, democracies, religions, participatory acts, political campaigns, and many other kinds of events and structures that matter for social science research. Much could be said about defining events, but we shall only emphasize that defining events in a useful fashion is one of the major tasks of good social science research. A second basic set of ontological questions concern the nature of the causal relationship. Is causality different when it deals with physical phenomena (e.g., billiard balls hitting one another or planets going around stars) than when it deals with social phenomena (democratization, business cycles, cultural change, elections) that are socially constructed?17 What role do human agency and mental events play in causation?18 What can we say about the time structure and nature of causal processes?19 Once again, there are real philosophical issues here, but we shall elide most of them because it would take us too far afield to deal with each one of them and because we are concerned with those situations where researchers want to determine causality. Our general attitude is that social science is about the formation of concepts and the identification of causal mechanisms. We believe that social phenomena such as the Protestant ethic, the system of nation-states, and culture 15 Events are located in space and time (e.g., “the WWI peace settlement at Versailles”) but facts are not (“The fact that the WW I peace settlement was at Versailles”). For discussions of causality and events see Bennett (1988) and for causality and facts see Mellors (1995). Many philosophers prefer to speak of “tropes” which are particularized properties (Ehring, 1997). Some philosophers reject the idea that the world can be described in terms of distinct events or tropes and argue for events as enduring things. (Harre and Madden, 1973, Chapter 6). 16 A potpourri of citations that deal with the definition of events and social processes are Abbott (1983, 1992, 1993), Pierson (2002), Riker (1957), Tilly (1984). 17 For representative discussions see Durkheim (19xx), Berger and Luckman (1966), von Wright (1971), Elster (19xx, nuts and bolts), Searle (1995), Wendt (1999). 18 See Dilthey (19xx), von Wright (1971, Chapter 1), Davidson (19xx), Elster (nuts and bolts), Searle (19xx), Wendt (1999). 19 In a vivid set of metaphors, Pierson (2002) compares different kinds of social science processes with tornadoes, earthquakes, large meteorites, and global warming in terms of the time horizon of the cause and the time horizon of the impact. He shows that the causal processes in each situation are quite different. 11 exist and have causal implications. We also believe that reasons, perceptions, beliefs, and attitudes affect human behavior. Furthermore, we believe that these things can be observed and measured. We are prepared to defend these assertions in the abstract, but our focus here is on the methods that practicing researchers should use to study these things. Nevertheless, as we shall show, getting a grip on causality requires researchers to have a detailed understanding of the kinds of mechanisms that could link one event with another. Researchers must think about what these processes might be and how they operate. Perhaps most importantly, researchers must think very hard about the nature of human action. Another basic question about the causal relation is whether it is deterministic or probabilistic. The classic model of causation is the deterministic, clockwork Newtonian universe in which the same initial conditions inevitably produce the same outcome. But modern science has produced many examples where causal relationships appear to be probabilistic. The most famous is quantum mechanics where the position and momentum of particles is represented by probability distributions, but many other sciences rely upon probabilistic relationships. Geneticists, for example, do not expect that couples in which all the men have the same height and all the women have the same height will have children of the same height. In this case, the same set of (observed) causal factors produce a probability distribution over possible heights. We now know, that even detailed knowledge of the couple’s DNA would not lead to exact predictions. Probabilistic causation, therefore, seems possible in the physical sciences, common in the biological sciences, and pervasive in the social sciences. Nevertheless, following the custom of a great deal of philosophical work, we shall start with a discussion of deterministic causation in order not to complicate the analysis. Epistemological Questions – Epistemology is concerned with how we can obtain intellectually certain knowledge (what the Greeks called “episteme”) and how we can identify and learn about causality. How do we figure out that X really caused Y? At the dinner table, our admonition not to reach across the table might be met with “I didn’t break the glass, the table shook,” suggesting that our causal explanation for the broken glass was wrong. How do we proceed in this situation? We would probably try to rule out alternatives by investigating whether someone shook the table, whether there was an earthquake, or something else happened to disturb the glass. The problem here is that there are many possibilities that must be ruled out, and what must be ruled out depends, to some extent, on our definition of causality. Learning about causality, then, requires that we know what it is and that we know how to recognize it when we see it. Simple Humean theories appear to solve both problems at once. Two events are causally related when they are contiguous, one precedes another, and they occur regularly in constant conjunction with one another. Once we have checked these conditions, we know that we have a causal connection. But upon examination, these conditions are not enough for causality because we would not say that night causes day, even though day and night are contiguous, night precedes day, and day and night are regularly associated. Furthermore, simple regularities like this do not make it easy to distinguish cause from effect – after all, day precedes night as well as night preceding day so that we could just as well, and just as mistakenly, say that day causes night. Something more is needed.20 It is this something more that causes most of the 20 Something different might also be needed. Hume himself dropped the requirement for contiguity in his 1748 rewrite of his 1738 work, and many philosophers would also drop his requirement for temporal 12 problems for understanding causation. John Stuart Mill suggested that there had to be an “unconditional” relationship between cause and effect and modern neo-Humeans have required a “lawlike” relationship, but even if we know what this means21 (which would solve the ontological problem of causation) it is hard to ensure that it is true in particular instances so as to solve the epistemological problem. In fact, it is possible, just as we might know what the perfect surfing wave should be without knowing how or where to find it, that we can know what causality is without knowing how or where to find it. We might have solved the ontological problem without solving the epistemological problem. Or, just as we might be able to bake an excellent souffle without being able to describe how it rises, we might be able to determine causality without really knowing what it is. We might just have a recipe for finding it. In this case, we would have solved the epistemological problem without solving the ontological problem. To make things even more complicated, some people might argue that the solution to the epistemological problem is the solution to the ontological one – a souffle is simply the recipe for it. Or alternatively, they might argue that the solution to the ontological problem indicates that there can be no solution to the epistemological one – we can know what causality is, but we can never establish it. In the following sections, we begin with a review of four theories of what causality might be. We spend most of our time on a counterfactual definition, mostly amounting to a recipe, that is now widely used in statistics. We end with a discussion of the limitations of the recipe and how far it goes towards solving the epistemological and ontological problems. Humean and Neo-Humean Theories of Causation Lawlike Generalities and the Humean Regularity Theory of Causation – Humean and neoHumean theories propose logical conditions that must hold for the constant conjunction of events to justify the inference that they have a cause-effect relationship. Specifically, Humeans have explored whether a cause must be sufficient for its effects, necessary for its effects, or something more complicated. The classic definition shared by Hume, John Stuart Mill, and many others was that “X is a cause of Y if and only if X is sufficient for Y.” That is, the cause must always and invariably lead to the effect. Certainly an X that is sufficient for Y can be considered a cause, but what about the many putative causes are not sufficient for their effect? Striking a match, for example, may be necessary for it to light, but it may not light unless there is enough oxygen in the atmosphere. Is striking a match never a cause of a match lighting? This leads to an alternative definition in which “X is a cause of Y if and only if X is necessary for Y.” Under this definition, it is assumed that the cause (such as striking the match) must be present for the effect to occur, but it may not always be enough for the cause to actually occur (because there might not be enough oxygen). precedence. 21 Those new to this literature are presented with many statements about the need for lawfulness and unconditionality which seem to promise a recipe that will insure lawfulness. But the conditions that are presented always seem to fall short of the goal. 13 But how many causes are even necessary for their effects? If the match does not light after striking it, someone might use a blowtorch to light it so that striking the match is not even necessary for the match to ignite. Do we therefore assume that striking the match is never a cause of its lighting? Necessity and sufficiency seem unequal to the task of defining causation.22 These considerations led John Mackie to propose a set of conditions requiring that a cause be an insufficient [I] but necessary [N] part of a condition which is itself unnecessary [U] but exclusively sufficient [S] for the effect. These INUS conditions can be explained by an example. Consider two ways that the effect (E), which is a building burning down, might occur. (See Figure 1.) In one scenario the wiring might short-circuit and overheat, thus causing the wooden framing to burn. In another, a gasoline can might be next to a furnace that ignites and causes the gasoline can to explode. A number of factors here are INUS conditions for the building to burn down. The short circuit (C) and the wooden framing (W) together might cause the building to burn down, or the gasoline can (G) and the furnace (F) might cause the building to burn down. Thus, C and W together are exclusively sufficient [S] to burn the building down, and G and F together are exclusively sufficient [S] to burn the building down. Furthermore, the short circuit and wooden framing (C&W) are unnecessary [U], and the gasoline can and the furnace (G&F) are unnecessary [U] because the building could have burned down with just one or the other combination of factors. Finally, C, W, G, or F alone is insufficient [I] to burn the building down even though C is necessary [N] in conjunction with W (or vice-versa) and G is necessary [N] in conjunction with F (or vice-versa). This formulation allows for the fact that no single cause is sufficient or necessary, but when experts say that a short-circuit caused the fire they “... are saying, in effect that the short-circuit (C) is a condition of this sort, that it occurred, that the other conditions (W) which, conjoined with it, form a sufficient condition were also present, and that no other sufficient condition (such as G&F) of the house’s catching fire was present on this occasion.” (Mackie, 1965, page 245, letters addded). Figure 1 C & W -------> E: Burning Building G&F -------> From the perspective of a practicing researcher, three lessons follow from the INUS conditions. First a putative cause such as C might not cause the effect E because G&F might be responsible. Hence, the burned down building (E) will not always result from a short circuit (C) even though C could cause the building to burn down. Second, interactions among causes may be necessary for any one cause to be sufficient (C and W require each other and W and G require each other). Third, the relationship between any INUS cause and its effect might appear to be probabilistic because of the other INUS causes. In summary, the INUS conditions suggest the multiplicity of 22 And there are problems such as the following favorite of the philosophers: “If two bullets pierce a man’s heart simultaneously, it is reasonable to suppose that each is an essential part of a distinct sufficient condition of the death, and that neither bullet is ceteris paribus necessary for the death, since in each case the other bullet is sufficient.” (Sosa and Tooley, pages 8-9). 14 causal pathways and causes, the possibility of conjunctural causation (Ragin, 1987), and the likelihood that social science relationships will appear probabilistic even if they are deterministic.23 [This section until the next asterisks might be put in an appendix.]********************** A specific example might help to make these points clearer. Assume that the four INUS factors mentioned above, C, W, G, and F, occur independently of one another and that they are the only factors which cause fires in buildings. Further assume that short circuits (C) occur 10% of the time, wooden (W) frame buildings 50% of the time, furnaces (F) 90% of the time, and gasoline (G) cans near furnaces 10% of the time. Because these events are assumed independent of one another, it is easy to calculate that C and W occur 5% of the time and that G and F occur 9% of the time. (We simply multiply the probability of the two independent events.) All four conditions occur 0.45% of the time. (The product of all four percentages.) Thus, fires occur 13.55% of the time. This percentage includes the cases where the fire is the result of C and W (5% of the time) and where it is the result of G and F (9% of the time), and it adjusts downward for double-counting that occurs in the cases where all four INUS conditions occur together (0.45% of the time). Now suppose an experimenter did not know about the role of wooden frame buildings or gasoline cans and furnaces and only looked at the relationship between fires and short-circuits. A crosstabulation of fires with the short-circuit factor would yield Table 2. As assumed above, short circuits occur 10% of the time (see the third column total at the bottom of the table) and as calculated above, fires occur 13.55% of the time (see the third row total on the far right). The entries in the interior of the table are calculated in a similar way.24 Table 2 – Fires by Short Circuits in Hypothetical Example (Total Percentages of each Event) Not C – No short circuits Not E – No fires E – Fires Column Totals C – Short Circuits Row Totals 81.90% 4.55% 86.45% 8.10% 5.45% 13.55% 90.00% 10.00% 100.00% Even though each case occurs because of a deterministic process – either a short-circuit and a wooden frame building or a gasoline can and a furnace (or both), this cross-tabulation suggests a probabilistic relationship between fires and short-circuits. In 4.55% of the cases, short circuits occur but no fires result because the building was not wooden. In 8.10% of the cases, there are no 23 These points are made especially forcefully in Marini and Singer (1988). 24 Thus, the entry for short circuits and fires comes from the cases where there are short-circuits and wooden frame buildings (5% of the time) and where there are short-circuits and no wooden frame buildings but there are gasoline cans and furnaces (5% times 9%). 15 short circuits, but a fire occurs because the gasoline can has been placed near the furnace. For this table, a standard measure of association, the Pearson correlation, between the effect and the cause is about .40 which is far short of the 1.0 required for a perfect (positive) relationship. If, however, the correct model is considered in which there are the required interaction effects, the relationship will produce a perfect fit.25 Thus, a misspecification of a deterministic relationship can easily lead a researcher to think that there is a probabilistic relationship between the cause and effect. ***************************************************************************** INUS conditions reveal a lot about the complexities of causality, but as a definition of it, they turn out to be too weak – they do not rule out situations where there are common causes, and they do not exclude accidental regularities. The problem of common cause arises in a situation where, for example, lightning strikes (L) the wooden framing (W) and causes it to burn (E) while also causing a short in the circuitry (C). That is, L –> E and L –> C (where the arrow indicates causation). If lightning always causes a short in the circuitry, but the short never has anything to do with a fire in these situations because the lightning starts the fire directly through its heating of the wood, we will nevertheless always find that C and E are constantly conjoined through the action of the lightning, suggesting that the short circuit caused the fire even though the truth is that lightning is the common cause of both.26 In some cases of common causes such as the rise in barometric pressure followed by the arrival of a storm, common sense tells us that the putative cause (the rise in barometric pressure) cannot be the real cause of the thunderstorm. But in the situation with the lightning, the fact that short circuits have the capacity to cause fires makes it less likely that we will realize that lightning is the common cause of both the short-circuits and the fires. We might be better off in the case where the lightning split some of the wood framing of the house instead of causing a short-circuit. In that case, we would probably reject the fantastic theory that split wood caused the fire because split wood does not have the capacity to start a fire, but the Humean theory would be equally confused by both situations because it could not appeal, within the ambit of its understanding, to causal capacities. For a Humean, the constant conjunction of split wood and fires suggests causation as much as the constant conjunction of short-circuits and fires. Indeed, the constant conjunction of storks and babies would be treated as probative of a causal connection. Attempts to fix-up these conditions usually focus on trying to require “lawlike” statements that are unconditionally true, not just accidentally true. Since it is not unconditionally true that splitting wood causes fires, the presumption is that some such conditions can be found to rule-out this explanation. Unfortunately, no set of conditions seem to be successful.27 Although the 25 If each variable is scored zero or one depending upon whether the effect or cause is present or absent, then a regression equation of the effect on the product (or interaction) of C and W, the product of G and F, and the product of C, W, G, and F will produce a multiple correlation of one indicating a perfect fit. 26 It is also possible that the lightning’s heating of the wood is (always or sometimes) insufficient to cause the fire (not L–>E), but its creation of a short-circuit (L–>C) is (always or sometimes) sufficient for the fire (C–>E). In this case, the lightning is the indirect cause of the fire through its creation of the short circuit. That is, L –> C –> E. 27 For some representative discussions of the problems see (Harre and Madden, 1975, Chapter 2; Salmon, 1990, Chapters 1-2; Hausman, 1998, Chapter 3). Salmon (1990, page 15) notes that “Lawfulness, modal import [what is necessary, possible, or impossible], and support of counterfactuals seems to have a 16 regularity theory identifies a necessary condition for describing causation, it basically fails because association is not causation and there is no reason why purely logical restrictions on lawlike statements should be sufficient to characterize causal relationships. Part of the problem is that there are many different types of causal laws and they do not fit any particular patterns. For example, one restriction that has been proposed to insure lawfulness is that lawlike statements should either not refer to particular situations or they should be derivable from laws that do not refer to particular situations. This would mean that Kepler’s first “law” about all planets moving in elliptical orbits around the sun (a highly specific situation!) was not a causal law before Newton’s laws were discovered, but it was a causal law after it was shown that it could be derived from Newton’s laws. But Kepler’s laws were always considered causal laws, and there seems to be no reason to rest their lawfulness on Newton’s laws. Furthermore, by this standard, almost all social science and natural science laws (e.g., plate tectonics) are about particular situations. In short, logical restrictions on the form of laws do not seem sufficient to characterize causality. The Asymmetry of Causation – The regularity theory also fails because it does not provide an explanation for the asymmetry of causation. Causes should cause their effects, but INUS conditions are almost always symmetrical such that if C is an INUS cause of E, then E is also an INUS cause of C. It is almost always possible to turn around an INUS condition so that an effect is an INUS for its cause.28 One of the most famous examples of this problem involves a flagpole, the elevation of the sun, and the flagpole’s shadow. The law that light travels in straight lines implies that there is a relationship between the height of the flagpole, the length of its shadow, and the angle of elevation of the sun. When the sun rises, the shadow is long, at midday it is short, and at sunset it is long again. Intuition about causality suggests that the length of the shadow is caused by the height of the flagpole and the elevation of the sun. But, using INUS conditions, we can just as well say that the elevation of the sun is caused by the height of the flagpole and the length of the shadow. There is simply nothing in the conditions that precludes this fantastic possibility. The only feature of the Humean theory that provides for asymmetry is temporal precedence. If changes in the elevation of the sun precede corresponding changes in the length of the shadow, then we can say that the elevation of the sun causes the length of the shadow. And if changes in the height of the flagpole precede corresponding changes in the length of the shadow, we can say that the height of the flagpole causes the length of the shadow. But many philosophers reject making temporal precedence the determinant of causal asymmetry because it precludes the possibility of explaining the direction of time by causal asymmetry and it precludes the possibility of backwards causation. From a practical perspective, it also requires careful measures of timing that may be difficult in a particular situation. Summary – This discussion reveals two basic aspects of the causal relation. One is a symmetrical form of association between cause and effect and the other is an asymmetrical relation in which causes produce effects but not the reverse. The Humean regularity theory, in the form of INUS common extension; statements either possess all three or lack all three. But it is extraordinarily difficult to find criteria to separate those statements that do from those that do not.” 28 *** Insert a footnote giving the source on the reversibility of INUS conditions. 17 conditions, provides a necessary condition for the existence of the symmetrical relationship,29 but it does not rule out situations such as common cause and accidental regularities where there is no causal relationship at all. From a methodological standpoint, it can easily lead researchers to presume that all they need to do is to find associations, and it also leads to an underemphasis on the rest of the requirement for a “lawlike” or “unconditional” relationship because it does not operationally define what that would really mean. A great deal of what passes for causal modeling suffers from these defects (Freedman, 1987, 1991, 1997, 1999) The Humean theory does even less well with the asymmetrical feature of the causal relationship because it provides no way to determine asymmetry except temporal precedence. Yet there are many other aspects of the causal relation that seem more fundamental than temporal precedence. Causes not only typically precede their effects, but they also can be used to explain effects or to manipulate effects while effects cannot be used to explain causes or to manipulate them.30 Effects also depend upon causes, but causes do not depend upon effects. Thus, if a cause does not occur, then the effect will not occur because effects depend on their causes. The counterfactual, “if the cause did not occur, then the effect would not occur” is true. However, if the effect does not occur, then the cause might still occur because causes can happen without leading to a specific effect if other features of the situation are not propitious for the effect. The counterfactual, “if the effect did not occur, then the cause would not occur” is not necessarily true. For example, where a short-circuit causes a wooden frame building to burn down, if the short-circuit does not occur, then the building will not burn down. But if the building does not burn down, it is still possible that the short-circuit occurred but its capacity for causing fires was neutralized because the building was made of brick. This dependence of effects on causes suggests that an alternative definition of causation might be based upon a proper understanding of counterfactuals. Counterfactual Definition of Causation In a book On the Theory and Method of History published in 1902, Eduard Meyer claimed that it was an “unanswerable and so an idle question” whether the course of history would have been different if Bismarck, then Chancellor of Prussia, had not decided to go to war in 1866. By some accounts, the Austro-Prussian-Italian War of 1866 paved the way for German and Italian unification (see, Wawro, 1997). In reviewing Meyer’s book in 1906, Max Weber agreed that “from the strict ‘determinist’ point of view” finding out what would have happened if Bismarck had not gone to war “was ‘impossible’ given the ‘determinants’ which were in fact present.” But he went on to say that “And yet, for all that, it is far from being ‘idle’ to raise the question what might have happened, if, for example, Bismarck had not decided for war. For it is precisely this question which touches on the decisive element in the historical construction of reality: the causal 29 Probabilistic causes do not necessarily satisfy INUS conditions because an INUS factor might only sometimes produce an effect. Thus, the short-circuit and the wooden frame of the house might only sometimes lead to a conflagration in which the house is burned down. Introducing probabilistic causes would add still another layer of complexity to our discussion which would only provide more reasons to doubt the Humean regularity theory. 30 Hausman (1998, page 1) also catalogs other aspects of the asymmetry between causes and effects. 18 significance which is properly attributed to this individual decision within the totality of infinitely numerous ‘factors’ (all of which must be just as they are and not otherwise) if precisely this consequence is to result, and the appropriate position which the decision is to occupy in the historical account.” (Weber, 1978, 111). Weber’s review is an early discussion of the importance of counterfactuals for understanding history and making causal inferences. He argues forcefully that if “history is to raise itself above the level of a mere chronicle of noteworthy events and personalities, it can only do so by posing just such questions” as the counterfactual in which Bismarck did not decide for war.31 Lewis’s Counterfactual Theory of Causation – The philosopher David Lewis (1973b) has proposed the most elaborately worked out theory of how causality is related to counterfactuals.32 His theory requires the truth of two statements regarding two distinct events X and Y. Lewis starts from the presumption that X and Y have occurred so that the “counterfactual” statement:33 “If X were to occur, then Y would occur” is true. The truth of this statement is Lewis’s first condition for a causal relationship. Then he considers the truth of a second counterfactual:34 “If X were not to occur, then Y would not occur either.” If this is true as well, then he says that X causes Y. If, for example, Bismarck decided for war in 1866 and, as some historians argue, German unification followed because of his decision, then we must ask “If Bismarck had not decided for war, would Germany have remained divided?” The heart of Lewis’s theory is the set of requirements, described below, that he lays down for the truth of this kind of counterfactual. Lewis’ theory has a number of virtues. It deals directly with singular causal events, and it does not require the examination of a large number of instances of X and Y. At one point in the philosophical debate about causation, it was believed that the individual cases such as “the hammer blow caused the glass to break” or “the assassination of Archduke Ferdinand caused World War I” could not be analyzed alone because these cases had to be subsumed under a general law (“hammer blows cause glass to break”) derived from multiple cases plus some 31 I am indebted to Richard Swedberg for pointing me towards Weber’s extraordinary discussion. Lewis finds some support for his theory in the work of David Hume. In a famous change of course in a short passage in his Enquiry Concerning Human Understanding (1748), Hume first summarized his regularity theory of causation by saying that “we may define a cause to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second,” and then he changed to a completely different theory of causation by adding “Or in other words, where if the first object had not been, the second had never existed.” (Enquiry, page 146) As many commentators have noted, these were indeed other words, implying an entirely different theory of causation. The first theory equates causality with the constant conjunction of putative causes and effects across similar circumstances. The second, which is a counterfactual theory, relies upon what would happen in a world where the cause did not occur. 32 33 Lewis considers statements like this as part of his theory of counterfactuals by simply assuming that statements in the subjunctive mood with true premises and true conclusions are true. As noted earlier, most theories of counterfactuals have been extended to include statements with true premises by assuming, quite reasonably, that they are true if their conclusion is true and false otherwise. 34 This is a simplified version of Lewis’s theory based upon Lewis (1973a,b; 1986) and Hausman (1998, Chapter 6). 19 particular facts of the situation in order to meet the requirement for a “lawlike” relationship. The counterfactual theory, however, starts with singular events and proposes that causation can be established without an appeal to a set of similar events and general laws regarding them.35 The possibility of analyzing singular causal events is important for all researchers, but especially for those doing case studies who want to be able to say something about the consequences of Stalin succeeding Lenin as head of the Soviet Union or the impact of the butterfly ballot on the 2000 election. The counterfactual theory also deals directly with the issue of X’s causal “efficacy” with respect to Y by considering what would happen if X did not occur. The problem with the theory is the difficulty of determining the truth or falsity of the counterfactual “If X were not to occur, then Y would not occur either.” The statement cannot be evaluated in the real world because X actually occurs so that the premise is false, and there is no evidence about what would happen if X did not occur. It only makes sense to evaluate the counterfactual in a world in which the premise is true. Lewis’s approach to this problem is to consider whether the statement is true in the closest possible world to the actual world where X does not occur. Thus, if X is a hammer blow and Y is a glass breaking, then the closest possible world is one in which everything else is the same except that the hammer blow does not occur. If in this world, the glass does not break, then the counterfactual is true, and the hammer blow (X) causes the glass to break (Y). The obvious problem with this approach is identifying the closest possible world. If X is the assassination of Archduke Ferdinand and Y is World War I, is it true that World War I would not have occurred in the closest possible world where the bullet shot by the terrorist Gavrilo Princip did not hit the Archduke? Or would some other incident have inevitably precipitated World War I? And, to add to the difficulty, would this “World War I” be the same as the one that happened in our world? Lewis’ theory substitutes the riddle of determining the similarity of possible worlds for the neoHumean theory’s problem of determining lawlike relationships. To solve these problems, both approaches must be able to identify similar causes and similar effects. The Humean theory must identify them across various situations in the real world. This aspect of the Humean approach is closely related to John Stuart Mill’s “Method of Concomitant Variation” which he described as follows: “Whatever phenomenon varies in any manner, whenever another phenomenon varies in some similar manner, is either a cause or an effect of that phenomenon, or is connected to it through some fact of causation.” (Mill, 1888, page xxx)36 Lewis’s theory must also identify similar causes and similar effects in the real world in which the cause does occur and in the many possible worlds in which the cause does not occur. This approach is closely related to Mill’s 35 In fact, many authors now believe that general causation (involving lawlike generalizations) can only be understood in terms of singular causation. “...general causation is a generalisation of singular causation. Smoking causes cancer iff (if and only if) smokers’ cancers are generally caused by their smoking.” (Mellors, 1995, pages 6-7). See also Sosa and Tooley, 1993. More generally, whereas explanation was once thought virtually to supercede the need for causal statements, many philosophers now believe that a correct analysis of causality will provide a basis for suitable explanations (see Salmon, 1990). 36 The Humean theory also has affinities with Mill’s Method of Agreement which he described as follows: “If two or more instances of the phenomenon under investigation have only one circumstance in common, the circumstance in which alone all the instances agree, is the cause (or effect) of the given phenomenon.” (Mill, 1888, page 280) 20 “Method of Difference” in which: “If an instance in which the phenomenon under investigation occurs, and an instance in which it does not occur, have every circumstance in common save one, that one occurring only in the former; the circumstance in which alone the two instances differ, is the effect, or the cause, or an indispensable part of the cause, of the phenomenon.” (Mill, 1888, page 280).37 In addition to identifying similar causes and similar effects, the Humean theory must determine if the conjunction of these similar causes and effects is accidental or lawlike. This task requires understanding what is happening in each situation and comparing the similarities and differences across situations. Lewis’s theory must identify the possible world where the cause does not occur that is most similar to the real world. This undertaking requires understanding the facts of the real world and the laws that are operating in it. Consequently, assessing the similarity of a possible world to our own world requires understanding the lawlike regularities that govern our world.38 It seems as if Lewis has simply substituted one difficult task, that of establishing lawfulness, for the job of identifying the most similar world. The Virtues of the Counterfactual Definition of Causation – Lewis has substituted one difficult problem for another, but the reformulation of the problem has a number of benefits. The counterfactual approach provides new insights into what is required to establish causal connection between causes and effects. The counterfactual theory makes it clear that establishing causation does not require observing the universal conjunction of a cause and an effect.39 One observation of a cause followed by an effect is sufficient for establishing causation if it can be shown that in a most similar world without the cause, the effect does not occur. The counterfactual theory proposes that causation can be demonstrated by simply finding a most similar world in which the absence of the cause leads to the absence of the effect. Consequently, comparisons, specifically the kind of comparison advocated by John Stuart Mill in his “Method of Difference,” have a central role in the counterfactual theory as they do in the analysis of case studies. Lewis’s theory provides us with a way to think about the causal impact of singular events such as the badly designed butterfly ballot in Palm Beach County, Florida that led some voters in the 2000 Presidential election to complain that they mistakenly voted for Reform Party candidate Patrick Buchanan when they meant to vote for Democrat Al Gore. The ballot can be said to be causally associated with these mistakes if in the closest possible world in which the butterfly ballot was not used the vote for Buchanan was lower than in the real world. Ideally this closest 37 Mill goes on to note that the Method of Difference is “a method of artificial experiment.” (Page 281). Notice that for both the Method of Concomitant Variation and the Method of Difference, Mill emphasizes the association between cause and effect and not the identification of which event is the cause and which is the effect. Mill’s methods are designed to detect the symmetric aspect of causality but not its asymmetric aspect. 38 Nelson Goodman makes this point in a 1947 article on counterfactuals, and James Fearon (1991), in a masterful exposition of the counterfactual approach to research, discusses its implications for counterfactual thought experiments in political science. Also see Tetlock and Belkin (1996). 39 G. H. von Wright notes that the counterfactual conception of causality shows that the hallmark of a lawlike connection is “necessity and not universality.” (von Wright, 1971, page 22) 21 possible world would be a parallel universe in which the same people received a different ballot, but this, of course, is impossible. The next best thing is a situation where similar people employed a different ballot. In fact, the butterfly ballot was only used for election day voters in Palm Beach County. It was not used by absentee voters. Consequently, the results for the absentee voting can be considered a surrogate for the closest possible world in which the butterfly ballot was not used, and in this absentee voting world, voting for Buchanan was dramatically lower, suggesting that at least 2000 people who preferred Gore – more than enough to give the election to Gore – mistakenly voted for Buchanan on the butterfly ballot. The difficult question, of course, is whether the absentee voting world can be considered a good enough surrogate for the closest possible world in which the butterfly ballot was not used.40 The counterfactual theory does not provide us with a clear sense of how to make that judgment.41 But the framework does suggest that we should consider the similarity of the election-day world and the absentee voter world. To do this, we can ask whether election day voters are different in some significant ways from absentee voters, and this question can be answered by considering information on their characteristics and experiences. In summary, the counterfactual perspective allows for analyzing causation in singular instances, and it emphasizes comparison, which seems difficult but possible, rather than the recondite and apparently fruitless investigation of the lawfulness of statements such as “All ballots that place candidate names and punch-holes in confusing arrangements will lead to mistakes in casting votes.” Controlled Experiments and Closest Possible Worlds – The difficulties with the counterfactual definition are identifying the characteristics of the closest possible world in which the putative cause does not occur and finding an empirical surrogate for this world. For the butterfly ballot, sheer luck led a team of researchers to discover that the absentee ballot did not have the problematic features of the butterfly ballot.42 But how can we find surrogates in other circumstances? One answer is controlled experiments. Experimenters can create mini-closest-possible worlds by finding two or more situations and assigning putative causes (called “treatments”) to some situations but not to others (which get the “control”). If in those cases where the cause C occurs, the effect E occurs, then the first requirement of the counterfactual definition is met: when C occurs, then E occurs. Now, if the situations which receive the control are not different in any 40 For an argument that the absentee votes are an excellent surrogate, see Wand et al., “The Butterfly Did It,” American Political Science Review, December, 2001. 41 In his book on counterfactuals, Lewis only claims that similarity judgments are possible, but he does not provide any guidance on how to make them. He admits that his notion is vague, but he claims it is not illunderstood. “But comparative similarity is not ill-understood. It is vague–very vague–in a wellunderstood way. Therefore it is just the sort of primitive that we must use to give a correct analysis of something that is itself undeniably vague.” (Lewis, 1973a, page 91). In later work Lewis (1979, 1986) formulates some rules for similarity judgements, but they do not seem very useful to us and to others (Bennett, 1984). 42 For the story of how the differences between the election day and absentee ballot were discovered, see Brady et al, 2001a. 22 significant ways from those that get the treatment, then they can be considered surrogates for the closest possible world in which the cause does not occur. If in these situations where the cause C does not occur, the effect E does not occur either, then the second requirement of the counterfactual definition is confirmed: in the closest possible world where C does not occur, then E does not occur. The crucial part of this argument is that the control situation, in which the cause does not occur, must be a good surrogate for the closest possible world to the treatment. Two experimental methods have been devised for insuring closeness between the treatment and control situations. One is classical experimentation in which as many circumstances as possible are physically controlled so that the only significant difference between the treatment and the control is the cause. In a chemical experiment, for example, one beaker holds two chemicals and a substance that might be a catalyst and another beaker of the same type, in the same location, at the same temperature, and so forth contains just the two chemicals in the same proportions without the suspected catalyst. If the reaction occurs only in the first beaker, it is attributed to the catalyst. The second method is random assignment of treatments to situations so that there are no reasons to suspect that the entities that get the treatment are any different, on average, from those that do not. We discuss this approach in detail below. Problems with the Counterfactual Definition43 – Although the counterfactual definition of causation leads to substantial insights about causation, it also leads to two significant problems. Using the counterfactual definition as it has been described so far, the direction of causation cannot be established, and two effects of a common cause can be mistaken for cause and effect. Consider, for example, an experiment as described above. In that case, in the treatment group, when C occurs, E occurs, and when E occurs, C occurs. Similarly, in the control group, when C does not occur, then E does not occur, and when E does not occur, then C does not occur. In fact, there is perfect observational symmetry between cause and effect which means that the counterfactual definition of causation as described so far implies that C causes E and that E causes C. The same problem arises with two effects of a common cause because of the perfect symmetry in the situation. Consider, for example, a rise in the mercury in a barometer and thunderstorms. Each is an effect of high pressure systems, but the counterfactual definition would consider them to be causes of one another.44 These problems bedevil Humean and counterfactual theories. If we accept these theories in their simplest forms, we must live with a seriously incomplete theory of causation that cannot distinguish causes from effects and that cannot distinguish two effects of a common cause from real cause and effect. That is, although the counterfactual theory can tell whether two factors A and B are causally connected45 in some way, it cannot tell whether A causes B, B causes A, or A 43 This section relies heavily upon Hausman, 1999, especially Chapters 4-7 and Lewis, 1973b. 44 Thus, if barometric pressure rises, thunderstorms occur and vice-versa. Furthermore, if barometric pressure does not rise, then thunderstorms do not occur and vice-versa. Thus, by the counterfactual definition, each is the cause of the other. (To simplify matters, we have ignored the fact that there is not a perfectly deterministic relationship between high pressure systems and thunderstorms.) 45 As implied by this paragraph, there is a causal connection between A and B when either A causes B, B causes A, or A and B are the effects of a common cause. (See Hausman, 1998, pages 55-63). 23 and B are the effects of a common cause (sometimes called spurious correlation). The reason for this is that the truth of the two counterfactual conditions described so far amounts to a particular pattern of the crosstabulation of the two factors A and B. In the simplest case where the columns are the absence or presence of the first factor (A) and the rows are the absence or the presence of the second factor (B), then the same diagonal pattern is observed for situations where A causes B or B causes A, or for A and B being the effects of a common cause. In all three cases, we either observe the presence of both factors or their absence. It is impossible from this kind of symmetrical information, which amounts to correlational data, to detect causal asymmetry or spurious correlation. The counterfactual theory as elucidated so far, like the Humean regularity theory, only describes a necessary condition, the existence of a causal connection between A and B, for us to say that A causes B. Requiring temporal precedence can solve the problem of causal direction by simply choosing the phenomenon that occurs first as the cause, but it cannot solve the problem of common cause because it would lead to the ridiculous conclusion that since the mercury rises in barometers before storms, this upward movement in the mercury must cause thunderstorms. For this and other reasons, David Lewis rejects using temporal precedence to determine the direction of causality. Instead, he claims that when C causes E but not the reverse “then it should be possible to claim the falsity of the counterfactual ‘If E did not occur, then C would not occur.’” This counterfactual is different from “if C occurs then E occurs” and from “if C does not occur then E does not occur” which, as we have already mentioned, Lewis believes must both be true when C causes E. The required falsity of ‘If E did not occur, then C would not occur’ adds a third condition for causality.46 This condition amounts to finding situations in which C occurs but E does not – typically because there is some other condition that must occur for C to produce E. Appendix 1 explores this strategy in much more detail, but it suffices to say here that there is typically a much better way of establishing causal priority that is explored in the next section. 46 There are four possible counterfactuals involving C and E, and unlike standard propositional logic in which the truth of ‘if C then E’ implies the truth of its contrapositive, ‘if not E then not C’, the truth or falsity of these four counterfactuals is logically independent of one another. That is, the law of the contrapositive does not hold for counterfactuals. Lewis proposes three conditions on these four counterfactuals for C to be said to cause E. First, the counterfactual “if C occurs then E would occur” must be true. In an experiment, this means that both C and E must occur in the treatment condition. We would expect this to happen if C deterministically causes E. Thus Lewis’s first condition for causality holds when C causes E. Lewis proposes that a second counterfactual, “if C did not occur then E would not occur” must also be true if we are to say that C causes E. The premise of this counterfactual (“C did not occur”) is true for the control group, and the counterfactual will be true if the control group is considered the closest possible world to the treatment group for which C did not occur and if E does not occur in the control group. Now, there is every reason to consider the control situation the closest possible world to the treatment situation, and if C really causes E, then E will not occur in the control group. Thus, the second possible counterfactual “if C did not occur then E would not occur” will be true when C causes E, and we can say that C does cause E according to Lewis’s definition. But when C causes E, the results for the treatment group also imply that a third counterfactual “if E occurs then C would occur” is true which leads to the possibility that E also causes C according to Lewis’s definition even though E does not really cause C at all. To avoid concluding that E causes C as well, the fourth counterfactual “if E did not occur then C would not occur” must be false. (If it were true then Lewis’s first two conditions for a causal relationship would hold for E causing C.) The falsity of this fourth counterfactual is Lewis’s third condition for claiming that C causes E but not the reverse. 24 The counterfactual theory provides us with substantial insights into the nature of causation by leading us towards experiments as a way to construct counterfactual worlds. It also illuminates one very important aspect of experiments. Although the cross-tabulation of the data from an experiment will indicate that there is a causal connection between one factor and another if the entries lie along a diagonal formed by cases where both factors are absent or both are present,47 it will not rule out a common cause or reveal the direction of causation if one factor directly causes another. Consequently, other ways (described in Appendix 1) must be found to determine causation such as introducing a factor that interacts (or conditions) the operation of the supposed cause or that might be the entire cause itself. In an experimental situation, extra factors like these can help establish the direction of causation and rule out common causes, although they must be used artfully. Considering other factors can also be useful in both experimental and observational studies because it leads to more careful consideration of the exact mechanisms by which causality occurs. However, considering other factors in observational (as opposed to experimental) studies cannot even assure us that we will avoid spurious correlations. Whatever its virtues and defects, this technique of finding another factor seems a bit unwieldy because it requires the identification and introduction of a factor in addition to the supposed cause and the supposed effect. From the perspective of a practicing researcher, temporal precedence would seem to be a much easier way to establish the direction of causation. But it has its own limitations including the difficulty of identifying what comes before what in many situations. Sometimes this is just the difficulty of measuring events in a timely fashion – when, for example, did Protestantism become fully institutionalized and did it precede the institutionalization of capitalism? Does the increase in the money supply really precede economic upturns?48 But identifying what comes before what can also involve deep theoretical difficulties regarding the role of expectations (Shiffrin, 19xx), intentions, and human decision-making. Consider, for example, the relationship between educational attainment and marriage timing. “Among women who leave full-time schooling prior to entry into marriage, there are some who will leave school and then decide to get married and others who will decide to get married and then leave school in anticipation of the impending marriage.” (Marini and Singer, 1988, page 377). In both cases, leaving school will precede marriage, but in the first case leaving school preceded the decision to marry and in the second case leaving school came after the decision to get married. Thus the timing of observable events cannot always determine causality, although the timing of intentions (to marry in this case) can determine causality. Unfortunately, it may be hard to get data on the timing of intentions. Finally, there are philosophical qualms about using temporal precedence to determine causal priority. Clearly, from a practical and theoretical perspective, it would be better 47 We are assuming the same set-up as we described earlier in which each factor is coded absent or present, and the diagonals represent the factors being jointly absent or jointly present. In observational data, this same pattern can be produced if the factors are the effects of a common cause, but the experimental context rules out this possibility. 48 The appropriate lag length in the relationship between money and economic output continues to be debated in economics, and it has led to the “established notion that monetary policy works with long and variable lags (Abdullah and Rangazas, 1988, page 680).” 25 to have a way of establishing causal priority that did not rely upon temporal precedence. Experimentation and the Manipulation Theory of Causation In an experiment, there is a readily available piece of information that we have overlooked so far because it is not mentioned in the counterfactual theory. The factor that has been manipulated can determine the direction of causality and help to rule out spurious correlation. The cause must be the manipulated factor.49 It is hard to exaggerate the importance of this insight. Although philosophers are uncomfortable with manipulation and agency theories of causality because they put people (as the manipulators) at the center of our understanding of causality, there can be little doubt about the power of manipulation for determining causality. Agency and manipulation theories of causation (Gasking, 1955; von Wright, 1975; Menzies and Price, 1993) elevate this insight into their definition of causation. For Gasking “the notion of causation is essentially connected with our manipulative techniques for producing results” (1955, pages 483), and for Menzies and Price “events are causally related just in case the situation involving them possesses intrinsic features that either support a means-end relation between the events as is, or are identical with (or closely similar to) those of another situation involving an analogous pair of means-end related events.” (1993, pages 197). These theories focus on establishing the direction of causation, but Gasking’s metaphor of causation as “recipes” also suggests an approach towards establishing the symmetric, regularity aspect of causation. Causation exists when there is a recipe that regularly produces effects from causes. Perhaps our ontological definitions of causality should not employ the concept of agency because most of the causes and effects in the universe go their merry way without human intervention, and even our epistemological methods often discover causes, as with Newtonian mechanics or astrophysics, where human manipulation is impossible. Yet our epistemological methods cannot do without agency because human manipulation appears to be the best way to identify causes, and many researchers and methodologists have fastened upon experimental interventions as the way to pin-down causation. These authors typically eschew ontological aims and emphasize epistemological goals. After explicitly rejecting ontological objectives, for example, Herbert Simon proceeds to base his initial definition of causality on experimental systems because “in scientific literature the word ‘cause’ most often occurs in connection with some explicit or implicit notion of an experimenter’s intervention in a system.” (Simon, 1952, page 518). When full experimental control is not possible, Thomas Cook and Donald T. Campbell recommend “quasi-experimentation,” in which “an abrupt intervention at a known time” in a treatment group makes it possible to compare the impacts of the treatment over time or across groups (Cook and Campbell, 1986, page 149). The success of quasi-experimentation depends upon “a world of probabilistic multivariate causal agency in which some manipulable events dependably cause 49 It might be more correct to say that the cause is buried somewhere among those things that were manipulated or that are associated with the manipulation. It is not always easy, however, to know what was manipulated as in the famous Hawthorne experiments in which the experimenters thought the treatment was reducing the lighting for workers but the workers apparently thought of the treatment as being treated differently from all other workers. (See ***) Part of the work required for good causal inference is clearly describing what was manipulated and unpacking it to see what feature caused the effect. 26 other things to change.” (Page 150). John Stuart Mill suggests that the study of phenomena which “we can, by our voluntary agency, modify or control” makes it possible to satisfy the requirements of the Method of Difference (“a method of artificial experiment”) even though “by the spontaneous operations of nature those requisitions are seldom fulfilled.” (Mill, 1888, pages 281, 282). Sobel champions a manipulation model because it “provides a framework in which the nonexperimental worker can think more clearly about the types of conditions that need to be satisfied in order to make inferences” (Sobel, 1995, page 32). David Cox claims that quasiexperimentation “with its interventionist emphasis seems to capture a deeper notion” (Cox, 1992, page 297) of causality than the regularity theory. As we shall see, there are those who dissent from this perspective, but even they acknowledge that there is “wide agreement that the idea of causation as consequential manipulation is stronger or ‘deeper’ than that of causation as robust dependence.” (Goldthorpe, 2001, page 5). This account of causality is especially compelling if the manipulation theory and the counterfactual theory are conflated, as they often are, and viewed as one theory. Philosophers seldom combine them into one perspective, but all the methodological writers cited above (Simon, Cook and Campbell, Mill, Sobel, and Cox) conflate them because they draw upon controlled experiments, which combine intervention and control, for their understanding of causality. Through interventions, experiments manipulate one (or more) factor which simplifies the job of establishing causal priority by appeal to the manipulation theory of causation. Through laboratory controls or statistical randomization experiments also create closest possible worlds that simplify the job of eliminating confounding explanations by appeal to the counterfactual theory of causation. The combination of intervention and control in experiments makes them especially effective ways to identify causal relationships. If experiments only furnished closest possible worlds, then the direction of causation would be indeterminate without additional information. If experiments only manipulated factors, then accidental correlation would be a serious threat to valid inferences about causality. Both features of experiments do substantial work. Any approach to determining causation in non-experimental contexts that tries to achieve the same success as experiments must recognize both these features. The methodologists cited above conflate them, and the psychological literature on counterfactual thinking cited at the beginning of this chapter shows that our natural inclination as human beings is to conflate them. When considering alternative possibilities, people typically consider nearby worlds in which individual agency figures prominently. When asked to consider what could have happened differently in a vignette involving a drunken driver and a new route home from work, subjects focus on having taken the new route home instead of on the factors that lead to drunken driving. They choose a cause and a closest possible world in which their agency matters. But there is no reason why the counterfactual theory and the manipulation theory should be combined in this way. The counterfactual theory of causation emphasizes possible worlds without considering human agency and the manipulation theory of causation emphasizes human agency without saying anything about possible worlds. Experiments derive their strength from combining both theoretical perspectives, but it is all too easy to overlook one of these two elements in generalizing from experimental to observational studies.50 50 Some physical experiments actually derive most of their strength by employing such powerful 27 As we shall see in a later section, the best known statistical theory of causality emphasizes the counterfactual aspects of experiments without giving equal attention to their manipulative aspects. Consequently, when the requirements for causal inference are transferred from the experimental setting to the observational setting, those features of experiments that rest upon manipulation tend to get underplayed. Preemption and the Mechanism Theory of Causation Preemption – Experimentation’s amalgamation of the lessons of counterfactual and manipulation theories of causation produces a powerful technique for identifying the effects of manipulated causes. Yet, in addition to the practical problems of implementing the recipe correctly, the experimental approach does not deal well with two related problems. It does not solve the problem of causal preemption which occurs when one cause acts just before and preempts another, and it does not so much explain the causes of events as it demonstrates the effects of manipulated causes. In both cases, the experimentalists’ focus on the impacts of manipulations in the laboratory instead of on the causes of events in the world, leads to a failure to explain important phenomena, especially those phenomena which cannot be easily manipulated or isolated. The problem of preemption illustrates this point. The following example of preemption is often mentioned in the philosophical literature. A man takes a trek across a desert. His enemy puts a hole in his water can. Another enemy, not knowing the action of the first, puts poison in his water. Manipulations have certainly occurred, and the man dies on the trip. The enemy who punctured the water can thinks that she caused the man to die, and the enemy who added the poison thinks that he caused the man to die. In fact, the water dripping out of the can preempted the poisoning so that the poisoner is wrong. This situation poses problems for the counterfactual theory because one of the basic counterfactual conditions required to establish that the hole in the water can caused the death of the man, namely the truth of the counterfactual “if the hole had not been put in the water can, the man would not have died,” is false even though the man did in fact die of thirst. The problem is that the man would have died of poisoning if the hole in the water can had not preempted that cause, and the “back-up” possibility of dying by poisoning falsifies the counterfactual. The preemption problem is a serious one, and it can lead to mistakes even in well-designed experiments. Presumably the closest possible world to the one in which the water can has been punctured is one in which the poison has been put in the water can as well. Therefore, even a carefully designed experiment will conclude that the puncturing of the can did not kill the man crossing the desert because the unfortunate subject in the control condition would die (from manipulations that no controls are needed. At the detonation of the first atom bomb, no one doubted that the explosion was the result of nuclear fission and not some other uncontrolled factor. Similarly, in what might be an apocryphal story, it is said that a Harvard professor who was an expert on criminology once lectured to a class about how all social science evidence suggested that rehabilitating criminals simply did not work. A Chinese student raised his hand and politely disagreed by saying that during the Cultural Revolution, he had observed cases where criminals had been rehabilitated. Once again, a powerful manipulation may need no controls. 28 poisoning) just as the subject in the treatment would die (from the hole in the water can). The experiment alone would not tell us how the man died. A similar problem could arise in medical experiments. Arsenic was once used to cure venereal disease, and it is easy to imagine an experiment in which doses of arsenic “cure” venereal disease but kill the patient while the members of the control group without the arsenic die of venereal disease at the same rate. If the experiment simply looked at the mortality rates of the patients, it would conclude that arsenic had no medicinal value because the same number of people died in the two conditions. In both these instances, the experimental method focuses on the effects of causes and not on explaining effects by adducing causes. Instead of asking why the man died in his trek across the desert, the experimental approach asks what happens when a hole is put in the man’s canteen and everything else remains the same. The method concludes that the hole had no effect. Instead of asking what caused the death of the patients with venereal disease, the experimental method asks whether giving arsenic to those with venereal disease had any net impact on mortality rates. It concludes that it did not. In short, experimental methods do not try to explain events in the world so much as they try to show what would happen if some cause were manipulated. This does not mean that experimental methods are not useful for explaining what happens in the world, but it does mean that they sometimes miss the mark. Mechanisms, Capacities, and the Pairing Problem –The preemption problem is a vivid example of a more general problem with the Humean account that requires a solution. The general problem is that constant conjunction of events is not enough to “pair-up” particular events even when preemption is not present. Even if we know that holes in water cans generally spell trouble for desert travelers, we still have the problem of linking a particular hole in a water can with a particular death of a traveler. Douglas Ehring notes that: Typically, certain spatial and temporal relations, such as spatial/temporal contiguity, are invoked to do this job. [That is, the hole in the water can used by the traveler is obviously the one that caused his death because it is spatially and temporally contiguous to him.] These singularist relations are intended to solve the residual problem of causally pairing particular events, a problem left over by the generalist core of the Humean account. (Ehring, 1997, page 18) Counterfactual theories, because they can explain singular causal events, do not suffer so acutely from this “pairing” problem, but the preemption problem shows that remnants of the difficulty remain even in counterfactual accounts. (Ehring, 1997, Chapter 1) In both the desert traveler and arsenic examples, the counterfactual account cannot get at the proper pairing of causes and effects because there are two redundant causes to be paired with the same effects. Something more is needed. The solution in both these cases seems obvious, but it does not follow from the neo-Humean, counterfactual, or manipulation definitions of causality. The solution is to inquire more deeply into what is happening in each situation in order to describe the capacities and mechanisms that are operating. An autopsy of the desert traveler would show that the person died of thirst, and an examination of the water can would show that the water would have run out before the poisoned water could be imbibed. An autopsy of those given arsenic would show that the signs of venereal 29 disease were arrested while other medical problems, associated with arsenic poisoning, were present. Further work might even show that lower doses of arsenic cure the disease without causing death. In both these cases, deeper inquires into the mechanism by which the causes and effects are linked would produce better causal stories. But what does it mean to explicate mechanisms and capacities?51 “Mechanisms” we are told by Machamber, Darden, and Craver (2000, page 3) “are entities and activities organized such that they are productive of regular changes from start or set-up to finish or termination conditions.” The crucial terms in this definition are “entities and activities” which suggest that mechanisms have pieces. Glennan (1996, page 52) calls them “parts,” and he requires that it should be possible “to take the part out of the mechanism and consider its properties in another context (page 53).” Entities, or parts, are organized to produce change. For Glennan (page 52), this change should be produced by “the interaction of a number of parts according to direct causal laws.” The biological sciences abound with mechanisms of this sort such as the method of DNA replication, chemical transmission at synapses, and protein synthesis. But there are many mechanisms in the social sciences as well including markets with their methods of transmitting price information and bringing buyers and sellers together, electoral systems with their routines for bringing candidates and voters together in a collective decision-making process, the diffusion of innovation through social networks, the two-step model of communication flow, weak ties in social networks, dissonance reduction, reference groups, arms races, balance of power, etc. (Hedstrom and Swedberg, 1998). As these examples demonstrate, mechanisms are not exclusively mechanical, and their activating principles can range from physical and chemical processes to psychological and social processes. They must be composed of appropriately located, structured, and oriented entities which involve activities that have temporal order and duration, and “an activity is usually designated by a verb or verb form (participles, gerundives, etc.)” (Machamber, Darden, and Craver, 2000, page 4) which takes us back to the work of Lakoff and Johnson (1999) who identified a “Causation Is Forced Movement metaphor.” Mechanisms provide another way to think about causation. Glennan argues that “two events are causally connected when and only when there is a mechanism connecting them” and “the necessity that distinguishes connections from accidental conjunctions is to be understood as deriving from a underlying mechanism” which can be empirically investigated (page 64). These mechanisms, in turn, are explained by causal laws, but there is nothing circular in this because these causal laws refer to how the parts of the mechanism are connected. The operation of these parts, in turn, can be explained by lower level mechanisms. Eventually the process gets to a bedrock of fundamental physical laws which Glennan concedes “cannot be explained by the mechanical theory (page 65).” Consider explaining social phenomena by examining their mechanisms. Duverger’s law, for 51 These approaches are not the same, and those who favor one often reject the other (see, e.g., Cartwright, 1989 on capacities and Machamer, Darden, and Craver, 2000 on mechanisms). But both emphasize “causal powers” (Harre and Madden, 1975, Chapter 5) instead of mere regularity or counterfactual association. We focus on mechanisms because we believe that they are somewhat better way to think about causal powers, but in keeping with our pragmatic approach, we find much that is useful in “capacity” theories. 30 example, is the observed tendency for just two parties in simple plurality single-member district elections systems (such as the United States). The entities in the mechanisms behind Duverger’s law are voters and political parties. These entities face a particular electoral rule (single district plurality voting) which causes two activities. One is that voters often vote strategically by choosing a candidate other than their most liked because they want to avoid throwing their vote away on a candidate who has no chance of winning and because they want to forestall the election of their least wanted alternative. The other activity is that political parties often decide not to run candidates when there are already two parties in a district because they anticipate that voters will spurn their third party effort. These mechanisms underlying Duverger’s law suggest other things that can be observed beyond the regularity of two party systems being associated with single member plurality-vote electoral systems that led to the law in the first place. People’s votes should exhibit certain patterns and third parties should exhibit certain behaviors. And a careful examination of the mechanism suggests that in some federal systems that use simple plurality single-member district elections we might have more than two parties, seemingly contrary to Duverger’s Law. Typically, however, there are just two parties in each province or state, but these parties may differ from one state to another, thus giving the impression, at the national level, of a multi-party system even though Duverger’s Law holds in each electoral district.52 Or consider meterological53 and physical phenomena. Thunderstorms are not merely the result of cold fronts hitting warm air or being located near mountains, they are the results of parcels of air rising and falling in the atmosphere subject to thermodynamic processes which cause warm humid air to rise, to cool, and to produce condensed water vapor. Among other things, this mechanism helps to explain why thunderstorms are more frequent in areas, such as Denver, Colorado, near mountains because the mountains cause these processes to occur – without the need for a cold air front. Similarly, Boyle’s law is not merely a regularity between pressure and volume, it is the result of gas molecules moving within a container and exerting force when they hit the walls of the container. This mechanism for Boyle’s law also helps to explain why temperature affects the relationship between the pressure and volume of a gas. When the temperature increases, the molecules move faster and exert more force on the container walls. Mechanisms like these are midway between general laws on the one hand and specific descriptions on the other hand, and activities can be thought of as causes which are not related to lawlike generalities.54 Mechanisms typically explicate observed regularities in terms of lower level processes, and the mechanisms vary from field to field and from time to time. Moreover, these mechanisms “bottom-out” relatively quickly – molecular biologists do not seek quantum mechanical explanations and social scientists do not seek chemical explanations of the phenomena they study. 52 This radically simplifies the literature on Duverger’s law (see Cox, 19xx for more details). 53 The points in this paragraph, and the thunderstorm example, come from Dessler (1991). 54 Jon Elster says: “Are there lawlike generalizations in the social sciences? If not, are we thrown back on mere description and narrative? In my opinion, the answer to both questions is No. The main task of this essay is to explain and illustrate the idea of a mechanism as intermediate between laws and descriptions.” (Elster, 1998, page 45) 31 When an unexplained phenomenon is encountered in a science, “Scientists in the field often recognize whether there are known types of entities and activities that can possibly accomplish the hypothesized changes and whether there is empirical evidence that a possible schemata is plausible.” They turn to the available types of entities and activities to provide building blocks from which to construct hypothetical mechanisms. “If one knows what kind of activity is needed to do something, then one seeks kinds of entities that can do it, and vice versa.” (Machamber, Darden, and Craver, page 17) Mechanisms, therefore, provide a way to solve the pairing problem, and they leave a multitude of traces that can be uncovered if a hypothesized causal relation really exists. For example, those who want to subject Max Weber’s hypothesis about the Reformation leading to capitalism do not have to rest content with simply correlating Protestantism with capitalism. They can also look at the detailed mechanism he described for how this came about, and they can look for the traces left by this mechanism (Hedstrom and Swedberg, 1998, page 5; Sprinzak, 1972).55 Multiple Causes and Mechanisms – Earlier in this paper, the need to rule out common causes and to determine the direction of causation in the counterfactual theory led us towards a consideration of multiple causes. In this section, the need to solve the problem of preemption and the pairing problem led to a consideration of mechanisms. Together, these theories lead us to consider multiple causes and the mechanisms that tie these causes together. Many different authors have come to a similar conclusion about the need to identify mechanisms (Cox, 1992; Simon and Iwasaki, 1988; Freedman, 1991; Goldthorpe, 2001), and this approach seems commonplace in epidemiology (Bradford Hill, 1965) where debates over smoking and lung cancer or sexual behavior and AIDS have been resolved by the identification of biological mechanisms that link the behaviors with the diseases. Four Theories of Causality [Incomplete] What is Causation? – We are now at the end of our review of four causal theories. We have described two fundamental features of causality. One is the symmetric association between causes and effects. The other is the asymmetric fact that causes produce effects, but not the reverse. Table 1 summarizes how each theory identifies these two aspects of causality. Regularity and counterfactual theories do better at capturing the symmetric aspect of causation than its asymmetric aspect. Regularity theories rely upon the constant conjunction of events and temporal precedence to identify causes and effects. Their primary tool is essentially the “Method of Concomitant Variation” proposed by John Stuart Mill in which the causes of a phenomenon are sought in other phenomena which vary in a similar manner. Counterfactual theories rely upon elaborations of the “Method of Difference” to find causes by comparing instances where the 55 Hedstrom and Swedberg (1998) and Sorenson (1998) rightfully criticize causal modeling for ignoring mechanisms and treating correlations among variables as theoretical relationships. But it might be worth remarking that causal modelers in political science have been calling for more theoretical thinking (Achen, 19xx, Bartels and Brady, 19xx) for at least two decades, and a constant refrain at the annual meetings of the Political Methodology Group has been the need for better “micro-foundations.” 32 phenomenon occurs and instances where it does not occur to see in what circumstances the situations differ. Counterfactual theories suggest searching for surrogates for the closest possible worlds where the putative cause does not occur to see how they differ from the situation where the cause did occur. This strategy leads naturally to experimental methods where the likelihood of the independence of assignment and outcome, which insures one kind of closeness, can be increased by rigid control of conditions or by randomly assigning treatments to cases. None of these methods is fool-proof because none solves the pairing problem or gets at the connections between events, but experimental methods typically offer the best chance of achieving closest possible worlds for comparisons. Causal theories that emphasize mechanisms and capacities provide guidance on how to solve the pairing problem and how to get at the connections between events. Our emphasis in this book upon causal process observations is in that spirit. These observations can be thought of as elucidations and tests of possible mechanisms. And the growing interest in mechanisms in the social sciences (Hedstrom and Swedberg, 1998; Elster, 19xx) is providing a basis for opening up the black-box of the Humean regularity and the counterfactual theories. The other major feature of causality, the asymmetry of causes and effects, is captured by temporal priority, manipulated events, and the independence of causes. Each notion takes a somewhat different approach to distinguishing causes from effects once the unconditional association of two events (or sets of events) has been established. Temporal priority simply identifies causes with the events that came first. If growth in the money supply reliably precedes economic growth, then the growth in the money supply is responsible for growth. Manipulation theories identify the manipulated event as the causally prior one. If a social experiment manipulates work requirements and finds that greater stringency is associated with faster transitions off welfare, then the work requirements are presumed to cause these transitions. Finally, one event is considered the cause of another if a third event can be found that satisfies the INUS conditions for a cause and that varies independently of the putative cause. If short-circuits vary independently of wooden frame buildings, and both satisfy INUS conditions for burned down buildings, then both must be causes of those conflagrations. Or if education levels of voters vary independently of their getting the butterfly ballot, and both satisfy INUS conditions for mistakenly voting for Buchanan instead of Gore, then both must be causes of those mistaken votes. Causal Inference with Experimental and Observational Data – Now that we know what causation is, what lessons can we draw for doing empirical research? Table 1 shows that each theory provides sustenance for different types of studies and different kinds of questions. Regularity and mechanism theories tend to ask about the causes of effects while counterfactual and manipulation theories ask about the effects of imagined or manipulated causes. The counterfactual and manipulation theories converge on experiments, although counterfactual thought experiments flow naturally from the possible worlds approach of the counterfactual theory. Regularity theories are at home with observational data, and the mechanical theory thrives on analytical models and case studies. Which method, however, is the best method? Clearly the gold-standard for establishing causality is experimental research, but even that is not without flaws. When they are feasible, well done experiments can help us construct closest possible worlds and explore counterfactual conditions. 33 But we still have to assume that there is no preemption occurring which would make it impossible for us to determine the true impact of the putative cause, and we also have to assume that there is no interactions across units in the treatment and control groups and that treatments can be confined to the treated cases. If, for example, we are studying the impact of a skill training program on the tendency for welfare recipients to get jobs, we should be aware that a very strong economy might preempt the program itself and cause those in both the control and treatment conditions to get jobs simply because employers did not care much about skills. As a result, we might conclude that skills do not count for much in getting jobs even though they might matter a lot in a less robust economy. Or if we are studying electoral systems in a set of countries with a strong bimodal distribution of voters, we should know that the voter distribution might preempt any impact of the electoral system by fostering two strong parties. Consequently, we might conclude that single-member plurality systems and proportional representation systems both led to two parties, even though this is not generally true. And if we are studying some educational innovation that is widely known, we should know that teachers in the “control” classes might pick-up and use this innovation thereby nullifying any effect it might have. If we add an investigation of mechanisms to our experiments, we might be able to develop safeguards against these problems. For the welfare recipients, we could find out more about their job search efforts, for the party systems we could find out about their relationship to the distribution of voters, and for the teachers we could find out about their adoption of new teaching methods. Once we go to observational studies, matters get much more complicated. Spurious correlation is a real danger. There is no way to know whether those cases which get the treatment and those which do not differ from one another in other ways. It is very hard to be confident that either independence of assignment and outcome or conditional independence of treatment and assignment holds. Because nothing has been manipulated, there is no surefire way to determine the direction of causation. Temporal precedence provides some information about causal direction, but it is often hard to obtain and interpret it. The Causality Checklist and Social Science Examples – [Still to be completed; This section will analyze several social science examples such as Duvergers’ Law, the Protestant ethic and the rise of capitalism, work requirements and welfare rolls, and the butterfly ballot and voting in Florida. Table 3 is a preliminary version of the causality checklist which will be “filled out” for each example in order to show what must be established.] 34 Table 3 Causality Checklist General Issues # What is the “cause” (C) event? What is the “effect” (E) event? # What is the exact causal statement of how C causes E? # What is the corresponding counterfactual statement about what happens when C does not occur? # What is the causal field? What is the context or universe of cases in which the cause operates? # Is this a physical or social phenomenon or some mixture? # What role, if any, does human agency play? # What role, if any, does social structure play? # Is the relationship deterministic or probabilistic? Neo-Humean Theory # Is there a constant conjunction (i.e., correlation) of cause and effect? # Is the cause necessary, sufficient or INUS? # What are other possible causes, i.e., rival explanations? # Is there a constant conjunction after controls for these other causes are introduced? # Does the cause precede the effect? In what sense? Counterfactual Theory # Is this a singular conjunction of cause and effect? # Can you describe a closest possible (most similar) world to where C causes E but C does not occur? How close are these worlds? # Can you actually observe any cases of this world (or something close to it, at least on average)? Again, how close are these worlds? # In this closest possible world, does E occur in the absence of C? # Are there cases where E occurs but C does not occur? What factor intervenes and what does this tell us about C causing E? Manipulation Theory # What does it mean to manipulate your cause? Be explicit. How would you describe the cause? # Do you have any cases where C was actually manipulated? How? What was the effect? # Is this manipulation independent of other factors that influence E? Mechanism and Capacities Theories # Can you explain, at a lower level, the mechanism(s) by which C causes E? # Do the mechanisms make sense to you? # What other predictions does this mechanism lead to? # Does the mechanism solve the pairing problem? # Can you identify some capacity that explains the way the cause leads to the effect? # Can you observe this capacity when it is present and measure it? # What other outcomes might be predicted by this capacity? # What are possible preempting causes? 35 Case Study: The Neyman-Rubin-Holland Counterfactual Conditions for Causation Among statisticians, the best known theory of causality has grown out of the experimental tradition. The roots of this perspective are in Fisher (19xx) and especially Neyman (1923), and it has been most fully articulated by Rubin (1974, 1978) and Holland (1986). In this section, which is more technical than the rest of this chapter, we explain this perspective, and we evaluate it in terms of the four theories of causality described above. There are four aspects of the Neyman-Rubin-Holland (NHR) approach: 1. A Counterfactual Definition of Causal Effect – Causal relationships are defined using a counterfactual perspective which focuses on estimating causal effects. The definition provides no guidance on how researchers can actually identify causes because it relies upon an unobservable counterfactual. To the extent that the NHR approach considers causal priority, it equates it with temporal priority. 2. Finding a Substitute for the Counterfactual Situation: The Independence of Assignment and Outcome – As a step towards identifying causes, the NHR approach goes on to formulate a set of epistemological assumptions, namely the independence of assignment and outcome or the mean conditional independence of assignment and outcome, that make it possible to estimate causal effects with observable data, although there is no way to verify the assumption. 3. An Assumption for Creating Mini-Possible Worlds – As a prelude to suggesting concrete ways that the independence or conditional independence of assignment and outcome can be achieved, the statistical approach describes an assumption, the Stable Unit Treatment Value Assumption (SUTVA) that makes it possible to treat cases as separate mini closest possible worlds by assuming that they do not interfere or communicate with one another and that treatments do not vary from case to case. 4. Methods for Insuring Independence of Assignment and Outcome if SUTVA holds – Finally, the NRH approach describes methods such as unit homogeneity or random assignment for obtaining independence or mean independence of assignment and outcome as long as SUTVA holds. The definition of a causal effect based upon unobserved counterfactuals was first described in a 1923 paper published in Polish by Jerzy Neyman (1990). Although Neyman’s paper was relatively obscure until 1990, similar ideas informed much of the statistical work on experimentation from the 1920's to the present. Rubin (1974, 1978, 1990a,b) and Heckman (19xx) were the first to stress the importance of independence of assignment and outcome. A number of experimentalists identified the need for the SUTVA assumption (e.g., Cox, 1958). Random assignment as a method for estimating causal effects was first championed by R.A. Fisher in (1925 and 1926). Holland (1986) provides the best synthesis of the entire perspective. Ontological Definition of Causal Effect Based Upon Counterfactuals – According to the NRH 36 understanding of causation, establishing a causal relationship consists of comparing: (a) the value of the outcome variable for a case that has been exposed to a treatment (Yt, with “t” for treatment), with (b) the value of the outcome variable for the same case if that case had not been exposed to the treatment (Yc, with “c” for control). Note that (a) refers to an actual observation in the treatment condition (“a case that has been exposed to a treatment”) so the value Yt is observed while (b) refers to a counterfactual observation of the control condition (“if that case had not been exposed to the treatment”).56 Because the case was exposed to the treatment, it cannot simultaneously be in the control condition, and the value Yc is the outcome in the closest possible world where the case was not exposed to the treatment. Although this value cannot be observed, we can still describe the conclusions we would draw if we could observe it. The Net Effect of the Treatment (NET) for a particular case is the difference in outcomes, NET = (Yt - Yc), for the case, and if this difference is zero (i.e., if NET = 0), we say the treatment has no net effect.57 If this difference is non-zero (i.e., NET… 0), then the treatment has a net effect. Then, based on the counterfactual approach of David Lewis, there is a causal connection between the treatment and the outcome if two conditions hold. First, the treatment must be associated with a net effect, and second the absence of the treatment must be associated with no net effect.58 Although the satisfaction of these two conditions is enough to demonstrate a causal connection, it 56 For simplicity, we assume that the treatment case has been observed, but the important point is not that the treatment is observed but rather that only one of the two conditions can be observed. There is no reason why the situation could not be reversed with the actual observation of the case in the control group and the counterfactual involving the unobserved impact of the treatment condition. 57 Technically, we mean that the treatment has no effect with respect to that outcome variable. 58 With a suitable definition of effect, one of these conditions will always hold by definition and the other will be determinative of the causal connection. The NHR approach focuses on the Net Effect of the Treatment (NET = Yt - Yc) in which the control outcome Yc is the baseline against which the treatment outcome Yt is compared. A nonzero NET implies the truth of the counterfactual “if the treatment occurs, then the net effect occurs,” and a zero NET implies that the counterfactual is false. In the NHR set-up the Net Effect for the Control (NEC) must always be zero because NEC = (Yc - Yc) is always zero. Hence, the counterfactual “if the treatment is absent then there is no net effect” is always true. The focus on the net effect of the treatment (NET) merely formalizes the fact that in any situation one of the two counterfactuals required for a causal connection can always be defined to be true by an appropriate definition of an effect. Philosophers, by custom, tend to focus on the situation where some effect is associated with some putative cause so that it is always true that “if the cause occurs then the effect occurs as well” and the important question is the truth or falsity of “if the cause does not occur then the effect does not occur.” Statisticians such as NHR, with their emphasis on the null hypothesis, seem to prefer the equivalent, but reverse, set-up where the important question is the truth or falsity of “if the treatment occurs, then the effect occurs.” The bottom line is that a suitable definition of effect can always lead to the truth of one of the two counterfactuals so that causal impacts must always be considered comparatively. 37 is not enough to determine the direction of causation or to rule out a common cause. If the two conditions for a causal connection hold, then the third Lewis condition, which establishes the direction of causation and which rules out common cause, cannot be verified or rejected with the available information. The third Lewis condition requires determining whether the cause occurs in the closest possible world in which the net effect does not occur. But the only observed world in which the net effect does not occur in the NRH setup is the control condition in which the cause does not occur either. As discussed earlier, another situation in which the net effect does not occur and the cause does occur must be observed to verify the third Lewis condition and to show that the treatment causes the net effect. Alternatively, the direction of causation can be determined (although common cause cannot be ruled out) if the treatment is manipulated to produce the effect. Rubin and his collaborators mention manipulation when they say that “each of the T treatments must consist of a series of actions that could be applied to each experimental unit” (Rubin, 1978, page 39) and “it is critical that each unit be potentially exposable to any one of the causes (Holland, 1986, page 946), but their use of phrases such as “could be applied” or “potentially exposable” suggests that they are more concerned about limiting the possible types of causes than with distinguishing causes from effects.59 To the degree that causal priority is mentioned in the NHR literature, it is established by temporal precedence. Rubin (1974, page 689), for example, says that the causal effect of one treatment over another “for a particular unit and an interval t1 to t2 is the difference between what would have happened at time t2 if the unit had been exposed to [one treatment] initiated at time t1 and what would have happened at t2 if the unit had been exposed to [another treatment] at t1.” Holland (1986, pages 980) says that “The issue of temporal succession is shamelessly embraced by the model as one of the defining characteristics of a response variable. The idea that an effect might precede a cause in time is regarded as meaningless in the model, and apparently also by Hume.” The problem with this approach, of course, is that it does not necessarily rule out common cause and spurious correlation.60 In fact, as we shall see, one of the limitations and possible confusions produced by the NHR approach is its failure to deal with the need for more information to rule out common causes and to determine the direction of causality. Finding a Substitute for the Counterfactual Situation: The Independence of Assignment and Outcome – As with the Lewis counterfactual approach, the difficulty with the NHR definition of causal connections is that there is no way to observe both Yt and Yc for any particular case. One obvious line of attack is to consider two cases instead of just one. One case gets the treatment and the other gets the control condition. We now explore what happens under these circumstances. 59 Rubin and Holland believe in “NO CAUSATION WITHOUT MANIPULATION” (Holland, 1986, pages 959) which seems to eliminate attributes such as sex or race as possible causes, although Rubin softens this perspective somewhat by describing ways in which sex might be a manipulation (Rubin, 1986, pages 962). Clearly, researchers must consider carefully in what sense some factors can be considered causes. 60 Consider, for example, the experiment described earlier in which randomly assigned special tutoring first causes a rise in self-esteem and then an increase in test scores, but the increase in self-esteem does not cause the increase in test scores. The NHR framework would incorrectly treat self-esteem as the cause of the increased test scores because self esteem is randomly assigned and it precedes and is associated with the rise in test scores. Clearly something more than temporal priority is needed for causal priority. 38 Table 4 describes a simple situation where we are investigating whether a hammer blow to a glass will or will not break it. We assume that we have two glasses. The treatment is the hammer blow. The control condition is no hammer blow. For the moment consider the row for glass number one and the entries for the outcome variables Y1t or Y1c that are listed in the center of the next to last row of the table. The subscript “1" for these outcome variables indicates they are for glass number one, and the superscripts “t” or “c” indicate whether they are for the treatment or the control condition. These variables take on the values of zero if the glass is not broken and one if the glass is broken. The realized values of these variables, that is, the ones that are actually observed, depend upon61 whether glass number one gets the treatment or the control condition. These conditions are mutually exclusive states of the world in the sense that if the glass is in one of them, then it cannot be in the other. The glass either gets the treatment or the control condition. Consequently, only one of the columns can be observed for glass number one (or any other glass), and the final column which provides an evaluation of the impact of the treatment cannot be computed row by row because one of the two quantities is not observed. This unobserved quantity is the counterfactual outcome. In the introduction to this section, the counterfactual outcome for the glass was for the control condition, but we have not yet made any assumptions about which glass in Table 4 does or does not get the treatment. In practice, therefore, those doing causal inference must find some way to get a substitute value for the unobserved counterfactual outcome in the final column of Table 4. How can this be done? Suppose, for example, the researcher hits glass number one with a hammer blow and observes that the glass is broken so that Y1t = 1. Some substitute is then needed for Y1c, the counterfactual situation where the hammer blow is not struck against glass number one. One possibility is to observe the glass just before it was hit by the hammer and to take that value as a substitute for Y1c. Le t us assume that the glass was unbroken in the moment before the hammer hit so that we set Y1c = Y1c* = 0 where Y1c* is the state of the glass a moment (indicated by c*) before the hammer hit. In this case, we might conclude that the hammer blow is causally connected with the broken glass because the treatment is associated with the glass breaking and the glass was unbroken the moment before the hammer hit. That is, there is a difference between the outcomes t and c*: Y1t Y1c* = 1- 0 = 1. This approach to inference makes use of what Holland (1986) calls the “temporal stability” and “causal transience” assumptions. Temporal stability assumes the constancy of response over time so that Y1c = Y1c*. That is, the observation of the glass at c* is assumed to be the same as the observation would have been at c. Many of our everyday causal inferences, such as the belief that our turning the key in the ignition made the car start, are based upon this assumption. We believe that since the engine was not going just before we turned the key, it would not have been going a moment later if we had not turned the key. This assumption, however, is risky when things change over time as they often do. The second assumption of “causal transience,” which might be better called “measurement transience,” asserts that the act of observing the state of the glass did not change it in any way. This assumption sometimes makes sense with inanimate objects, 61 We are using the term “depend upon” to mean that they may vary from one condition to another, but we do not mean to imply that they must vary in some way. In fact, the whole enterprise is designed to find out whether or not they vary based solely upon whether or not the glass is in the treatment or control condition. 39 but is worrisome with things that react to measurement by either learning or changing their behaviors. For example, if we provide extra tutoring to a student after a poor test performance, we cannot presume that the student’s improvement on a subsequent test is due to our tutoring – it may have been due to learning (about test-taking) from the first test or from motivation to do better from the poor showing on the first test. Thus a skeptic could reasonably claim that the glass broke because of some other factor “Z” such as a nearby high-pitched sound that cracked the glass by sounding just after our observation Y1c* and just before the hammer blow (thereby violating “temporal stability”) or because of our handling of the glass when we checked to make sure it was not broken (thereby violating “causal transience”). That is, if it were possible to observe the actual Y1c it would have a value of one indicating that the glass would have broken without a hammer blow because of the high-pitched sound or our rough handling of it. Consequently there is no difference between Y1t and Y1c (both indicate a broken glass), and the hammer blow did not break the glass.62 The problem here is that Y1c* is similar to Y1c, but it is not identical with it. Of all the worlds in which the hammer blow is not struck, Y1c* is not quite the closest possible world to the treatment situation because it does not yet include the operation of the high-pitched sound which came just after the observation Y1c*. What arguments can be used to dissuade the skeptic who makes these arguments? The problem is to rule out factor Z. In practice, those doing causal inference seek to do this by replicating the hypothetical comparison of Yt and Yc through real-world comparisons across (hopefully) similar units some of which are exposed to the treatment and some of which are not. Thus, the researcher could use a similar glass from the same manufacturer and of the same sort as a “control.” Let us call this glass number two and affix a subscript “2" to Y for it, Y2. This approach uses what Holland calls the “unit homogeneity” assumption in which units are prepared carefully “so that they ‘look’ identical in all relevant aspects.” (Holland, 1986, page 948). 62 Note that this claim is not the same as the claim that “the hammer blow could not break the glass” which is a statement about capacities. It might be possible for a hammer blow to break the glass but in this case the glass might have broken just before the hammer blow was struck. 40 Table 4 – Making Causal Inferences and Independence of Assignment and Outcome Mutually Exclusive States of the World Impact of Treatment Cases Glass Number 1 Treatment Outcome “Hit with hammer” Y1t Control Outcome “Not hit with hammer” Y1c Y1t - Y1c Glass Number 2 Y2t Y2c Y2t - Y2c If this second glass is placed in the control condition, then the researcher could observe Y2c and substitute its value for Y1c. With this information, the researcher could calculate Y1t - Y2c to see if hitting the glass with a hammer caused it to break. If the second glass did not break while the first one did, then Y1t would be one and Y2c would be zero, apparently demonstrating that the hammer caused the glass to break. The skeptic, of course, might not be silenced. The doubter might argue that the high-pitched tone (factor Z) only affected the glass in the treatment condition but not the glass in the control condition (thereby violating “unit homogeneity”). This differential impact of Z is the crucial problem that can confound a causal inference. Perhaps glass number one in the treatment condition was closer to the high pitched sound, or glass number two in the control condition was shielded in some way from the tone. In effect, unit homogeneity failed because it did not extend to the relevant causal circumstances. The trick for the researcher is to make sure that the circumstances of each glass are so similar that it is very difficult, if not impossible, to think of some difference in any factor Z that might break one glass but not the other. Technically what is required is that the two glasses be so similar and so similarly situated that if both glasses were in the control condition, then both would have the same outcome (either broken or unbroken) and if both glasses were in the treatment condition, then both would have the same outcome as well. In terms of the quantities in Table 4, we require that Y1c and Y2c have the same value and Y1t and Y2t have the same value.63 This condition, amounts to saying that the circumstances of the objects must be interchangeable in terms of the outcome variable Y. If we interchange their indices (one becomes two and two becomes one), then we must have the same entries in Table 4. If this requirement is met, then the values of the outcomes for the treatment condition are independent of which glass is assigned to the treatment and the values of the outcomes for the control condition are independent of which glass is assigned to the control condition. For our purposes, the two Note that we could observe whether Y1c equals Y2c by not hitting either glass with a hammer, but then we never observe what happens when we hit a glass with a hammer. Similarly we could observe whether Y1t equals Y2t by hitting each glass with a hammer, but then we never observe what happens when we do not hit a glass with a hammer. The problem is that no matter what we do, we can only get some of the information that we need. 63 41 glasses and their circumstances are identical. These independence conditions rule out the operation of any factor such as Z, and they can be thought of as a definition of what we mean by “closest possible world.” If independence of assignment and outcome holds, no matter which glass is assigned to the control condition, the value of Y1c and Y2c must be the same. But if glass number one is affected by high-pitched notes (the factor Z) and glass number two is not, then Y1c and Y2c will have different values – the first glass will break (from the high pitched sound) without the hammer blow but the second one will not break.64 Thus, if Z acts only on glass number one, the independence condition, which requires Y1c and Y2c to be equal, will not hold. If assignment is independent of the outcomes, the factor Z must either act on both glasses or on neither. If it acts on both, then all four values in Table 4 will indicate a broken glass, and the researcher will correctly conclude that the glass that was struck was not broken because of a hammer blow. (The researcher should not necessarily conclude that hammer blows never break glasses because the high-pitched sound may have broken the glass just before the hammer blow would have broken it.) If Z acts on neither, then the researcher will either correctly conclude that the hammer blow is associated with the glass breaking (if, in fact, the glass broke when it was hit by the hammer) or that the hammer blow is not associated with the breaking of the glass (if, in fact, the glass did not break when hit by the hammer). Independence of assignment and outcome, therefore, is one of the crucial conditions that ensures good causal inference. If Y1c equals Y2c and Y1t equals Y2t, then any observed difference between Y1t and Y2c (or alternatively between Y2t and Y1c) must be due to the treatment and not to any other factor. If independence does not hold, then one (or both) of these inequalities does not hold and the outcome of the control or treatment condition will depend upon which glass is assigned to each condition. If, for example, the high-pitched notes only affect the first glass, then the assignment of the first glass as the control will result in its breaking whereas the assignment of the second glass as the control will not lead to a broken glass. When a real-world comparison is employed, the quality of the resulting causal inference depends on how cases are “assigned” to the group that gets the treatment and the group that does not. If the cases assigned to the treatment and control groups are different in terms of what would have happened to them if they had all been in the treatment group or had all been in the control group, then independence of assignment and outcome is violated and causal inference will be flawed. The independence condition with its focus on the outcome variable Y is a very convenient way of 64 If hitting a glass with a hammer does not cause it to break, then the fact that the skeptic’s factor Z only operates on glass one will also violate the requirement that the outcomes be the same in the treatment condition (i.e, that Y1t = Y2t) because glass number one will break (because of Z) and glass number two will not. But if a hammer blow usually does shatter a glass, although it did not do so in the case of glass number one because the high-pitched tone did the job before the hammer could, then Y1t and Y2t would be identical outcomes of broken glasses – but the first would be caused by the high pitched sound and the second (where Z was absent) would be caused by the hammer blow. Nevertheless, independence will still be violated in this case because the first condition, that is Y1c = Y2c, will not be true. Cases of (nearly) simultaneous causation like this can cause vexing problems which we discuss in a later section. 42 describing what we mean by the closest possible world in a counterfactual definition of causality, but it is not a testable assumption. We cannot know whether the outcomes for the two glasses would be identical if both got the hammer blow or both did not get it. There is no way that we can know whether we could interchange the two glasses and get the same results. All we can do is to try to control the circumstances of each case as much as possible to reduce the chance that differences in them might mean that assignment would not be independent of outcome. This discussion suggests two ways to provide that control. One is to consider whether temporal stability and causal transience hold. Another is to consider whether unit homogeneity holds. Confirming that these conditions hold requires a great deal of ancillary knowledge about the world such as whether glasses tend to break on their own, whether there are other features of the experimental situation (such as high pitched noises) that might cause them to break, and so forth. In effect, it requires establishing that the glasses being compared in the treatment and control condition are in closest possible worlds except for the difference in the treatment they get. It also typically requires another, very subtle, assumption. The SUTVA Assumption for Creating Mini-Possible Worlds – Perhaps the hammer blow is not the same for all units or perhaps the invocation of the treatment for some of the cases causes changes in the control cases. For example, the decision to hit some glasses with a hammer might change the structure of the glass for some “nearby” control glasses, but this change would not occur if none of the glasses were hit with a hammer. Or it would not occur for those glasses that were farther away. In this situation, the outcome for a control case depends upon what happens to other cases. This possibility seems unlikely given what we know, but consider the following. Suppose people in a treatment condition are punished for poor behavior while those in a control condition are not. Further suppose that those in the control condition who are “near” those in the treatment condition are not fully aware that they are exempt from punishment or they fear that they might be made subject to it. Wouldn’t their behavior change in ways that it would not have changed if there had never been a treatment condition? Doesn’t this mean that it would be difficult, if not impossible, to satisfy the conditions for independence of assignment and outcome? In the Cal-Learn experiment in California, for example, teenage girls on welfare in the treatment group had their welfare check reduced if they failed to get passing grades in school. Those in the randomly selected control group were not subject to reductions but many thought they were in the treatment group (probably because they knew people who were in the treatment group) and they appear to have worked to get passing grades to avoid cuts in welfare (Mauldon et al., 19xx).65 Their decision to get better grades, however, may have led to an underestimate of the impact of Cal-Learn because it reduced the difference between the treatment group and the control group. The problem here is that there is interaction between the units. Similar problems arise if supposedly identical treatments vary in effectiveness so that the causal effect for a specific unit 65 Experimental subjects were told which group they were in, but some apparently did not get the message. They may not have gotten the message because the control group was only a small number of people and almost all teenage welfare mothers in the state were in the treatment group. In these circumstances, an inattentive teenager in the control group could have sensibly supposed that the program applied to everyone. Furthermore, getting better grades seemingly had the desired effect because their welfare check was not cut! 43 depends upon which bag of fertilizer the plot agricultural plot got or which teacher the student had. To rule out these possibilities, Rubin (1990) proposed the “Stable-Unit-Treatment-ValueAssumption (SUTVA)” which asserts that the outcome for a particular case does not depend upon what happens to the other cases or which of the supposedly identical treatments the unit receives.66 SUTVA rules out a number of phenomena described in the literature. Agricultural experimenters have worried that bags of supposedly identical fertilizer are different because of variations in the manufacturing process. As a result, the causal effect of fertilizer for a plot may depend upon which bag of fertilizer was applied to it. Agricultural experimenters have also worried that the treatments given to agricultural plots could interact with one another if rainstorms cause the fertilizer applied to one plot to flow into adjacent plots. As a result, the causal impact for a plot may depend upon the pattern of assignment of fertilizer in neighboring plots. Researchers using human subjects have worried about similar problems. Cook and Campbell (1986, pages 148) mention four fundamental threats to randomized experiments. Compensatory rivalry occurs when control units decide that even though they are not getting the treatment, they can do as well as those getting it. Resentful demoralization occurs when those not getting the treatment become demoralized because they are not getting the treatment. Compensatory equalization occurs when those in charge of control units decide to compensate for the perceived inequities between treatment and control units, and treatment diffusion occurs when those in charge of control units mimic the treatment because of its supposed beneficial effects. SUTVA implies that each supposedly identical treatment really is identical and that each unit is a separate, isolated possible world that is unaffected by what happens to the other units. SUTVA is the master assumption that makes controlled or randomized experiments a suitable solution to the problem of making causal inferences. SUTVA insures that treatment and control units really do represent the closest possible worlds to one another except for the difference in treatment. In order to believe that SUTVA holds, we must have a very clear picture of the units, treatments, and outcomes in the situation at hand so that we can convince ourselves that experimental (or observational) comparisons really do involve similar worlds. Rubin (1986, page 962) notes, for example, that statements such as “If the females at firm f had been male, their starting salaries would have averaged 20% higher” require much more elaboration of the counterfactual possibilities before they can be tested. What kind of treatment, for example, would be required for females to be males? Are individuals or the firm the basic unit of analysis? Is it possible to simply randomly assign men to the women’s jobs to see what would happen to salaries? From what pool would these men be chosen? If men were randomly assigned to some jobs formerly held by women, would there be interactions across units that would violate SUTVA? 66 SUTVA amounts to assuming that the outcome values in each cell in Table 4 do not vary with the pattern of assignment of treatments and controls or with the specific treatment or control given to the unit. If SUTVA fails, then we must develop additional notation that specifies each of the four possible patterns of assignment of treatment and control conditions [namely for (c,c), (c,t), (t,c), and (t,t) where the first entry refers to the first glass and the second to the second glass] and that specifies each possible treatment separately [with t1 and t2 considered different versions of the treatment and c1 and c2 different versions of the control]. Combining these notations, we must consider (c1,c2) to be different from (c2,c1) and from (c1,t1) and so forth. Thus, the entries in each cell in Table 4 will vary according to the pattern of treatments and controls and the allocation of each version of treatments and controls. 44 Not surprisingly, if the SUTVA assumption fails, then it will be at best hard to generalize the results of an experiment and at worst impossible to even interpret its results. Generalization is hard if, for example, imposing a policy of welfare time-limits on a small group of welfare recipients has a much different impact than imposing it upon every recipient. Perhaps the imposition of limits on the larger group generates a negative attitude towards welfare that encourages job-seeking which is not generated when the limits are only imposed on a few people. Or perhaps the random assignment of a “Jewish” culture to one country (such as Israel) is much different than assigning it to a large number of countries in the same area. In both cases, the pattern of assignment to treatments seems to matter as much as the treatments themselves because of interactions among the units, and the interpretation of these experiments might be impossible because of the complex interactions among units. If SUTVA does not hold, then there are no ways such as randomization to construct closest possible worlds, and the difficulty of determining closest possible worlds must be faced directly. If SUTVA holds and if there is independence of assignment and outcome,67 then the degree of causal connection can be estimated.68 But there are no direct tests that can insure that these assumptions hold, and much of the art in experimentation goes into strategies that will increase the likelihood that they do hold. Cases can be isolated from one another to minimize interference, treatments can be made as uniform as possible, and the characteristics and circumstances of each case can be made as uniform as possible, but nothing can absolutely insure that SUTVA and the independence of assignment and outcome hold.69 Finding a Substitute for the Counterfactual Situation: Conditional Independence of Assignment and Outcome – Independence of assignment and outcome is a very strong condition that can be approached by thoroughgoing control over all confounding factors in the research situation, but it 67 These assumptions are logically independent of one another. SUTVA asserts that the values in Table 4 do not depend upon what ultimately happens to each glass, while independence of assignment and outcomes refers to the columns having the same values. If SUTVA does not hold, the values of Y1t, Y1c,Y2t, and Y2c will depend upon the overall pattern of assignment of treatment and control conditions and the specific treatments and controls given to each unit. The values within each column could be equal for a given pattern but different across different patterns so that independence of assignment and outcomes would hold (although this seems highly unlikely in most instances), or the values within each column might be different within a given pattern so that independence of assignment and outcomes would not hold. If SUTVA does hold, it is easy to see that independence of assignment and outcomes might or might not hold. 68 If SUTVA fails and independence of assignment and outcome obtains, then causal effects can also be estimated, but they will differ depending upon the pattern of treatments. Furthermore, the failure of SUTVA may make it impossible to rely upon standard methods such as experimental control or randomization to insure that the independence of assignment and outcome holds because the interaction of units may undermine these methods. 69 Rosenbaum worries that SUTVA incorporates too much and that “it seems to bear a distinct resemblance to an attic trunk; what does not fit is neatly folded and packed away.” (Rosenbaum, 1987, page 313) He would like “to see SUTVA divided up into a series of more tangible assumptions with practical interpretations, so that violations could be quickly discerned and perhaps addressed.” (Page 110). 45 requires a degree of control that is seldom possible. Even physical scientists typically have to make corrections in their observations because of confounding factors such as stray sources of particles in high energy accelerators or stray sources of electromagnetic radiation that affect radio or visible light telescopes. When these scientists do this, they are using a weaker assumption called conditional independence. Conditional independence holds there is some confounding variable Z that produces violations of independence of assignment but any subgroup of cases with the same value of Z satisfy the condition for independence of assignment. In this case, each subgroup can be analyzed separately. Thus, in the hammer and glass example, the effect of the hammer blow treatment for all glasses subject to the high-pitched tone can be analyzed as one subgroup and the effect of the hammer blow treatment for all those glasses not subject to the high-pitched tone can be analyzed as another subgroup. Table 5 – Causal Inference and Conditional Independence of Assignment and Outcome Mutually Exclusive States of the World Subject to Z? Cases Treatment Outcome “Hit with hammer” Glass Number 1 Y1t Control Outcome “Not hit with hammer” Y1c Glass Number 2 Y2t Y2c Yes Glass Number 3 Y3t Y3c No Glass Number 2 Y4t Y4c No Yes Table 5 provides a schematic of the situation. To simplify matters, assume that the only confounding or concomitant variable is Z and that it either operates or it does not – either a glass is in the range of the high-pitched tone or it is not. In Table 5, the first two glasses are subject to Z, and they will have identical values of Y1c and Y2c and identical values of Y1t, and Y2t. Consequently, independence of assignment will hold true for these two glasses. Because all these glasses are affected by Z, all of their Y values are equal to one because the glasses will break from the high-pitched tone. Similarly, glasses three and four will have identical values of Y3c and Y4c and identical values of Y3t and Y4t. The values of Y3c and Y4c will be equal to zero because the glasses will not break. The values of Y3t and Y4t will either both be zero if the hammer does not cause glasses to break or one if the hammer does cause glasses to break. In any case, independence of assignment will hold for these two glasses. Although the outcome values for glasses within the two subgroups are identical because Y1c = Y2c = 1 and Y3c = Y4c = 0, these values are not identical across the two groups. The first two glasses are subject to Z and will break without the hammer blow; the second two glasses are not subject to Z and will not break without the hammer blow. As a result, independence of assignment and outcome does not hold, but there is independence of assignment that is conditional on the value of Z. There is conditional independence of assignment and outcome. 46 Finding a Substitute for the Counterfactual Situation: Mean Independence of Assignment and Outcome when there is Outcome Variability – The preceding version of conditional independence, in which the cases are identical in the sense that the case values on the outcome variable are the same within each column conditional on Z, is very strong. It might be met in a situation where there is a lot of control over the factors that distinguish situations. We might be able to get identical glasses in identical situations, or identical chemicals in identical situations, but in most social science, outcomes are highly variable from case to case. For example, if the rows in Table 5 are welfare recipients who are involved in a job training program and the outcomes are wages in subsequent jobs, then we might expect that the training program would work for some but not for others and that some who did not even get job training would still get high wages. In short, the row values in the control and the treatment groups would vary considerably. Conditional independence of assignment and outcome can be extended to this circumstance where there is variability in outcomes. The basic device is to be content with estimating only an average causal effect and to require only that outcomes be similar on average.70 In this and the following section, we begin by generalizing independence of assignment to average or mean independence of assignment, and then we generalize still further to mean conditional independence. Identical values of all the Yit and of all the Yic (where i represents different cases) are not required. Rather, all that is needed is that cases are assigned in such a way that those in the treatment group are similar to those in the control group. Table 6 shows what is needed. In this table, the first four cases are control cases and the next four are treatment cases.71 Four averages are presented in the table, and the requirements for conditional independence are that: – the average of the treatment outcomes for those cases used as controls (Ct) equals the average of the treatment outcomes for those cases that get the treatment (Tt) and – the average of the control outcomes for those cases getting the control condition (Cc) equals the average of the control outcomes for those cases that get the treatment (Tc). In sum, the two averages in each column, one for the treated cases and the other for the control cases, must equal one another. On average the treatment and control groups must be similar. Obviously, these conditions will be met if the values in each column are identical to one another as in the example with the glasses, but we do not require this much. If these conditions hold, then the researcher will be able to make a good inference about the causal effect of the treatment by comparing the average of the observed Yt among the cases given the treatment with the observed Yc among the cases given the control. If these conditions do not hold, 70 See Stone (19xx) for a discussion of the relationship between various definitions of causal impacts and the types of assumptions about conditional independence required to estimate them. 71 This discussion ignores two important, but somewhat technical issues. It does not mention how cases are sampled, and it does not mention that we really need the expectations of the averages to be equal. It also ignores some conditions that are stronger than equal means and which lead to stronger possibilities for causal inference. See Rubin (1974) for a very clear exposition of these issues. 47 then the researcher will be in danger of making a biased causal inference. One way to assess and to influence the probability of independence of assignment and outcome is to use statistical randomization. If cases are randomly assigned to treatment and control conditions, then we can calculate the probability that there are deviations from a given level of independence. We can also develop statistics to see if observed differences between the treatment and control group are due to chance or to a real causal impact of the treatment. Textbooks on experimentation provide the details of how this can be done (Fisher, 19xx; Kempthorne, 1952; Cox, 1958). Table 6– Independence of Assignment and Outcome with Variable Outcomes Mutually Exclusive States of the World Cases Treatment Outcome Control Outcome Control Cases Ct = (Y1t+Y2t+Y3t+Y4t)/4 Cc = (Y1c+Y2c +Y3c+Y4c)/4 1 Y1t Y1c 2 Y2t Y2c 3 Y3t Y3c 4 Y4t Y4c Treatment Cases Tt = (Y5t+Y6t+Y7t+Y8t)/4 Tc = (Y5c+Y6c+Y7c+Y8c)/4 5 Y5t Y5c 6 Y6t Y6c 7 Y7t Y7c 8 Y8t Y8c Finding a Substitute for the Counterfactual Situation: Mean Conditional Independence when there is Outcome Variability – In an observational study, randomization is not available, and the mean independence assumption can easily fail for the same reason that independence of assignment can fail for the example of a hammer striking glasses. If there is some factor Z, say prior job experience among those getting job training, that affects the treatment cases but not the control cases, then the average Ct may not equal Tt and the average Cc may not equal Tc. If, for example, the treatment group has more prior job experience, then we would expect that even without job training, their average wages, namely Tc, would be higher than those of the control group Cc. And we would probably expect Ct to be higher than Tt as well. What can be done in this situation? If, and this is a big if, the researcher can identify the variable (or variables) Z that cause these departures from independence of assignment, then statistical 48 corrections can be made for the problem. The logic is simple, although the practice is hard because Z is seldom known. Suppose that cases with the same value of Z satisfy mean independence of assignment. This amounts to saying that it is possible to construct tables like Table 6 for each value of Z in which Ct = Tt and Cc = Tc. If this is possible, then statistical corrections can be made for the confounding caused by Z. Summary of the NHR Approach – If SUTVA holds and if the conditional independence conditions hold, then mini-closest-possible worlds have been created which can be used to compare the effects in a treatment and control condition. If SUTVA holds, then there are three ways to get the conditional independence conditions to hold: (a) Controlled experiments in which either unit homogeneity holds or temporal stability and causal (or measurement) transience holds. (b) Statistical experiments in which random assignment holds. (c) Observational studies in which corrections are made for covariates that ensure mean conditional independence of assignment and outcome. The mathematical conditions required for the third method to work follow easily from the Neyman-Holland-Rubin set-up, but there is no method for identifying the proper covariates. And outside of experimental studies, there is no way to be sure that conditional independence of assignment and outcome holds. Even if we know about some Z that may confound our results, we may not know about all of them, and without knowing all of them, we cannot be sure that correcting for some of them insures conditional independence. Thus observational studies face the problem of identifying a set of Z variables that will insure conditional independence so that the impact of the treatment can be determined. A great deal of research, however, does this in a rather cavalier way. Even if SUTVA and some form of conditional independence is satisfied, the NRH framework, like Lewis’s counterfactual theory to which it is a close relative, can only identify causal connections. Additional information is needed to rule out spurious correlation and to establish the direction of causation. Appeal can be made to temporal precedence or to what was manipulated to pin-down the direction of causation, but neither of these approaches provides full protection against common-cause. More experiments or observations which study the impact of other variables which suppress supposed causes or effects may be needed, and these have to be undertaken imaginatively in ways that explore different possible worlds. There is a further problem of moving from experiments to observational studies by using these conditions. As shown in Chapter XX, the major method that has been used to make the statistical corrections required for conditional independence is regression analysis in which the left-hand-side variable is the outcome variable Y and the right-hand-side variables are the covariates Z and some measure of the treatment and control such as a dummy variable X. The use of this technique has led to two difficulties. First, unlike correlation analysis which is inherently symmetrical, regression analysis is inherently asymmetrical. One variable has to be chosen as the left-hand-side or “dependent” variable and others have to be chosen as the right-hand-side or “independent” variables. It is all too easy for researchers to fall into the easy assumption that the left-hand-side 49 variable is the effect of the right-hand-side causes. Yet, once outside of the experimental paradigm, there is no guidance about which variable should be considered the outcome or effect. Observational studies very seldom have any built-in asymmetry that suggests the proper dependent variable, but experiments do have this asymmetry which comes from one variable being manipulated. Second, all the right-hand-side variables are treated symmetrically in regression. Yet, the conditional independence framework treats covariates and treatments asymmetrically. If conditional independence holds for the assignment of the treatment X and the outcome Y when the covariates Z are controlled, it does not follow that we can interchange X and Z. Once again, it is important to recognize that experiments identify putative causes through their manipulation of them. Thus, those using the conditional independence framework for observational studies must do two things. First, they must identify a variable X that has been or could be manipulated to affect some outcome Y. The variable X is the putative cause and Y its effect. Then they must identify a set of covariates Z which can be used to adjust the Y values so that the impact of X in the closest possible world can be evaluated. In short, they must employ lessons from both the manipulation and counterfactual theories of causality. Conclusion: Causality and Explanation Wesley Salmon ends his review of “Four Decades of Scientific Explanation” (1990) with a chapter entitled “Peaceful Coexistence?” He finds that explanation has made a comeback after the logical positivists and logical empiricists had written it off as humbug and metaphysics. There are two major approaches to explanation. One is the unification approach that seeks general laws and a reduction in the number of independent assumptions needed to explain what happens on in the world. Salmon calls this a “top-down” approach which has close kinship with the neo-Humean theories of causality. The second approach builds explanations from the “bottom-up” analysis of causation for singular events and the investigation of causal mechanisms. Although these two approaches have fought with one another, Salmon considers them to be complementary aspects of scientific understanding. Sometimes, it turns out, we explain things best by appealing to a very general principle or law, but at other times, we explain them best when we appeal to specific events and mechanisms. There is no need to enshrine one approach over the other. Both have their uses. 50 Bibliography [Incomplete] Abdullah, Dewan A. and Peter C. Rangazas, “Money and the Business Cycle: Another Look (in Notes),” The Review of Economics and Statistics, Vol. 70, No. 4. (Nov., 1988), pp. 680685. Achen, Christopher H., “Toward Theories of Data: The State of Political Methodology,” in Political Science: The State of the Discipline, Ada Finifter (editor), Washington, D.C., American Political Science Association, 1983. Angrist, Joshua D., “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records,” The American Economic Review, Vol. 80, No. 3. (Jun., 1990), pp. 313-336. Bartels, Larry M., Presidential Primaries and the Dynamics of Public Choice, Princeton, N.J.: Princeton University Press, 1988. Bartels, Larry and Henry E. Brady, “The State of Quantitative Political Methodology,” in Political Science: The State of the Discipline, 2nd Edition, Ada Finifter (editor), Washington, D.C.: American Political Science Association, 1993. Beauchamp, Tom L. and Alexander Rosenberg, Hume and the Problem of Causation, New York: Oxford University Press, 1981. Bennett, Jonathan, Events and Their Names, Indianapolis: Hackett Publishing Company, 1984. Bertrand, Russell, “On the Notion of Cause,” in Mysticism and Logic and Other Essays, New York: Longmans, Green and Co., 1918. Brady, Henry E., “Knowledge, Strategy and Momentum in Presidential Primaries,” in Political Analysis, John Freeman (editor), Ann Arbor: University of Michigan Press, 1996. Brady, Henry E., Michael C. Herron, Walter R. Mebane, Jasjeet Singh Sekhon, Kenneth W. Shotts, and Jonathan Wand, “Law and Data: The Butterfly Ballot Episode,” PS: Political Science & Politics, v34, n1 (2001), pp. 59-69. Brady, Henry E., Mary H. Sprague, Fredric C. Gey and Michael Wiseman, “The Interaction of Welfare-Use and Employment Dynamics in Rural and Agricultural California Counties,” 2000. California Work Pays Demonstration Project: County Welfare Administrative Data, Public Use Version 4.1, Codebook, Berkeley, California: UC DATA Archive and Technical Assistance, 2001. Campbell, Donald T. and Julian C. Stanley, Experimental and Quasi-Experimental Designs for 51 Research, Chicago: Rand McNally, 1966. Card, David and Alan B. Krueger, “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania,” The American Economic Review, Vol. 84, No. 4. (Sep., 1994), pp. 772-793. Cartwright, Nancy, Nature's Capacities and Their Measurement, New York: Oxford University Press, 1989 Chatfield, Chris,”Model Uncertainty, Data Mining and Statistical Inference,” Journal of the Royal Statistical Society Series A (Statistics in Society), Vol. 158, No. 3. (1995), pp. 419466. Cook, Thomas D. and Donald T. Campbell, Quasi-Experimentation: Design & Analysis Issues for Field Settings, Boston: Houghton Mifflin Company, 1979. Cook, Thomas D. and Donald T. Campbell, “The Causal Assumptions of QuasiExperimental Practice,” Synthese, Vol. 68, pp. 141-180, 1986 Cox, David Roxbee, “Causality: Some Statistical Aspects,” Journal of the Royal Statistical Society. Series A (Statistics in Society), Vol. 155, No. 2 (1992), pp. 291-301. Cox, Gary W., Making Votes Count : Strategic Coordination in the World's Electoral Systems, New York: Cambridge University Press, 1997. Dessler, David, “Beyond Correlations: Toward a Causal Theory of War,” International Studies Quarterly, Vol. 35, No. 3. (Sep., 1991), pp. 337-355. Elster, Jon, “A Plea for Mechanisms,” in Social Mechanisms, Peter Hedstrom and Richard Swedberg (editors),Cambridge: Cambridge University Press, 1998. Ehring, Douglas, Causation and Persistence: A Theory of Causation. New York: Oxford University Press, 1997. Fearon, James D., “Counterfactuals and Hypothesis Testing in Political Science” in World Politics, Vol. 43, No. 2. (Jan 1991), pp. 169-195. Ferber, Robert and Werner Z. Hirsch, “Social Experimentation and Economic Policy: A Survey,” Journal of Economic Literature, Volume 16, Issue 4 (Dec., 1978), pp. 1379-1414. Firebaugh, Glenn and Kevin Chen, “Vote Turnout of Nineteenth Amendment Women: The Enduring Effect of Disenfranchisement,” American Journal of Sociology, Vol. 100, No. 4. (Jan., 1995), pp. 972-996. Fisher, Ronald Aylmer, Sir, The Design of Experiments, Edinburgh, London: Oliver and Boyd, 1935. 52 Fraker, Thomas and Rebecca Maynard, The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs (in Symposium on the Econometric Evaluation of Manpower Training Programs),” The Journal of Human Resources, Vol. 22, No. 2 (1987), pp. 194-227. Franke, Richard Herbert and James D. Kaul, “The Hawthorne Experiments: First Statistical Interpretation,” American Sociological Review, Vol. 43, No. 5 (1978), pp. 623643. Freedman, David A., “Statistical Models and Shoe Leather” in Sociological Methodology, Vol. 21 (1991), pp. 291-313. Freedman, David A., “As Others See Us: A Case Study in Path Analysis” in Journal of Educational Statistics, Vol. 12. No. 2 (1987), pp. 101-223, with discussion. Freedman, David A., “From Association to Causation via Regression,” in V. R. McKim and S. P. Turner (editors), Causality in Crisis? Notre Dame IN: University of Notre Dame Press, 1997, pp. 113-161. Freedman, David A., “From Association to Causation: Some Remarks on the History of Statistics,” Statistical Science, Vol. 14 (1999), pp. 243–58. Galison, Peter Louis, How Experiments End, Chicago : University of Chicago Press, 1987. Gasking, Douglas, “Causation and Recipes,” Mind, New Series, Vol. 64, No. 256 (Oct. 1955), pp. 479-487. Glennan, Stuart S., “Mechanisms and the Nature of Causation,” Erkenntnis, Vol. 44 (1996), pp. 49-71. Goldthorpe, John H., “Causation, Statistics, and Sociology,” European Sociological Review, Vol. 17, No. 1 (2001), pp. 1-20. Goodman, Nelson, “The Problem of Counterfactual Conditionals,” Journal of Philosophy, Vol. 44, No. 5. (Feb 1947), pp. 113-128. Greene, William H., Econometric Analysis, Upper Saddle River, New Jersey: Prentice-Hall, 1997. Harre, Rom and Edward H. Madden, Causal Powers: A Theory of Natural Necessity. Imprint: Oxford, [Eng.]: B. Blackwell, c1975. Hausman, Daniel M, Causal Asymmetries. Imprint: Cambridge, U.K.; New York: Cambridge University Press, 1998. Heckman, James J., “Sample Selection Bias as a Specification Error,” Econometrica, Vol. 47, No. 1. (Jan., 1979), pp. 153-162. 53 Heckman, James J., “Randomization and Social Policy Evaluation,” in Evaluating Welfare and Training Programs, Charles F. Manski and Irwin Garfinkel (editors), Cambridge, MA: Harvard University Press, 1992. Heckman, James J. and V. Joseph Hotz, “Choosing Among Alternative Non-experimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training: Rejoinder (in Applications and Case Studies), Journal of the American Statistical Association, Vol. 84, No. 408. (Dec., 1989), pp. 878-880. Heckman, James and Richard Robb, “Alternative Methods for Evaluating the Impact of Interventions,” in Longitudinal Analysis of Labor Market Data, James Heckman and Burton Singer (editors), New York: Wiley, 1995. Heckman, James J. and Jeffrey A. Smith, “Assessing the Case for Social Experiments,” The Journal of Economic Perspectives, Volume 9, Issue 2 (Spring 1995), pp 85-110. Hedstrom, Peter and Richard Swedberg (editors), Social Mechanisms: An Analytical Approach to Social Theory, New York : Cambridge University Press, 1988. Hempel, Carl G., Aspects of Scientific Explanation, New York: Free Press, 1965. Hill, A. Bradford, “The Environment and Disease: Association or Causation?, ” Proceedings of the Royal Society of Medicine, Vol. 58 (1965), pp. 295-300. Holland, Paul W., “Statistics and Causal Inference (in Theory and Methods),” Journal of the American Statistical Association, Vol. 81, No. 396. (Dec., 1986), pp. 945-960. Holland, Paul W., “Causal Inference, Path Analysis, and Recursive Structural Equations,” Sociological Methodology, Vol. 18. (1988), pp. 449-484. Holland, Paul W. and Donald B. Rubin, “Causal Inference in Retrospective Studies,” Evaluation Review, Vol. 12 (1988), pp. 203-231. Hotz V. Joseph, Guido W. Imbens and Jacob A. Klerman, “ The Long-Term Gains from GAIN: A Re-Analysis of the Impacts of the California GAIN Program,” September 2001. Hotz, V. Joseph, Charles H. Mullin, and Seth G. Sanders, “Bounding Causal Effects Using Data From a Contaminated Natural Experiment: Analysis the Effects of Teenage Childbearing, “ The Review of Economic Studies, Vol. 64, No. 4, Special Issue: Evaluation of Training and Other Social Programmes. (Oct., 1997), pp. 575-603. Hume, David, A Treatise of Human Nature (1739), edited by L. A. Selby-Bigge and P.H. Nidditch, Oxford: Clarendon Press, 1978. Jenkins, Jeffery A., “Examining the Bonding Effects of Party: A Comparative Analysis of RollCall Voting in the U.S. and Confederate Houses,” American Journal of Political Science, 54 Vol. 43, No. 4. (Oct., 1999), pp. 1144-1165. Jones, Stephen R. G., “Was There a Hawthorne Effect?” American Journal of Sociology, Vol. 98. No. 3. (1992), pp. 451-468. Judge, George G., R. Carter Hill, William E. Griffiths, Helmut Lütkepohl and Tsoung-Chao Lee, Introduction to the Theory and Practice of Econometrics, New York: John Wiley and Sons, 1988. Lakoff, George and Mark Johnson, “Conceptual Metaphor in Everyday Language” in The Journal of Philosophy, Vol. 77, No. 8. (Aug., 1980), pp. 453-486, (1980a). Lakoff, George and Mark Johnson, Metaphors We Live By, Chicago: University of Chicago Press, 1980 (1980b). Lakoff, George and Mark Johnson, Philosophy in the Flesh: The Embodied Mind and its Challenge to Western Thought, New York: Basic Books, 1999. LaLonde, Robert J., “Evaluating the Econometric Evaluations of Training Programs with Experimental Data,” The American Economic Review, Vol. 76, No. 4. (Sep., 1986), pp. 604-620. Lewis, David, Counterfactuals, Cambridge: Harvard University Press, 1973 (1973a). Lewis, David, “Causation,” Journal of Philosophy, Vol. 70, No. 17, (Oct 1973), pp. 556-567 (1973b). Lewis, David, “Counterfactual Dependence and Time's Arrow,” Noûs, Vol. 13, No. 4, Special Issue on Counterfactuals and Laws, pp. 455-476 (Nov 1979). Lewis, David, Philosophical Papers, New York: Oxford University Press, Vol. 2, 1986. Lichbach, Mark Irving, The Rebel's Dilemma, Ann Arbor : University of Michigan Press, 1995. Lichbach, Mark Irving, The Cooperator's Dilemma, Ann Arbor : University of Michigan Press, 1996. Lijphart, Arend, Electoral Systems and Party Systems : A Study of Twenty-seven Democracies, 1945-1990, Oxford ; New York: Oxford University Press, 1994. Machamber, Peter, Lindley Darden, and Carl F. Craver, “Thinking about Mechanisms” in Philosophy of Science, Vol. 67, No. 1 (2000), pp. 1-25. Mackie, John L., “Causes and Conditions,” American Philosophical Quarterly, 2/4 (1965), pp. 245-64. 55 Manski, Charles F., “Identification Problems in the Social Sciences,” Sociological Methodology, Vol. 23. (1993), pp. 1-56. Manski, Charles F., Identification Problems in the Social Sciences,” Cambridge, Mass.: Harvard University Press, 1995. Marini, Margaret Mooney, and Burton Singer, “Causality in the Social Sciences,” Sociological Methodology, Vol. 18 (1988), pp. 347-409. Mauldon, Jane, Jan Malvin, Jon Stiles, Nancy Nicosia and Eva Seto, “Impact of California’s Cal-Learn Demonstration Project: Final Report”, UC DATA Archive and Technical Assistance, 2000. Mellors, D.H., The Facts of Causation, London: Routledge, 1995. Menzies, Peter and Huw Price, “Causation as a Secondary Quality,” British Journal for the Philosophy of Science, Vol. 44, No. 2 (1993), pp. 187-203. Metrick, Andrew, “Natural Experiment in "Jeopardy!",” The American Economic Review, Vol. 85, No. 1. (Mar., 1995), pp. 240-253. Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin, “Workers' Compensation and Injury Duration: Evidence from a Natural Experiment,” The American Economic Review, Vol. 85, No. 3. (Jun., 1995), pp. 322-340. Mill, John Stuart, A System of Logic, Ratiocinative and Inductive, 8th Edition, New York: Harper and Brothers, 1988. Pearson, Karl, The Grammar of Science, 3rd Edition, Revised and Enlarged, Part 1. – Physical, London: Adam and Charles Black, 1911. Pindyck, Robert S. and Daniel L. Rubinfeld, Econometric Models and Economic Forecasts, New York: McGraw-Hill, 1991. Ragin, Charles C., The Comparative Method: Moving beyond Qualitiative and Quantitative Strategies. Imprint: Berkeley: University of California Press, 1987. Riccio, James et al., “GAIN: Benefits, Costs, and Three-Year Impacts of a Welfare-to-Work Program”, Manpower Demonstration Research Corporation, 1994. Rosenbaum, Paul R. and Donald B. Rubin, “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, Vol. 70, No. 1. (Apr., 1983), pp. 4155. Rosenweig, Mark R. and Kenneth I. Wolpin, “Testing the Quantity-Quality Fertility Model: The Use of Twins as a Natural Experiment,” Econometrica, Vol. 48, No. 1. (Jan., 1980), pp. 227-240. 56 Rubin, Donald B., “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, Vol. 66. No. 5 (1974), pp. 688-701. Rubin, Donald B., “Bayesian Inference for Causal Effects: The Role of Randomization,” Annals of Statistics, Vol. 6, No. 1. (Jan., 1978), pp. 34-58. Rubin, Donald B., “Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies.” Statistical Science, Vol. 5, No. 4 (1990), pp. 472-480. Salmon, Wesley C., Four Decades of Scientific Explanation, Imprint: Minneapolis: University of Minnesota Press, 1989. Sheffrin, Steven M., Rational Expectations, Cambridge [Cambridgeshire] ; New York : Cambridge University Press, 1983. Simon, Herbert A., “On the Definition of the Causal Relation,” The Journal of Philosophy, Vol. 49, No. 16(Jul 1952), pp. 517-528. Simon, Herbert A. and Yumi Iwasaki, “Causal Ordering, Comparative Statics, and Near Decomposability,” Journal of Econometrics, Vol. 39 (1988), pp. 149-173. Sobel , Michael E., “Causal Inference in the Social and Behavioral Sciences,” in Handbook of Statistical Modeling for the Social and Behavioral Sciences, Gerhard Arminger, Clifford C. Clogg, and Michael E. Sobel (editors)New York: Plenum Press, 1995. Sorenson, Aage B., “Theoretical Mechanisms and the Empirical Study of Social Processes,” in Social Mechanisms, Peter Hedstrom and Richard Swedberg (editors), Cambridge: Cambridge University Press, 1998. Sosa, Ernest and Michael Tooley, Causation, edited by Ernest Sosa and Michael Tooley. Imprint: Oxford; New York: Oxford University Press, 1993. Sprinzak, Ehud, “Weber's Thesis as an Historical Explanation,” History and Theory, Vol. 11, No. 3. (1972), pp. 294-320. Tetlock, Philip E. and Aaron Belkin (editors), Counterfactual Thought Experiments in World Politics: Logical, Methodological, and Psychological Perspectives, Imprint: Princeton, N.J.: Princeton University Press, 1996. von Wright, Georg Henrik, Explanation and Understanding. Ithaca, N.Y.: Cornell University Press, 1971. von Wright, Georg Henrik, Causality and Determinism, New York: Columbia University Press, 1974. Wand, Jonathan N., and Kenneth W. Shotts, Jasjeet S. Sekhon, Walter R. Mebane, 57 Michael C. Herron, and Henry E. Brady. “The butterfly did it: the aberrant vote for Buchanan in Palm Beach County, Florida,” American Political Science Review, Vol. 95, No. 4 (Dec. 1991), pp. 793-810. Wawro, Geoffrey, The Austro-Prussian War : Austria's war with Prussia and Italy in 1866, New York: Cambridge University Press, 1996. Weber, Max, Selections in Translation, W.G. Runciman (editor) and E. Matthews (translator), Cambridge: Cambridge University Press, 1906 [1978]. 58 Appendix 1 -- Causal Independence Among the Causes of a Given Effect Lewis claims that when C causes E but not the reverse “then it should be possible to claim the falsity of the counterfactual ‘If E did not occur, then C would not occur.’” This counterfactual is different from “if C occurs then E occurs” and from “if C does not occur then E does not occur” which Lewis believes must both be true when C causes E. The required falsity of ‘If E did not occur, then C would not occur’ adds a third condition for causality. The intuition for this third requirement is that causes produce their effects but not vice-versa. Consequently, nullifying a cause should nullify an effect because the effect cannot be produced without the cause, but nullifying an effect should not nullify the cause because the presence or absence of an effect can have no determinant impact on one of its causes. This third counterfactual requirement also helps to rule out common causes. Lewis presents a complicated rationale for why the counterfactual “If E did not occur, then C would not occur” would be false in what he considers the closest possible world to the world in which C truly causes E .72 Our reading of Lewis’s articles leads us to conclude that his theory is contrived and somewhat outlandish.73 Moreover, in an experiment where the cause is effective, then both C and E do not occur in the control condition, and it appears as if we can assert the truth, from our empirical observations, of the proposition that “if E does not occur, then C does not occur.” Thus, it seems as if the counterfactual theory and experimental observations could lead to the conclusion that E causes C as well as the reverse. What can be done? One avenue might be to abandon the counterfactual theory, but there are two other paths that might lead us out of this unhappy result. One path is to assert that of all the possible worlds in which the effect E does not occur, the control group should not be considered the closest possible world to the treatment group.74 This approach, however, seems foolish. Of all the worlds in which E does not occur, the only additional way that the control group differs from the treatment group is that C does not occur. Hence, it does not seem possible that we can find any world in which E does not occur that is any closer to the treatment group. Another way the counterfactual might be false is if there is another possible world, just as close to the treatment group as the control group, in which E does not occur, but C does occur. A moment’s reflection suggests that in this world there must be another factor, call it W, that is 72 His arguments are in a series of papers (1979; 1986) which have been criticized by several authors (e.g., Bennett, 1984; Hausman, 1999, Chapter 6). 73 Among other things, Lewis requires “miracles” in his possible worlds which might make sense for a metaphysical (ontological) definition of causation, but miracles seem well-beyond the capacity of most social scientists who want a practical method for discovering causation in their everyday work. 74 Note that we have no quarrel with the notion that the closest possible world to the treatment group in which the cause does not occur is the control group, but we are questioning whether the closest possible world to the treatment group in which the effect does not occur is the control group. There is no a priori reason why the control group should be the closest to the treatment group in both circumstances. 59 necessary (actually INUS)75 for the effect, and the absence of this factor must prevent E from occurring. Thus, if the cause is a short-circuit (C) and the effect (E) is the building burning down, then W might be the requirement that the building is fabricated of wood. If a short-circuit occurs, but the building is not fabricated of wood (say it is made of brick),76 then the building will not burn down. That is, “if E does not occur, then C can still occur because W did not occur”. Consequently there are now two possible worlds that are equally close to the treatment world in which the building does not burn down. In one world, the building does not burn down, the shortcircuit occurs, and the building is made of brick. Thus we have (not E, C, not W). In the original control group world, the building burns down, the short-circuit occurs, and the building is made of wood. Thus we have (not E, not C, and W). These two worlds can be considered equally close to the treatment world in which the building burns down, there is a short-circuit, and the building is made of wood (E, C, and W) because they differ in exactly the same number of ways.77 Thus, it is not necessarily true that “if E did not occur, then C would not not occur” because it is also possible that “if E did not occur, then W would not occur.” A counterfactual that is true is that “if E did not occur, then either C would not occur or W would not occur.” The only requirement for this result is that there must be more than one INUS cause for E. There must be at least two different causes of a common effect which can occur independently of one another. Short-circuits, for example, can occur outside wooden buildings and wooden buildings can occur without short circuits. The existence of each is independent of the other. Hausman (1999, pages 64-70) makes a strong argument that this “independence of causes” will always be true in any case of interest. To determine the direction of causation, we can perform an experiment in which we construct three possible worlds. We can assign one group C and W, another W and not C, and still another C and not W. For completeness, we might also want to assign one group neither C nor W, but this is not necessary for the argument. The result will be the pattern of four possible worlds depicted in Table A-1. The entries in the table are the expected effects for each world given the true (deterministic) causal relationships. Note that the entries in this table are symmetrical in C and W (interchanging these two variables leads to the same entries in the table), but they are not symmetrical in C and E or W and E. From an observational perspective, this asymmetry is the reason that the data can support the claim that C (or W) causes E. Specifically, we observe from three entries in the table (all but the bottom right-hand corner) that “if the building does not burn down (not E), then either the short circuit did not occur (not C) or the building was not made of wood (not W).” But the counterfactual conditional “if E does not occur, then C does not occur” is not true, because the building could be saved from burning down by being made of brick – by W not occurring (see the top right-hand entry in the table). Hence, we can obtain the falsity of the counterfactual conditional that we need to show that C (and W for that matter) are causes of the building burning down. That is, the pattern of entries in Table A-1 supports the claim that W and 75 Remember that INUS is really nothing more than the statement that W could cause E in conjunction with the right set of things but that an entirely different set of things (not including W) could also cause E. 76 We assume that buildings can only be wood or brick so that “not W” means a brick building. 77 Without any clear-cut notion of how to define similarity, it seems reasonable to conclude that two possible worlds differing in “exactly the same number of ways” from a third are equally close to the third world. 60 C together cause E. Table A-1 – Four Closest Possible Worlds TREATMENTS No short circuit (no C) Short Circuit (C) Brick building (not W) No fire (not E) No fire (not E) Wood building (W) No fire (not E) Fire (E) It is imperative, however, to remember that this causal claim rests heavily on the fact that the four possible worlds in Table A-1 are as close as possible to one another given their differences in C and W. A similar table with the same pattern of entries from observational data cannot be used to make the same inference because the entries might be based upon quite different situations which are not close to one another. Consider, for example, a universe different from ours in which shortcircuits never cause fires even when the short-circuit occurs next to a piece of wood because shortcircuits do not have the capacity to start fires. In this universe, we could still get the pattern in Table A-1 from observational data if wooden buildings burned when they were hit by lightning and if, by sheer happenstance, only wooden buildings with short-circuits were hit by lightning. In statistical parlance, lightning would be a confounder that is correlated with short-circuits so that it appears that short-circuits cause wooden buildings to burn down, when in truth it is the lightning that does the work. Short-circuits and fires are the joint effects of a common cause which is lightning. In these observational data, the cases in the bottom right-hand corner where fires occur in wooden buildings with short-circuits are not the closest possible world to the cases in the bottom left-hand corner where fires do not occur in wooden buildings without short-circuits because lightning strikes the buildings in the first set of cases but not in the second set of cases. The two sets of cases would be closer if lightning struck both (in which case both would burn down) or it struck neither (in which case neither would burn down). This method of finding additional causes can also help to rule out the possibility that a causal connection between two events results from a common cause for both of them, although ruling out common causes can be a tricky business. Consider, for example, an experiment in which randomly assigned special tutoring causes a rise in self-esteem and then an increase in test scores, but the increase in self-esteem does not cause the increase in test scores. Rather it is the substantive content of the special tutoring that causes the rise in test scores. Self-esteem might be incorrectly considered the cause of the increased test scores in this situation because self esteem is randomly assigned and it precedes and is associated with the rise in test scores. Two problems must be sorted out. One is the possibility that the treatment (which is special tutoring with substantive content) is misnamed “increases in self-esteem.” The other is the failure to identify the common cause. If the first mistake has been committed, then subsequent manipulations of special tutoring thought of as manipulations of self-esteem will perpetuate the mistake. But if attempts are made, as they should be, to look for other possible “independent” causes of increases in test scores, such as the substantive content of the special tutoring, and to manipulate the hypothesized causes (self-esteem and substantive content) separately, then it will become clear that even the second of Lewis’s conditions does not hold for self-esteem, properly 61 described. Although self-esteem is associated with higher test scores in the initial situation, when we construct an experimental condition in which the increases in self-esteem are counteracted but the substantive content of the special tutoring remains, then test scores will still be high. Hence, self-esteem cannot be responsible for the higher test scores. Note that this experimental condition (in which only self-esteem is manipulated) is actually a closer possible world to the initial situation than the experiment in which self-esteem is manipulated through the presence or absence of special tutoring because the manipulation of special tutoring actually changes (at least) two aspects of the world, the self-esteem of the subjects and the substantive content to which they have been exposed. Thus, looking for independent causes is one way in which researchers can try to get closer and closer possible worlds. Perhaps the most important lesson from this discussion is that the introduction of a new factor W in the experimental situation leads towards identifying the exact factors that cause the effect and the mechanism whereby the effect occurs. It turns out that it is not just short-circuits that cause buildings to burn down. Rather, the interaction of a short-circuit with a wooden building causes fires. Once establishing this, we can investigate the mechanism in more detail by considering other sources of sparks and other sources of flammable material. Similarly, it turns out that it is not self-esteem that causes grades to rise, rather it is the substantive content of the special tutoring. More generally, this logic can be applied to any situation. In the butterfly ballot case, if we find that less well-educated people were more likely to make mistakes (which was true), then we can see whether less well-educated people make other kinds of mistakes with voting equipment such as “overvoting” by inadvertently voting twice for the same office.78 Or we can check to see whether the mistakes that were made were really due to the butterfly ballot itself or to some other feature of the Palm Beach County election administration of which the ballot was just another effect. The counterfactual theory of causation, therefore, leads us towards introducing other possible causes and considering mechanisms that tie causes together. 78 See Brady et al (2001b). 62 Studying the Causes of Human Variability: The Role of Conditional Independence Good social science research must rule out alternative cause and effect relationships.1 This requires finding a way to move from association to causation. This step is a very big one. For example, informal observation suggests that those countries using proportional representation tend to have more political parties than those using the familiar American system of single-member plurality voting districts. But it is wrong to move from this observation to the claim that proportional representation (PR) leads to more political parties than plurality voting systems if, as seems to be the case, PR systems are more common in those nations that include numerous powerful groups that want political parties devoted to their needs. In these circumstances, the powerful groups may have persuaded their governments to choose PR over plurality systems because they thought it would help them form political parties. A tangle of causal possibilities follows from the observed associations in this case. It is possible that PR causes more parties to form, but it is also possible that powerful groups are the major causal factor. Without PR, these groups might still spawn many parties in which case the association of the number of parties with PR is completely spurious. Alternatively, without PR, the groups might be thwarted in their attempts to create political parties which they could form if PR were used. In this case, PR interacts with powerful groups to produce many flourishing parties. Whatever the truth, it is clearly hard to figure out from the pattern of associations. The problem here, as in a great deal of social science research, is that the requirement for good causal inferences, called conditional independence by statisticians and the specification assumption by econometricians, is not met. The putative cause (proportional representation in this case) is correlated with another factor (numerous powerful political groups) that might explain the effect (the number of political parties) and the other factor is not controlled in a way that allows the researcher to rule it out as the true cause of a multiplicity of parties. Conditional independence might be satisfied if the confounding factor is controlled, but even if many confounding factors are identified, controlled, and ruled out, there still may be others that have not been identified that can derail an inference. Satisfying the requirement for conditional independence is the Achilles heel of all social science inference because it is so very hard to do. This chapter shows why this is so. 1 Of course it also requires ensuring the representativeness of the phenomena under consideration, conceptualizing and measuring the variables whose relationships are being studied, and guaranteeing, through statistical significance testing, that putative relationships are not solely due to chance (Campbell and Stanley, 1966). All four tasks are important parts of the scientific enterprise, but there is something especially futile about a representative study of the causes of, say, revolutions that pays careful attention to conceptualization and measurement and that dutifully reports statistical significance, but which gets the causes wrong because it failed to rule out obvious alternatives explanations. 1 The fundamental problem for researchers who want to rule out alternative explanations is the extent of human variability and the difficulty of controlling that variation. In this chapter, we do four things towards understanding the ways that variability can be controlled. First we describe the experimental approaches, dating from the 17th century, used by physical scientists to develop and test new laws. We show that these scientists were well aware of the need for control and for minimizing errors, but they dealt with a world in which statistical variation was not a great problem. Nevertheless, one of the major methods for analyzing social science data, ordinary least squares, was first developed to deal with the bothersome but relatively innocuous measurement error in astronomical and physical data. Second, we show how the recognition of statistical variability – a notion that went significantly beyond the idea of errors – by social and biological scientists in the 19th century posed a conceptual problem that required new ways of doing science. We show how 19th century statisticians found ways to describe, to summarize, and to understand this variability. Third, we discuss how randomized experiments provide a method for developing reliable law-like statements about the social and biological world by providing a setting in which counterfactual possibilities can be clearly stated and in which these counterfactual outcomes can be tested while ruling out confounding factors. Experiments do this by insuring that the outcome, conditional on the treatment, is independent of all other factors that affect the outcome of the experiment. We explain what this means and how experiments can provide evidence about counterfactuals. Fourth, we discuss the ways that observational studies fall short of randomized experiments by not automatically satisfying conditional independence. We consider what observational work must accomplish to develop reliable knowledge about the social and biological world. Experiments Under Ideal Conditions – Physical Science Research Until the middle of the 19th century, the prototype of scientific progress was the development of a physical law relating physical quantities such as velocity, force, pressure, temperature, or momentum. The methods for establishing these laws were controlled experiments and observational studies such as Galileo’s early 17th century experiments with falling bodies and his observations of the moons of Jupiter. Both methods depended upon conceptual and theoretical perspectives that suggested specific hypotheses. The experimental method also depended upon these perspectives to determine the factors that had to be manipulated and controlled. Boyle’s Gas Law – Robert Boyle’s (1627-1691) experiments in pneumatics in the middle of the 17th century, which led to the eponymous law relating the volume and pressure of a gas to one another, exemplify the ideal. Boyle was a careful experimenter, and his experiments were the culmination of investigations that established the existence of air pressure caused by the sea of air surrounding the earth. The fundamental instrument in these experiments, invented by Evangelista Torricelli (1608-1647), was a column of mercury – a barometer – to measure air pressure. Critics of Torricelli’s theory believed that air could not have enough weight to push up a column of mercury, and the experiments that led to Boyle’s law were the result of Franciscus Linus’s now fantastic, but then conceivable alternative hypothesis that “the space above the mercury column in a Torricellian tube contained an invisible membrane or cord which he called a funiculus (Conant, 2 1957, page 50).” This membrane, Linus claimed, could draw up a column of mercury to about 29 inches, and it explained the observed behavior of the mercury in the Torricelli tube. Boyle rejected the idea of the funiculus, but to provide the necessary proof he had to find a way to show that air has sufficient “weight and spring ... to perform such great matters as the counterpoising of a mercurial cylinder of 29 inches, as we teach that it may.” (Boyle, cited in Conant, 1957, page 51). Boyle developed a series of ingenious experiments to do this by varying the pressure exerted on a volume of air. By doing this, he showed that pressures other than about 29 inches of mercury were possible, thus dealing a blow to the theory of the funiculus. In the process, he noted that the volume (V) of air decreased in a regular way as the pressure (P) increased. Specifically, the pressure was related to volume (V) according to the following simple relationship where C is a constant: (1) PV = C, which can be rewritten as follows after taking logarithms and rearranging terms: log(P) = log(C) - log(V). If we let p = log(P), v = log(V), and c = log(C), then we get the following linear equation: (2) p = c - v. Figure 1 plots the measurements of pressure versus volume from Boyle’s experiments (Conant, 1957, page 53) along with an ordinary least squares (OLS) fit to them.2 OLS finds a line that best fits a scatter of points by minimizing the sum of the squared vertical errors. These errors are the difference between the observed pressure for a given volume and the value that would be predicted for that volume from the regression line. In Figure 1, the points are so close to the fitted line that is hard to see any discrepancies, but they do exist. Although Boyle could have taken logarithms as we did, he did not have fit an OLS line to his data because OLS was developed in the 18th and perfected in the early 19th century (Stigler, 1986). The inspiration for OLS was to provide better fits to physical data in which the deviations were thought of as errors of measurement. A 19th century physicist would have readily used this method to smooth out the errors of measurement and to determine the slope of the line (-1) and the intercept (c) in equation (2). In Figure 1, the slope of the line is -1.0015 with a standard error of 0.0017 so that its departure from the predicted value of minus one in equation (2) is 2 It is reasonable to ask why we plot pressure versus volume instead of the reverse. For each of his trials, Boyle apparently chose a target volume and then varied the pressure until he hit that target. Berkson suggests in this case that volume is the independent variable. In any case, the reverse regression has a coefficient of -0.9984 with a standard error of 0.0017 which is essentially identical to the first one. For a more thorough discussion of the issues involved in this case, see Berkson, 1950. 3 insignificant. The fit is remarkably good, and the largest error is only about 1%. This strongly suggests that Boyle’s care in constructing and using his experimental device led to very little imprecision in the measurement of volume and pressure. But the error could have also had another cause that is discussed below. Boyle’s result illustrates a number of features of controlled experiments. First, his work involves a working hypothesis (about the “weight and spring” of the air), and there is an alternative hypothesis (the funiculus). Second, the experiment is carefully controlled so that precise numerical measurements can be made. Third, although he makes no explicit statement about how temperature would affect the accuracy of his results, Boyle knew that heated air expands and cooled air contracts. To check for the effects of temperature, he warmed his volume of air with a candle and noted that there were only small changes in its volume. Conant notes that “This fact must have assured Boyle that the minor fluctuations in the room temperature during the experiment in which he varied the height of the column of mercury would not affect the significance of his results (Conant, 1957, page 56).” Nevertheless, changes in room temperature undoubtedly did occur during his experimentation, and they could be (partly) responsible for the deviations from a perfect fit in Figure 1. More Complete Gas Laws – In fact, the modern version of the gas law incorporates two other laws, Charles’ law on temperature and pressure and Avogodrado’s law regarding the amount of matter: (3) PV = R N T where R is the “gas constant,” N is the amount of matter (measured in moles, a specific number of molecules), and T is temperature in degrees Kelvin. Taking logarithms and setting r=log(R), n=log(N), and t = log(T), we get: (4) p = r + n - v + t. Thus, Boyle developed his law by omitting two important variables (n and t), but he was saved from making an incorrect inference for one of two reasons. Either the variables did not vary because Boyle controlled them, or their range of variation was very small compared to the range of variation in v and p. It is likely that n was controlled because he did a series of experiments with the same amount of matter, but t undoubtedly varied during the course of his experiment. In this, Boyle was helped by the fact that the amount of variation in t required to have an impact is very large, and by the fact that he did check qualitatively to see that this was true. It is worth contemplating, however, what might have happened if the range of temperature had been greater in his experiments. We shall return to this problem later. Experiments such as the one undertaken by Boyle were successful, then, because of precision in measurement and calculated (or lucky) efforts to control “confounding” variables such as temperature. But there is a still deeper reason for their success. Boyle was lucky because he 4 was studying a subject for which the individual parts of matter could be isolated and would behave the same as any other part of matter. Boyle did not have to worry that a sample of air from one location would differ from that in another location. He did not have to worry that air might have many different characteristics that would vary and confound his analysis. Indeed, it turns out that he did not even have to worry about what kind of gas he was studying because by Avogadro’s Law, equal volumes of all gases at the same temperature and pressure have the same number of particles. Gases are extraordinarily homogeneous substances once pressure and temperature are controlled. The behavior of liquids, for example, is much harder to understand because different liquids behave quite differently. In truth, even gases have very small differences in the forces between molecules (van der Waal’s forces) that lead to departures from Avogadro’s Law, but these were undetectable until 19th century improvements in the art of physical measurement. Finding the Right Tools for Analyzing Data in the Social and Biological Sciences Social and biological scientists, however, are faced with a more difficult situation as they try to explain human variation. Even the most elementary social or biological phenomena display extraordinary variability. Tall parents do not inevitably have tall children – common experience tells us that tall parents can have children of all sizes. Susceptibility to disease varies enormously, and not all people get sick even in the midst of a plague. Criminals are not necessarily poor, badly educated, from broken families, or marked by distinctive body types. New agricultural techniques do not lead to the same yield on all plots. Even if one tried to hold every factor constant in these instances (the same nutrition for all children, the same sanitary conditions for those exposed to disease, the same social conditions for possible criminals, the same fertilizers, rainfall, and other conditions for agricultural plots), we still find substantial variation in these phenomena. It does not seem that there is any way to produce deterministic laws in the social sciences such as the one obtained by Boyle. In fact, upon initial examination, the biological and social world seems to be idiosyncratic, anomalous, and hopelessly variable. Few, if any, law-like regularities are apparent. Only through a series of halting steps that would require most of the 19th century, would a strategy be developed for dealing with this problem. These steps would lead researchers towards a recognition of stochastic laws. First there would be a recognition that there were societal regularities that could be described by mean tendencies. Second, for some important situations it would be recognized that deviations from these tendencies took a lawlike form. Third, correlational and regression methods would be developed for summarizing statistical laws that differed in important ways from the deterministic laws discovered by Boyle and other physical scientists.3 3 As well as consulting many of the original articles myself, I have relied heavily upon three books for this section of the paper: Stephen Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Cambridge: Harvard University Press, 1986 (chapters 5, 8, 10). Theodore M. Porter, The Rise of Statistical Thinking: 1820-1900, Princeton: Princeton University Press, 1986 (Chapters 2, 4, 5, 9), and Gerd Gigerenzer, 5 Step One: Categorizing the Average Person – Adolphe Quetelet (1796-1874) and Francis Galton (1822-1911) would be the leaders in the first steps. Quetelet, born in Belgium and educated in France, spent most of his professional life in Brussels. He was an energetic arranger of statistical data, but his greatest contribution was his conception of social physics, partly derived from his early training in astronomy, which served as a metaphor for his belief in the essential lawfulness of social phenomena. His belief in these regularities was buttressed by the ever growing efforts of industrializing nation-states to collect data that revealed surprising regularities in births, marriages, deaths, crime, and many other phenomena. “Quetelet interpreted the regularity of crime as proof that statistical laws may be true when applied to groups even though they are false in relation to any particular individual (Porter, 1986, page 51).” Statistical regularity, “the law of large numbers,” became for him the fundamental axiom of social physics. He identified the average man, “l’homme moyen,” as the central concept in his two volume work published in 1835 entitled On Man and the Development of His Faculties, or an Essay on Social Physics.4” For Quetelet, there were many average individuals, one for each way of categorizing people. “It was the relationship between these average men that was the focus of Quetelet’s attention, their rates of development and their differences and similarities (Stigler, 1986, page 171)” The assumption that the average person differed from one category to another made it imperative that Quetelet have a way to define analytically distinct groups. Yet, for any categorization, people often varied as much within the category as between categories. Quetelet had to solve this problem of variability within groups, and he had to do so in a way that would provide some theoretical power. Solving the problem would not be easy. The recognition of the average man provided a focal point for the comparison of groups, but it made deviations from that average within the group a more difficult problem for social and biological scientists. In physical science, phenomena could be described in essentially complete ways such that physical laws would completely determine their other characteristics. The same amount of any gas contained in the same volume and at the same temperature would exert the same pressure. A range of different pressures was not observed for a given volume and temperature. Stars observed on one night at a given time would appear in the exact same locations on the next night with due allowance for the earth’s movement. Except for measurement error, they did not appear to move randomly about these locations. One way to describe these results is to say that physical phenomena followed deterministic causal laws, but this description requires the additional baggage of determinism and causal explanation. It is far simpler to say that physical laws provided ways to describe homogenous groups of phenomena such that once some facts were known about them, then others followed (almost) exactly. The same amount of any gas with a given volume and set temperature did not exhibit a range of pressures. Stars did not deviate from their predicted positions. In these Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz Kruger, The Empire of Chance, Cambridge: Cambridge University Press, 1989, (Chapter 2). 4 Sur l’homme et le development de ses facultes, ou essai de physique sociale, Paris: Bachelier. 6 cases, any deviations from the predicted values invariably seemed to be errors that could be eliminated through better measurement.5 And even if individual measurements could not be improved to eliminate deviations, statistical techniques such as ordinary least squares could be used to average out errors. But it seemed difficult, if not impossible, to develop descriptions of social and biological phenomena that were homogeneous in any way. A description of parents’ heights did not perfectly predict children’s heights, and additional information about the parents and the family still left a large residual of unexplained variation. Extensive knowledge about the income and living conditions of people did not perfectly predict whether they would become criminals or get ill. Detailed measurements of past fertility and other characteristics of agricultural plots was not enough to reliably predict their future yield. All descriptions seemed inevitably to lead to heterogeneous values on other characteristics, and it did not seem that better measurements or better theories would inevitably solve this problem. There did not appear to be deterministic social or biological laws. What could be done in these circumstances? What kind of laws were appropriate for social and biological phenomena? Was there another way to define homogeneity that would lead to useful results? Although Quetelet’s first step showed that mean values could be used to describe groups categorized by different characteristics, further progress seemed to require a new conception of homogeneous groups. What could this conception be? Quetelet’s second contribution was to suggest a way to do this that led to an interpretation for the normal distribution that made it more than a theory of errors. The normal error curve was well-known by the middle of the 19th century, dating back to the work of Abraham De Moivre (1667-1754) in the 1730's and the generalization to a “central limit theorem” in the work of Pierre Simon Laplace (1749-1827) in the 1770s. De Moivre showed that the normal distribution was the limiting distribution of the binomial distribution that arose in games of chance, and Laplace showed how the normal distribution could be thought of as the result of a large number of independent and identical factors such as errors of measurement or small deviations in ideal conditions that would cause measurements to deviate from their true mean. The normal distribution produced by this central limit theorem was considered an ideal model for the numerous factors such as eye fatigue, lens distortions, weather conditions, and recording error that affected astronomical measurements. With this in mind, it was natural (although by no means simple) for Carl Friedrich Gauss (1777-1855) and Laplace to join this model to the method of least squares in the early part of the 19th century in order to improve the analysis of astronomical data. The same rationale justifies our use of least squares in fitting Boyle’s data in Figure 1. Quetelet, however, developed a novel use of the normal distribution. He decided that the normal distribution was an apt standard for judging the homogeneity of a category. “If a 5 Or in extreme cases, through new theories. 7 collection of variable measurements were in fact homogeneous (that is, susceptible to the same dominant causes, differing only in the more minor and random aspects that Quetelet would term accidental causes), then Laplace’s theorem would tell us to expect the observations to follow the normal law ... supposing the accidental causes sufficiently numerous (Stigler, 1986, pages 203-5).” In effect, he turned the central limit theorem on its head. Instead of using it as a model of the way errors followed a normal distribution that justified their averaging, Quetelet used the central limit theorem as a way to judge the homogeneity or averageness of a group.6 He considered a group whose distribution of a trait followed a normal distribution to be homogeneous because the deviations within the group could have been produced by essentially random errors. An example demonstrates the reasonableness of Quetelet’s approach. The heights of men or women separately follow an approximately normal law, but if men and women are mixed together, the result is not quite normal which suggests, by Quetelet’s criterion, that they should be analyzed separately. The conclusion is reasonable in this case, but Quetelet’s reliance on the normal distribution to make this decision ultimately falters on practical and theoretical grounds. First, it is remarkably hard to conclude that a set of data does or does not follow the normal distribution. Second, the basic logic of his approach is forced. There are random “error” processes that form distributions other than the normal (e.g., random arrival times for people entering queues follow the Poisson distribution) and empirically important subgroups can exist within normally distributed populations. Random errors, in short, are not the only explanation of a normal distribution, and they do not always lead to a normal distribution. Nevertheless, Quetelet’s acceptance of variation within groups was a big step forward, even if his explanation of it (random errors) and his solution (looking for normally distributed data) were both flawed. Step Two: Developing Stochastic Laws – The next step that had to be taken was the recognition that stochastic variation, and not just error, was a natural feature of the biological and social world. Francis Galton would take this step in his study of human characteristics such as height and other bodily measurements. To take it, he would have to overcome the legacy of the classical theory of errors had led researchers to assume that the normal curve was inevitably the result of many factors operating independently. For if this were necessarily true, then “what opportunity was there for a single factor, such as a parent, to have a measurable impact? And why did population variability not increase from year to year? (Stigler, 1986, page 272).” Galton’s solution involved two steps. First he showed that the normal distribution for an entire population could arise as the weighted sum of many different normal distributions with different means, with these means indicating a natural variability across subpopulations. Then he described a mechanism, regression to the mean, whereby this variability would neither grow nor diminish from generation to generation. In effect, he developed the first stochastic model of human phenomena. 6 The central limit theorem amounts to the result that if there are many small and independent causes of deviations, then the resulting distribution will be normal. Quetelet turned this around by presuming that if a distribution within a category was normal, then there must be many small and independent causes of deviations from the mean and the group within the category must be homogenous. 8 In the 1870's and 1880's, Galton was concerned with understanding the relationships among bodily measurements from the same people and their kin. To do this, he collected data from people and their relatives. He invariably found that these measurements, suitably categorized by sex, age, and other factors, were normally distributed. For example, Figure 2 presents a histogram of height data of 205 parents and their 928 adult children from Galton’s 1886 study of “Regression towards mediocrity in hereditary stature.” These data are approximately normal, as we would expect,7 and there is substantial variability in both groups. Moreover, this variability remains even after we control, in Figure 3, for parents’ heights by plotting children’s stature versus that of their parents. This graph, appropriately called a “scatterplot” in modern statistical parlance, looks much different from the one constructed from Boyle’s experiment. The points, represented by the number of petals on “sunflowers,” do not all lie along a straight line. Although they have a central tendency, they are scattered about. Galton’s recognized that this variation was not due to errors. Rather it was the result of variability in the human population for which he offered “a simple and far-reaching law that governs the hereditary transmission of, I believe, every one of those simple quantities which all possess, though in unequal degrees.” (Galton, 1886, page 246). Galton’s most important insight, described in his 1886 paper, was that this variability could be built up, from one generation to the next, from the mixture of the separate normal distributions for children produced from each group of parents with the same height. And the variability would remain stable if the means of these separate distributions regressed to the overall mean so that the average height of children of parents of above average stature was less than their parents and the average height of children of parents of “mediocre” stature was greater than their parents. Figure 3 plots this regression line along with a dashed line that would indicate no such regression to the mean. The dashed line is above the solid regression line for those parents who are below average stature and below it for those parents who are above average stature. With this stochastic model, Galton showed how the inter-generational stability of stature could be preserved without requiring, as would a deterministic law, that children have exactly the same heights as their parents. Instead, children could be different from their parents as long as there was enough regression to the mean to insure the same overall variance in heights in the new generation as in the old. And in this situation of intergenerational stability, the amount by which children could vary from their parents was related in a clear mathematical way to the amount of regression to the mean. One implied the other.8 7 Galton adjusted female heights by multiplying them by 1.08, and then he averaged the heights of both parents and included the heights of both male and female children. 8 Using modern notation, we can say that if Y is the deviation of a child’s height from the mean for all children and X is the deviation of the parent’s height from the mean for all parents, then we have the following regression equation: Y = bX + e, where b is the regression coefficient and e is the variability of children with a parents of a given height X. We assume that e has mean zero for each X and constant variance Var(e) so that the variation in children from tall parents is the same as the variation in children from short parents. Galton was dealing with the situation of stability where not only the mean the mean of Y and the mean of X are equal – on average parents and children have the same heights but the variance Var(Y) of Y and the variance Var(X) of X are equal so that the amount of variability in the two generations is the same. By the elementary properties of 9 Galton’s approach showed that it was possible for two characteristics to be related to one another in a non-deterministic way. Homogenous groups of parents could be defined based upon their heights, and even though the children of these parents with the same stature would vary in their heights, it was possible to say something lawlike about the relationship between the heights of parents and children. Galton had shown how to specify stochastic laws. All that remained was for him to find a way to characterize his law. One approach, of course, was to report the regression coefficient, the slope of the line, in Figure 3. But there are two regression coefficients. One for the regression of children’s heights on parents’ heights and one for the reverse regression. Which one should be reported? In this case, it does not matter. The regression lines are identical because the height measurements of parents and children are in the same units and have the same variance,9 but problems arise when Galton’s approach is extended to the relationship of the length of a people’s arms to the lengths of their legs.10 In this case, the two regression coefficients are quite different. Step Three: Measuring Association through Correlations – Galton’s final major contribution to statistics, in “Co-Relations and Their Measurement, Chiefly from Anthropometric Data” (1888), showed that a common measure of relationship could be produced for any two characteristics by rescaling both of them by what we would now call their standard deviation, and taking the common value of the two possible regressions as a measure of their relationship. This correlation index provided a single measure of the degree of association between two normally distributed characteristics, and it ranged conveniently from -1 to +1. For Boyle’s data, the correlation is 0.999965, and for Galton’s data in Figure 3 it is 0.460.11 Galton’s insights might seem relatively prosaic. After all, the normal error model and its relationship to least squares was well-known by Gauss and Laplace in the early part of the nineteenth century. What Galton added was an interpretation of the normal distribution as a measure of variability – and not errors– in the population, and he developed the first stochastic variance, we know that for the equation above, Var(Y) = b2 Var(X) + Var(e) so that (1-b2)Var(X) = Var(e) because Var(X) = Var(Y). Thus, once the amount of variation within generations Var(X) = Var(Y) is known, the regression coefficient b (and hence the amount of regression to the mean) and the amount that children can vary from their parents Var(e) are mathematically related. 9 Given the assumptions of the preceding footnote, the regression coefficient of Y on X is Cov(Y,X)/Var(X) where Cov(Y,X) is the covariance of Y and X, and the regression coefficient of X on Y is Cov(Y,X)/Var(Y). Since Var(Y)=Var(X), the regression coefficients are identical. 10 Because, using the notation of the previous footnote, Var(X) does not necessarily equal Var(Y) so that Cov(Y,X)/Var(X) does not necessarily equal Cov(Y,X)/Var(Y). 11 Galton used the median and the median deviation where we would use the mean and standard deviation, and he did not use the covariance. But his 1888 paper describes a measure that is close relationship to the modern correlation coefficient, Cov(Y,X)/[Var(X) Var(Y)]½. By rescaling each variable by dividing by its standard deviation, Galton ensured that each variable had a unit variance so that the correlation coefficient and both regression coefficients were equal to Cov(Y,X). 10 model to explain how stability could coexist with variability. His model of the relationship between parental height and children’s height is very simple, involving only one cause (parental height) to produce one effect (children’s height), but Galton’s thinking about the problem was not simplistic. In his 1886 paper, he goes to some lengths to rule out alternative causes. He studied height because of “... its practical constancy during thirty-five years of middle life, its small dependence on differences of bringing up, and its inconsiderable influence on the rate of mortality (p. 249).” He shows that “the stature of the children depends closely on the average stature of the two parents, and may be considered in practice as having nothing to do with their individual heights (page 249),” and he provides data to show that “marriage selection takes little or no account of shortness or tallness (Pages 250-51).” In sum, stature is a good subject for study because “its discussion is little entangled with considerations of nurture, of the survival of the fittest, or of marriage selection (p 251).” Modern researchers would want to consider these factors in more detail (e.g., Floud, Wachter, and Gregory, 1990), but Galton chose a subject for which his bivariate approach was very well-suited. Within a decade of his major publications, Karl Pearson (1857-1936) and his student G. Udny Yule (1871-1951) would extend his framework to the multivariate case. Developing the Logic of Causal Inference in Observational Studies Karl Pearson developed the modern “product moment” approach to correlation that every student learns in introductory courses, and he constructed the institutional infrastructure that allowed mathematical statistics to thrive (Stigler, 1986, Chapter 10, Porter, 1986, Chapter 9). Although he is arguably the father of modern statistics, his approach to multivariate inference through the development of generalized frequency curves that could be fit to multivariate data proved to be less fruitful than the approach taken by Yule that involved the generalization of regression. Regression as a Model of Association – Yule’s seminal papers, written from 1895 to 1899 while Yule was still in his twenties involved the application of the new method of correlation to a question that confronted 19th century social reformers and that seems remarkably up-to-date given recent welfare reform efforts. Is pauperism (i.e., being supported by public welfare) increased by providing “out-of-doors” relief to people in their own homes with no work requirements instead of requiring more stigmatizing “indoors” relief given in workhouses? In modern terms, is the welfare caseload increased when people are allowed to receive welfare without work requirements? In a book published in 1894, Charles Booth claimed that lax work requirements had no impact on welfare caseloads – there was no relationship between pauperism and the proportion of total relief provided out-of-doors. Yule thought that Booth’s own data proved otherwise. Figure 4, based upon Table II (p. 609) in Yule’s 1895 paper – published when Yule was 24 years old – shows the relationship between pauperism and the ratio of out-relief to in-relief in 1891 for districts in Britain. In his comments on this table (and a similar one for data from 1871), Yule notes that the use of “‘Galton’s function’ or coefficient of correlation” is somewhat problematic because the joint distribution is not bivariate normal and “no theory of skew correlation has yet been published (p. 604).” But even though “no great stress can be laid on the value of the 11 correlation coefficient (the surfaces not being normal), its magnitude at least may be suggestive (p. 604-605).” Yule reports a value of .388 using the new product-moment method developed by Karl Pearson12 from which he concludes that “the rate of total pauperism is positively correlated with the proportion of out-relief given (p. 605)” so that lax work requirements were associated with larger welfare caseloads. In a footnote to his claim, Yule reveals that he understands the difficulty of making causal inferences from a correlation, and he demonstrates a sophisticated understanding of a causal equilibrium between pauperism and the form of administration: “This statement does not say either that the low mean proportion of out-relief is the cause of the lesser mean pauperism or vice-versa; such terms seem best avoided where one is not dealing with a catena of causation at all. To use a simile, due I believe to Professor Marshall, the case is like that of a lot of balls– say half a dozen – resting in a bowl. Then you cannot say that the position of ball No. 3 is the cause of the position of No. 5 or the reverse. But the position of 3 is a function of the position of all the others including 5; and the position of 5 is a function of the positions of all the others including 3: hence variations in the positions of the two will be correlated, and it is to this term I prefer to adhere. To be quite clear, I do not mean simply that out-relief determines pauperism in one union, and pauperism out-relief in another, so that you cannot say which is which in the average; but I mean that out-relief and pauperism mutually react in one and the same union (footnote 2, p. 605).” This sophisticated understanding is, at best, only hinted at in Yule’s subsequent papers, where he regresses pauperism on form of administration and other variables and he offers a causal interpretation of the impact of form of administration and other variables on pauperism. Booth’s reply (March, 1896) to Yule raised the stakes by noting that “I did not find much which suggested the influence of the form of administration [out-of-doors or in-door relief] on pauperism, but a good deal to show the influence on administration of the different shapes which pauperism assumes, due to density or sparseness of population, to the presence of many old people, to geographical or industrial characteristics, or to prosperity or the reverse as connected with increase or decrease of population; and I came to the conclusion that good results follow wherever an appropriate and well-considered policy is acted upon, whatever the policy may be (page 71).” Booth’s comments suggest two possible problems with Yule’s analysis, although it is unlikely that Booth had a clear picture of either one of them. One, to which Yule responds in subsequent work, is that factors other than administration affect pauperism so that Yule’s inference may be spurious. In his subsequent papers, Yule goes to substantial lengths to avoid spuriousness 12 My calculations for these data produce the same results up to the third decimal place. In a footnote, Yule notes that “Professor Pearson kindly permits me to make use of results obtained by him since this paper was written, to state that the coefficient of correlation remains equally significant for skew surfaces, although it no longer completely gives the form of the distribution.” (Page 604). 12 by controlling for other variables. The second problem, to which Yule never really responds even though he identified it quite clearly in his original paper, amounts to the assertion that Yule has the causal arrow going the wrong way – the form of relief does not cause pauperism, rather the form of poverty determines the form of relief. Adjudicating between Yule and Booth on this issue requires allowing for the possibility of simultaneous causation – the form of administration might cause pauperism and the pauperism might cause the form of administration – but the nature of this problem and a possible solution for it would only become clear almost fifty years later with the work of econometricians studying supply and demand (Haavelmo, 1943; Koopmans, 1945). In a December, 1896 article, Yule takes on the first problem by breaking down the data by different ways of measuring pauperism, by metropolitan, urban, mixed, and rural districts, by age groups, and by poverty level. He shows that the coefficient of correlation between pauperism and out-relief is significant in every circumstance except in the metropolitan areas, but he dismisses this result on the grounds that the metropolitan data are known to be of poor quality. His most interesting endeavor in this paper appears in a footnote where he considers the gross and partial correlations for rural districts among three variables, “pauperism (proportion of the population in receipt of relief of any kind), ratio of outdoor to indoor relief, and estimated earnings of agricultural labourers in each union (p. 615).” He shows, as we would expect, that higher earnings tend to reduce pauperism, and after controlling for earnings, out-relief still appears to increase pauperism. This is a nice step towards controlling for confounding variables, but the net impression from reading this footnote is that Yule is struggling with problems of confounding variables and causality and only beginning to get a foothold. He concludes in a somewhat contorted fashion by saying that “the question in the present case does not seem to me to be whether pauperism is mainly due to an out-relief policy but whether there is any direct connection between pauperism and out-relief, however slight.... My two notes have shown distinctly that there is a connection, but do not show whether it is direct, or whether, e.g., I must simply attribute the result, that pauperism is positively correlated with out-relief, to the fact that pauperism and outrelief are both positively correlated with poverty. I prefer not to follow Mr. Booth into what must be at present mere guesswork on this point, but may remark that the figures quoted in my note on p. 615 directly contradict any such hypothesis for rural unions. (p. 620).”13 Regression as a Model of Multivariate Causation – In December of the next year, Yule (1897b) published a paper which, while it was titled “On the theory of Correlation,” was really the first complete treatment of multiple regression analysis. In his first paragraph Yule announces his intention to use regression to discover causal relationships: “The investigation of causal relations between economic phenomena presents many problems of peculiar difficulty, and offers many opportunities for fallacious conclusions. 13 A multivariate regression analysis using Yule’s data produces the following equation where each variable is assumed to be mean deviated: Pauperism = .524(Out-Relief Ratio) - .592(Earnings). The coefficient of the out-relief ratio, .524, though smaller than the bivariate coefficient of .60 when Pauperism is regressed on just the Out-Relief Ratio, is still highly statistically significant with these data. 13 Since the statistician can seldom or never make experiments for himself, he has to accept the data of daily experience, and discuss as best he can the relations of a whole group of changes; he cannot, like the physicist, narrow down the issue to the effect of one variation at a time. The problems of statistics are in this sense far more complex than the problems of physics (Yule, 1897b, p. 812) In this paper and in an earlier one (1897a), Yule proposes to achieve his goal by using correlations and estimating linear regression equations to analyze the typically non-normal distributions found in social statistics. Yule treats both bivariate and trivariate regression in detail and he gives examples of bivariate regression using data on poor relief. Yule’s ambitious program led his teacher, Pearson, to write him that he did not think that a linear functional relationship was adequate to summarize social or biological data and that the proper approach started with a frequency surface (Stigler, 1986, p. 351).14 But although Yule noted that “the much more general problem of obtaining an expression completely describing the frequency distribution is one that may sometimes become of importance (Yule, 1987b, p. 839),” the difficulty of solving the problem for distributions other than the multivariate normal and the simplicity of Yule’s approach meant that linear regression analysis would carry the day within the social sciences. Regression analysis, however, typically meant that researchers chose one variable as the left-hand-side or “dependent” variable even though there are two regressions for two variables and K regressions for K variables. One of the consequences is that whereas the symmetry of correlational analysis discouraged causal interpretations, the asymmetry inherent in regression analysis led to causal interpretations in which the independent variables were assumed to affect the dependent variable. Researchers would eventually realize that substantial thought would have to be given to the choice of the dependent variable, and the techniques of causal modeling would become much more sophisticated (Wright, 1934, Koopmans, 1950, Simon, 1954, Wold and Jureen, 1952, Hood and Koopmans, 1953, Blalock, 1964, Joreskog, 1970,Goldberger and Duncan, 1973). The resulting causal modeling tradition is considered a major achievement in some quarters (Hendry and Morgan, 1995, Morgan, 1990) and a failed approach in others (Freedman, 1987, 1991). In short, we are still debating the adequacy of Yule’s solution (McKim and Turner, 1997). Yule’s 1899 paper on “An Investigation into the Causes of Changes in Pauperism in England, Chiefly during the Last Two Intercensal Decades (Part I.)” suggests the strengths and limitations of the approach. For Stephen Stigler, a University of Chicago statistician widely 14 Stigler quotes a letter from Pearson which makes the key point: “In physics you know by experience that the finer your methods of observation and your powers of observation, the more nearly you get your two variables related by a single valued equation and you are justified in trying to find the value of its constants.... They key to your method is, such a relation between the two variables actually exists in nature, it is the axiom from which you start. In biology you start with the exact opposite– no such single valued relation exists, but I understand by correlation the theory which endeavors to supply its place.” (Stigler, 1986, p. 351). Pearson’s observation, though correct, amounted to restating the problem. It did not indicate why Yule’s solution would not solve it. 14 recognized as the major chronicler of the history of statistical methods, “the paper was in its way a masterpiece, a careful, full-scale, applied regression analysis of social science data (Stigler, 1986, p. 355).” For David Freedman, a University of California statistician widely recognized as a major critic of causal modeling in the social sciences, the paper is “quite modern in spirit” (p. 118) and fatally flawed (Freedman, 1997, p. 119). Both are right. The remarkable features of Yule’s paper are its ambition and its recognition of many problems that would bedevil all subsequent observational studies of its type. The ambition comes in Yule’s desire to develop an explanatory model for pauperism. He begins his second paragraph by speaking of causes, and he endeavors to classify “[t]he various causes that one may conceive to effect changes in the rate of pauperism” (p. 249) under five headings: Changes in the administration of the law, changes in economic conditions, changes of a general social character such as overcrowding, changes of a moral character such as crime or illegitimacy, and changes in the age structure of the population. A modern researcher would have trouble coming up with a better list. Yule goes on to note that these causes might be interdependent and a method is needed to decide between “different interpretations of the same facts (p. 250).” For example, a change in pauperism might result from a change in the proportion of out-relief, but both pauperism and the proportion of out-relief might be due to a common association of both with economic and social changes. Some way, therefore, must be found to control for the other factors. “This,” he claims, “the method I have used is perfectly competent to do (p. 250).” By including all of the other causal factors in a regression equation along with the factor of interest, his method “gives the change due this factor when all the others are kept constant (p. 251).” Yule claims that he can determine the net effect of one factor on another. He recognizes that there may still be problems for he quickly adds that “There is still a certain chance of error depending on the number of factors correlated both with pauperism and with proportion of out-relief which have been omitted, but obviously this chance of error will be much smaller than before (p. 251).” The last part of this sentence seems too optimistic even to the most ardent supporters of causal modeling, and it seems positively wrong to critics of the methods. The chances of error are much greater than Yule imagined, and there are heated debates about what can be learned from regression analysis (Freedman, 1991). We review the problems in detail below. Although Yule achieves a great deal in his paper,15 Freedman is right in complaining that “there seem to be some important variables missing from the equation, including variables that measure economic activity (p. 116-117).” Freedman also notes that some coefficients change signs from one source of data to another, and Booth’s second concern may be a problem. Out-relief may be the result, not the cause of pauperism. Yule is not entirely unaware of some of these difficulties, and he includes a section on “Unaccounted Changes” (p. 260) and spends many pages 15 Yule’s paper includes a number of innovations. Using data from 1871, 1881, and 1891, he takes differences in order to explain the change in pauperism. He also estimates regression equations for rural, mixed, urban, and metropolitan groups, thus anticipating time-series cross-sectional regressions. By trying to explain the determinants of the out-relief ratio as well as pauperism, he anticipates causal models in which there are chains of causation. 15 discussing his results. Freedman even finds a deft retraction of causal claims in a footnote where Yule admits that in his tables, “Strictly, for ‘due to’ read ‘associated with.’” (footnote 25, p. 270). But later Yule screws up his courage again and argues that “It seems impossible to attribute the greater part, at all events, of the observed correlation between changes in pauperism and changes in out-relief to anything but the direct influence of change of policy on change of pauperism, the change in policy not being due to any external causes such as growth of population or economic changes (p. 277).” The comments published with Yule’s paper (p. 287-295) are remarkable for their similarity to modern discussions of the topic, and it is worth cataloging them. There are criticisms of the models used by Yule. Other variables (e.g., growing prudence, the distribution of age) might explain the observed reductions in pauperism, and the statistical methods may not be appropriate for non-normal distributions or for measures bounded between zero and one. There are concerns that the causal mechanisms are obvious or too opaque. One commentator says that statistical analysis of this sort only confirms what administrators already know, and another asks what mechanisms explain the association of reductions of pauperism with decreases in out-relief – was it “the rejection of applications unwarranted by real destitution, or was it due to the deeply-rooted dread of the workhouse, which prevented application for relief in cases of real destitution? (p. 289)” There are concerns about the generality of the results. Restrictions on out-door relief might only work “in a society where certain conditions already existed, those conditions being at the present time a constant improvement in the economic and moral conditions of the community.” (P. 293). Finally, there is a call for more data. With more data collection, “It might be found that there were two kinds of pauperism to be dealt with: the pauperism which was more or less chronic, where the people were receiving relief from year’s end to year’s end, and on the other hand a pauperism which was more or less transient, where people received relief for short periods.” (Page 293). In his reply, Yule notes that practically speaking, the burdens of the arithmetic mean that only three or four factors can be considered in this type of analysis. Modern computers, of course, have overcome this defect, but they have not overcome the possibility that one or more of the assumptions of regression analysis might fail and vitiate the conclusions of a regression analysis. The most worrisome problem is that some omitted factors might be correlated with included ones. The Specification Assumption, Conditional Independence, and Regression How Regression Can Go Wrong – It is worth stopping to describe this problem in detail because it is the Achilles heel of regression analysis, and the extent of the problem is often underestimated. It is best to make a few simplifying assumptions, none of which are essential to the results.16 First, we assume that all variables have been centered about their means17 which has 16 This analysis is adapted from Clogg and Haritou, 1997, pages 95-96, but the basic ideas go back to Wold and Jureen, 1952 and Ezekiel, 1930. (Theil?) 17 That is, they have been mean-deviated by subtracting off the value of their means. As a result, these mean-deviated variables have a mean of zero. 16 the effect of eliminating the intercept in the regression equation. Second, we assume that all variables have been standardized to have unit variance by dividing by their standard deviations.18 Third, we consider a model with only one independent variable. Consider the bivariate regression: (5) Y=bX+e where Y is the dependent variable, b is a regression coefficient, X is an independent variable, and e is the error term consisting of all omitted variables. The dependent variable Y could be pauperism, the independent variable X could be the out-relief ratio, and e could be omitted variables such as wages, age distribution, type of area, moral climate, and so forth. Or Y could be pressure, X could be volume, and e could be temperature, the amount of matter, or simple measurement error. The standard OLS estimator b* for b is the ratio of the covariance of X and Y to the variance of X: b* = Cov(X,Y)/Var(X). Because the variables have unit variance, b* is identical in this case to the correlation coefficient of X and Y, b* = Cor(X,Y).19 A fundamental result in statistics is that this estimator provides an unbiased estimate of the “true” value b if Cov(X,e) is zero – if there is no covariance (or correlation) between the included independent variable X and the omitted variables e. This assumption implies that none of the omitted variables are correlated with the included variables. There are various names for this assumption. Econometric textbooks (e.g., Greene, 1993) call it the specification assumption. Statisticians call it conditional independence (e.g., Holland, 1986, p. 949). It is closely related to other conditions such as no confounding and strong ignorability.20 It is always identified as a crucial assumption whose failure can lead to very poor inferences. What happens if it is not true? (For those not interested in a mathematical derivation, please skip to the next paragraph and equation (9) to find the answer to this question.) If the specification assumption fails, then the true b will be the solution to two equations formed by taking variances on both sides of equation (5) and the covariance of both sides of (5) with X:21 Var(Y) = Var(bX + e) = b2 Var(X) + 2 b Cov(X,e) + Var(e) Cov(Y,X) = Cov(bX+e,X) = b Var(X) + Cov(X,e) 18 Thus, Var(Y) = 1 and Var(X) = 1. 19 As noted earlier, the Pearsonian correlation coefficient is Cor(X,Y) = Cov(X,Y)/[Var(X) Var(Y)]½ 20 The literature in this area is quite technical, with many different conditions, and results. A recent summary is in Stone, 1993. 21 This step may appear a bit mysterious to those unfamiliar with the algebra of variances and covariances, but it is nothing more than the application of several simple rules that are proved in elementary statistics courses. Namely, Var(X + Y) = Var(X) + Cov(X,Y) + Var(Y); for a constant a, Var(aX) = a2 Var(X); Cov(X + Y, Z) = Cov(X,Z) + Cov(Y,Z); and Cov(X,X)=Var(X). 17 Because the variables are standardized these equations reduce to: (6) 1 = b2 + 2 b Cov(X,e) + Var(e) (7) Cor(Y,X) = b + Cov(X,e) In addition, we know by the definition of correlation that: (8) Cor(X,e) = Cov(X,e)/[Var(X) Var(e)]½ = Cov(X,e)/[Var(e)]½ Equations (6-8) are three equations in the five unknowns, b, Var(e), Cor(X,e), Cor(Y,X), and Cov(X,e). With some tedious algebra, including an application of the quadratic formula, two of the unknowns can be eliminated to obtain an expression entirely in terms of the correlations of Y with X and X with e. (9) b = Cor(Y,X) ± [1 - Cor2(Y,X)]½ Cor(X,e)/[1 - Cor2(X,e)]½ The first term on the right is the OLS estimator b* for b, and the second term shows how the true value of the regression coefficient b for X departs from this estimate depending upon the degree to which the specification assumption, Cor(X,e) = 0, does not hold. When Cor(X,e) is near minus one, the value of b is minus infinity; when Cor(X,e) is near plus one, the value of b is plus infinity. Only when Cor(X,e) is zero does the specification assumption hold. Then the OLS coefficient is the correct measure of the impact of X. Note that even if the observed correlation between Y and X is zero, seemingly implying that X and Y are unrelated, the true impact of b can be anywhere from minus infinity to plus infinity. This result shows why Cor(X,e) = 0 is called the specification assumption. Without it, the estimated value of b can be wildly wrong. There is a mildly reassuring result in the literature which shows that small departures from Cor(X,e) = 0 only cause small departures of b* from b (see Wold, 1956, p. 43), but the simple truth is that big problems can occur when the specification assumptions fails. Examples of How Regression Can Go Wrong – If Yule omitted some important factor, such as moral climate, that affects pauperism and that is also correlated with the out-relief, then his estimate of the impact of out-relief on pauperism could have any sign or magnitude depending upon the correlation between moral climate and out-relief. If Boyle had encountered a rather unusual British day with temperatures ranging from -20 degrees Fahrenheit in the morning to 100 degrees in the midday and if he had set-up his apparatus to deal with small volumes in the morning and with large volumes in the middle of the day, then he would not have found that pressure times volume equals a constant. Instead of obtaining a regression coefficient of minus one for the impact of logged volume on logged pressure, he would have obtained a regression coefficient for volume 18 of about minus .82 – nowhere near his value of minus one.22 Luckily, the monotony, if not the salubrity, of the British weather saved Boyle from missing his chance to have a law named after him. These results are theoretically disquieting, but perhaps we typically obtain similar regression coefficients across many different data sets which would imply that the theoretical problem is empirically trivial. Unfortunately, changes in magnitudes and even the signs of coefficients are not unusual from one data set to the next (remember Freedman’s critique of Yule’s work), but even if coefficients remained the same, stability of regression coefficients across data sets could be misleading. If the same specification is used across these data sets and if the same omitted variables are operating in the same way across them, then stable results provide little evidence for a correct specification. If Boyle consistently sets up his apparatus to study small volumes in the morning and large volumes in the midday and if the British weather continues to have such extremes, then he will get the same result time after time. If the weather changes, which is perhaps the best possible outcome for his research, his regression coefficients will become unstable, suggesting that something is amiss. In fact, the oft-noted instability of regression coefficients provides strong evidence that the specification is incorrect, but stability does not necessarily indicate that the specification is correct unless there are reasons to believe that plausible confounders have varied across the data sets without changing the regression results. Another way to think about the problem is this. Regressions in observational studies tell us about the mean values of Y when we select a set of subjects with characteristics X from a given population. If we repeatedly select subjects in the same way from this population, the regression equation will typically provide a similar result. Then it is easy, for example, to assume that because pauperism is associated with high levels of out-relief that a change in out-relief will decrease pauperism. But regression cannot guarantee this unless the specification assumption holds, and the specifications assumption amounts to being able to say that when X is changed, all other things equal, then Y changes. Freedman (1997, p. 118) puts it this way. There is a substantial difference between the following two procedures: “Procedure #1. Select subjects with X=x; look at the average of their Y’s. Procedure #2. Intervene and set X=x for some subjects; look at the average of their Y’s. The first involves the data set as you find it. The second involves an intervention. (Emphasis added)” 22 The problem, of course, is that Boyle would have omitted an important variable, temperature, that was highly correlated with his changes in volume. A regression of logged pressure on logged volume and logged temperature would give the correct result. 19 Does this mean that only experimental studies that intervene while controlling all other factors can provide good inferences and that observational studies can never do so? This is too pessimistic, and Freedman’s terminology hints at part of the solution. Freedman distinguishes between what we might call “selection studies” and “intervention studies.” This distinction is not the same as that between observational and experimental studies. Experiments are invariably interventions, but observational studies are not always merely selection studies. Many observational studies consider, or adventitiously encounter, interventions such as new social programs, political campaigns, or efforts to change behavior. Under some circumstances, these studies can lead to useful inferences. In order to understand these circumstances, however, it is useful to see how experimental studies solve the inference problem. Developing the Logic of Causal Inference in Experimental Studies Modern randomized experiments come out of the long tradition of agricultural research in which experimenters addressed practical questions about varieties of seed, methods of ploughing, types of fertilizer, and methods of planting. The first scientific field experiments probably occurred at Rothamsted in England in 1839 (Wishart, 1934, p. 26), although Cochran (1976) discusses a surprisingly modern approach that appeared in 1771 as a three-volume work, A Course of Experimental Agriculture, published by Arthur Young. Young distrusted single trials because of uncontrollable variability in the outcomes, and he insisted that experiments must be comparative; thereby identifying the two most important ideas of modern experimental methods. Variability in Experimental Studies – Young’s work and the experiments at Rothamsted revealed problems that were different from those faced by physical scientists who could often control their experimental circumstances. Too many factors – the weather, the natural fertility of the soil, drainage, and bird and insect damage – could not be controlled in agricultural research. In an effort to test for the possible impacts of these uncontrolled factors, the early experiments at Rothamsted placed “control” plots at opposite ends of the field to check for variations in soil fertility. “If the control plots differed little in their final yield it was held to demonstrate that the area was satisfactorily uniform, and could continue to be used with confidence (Wishart, 1934, p. 26).” But this strategy for control could be easily confounded if soil fertility increased (or decreased) from each end of the field towards the middle which contained the experimental plots. Even though more sophisticated methods of control were developed (Student, 1909, 1923), none was entirely satisfactory until R. A. Fisher (1890-1962) developed a new conception of a field experiment. Fisher’s methods were based upon a new statistical technique, the analysis of variance, and they involved an element of randomization in which experimental treatments were assigned randomly to plots. Fisher’s methods provided a way to deal with the inherent heterogeneity in the social and biological worlds. This heterogeneity led to two problems that the observational approaches, even the clever techniques proposed by Yule, had not solved. First, there had to be some definition of what was meant by a treatment effect. If fertilizers were meant to increase yields, then how was one to define the yield of a plot with and without fertilizer? In a physical experiment, such as tests 20 of Hooke’s law of the stretching of springs by weights, the deviation of the spring before and after attaching a weight could be observed. The difference would be the net treatment effect. To check that the effect was due to the weight, the spring’s deviation after the weight had been removed could also be recorded. Alternatively, the impact of the weight could be compared with two identical springs – one with a weight and one without. There was no such obvious strategy with agricultural experiments. Experience showed that even if a fertilizer were generally regarded as beneficial, it was possible for a plot to yield less with it than without it. The problem, of course, is the heterogeneity of plots over space and over time and the difficulty of defining an effect under these circumstances. The second problem was finding a way to control for this heterogeneity. The two problems are linked, but they are different, and it is remarkable that they seemed to have been solved by two different people, although each person might rightly claim to have perceived both problems and to have offered solutions to them. Defining the Impact of an Treatment – Jerzey Neyman (1894-1981) solved the problem of defining the impact of a treatment in an article, taken from his doctoral dissertation, that was published in a Polish journal in 1923.23 Neyman considered agricultural experiments in which the yield on a field from one variety of a crop is compared with the yield from another variety. Perhaps his most significant contribution in this paper was a formal notation that clarified the nature of the inference problem. Neyman specified the outcome Y (e.g., the crop yield) of the experiment for a unit u (e.g., a plot of land) for each possible condition i (e.g., varieties of a crop). Today, we might think of the conditions as treatment i=t and control i=c. Thus, Yt(u) is the outcome for unit u when it gets the treatment t, and Yc(u) is the outcome for unit u when it is in the control condition c. Figure 5 describes Neyman’s setup. He envisioned a situation with a large number of different units, such as plots in a field, taken from the same population such as a field or a farm. For the sake of pictorial economy, we list just four units in the four rows of Figure 5, but there would typically be many more units. The possible outcomes for each of these units are listed in the second and third columns. These outcomes depend upon the condition assigned to the unit. If the unit gets the treatment, then its outcome is described by the value in the second column, Yt(u). If the unit gets the control condition, then its outcome is described by the value in the third column, Yc(u). The obvious measure of the impact of the treatment compared to the control condition for a plot u is the difference between the outcome for u in the treatment condition and in the control condition, Yt(u) and Yc(u). But the variability in social and biological phenomena might mean that this difference, Yt(u) - Yc(u), will not be representative of the average effect of the treatment compared to the control in the reference population because the reaction of one plot in a field or one person in a group to a treatment is unlikely to be representative of the entire field or group. 23 The history of this article is interesting. Partly because it was published in Polish, it was little noticed, except for a few citations, before 1990. In that year, an English translation of part of it led to a recognition of its relationship to the work of Rubin (1974, 1978) who had extended the approach 50 years later (Rubin, 1990) without knowing about Neyman’s paper. My discussion draws upon Rubin, 1978 and Holland, 1986 which provide a modern version of Neyman’s model along with an extension to observational studies. 21 Another problem is that the difference cannot be computed because, for any experiment, only one of Yt(u) and Yc(u) can be observed. Either the unit gets the treatment, or it gets the control condition. It cannot get both. If it gets the treatment, then Yt(u) is observed. If it gets the control condition, then Yc(u) is observed. For each unit only one of the two outcomes, Yt(u) or Yc(u), will be, in fact, observed. The value of the other outcome is a counterfactual observation – it can only be observed in a state of the world that does not occur. Yet having this observation seems crucial for evaluating an experiment. Indeed, philosophers have made counterfactuals the centerpiece of their explanations of causal inference (Goodman, 1983, Menzies, 2000). Causal statements make assertions about what would have happened if the cause had not been present. For example, Boyle’s law asserts that if the pressure had not been increased in Boyle’s experiments, then the volume would not have decreased. Similarly, according to Yule, his regression analysis shows that if out-relief had not increased in some districts, then pauperism would not have increased as well. Both of these assertions involve counterfactual statements. In neither of these cases, could the researcher actually compare what happened to the same unit with and without the cause present. They had to find some other way to establish their causal argument. Both Boyle and Yule had to overcome the difficulty of not being able to observe a counterfactual, but Boyle faced a much easier problem. The much lower variability in Yt(u) and Yc(u) across units in his experiments, the ability to control other factors that might affect the outcomes, and the homogeneity of his units (a large number of gas molecules) meant that convincing comparisons could be made that substituted for knowing the counterfactual outcome. In Boyle’s experiments, for example, he could use his apparatus to increase pressure and to then decrease it within such a short period of time that very little else could change (such as the temperature). As a result, he could, with some degree of confidence, compare different rows of Figure 5. For example, if in row one the unit of air was in the control condition (say, “low pressure”) with volume Yc(1) and in row two the unit was in the treatment condition (say, “high pressure”), with volume Yt(2), then the difference Yt(2) - Yc(1), could be considered the change in volume with the change in pressure. This difference is not at all the same as Yt(1) - Yc(1) which involves a counterfactual, and the comparison Yt(2) - Yc(1) runs the risk of confounding, but Boyle could make the conditions for unit 1 so close to those for unit 2 that the comparison seems acceptable. Moreover, if there were concerns about the validity of this comparison, then the same experiment could be tried again in rows three and four with nearly identical results. Of course, this strategy might fail if every time Boyle increased the pressure, his physical activity heated up the room so as to confound the result, but the almost perfect relationship between pressure and volume that he obtained (see Figure 1) suggested that other factors would have to be working rather artfully to confound him. Boyle’s success depended upon a number of factors, but the most important one was that he could limit the variability in outcomes. This suggests that finding a way to control the variability in the outcomes of social and biological experiments might make comparisons possible 22 that would substitute for having counterfactual information. Neyman’s contribution was to find a way to control this variability. His method was to take the average of the impact of the treatment condition over all units and to compare this with the average over all units of the impact of the control condition. Neyman defined the average outcome for the treatment group as the average of the outcomes Yt(u) in the second column or Yt* = 'u=1,...,4 Yt(u)/4, and he defined the average yield for the control group as the average of the Yc(u) in the third column or Yc* = 'u=1,...,4 Yc(u)/4. Obviously, the impact of the treatment compared to the control is simply the difference between these two, or Yt* - Yc*. At first blush, this does not seem to advance the situation very much because it involves averages which include some counterfactual information, but we shall see that this approach is the first step towards finding a solution to the problem of only being able to observe the impact of the treatment or the control condition for each plot. One of the remarkable features of Neyman’s approach is that it deals with heterogeneity in a new way and it makes very minimal assumptions about the values of Yt(u) and Yc(u). Unlike Quetelet, Galton, and others who thought that comparisons could only be made if the treatment and control plots demonstrated some homogeneity, which they typically inferred was present when the outcomes from each type of plot followed the normal distribution, Neyman required no such thing. He states, complete with his own emphasis, that there is a misunderstanding “that probability theory can be applied to solve problems similar to the one discussed only if the yields from the different plots follow the Gaussian [normal] law.” (Page 468) But “consistency with the law of random errors [the normal law] should not justify a framework which is based on an assumption of independence of the measurements” and “it is enough to assume that our measurements are independent, and for that we need a large number of plots on the field.” (Page 468). For Neyman, the distribution of yields over a set of plots could be virtually anything depending upon the physical features of the field, and he made no assumptions about the distribution. In fact, Neyman conceived of his “experiment” very abstractly as the problem of drawing balls from two urns, one for the treatment condition and one for the control condition. Each urn has as many balls as plots and each ball is inscribed with the number of the plot and its outcome under the condition. The outcomes in an urn could have any distribution including a multimodal one if some plots are especially fertile, some only moderately so, and others not at all fertile. The average of the outcomes written on the balls for each urn are equal to the averages Yt* and Yc* described above. These averages help to reduce variability. But how can the experimenter estimate them when they involve counterfactual observations? Using Randomization to Get Estimates of Effects – Assume that the experimenter chooses equal number of balls from each urn so that there is the same number of plots in the treatment and the control condition. The urns have the property that if a ball is taken from one of them, then the ball having the same plot number in the other urn disappears (Neyman, p. 467). This assumption, of course, is the requirement that only one possible world can be realized. Each plot either gets the treatment or control condition. It is at this point, that Neyman gets tantalizingly close to the idea of random assignment. By choosing balls at random, Neyman is 23 essentially assigning treatment and control conditions randomly, but he never mentions the physical act of randomization.24 Neyman seems to anticipate the random assignment of plots to treatment and control conditions, but his paper never makes that suggestion. Instead, he offers a thoroughly worked out “thought experiment.” In his 1926 paper, Fisher makes the suggestion explicit: “One way of making sure that a valid estimate of error will be obtained is to arrange the plots deliberately at random, so that no distinction can creep in between pairs of plots treated alike and pairs treated differently (p.507).” Although Fisher made the idea explicit, Neyman’s justification for it is clearer. By randomly assigning treatment and control conditions, Neyman obtains random samples of Yt(u) and Yc(u) that can be used to calculate good estimates of Yt* and Yc* if the number of plots is large enough. Both random assignment and the large number of plots matter for this argument. To demonstrate this, we begin with the example of four plots in Figure 5. The last six columns show all the possible ways that experiments can be set up in which two plots receive the treatment condition and two receive the control condition. There are six such ways for this to happen with four plots, two conditions, and the requirement of equal number of plots for each condition. We call the six columns “states-of-the-world” because they are mutually exclusive and exhaustive ways that the world might look after the assignment of treatment and control conditions. For concreteness, suppose that the first two plots (u=1,2) are very fertile and the second two (u=3,4) are not. If design A were chosen by the experimenter, then even if the treatment has no effect, the average for the first two plots will be larger than the average for the second two plots simply because of their greater fertility. Even if the number of plots is increased without limit by reproducing equal numbers of the first type of plots (“high fertility”) and the second type (“low fertility”), the experiment will still give the wrong answer. If plots are assigned at random, situation A will still occur one-sixth of the time, but other situations will also occur, including four (B,C,E,F) in which fertile and infertile plots are evenly divided between control and treatment groups and one (D) in which the control condition is favored. If the number of plots is again increased in the way described above and each plot is randomly assigned a condition, then the chance of randomly getting a state of the world favorable to the treatment decreases still farther and in a way that can be easily calculated. As a result, it is possible to say how likely an observed difference is due to chance versus some real effect. In short, randomization ensures that variability will be averaged out and the true impact of the treatment versus the control will be estimated. As the number of plots increases, the law of large numbers averages out the variability and ensures that the estimates will be close to the actual values.25 24 Rubin says: “I am in full agreement with Scheffe’s (1956) description of Neyman’s mathematical model as corresponding to the completely randomized experiment, and I also agree with Dabrowska and Speed [1990] that the explicit suggestion to use the urn model to physically assign varieties to plots is absent.” (P. 477). 25 Here Neyman was not as careful as a modern scholar. If the number of plots increases, then some assumption has to be made about how the Yt(u) and Yc(u) change as well. If, for example, the initial plots are 24 Hence, the difference of the estimates is a good estimate of the impact of the treatment versus the control condition. Another way to think of this randomization is that it selects a random subset of observations from all possible observations of the impacts of the treatment and control conditions. The outcomes of all the possible observations from the treatment condition can be written as: (10) Yt(u) = Yt* + et(u), where et(u) represents Yt(u)’s deviation from the mean Yt*. This equation says that the outcome for any specific unit is equal to the average impact of the treatment plus some deviation from that average. If there are N units, then there are N equations like (10). The average of all N of these deviations et(u) will be zero across all units by the definition of the mean. If half the units are randomly assigned the treatment, then the law of large numbers implies that the average of the (N/2) deviations et(u) for the observed units will get closer and closer to zero as (N/2) increases. If the treatment condition is not assigned randomly, then this will not necessarily be true. For example, if all the plots with high yields are in the treatment condition, then all the observed outcomes for the treatment condition will have positive values of the deviations et(u). Similarly, all the possible observations for the control condition, can be written as: (11) Yc(u) = Yc* + ec(u) where ec(u) represents Yc(u)’s deviation from the mean Yc*. The average of all ec(u) will be zero across all units by the definition of the mean, and the average of the deviations ec(u) for the (N/2) observed units will be approximately zero if the treatment condition is assigned randomly and if there are many units. With this notation, we can show that randomized experiments satisfy the specification condition described earlier. We can write (10) by simply adding and subtracting Yc* to the equation: (12) Yt(u) = Yc* + (Yt* - Yc*) + et(u) Remember that (11) involves N possible observations and (12) involves N more possible observations. Assigning treatment and control conditions amounts to choosing one of the columns, that is, one of the states of the world, A,B,C,D,E,F in Figure 5. When this is done, half of the located at Rothamsted, the next plots are in the arctic, those after that are in greenhouses, and so forth, then the law of large numbers will be unable to provide a stable estimate of average outcomes because the mean values will be jumping around. Thus, implicit in Neyman’s model is some understanding that the outcomes from the new plots look like the outcomes from the old plots. A sufficient condition is that all plots are taken from the same population (for which average outcomes exist for both treatment and control conditions). 25 possible observations become impossible. Then, (11) and (12) can be written more economically for each value of u by defining a variable X(u) with the value of one if the person is in the treatment group (i=t) and zero if the unit is in the control group (i=c). By doing this, all counterfactual observations will be dropped from the new equation, and only actual observations will be included. Because we have designed the experiment so that there are equal numbers of units in each condition, there are N/2 units from (11) that will be assigned to the control condition, and there are N/2 units from (12) that will be assigned to the treatment condition. All the actual observations can be written as: (13) Y(u) = Yc* + (Yt* - Yc*)X(u) + et(u)X(u) + ec(u)(1- X(u)). Y(u) = a + b X(u) + e(u), where a = Yc*, b = (Yt* - Yc*), and e(u) = et(u)X(u) + ec(u)(1- X(u)). Note that the“t” or “c” subscript has been dropped, and there will be N of these equations, one for each observation. The intercept a is a measure of the average impact of the control condition and the slope b is a measure of the net average impact of the treatment over the control condition. The error term e(u) is either the error et(u) if X=1 or ec(u) if X=0. This equation has the form of a regression of the observed values Y(u) on the observed values of X(u). We know from the discussion above that the OLS estimate of the slope b will be unbiased if Cov(X,e) = 0. Does randomization insure this result in (13)? It is easy to show that without randomization difficulties might arise. If the fertile plots are assigned to the treatment and the infertile ones to the control condition, then when X=1 for the treatment plots, the values of e(u) will be highly positive values of et(u) and when X=0 for the control plots, the values of e(u) will be highly negative values of ec(u). Obviously, Cov(X,e) will be very positive because a value of X=1 will be associated with high values of e and a value of X=0 will be associated with low values of e. We know from our previous discussion that this will lead to an upward bias in the OLS estimate of the true impact of the treatment. The treatment will appear to work because it has been assigned to the fertile plots. The reverse will occur if fertile plots are assigned to the control condition and infertile ones to the treatment plots. But when there is randomization, we know that the average value of the errors will be zero when X=1 and zero when X=0. Hence, the covariance of X and e must be zero, and OLS will give an unbiased estimate of the average net effect. Randomized experiments automatically satisfy the specification assumption.26 26 Two things will actually be true. The first has to do with a thought experiment in which the experiment is done repeatedly and the average result over all such experiments is considered. The expected values of the errors for each condition in these circumstances will be zero [E(e(u)|X=1) = 0 and E(e(u)|X=0) = 0] because randomization makes each of the six states of the world in Figure 5 equally likely, and the average of the errors for a given condition across the states of the world for all the units in Figure 5 will be zero. This condition ensures that the estimator of b, the net treatment impact, will be statistically unbiased no matter how many units are considered. The second has to do with a thought experiment in which the number of units increases without limit. This will increase the length of each column in Figure 5, and it will “split-up” the existing columns into many subcolumns as different randomizations occur for the new units. Thus, adding one new unit will cause state of the 26 Problems with Experiments – Randomized experiments are classic intervention studies in which units – plots or people – are assigned some value of X, and then the average impact of X is observed. The fact that conditional independence is automatically satisfied with experiments makes them very attractive. Why, then, don’t we do them more often? Probably the biggest reason is that they are hard to do, especially in the social sciences. The limitations are both practical and ethical. The practical limitations are the same as those which held-up the first experimental test of Newton’s theories of orbiting satellites until almost 300 years after their formulation. Sputnik was expensive and complicated and so are most social and biological experiments. The ethical limitations have to do with the unacceptability of randomly assigning people to families, political parties, or guerilla groups. But there are other limitations as well. Consider, for example, what would happen if a “Boyle’s Law” experiment varied T and measured P but the apparatus was set-up, unbeknownst to the investigator, so that V would adjust. If V adjusted enough, then P might not vary at all. Alternatively, P might vary somewhat as T was adjusted, but V might vary as well. If only P and T were being measured, then the experiment might grossly underestimate the possible impacts of T. There are not violations of physical laws here, and there is no failure of the experimental method. The method would be giving a true and correct rendition of what occurs under the experimental circumstances which just happen to allow V to vary. One of the problems, then, with experiments is that they only tell us what happens under the conditions that happen to exist in the experiment. Now consider an experiment which tries to increase the employment P of people by providing them with some training T requiring substantial study and reading. Suppose that V measures the violence of this group of people, and suppose that the treatment affects some subjects by causing them to become employed (increasing P) but it affects others by getting them frustrated and more violent (increasing V) because they cannot seem to learn. In fact, suppose that for any individual, the relationship among these three variables is exactly the same as the gas law so that PV is always proportional to T, but for some people only P can vary, for others only V can vary, and for still others both can vary but to a different extent depending upon the person. Under these circumstances, the treatment, T, could lead to highly variable employment outcomes depending upon the mix of people in the program. In some instances the program might seem, on average, to get people employed, and in others it might seem to harm them by decreasing their employment possibilities. The inferences in each case would be correct for the population that was studied, but it would be wrong to generalize from it. In fact, if the experimenter had also measured V, then it would become clear that there was a structural, lawlike relationship among P, V, and T. world A to split into two possibilities – one in which the new unit gets the treatment condition and the other in which the new unit gets the control condition. In this situation, by the law of large numbers, the average of the errors for a given condition down each column in Figure 5 will be more and more likely to be close to zero as the number of units increases. This condition ensures that the estimator of b will be statistically consistent as the number of units increases. 27 This example might seem far-fetched, but American social policy has already generalized from a series of experiments that might have been misleading in just this way. Through a series of experiments in California, welfare researchers concluded that a “work-first” welfare program was much better than a “training-first” program. The work-first program was based upon job attachment theory which presumes that welfare recipients have the skills to be good workers, but they have forgotten (or never learned) the habits required to get jobs. Consequently, getting welfare recipients tied to jobs is the best way to move them out of poverty. Job attachment theory has much different implications than human capital theory which supposes that welfare recipients lack the skills to be good workers. According to this theory, “job-training” is needed to provide welfare recipients with the skills they need to get jobs. Recent work by Hotz, Klerman, and Imbens (19xx) suggest that work-first programs did well because they were implemented in counties where jobs were plentiful and the welfare recipients had relatively high education and past job experience, but if these same programs had been implemented in counties with a less job-ready population, then work-first might not have been so successful. Indeed, it might just frustrate welfare recipients who try to get jobs even though they are not suited for them. There are other problems with experiments as well. Heckman (1992) has argued that randomization may affect participation decisions so that the people who get involved in a randomized experiment may differ from those who would get involved in a full-scale program. The assumption that there is no effect of randomization on participation decisions “is not controversial in the context of randomized agricultural experimentation” (227) which is where the Fisher’s experimental model was developed. This model is the intellectual basis for modern social experiments, but it may require some modification with human subjects. Heckman also argues that experiments are “inflexible vehicle for predicting outcomes in environments different from those used to conduct the experiment.” (p. 227). Nevertheless, randomized experiments are still the gold standard for making valid inferences. LaLonde (1986) and Fraker and Maynard (1987) have shown that when experimental data is analyzed using standard observational methods, the results are quite different, and there are good reasons to believe that the experimental methods are much more trustworthy. Although critics (Heckman and Hotz, 1989) using better observational methods provide evidence that “tempers the recent pessimism about nonexperimental evaluation procedures” (p. 863), experimental methods still seem to be the only reliable method to make reliable Dinferences. Doing Observational Studies Where does this leave observational studies? One, rather weak, answer is that we have no choice but to use them to answer many questions for which experiments are either impractical or unethical. A better answer is that observational studies can still produce reliable inferences if we are careful to consider alternative causes and to rule out competing explanations. The basic tool for this is disciplined comparisons where we try to find as many ways as possible to compare one situation with another in order to rule out competing explanations. We can offer six kinds of tools for improving this process. They range from better theory (models that provide mechanisms and 28 explanations), through better research design (thinking about the inference problem and employing natural experiments and matching), to improved model-building (better model selection, more concern with model uncertainty, and model replication through the use of multiple data-sets). 1. Better Theory: Models that provide mechanisms and explanations – One of the major flaws in many observational studies (and experiments as well) is that there is often very little theory to help guide the inferential task. Yet, most observational studies must make a passel of assumptions – what variables to include, the functional form of relationships, the way error enters the model – that can affect the ultimate inference. One of the best things that social scientists can do is to develop better models that will provide guidance about these decisions. These models should pay special attention to the “social mechanisms” (Hedstrom and Swedberg, 1998) that generate and explain events. At the simplest level, this means that researchers should not be happy with regression “models” that simply throw variables into a regression. It is nowhere near enough to know that job training programs increase the work effort of welfare recipients, that the possession of civic skills increases political participation, or that proportional voting systems increase the number of political parties. Researchers must seek to understand the exact mechanisms by which training increases work effort, civic skills increase participation, and proportional voting methods lead to more parties. These mechanisms must include detailed descriptions of the decisionmaking problem facing individuals and the way that they solve this problem. For example, if some candidates in American presidential primaries gain “momentum” from winning early primaries (Bartels, 1988), then models of the individual level processes (e.g., increased name recognition or strategic voting) that might lead to momentum should be developed (Brady, 1996) and experiments should be undertaken to see whether these processes actually occur (Brady, 19xx). Social science theories should seek to explain social phenomena in the same way that the Maxwell-Boltzmann theory of gases explains the regularities of the gas laws by developing a microtheory that unifies seemingly disparate phenomena – pressure and temperature – through the concept of energy and Newton’s laws. The Maxwell-Boltzmann theory postulates a large number of individual molecules that rush about at random with varying speeds and whose average speed is affected by the amount of energy in the system. This theory implies that if the temperature of a gas in a container is increased, the molecules become more energetic and their average speed increases, thus increasing their momentum and the pressure they exert when they hit the sides of the vessel that contains them. If the volume of the container gets smaller, then the number of collisions between these particles and the wall of the container increases, also increasing the pressure. The Maxwell-Boltzmann theory presents a unified way to understand the gas law, and it eliminates what Clark Glymour calls contingency. The theory makes it clear that pressure must, as a consequence of Newton’s laws, increase with temperature and decrease with volume (Glymour, 1980). Explanations like these not only improve our ability to make inferences, they also provide additional reasons to believe a theory. This call for better theory may seem utopian, but social scientists have developed theories that help guide the research process. Political scientists, for example, have gained considerable understanding of electoral systems not only through detailed empirical studies (Lijphart, 1994) but 29 also through sophisticated modeling (Cox, 1997) which helps to explain empirical regularities. Sociologists and others have developed theories of mass political behavior (Lichbach, 1995, 1996) which explain the actions of rebel’s and cooperators alike. Economic theory provides guidance about both macro-economic and micro-economic phenomena. 2. Better Research Design: Thinking about the inference problem – Researchers can never worry enough about the validity of their inferences. In his book on How Experiments End (1987), Peter Galison argues that experiments end when researchers believe they have a result that will stand up in court because they cannot think of any credible ways to make it go away. A lot of the work of inference is trying to think of ways to make results go away, and researchers should think hard about this before, during, and after a study. Researchers who never experience sleepless nights worrying “what if I am wrong?” should probably rethink their research strategies. General frameworks for thinking about inference can help to generate lists of generic threats to inference. Fisher’s classic The Design of Experiments (1935) is all about setting up experiments in ways that will make the results stand up in court. The classic handbook for observational studies, Campbell and Stanley’s Experimental and Quasi-Experimental Designs for Research (1966, see also Cook and Campbell, 1979), provides an extraordinarily fertile list of threats to validity for many different kinds of research designs. All researchers should be familiar with the CampbellStanley-Cook lists. In the past 25 years, Rubin and his collaborators (Rubin, 1974, 1978, 1990; Holland and Rubin, 1988; Holland, 1986; Rosenbaum and Rubin, 1983) have developed an elegant generalization of the Neyman framework for inference that covers experiments and observational studies. The central focus of this work has been a careful explication of the assignment or selection method (see also Heckman, 1978, Heckman and Robb, 1985). This framework has led to concrete methods for improving causal inference such as the use of propensity scores for matching (Rosenbaum and Rubin, 1983) and the analysis of the conditions under which path modeling can be successful (Holland, 1988). Every empirical researcher should become familiar with this framework. Manski (1991, 1995) has explored what can be inferred from observational studies when there are problems of extrapolation, selection, simultaneity, mixing, or reflection. An understanding of these generic problems should also be part of every researcher’s tool-kit. Familiarity with this literature should enable researchers to develop better designs for their research which control for some of the major threats to valid inferences. Time-series studies, for example, make it possible to determine whether putative causes really change before their supposed effects. Time-series cross-sectional studies add the possibility of comparing across different units to see if the same results occur. Life-history data provide more and better controls for individual differences. None of these designs is foolproof, but they can provide some confidence that major sources of confounding have been controlled. 3. Better Research Design: Employing natural experiments and using matching – Partly as a result of doubts about observational studies, researchers have increasingly looked for “natural 30 experiments” in which essentially random events provide some inferential leverage akin to randomized experiments. For example, the Vietnam War draft lottery randomly selected some people to enter the military which makes it possible to determine the impact of military service on future earnings (Angrist, 1990), and miscarriages occur almost at random so that they can be used to study the consequences of teenage childbearing on mother’s incomes (Hotz, Mullin, Sanders, 1997). This approach has been used to determine the consequences of workers’ compensation on injury duration (Meyer, Viscusi, and Durbin, 1995), past disenfranchisement on future voting (Firebaugh and Chen, 1995), uncertainty on decision-making (Metrick, 1995), ballot form on voting choices (Wand, Shotts, Sekhon, Mebane, Herron, and Brady, 2001), the draft lottery and veteran status on lifetime earnings (Angrist, 1990), the minimum wage on total employment (Card and Krueger, 1994), the number of children in a family on their future life prospects (Rosenzweig and Wolpin, 1980), and political parties on voting behavior in the U.S. and Confederate Houses during the Civil War (Jenkins). Another method that has shown some promise for yielding good inferences is sophisticated matching methods based upon variants of the propensity score of Rosenbaum and Rubin (1983). For example, a study of the impact of Workers’ Compensation on future wages might start from workers who have been injured and then match them with a number of other workers in the same firm with similar characteristics who have not been injured. The impact of the injury on earnings is simply the difference between the earnings of the non-injured and the injured workers. These data could be analyzed with standard regression techniques by regressing future wages on worker and firm characteristics with a dummy variable for injury, but this approach inevitably makes strong assumptions about functional forms. Matching techniques appear to rely on fewer assumptions and seem to provide good estimates of program impacts (Heckman, Ichimura, and Todd, 1997; Friedlander, Greenberg, and Robins, 1995), but much more needs to be learned about their strengths and limitations. 4. Improved Model Building: Model selection through encompassing and specification tests – If researchers must do observational studies, then they should be much more thoughtful about model selection. A number of strategies and statistical tests have been developed to improve this process. The “encompassing” methods of Hendry and his colleagues (Hendry and Richard, 1982; Mizon, 1984; Gilbert, 1990; Hendry, 1995, Chpt. 14) emphasizes evaluating alternative theories by developing frameworks that encompass all the theories. Test statistics are then used to evaluate which theories fit the data. Leamer (1990) worries about fragile inferences, and he asks that “all empirical studies offer convincing evidence of inferential sturdiness” (page 88). He uses Bayesian methods to determine how sensitive parameter estimates are to decisions about model specification. Sims (1980), working in a time-series context, advocates using the minimal amount of prior information and letting the data tell the story in vector auto-regressions in which each variable is regressed on its own lagged values and the current and lagged values of all other variables. Sims goes too far for my taste, but his concerns about the precarious state of our prior knowledge are well worth considering. Pagan (1990) provides a nice comparison of all three methods. In a series of publications, White (1982, 1990, 1994) has developed methods for producing “robust” standard errors and for testing the specification of models. Although it is 31 already somewhat dated, the 1990 volume edited by Clive Granger provides readings on all these approaches. Davidson and MacKinnon (1990) provide a summary of simple ways to perform specification tests. Heckman and Hotz (1989) employ these methods to show that observational methods can produce estimates of impacts close to experimental results. 5. Improved Model Building: Model uncertainty and model Averaging – Once a model is selected, the researcher should be very skeptical about it. At least since Edward Leamer’s challenge to “Let’s Take the ‘con’ out of Econometrics,” (1986) economists have been more sensitive to the way in which their “specification searches” (Leamer, 1978) make a mockery of standard statistical tests. The problem is very simple. Using the same dataset, researchers invariably try many different specifications before they find the one that they report, but when they report it, they attach significance levels to parameter estimates as if they had only tested this one specification. No account is taken of the pre-testing. There is ample evidence that this procedure leads to overfitting in which there is much greater sense of confidence in the result than is warranted. In effect, all measures of fit are too optimistic and all standard errors are too small (Draper, 1995; Chatfield, 1995). One way to diagnose the extent of this problem is put some data aside during the model selection phase and to use it to validate the model (Picard and Cook, 1984) once it is chosen. Unfortunately, this only makes the researcher aware of the problem; it does not solve it. Model averaging (Draper, 1995) incorporates model uncertainty by averaging over a number of different model specifications. Bartels provides an accessible introduction to the method (1997), and Bartels and Zaller (2001) apply it to the presidential election forecasts. Chatfield (1995) provides a general discussion of model uncertainty. 6. Improved Model Building: Model Replication with Multiple Data Sets and Multiple Studies – The model selection methods described above can improve the quality of inferences, but they still lead to just one model that might be flawed. Incorporating model uncertainty can usefully increase our skepticism about any one model by considering an array of models. But these methods still use only one dataset, and they cannot protect researchers from a mischievous Nature that fails to vary or include important variables in a dataset. Therefore, the ultimate test of any finding is that it can be reproduced in other datasets derived from varying circumstances and situations. Ultimately, researchers should be looking for regularities across “many sets of data, drawn from different populations.” (Ehernberg and Bound, 1993). The statistical analysis of many such studies, called “meta-analysis” (Hedges and Olkin, 1985) is one way to provide stronger evidence for a hypothesis. Perhaps even better, researchers should look for circumstances that might be likely to disprove their hypothesis in order to test its limits. Conclusions Making an inference is engaging in an argument with nature. In the course of this argument, we must presume that nature is mischievous, if not downright cunning and deceitful. There is no reason to believe that our initial theories are correct or that the data we have are very illuminating. 32 We must constantly think of new queries to ask our adversary, and we must be skeptical of the answers we get. Randomized experiments provide a way to reduce nature’s efforts to confound us, but even they are not foolproof. In the end, David Freedman is right in saying that success depends upon “the clarity of the prior reasoning, the bringing together of many different lines of evidence, and the amount of shoe leather” (1991, p. 298) provided by the researcher. For observational studies, there is no specific technique that will solve the problem of making good inferences. But there are lots of things we can do better, and we have described many of them. Figure 5 -- Outcomes Yi(u) Under Different Conditions and Experimental Set-Ups Units (u) Conditions (i) Possible States of the World with Only Two Units Getting Each Condition Treatment Control A B C D E F 1 Yt(1) Yc(1) Yt(1) Yt(1) Yt(1) Yc(1) Yc(1) Yc(1) 2 Yt(2) Yc(2) Yt(2) Yc(2) Yc(2) Yc(2) Yt(2) Yt(2) 3 Yt(3) Yc(3) Yc(3) Yt(3) Yc(3) Yt(3) Yt(3) Yc(3) 4 Yt(4) Yc(4) Yc(4) Yc(4) Yt(4) Yt(4) Yc(4) Yt(4) 33 References [Incomplete] Angrist, Joshua D. 1990. “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.” The American Economic Review 80(June): 313-336. Bartels, Larry M. 1997. “Specification Uncertainty and Model Averaging.” American Journal Of Political Science 41(April):641-674. Bartels, Larry M., and Zaller, John. 2001. “Presidential Vote Models: A Recount.” Political Science & Politics 34(1):9-20. Berkson, Joseph. 1950. “Are There Two Regressions?” Journal of the American Statistical Association 45(June):164-180. Blalock, H.M. Jr. 1964. Causal Inferences in Nonexperimental Research. Chapel Hill, NC: University of North Carolina. Booth, Charles. 1896. “Poor Law Statistics.” The Economic Journal 6(March):70-74. Bronars, Stephen G. and Jeff Grogger. 1994. “The Economic Consequences of Unwed Motherhood: Using Twin Births as a Natural Experiment.” The American Economic Review 84(December):1141-1156. Campbell, D.T. and J.C. Stanley. 1966. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally. Card, David and Alan B. Krueger. 1994. “Minimum Wages and Employment: A Case Study of The Fast Food Industry in New Jersey and Pennsylvania.” The American Economic Review 84(September):772-793. Chatfield, Chris. 1995. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of The Royal Statistical Society, Series A (Statistics in Society), 158(3):419-466. Clogg, Clifford C. and Haritou, Adamantios. 1997. “The Regression Method of Causal Inference and a Dilemma Confronting This Method.” In Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences, ed. Vaugh R. McKim and Stephen P. Turner, Notre Dame, IN: University of Notre Dame. Cochran, William G. 1976. “Early Development of Techniques in Comparative Experimentation.” In On the History of Statistics and Probability. Statistics Textbooks and Monographs, ed. D.B. Owen, vol.17. New York: Marcel Dekker, Inc. 35 Conant, James Bryant. 1957. “Robert Boyle’s Experiments in Pneumatics, Edited by James Bryant Conant.” In Harvard Case Studies in Experimental Science, Vol. 1, ed. James Bryant Conant. Cambridge, MA: Harvard University Press. Cox, Gary W. 1997. Making Votes Count: Strategic Coordination in the World’s Electoral Systems. Cambridge, New York, and Melbourne: Cambridge University Press. Davidson, Russell, and James G. MacKinnon. 1990. “Specification Tests Based on Artificial Regressions.” Journal of the American Statistical Association 85(March):220-227. Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” Journal of the Royal Statistical Society, Series B (Methodological), 57(1):45-97. Ehrenberg, A.S.C. 1968. “The Elements of Lawlike Relationships.” Journal of the Royal Statistical Society, Series A (General), 131(3):280-302. Ehrenberg, A.S.C., J.A. Bound. 1993. “Predictability and Prediction.” Journal of the Royal Statistical Society, Series A (Statistics in Society), 156(2):167-206. Firebaugh, Glenn, and Kevin Chen. 1995. “Vote Turnout of Nineteenth Amendment Women: The Enduring Effect of Disenfranchisement.” American Journal of Sociology 100 (January):972-996. Fisher, R.A. [1926] 1972. “The Arrangement of Field Experiments.” In Collected Papers of R.A. Fisher, vol. II—1925-31, ed. J.H. Bennett. Adelaide, Australia: University of Adelaide. Fisher, Ronald A. [1935] 1971. The Design of Experiments. 9th ed. New York: Hafner Press. Freedman, David A. 1991. “Statistical Models and Shoe Leather.” Sociological Methodology 21 (1991):291-313. Freedman, David A. 1997. “From Association to Causation via Regression.” In Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences, ed. Vaugh R. McKim and Stephen P. Turner. Notre Dame, IN: University of Notre Dame. Galison, Peter. 1987. How Experiments End. Chicago: University of Chicago Press. Galton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.” Journal of the Anthropological Institute of Great Britain and Ireland 15(1886):246-263. Galton, Francis. 1888. “Co-Relations and Their Measurement, Chiefly from Anthropometric Data.” Proceedings of the Royal Society of London 45(1888-1889):135-145. 36 Gigerenzer, Gerd, Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz Krüger. 1989. The Empire of Chance: How probability changed science and everyday life. Cambridge, New York, and Melbourne: Cambridge University Press. Gilbert, Christopher L. 1990. “Professor Hendry’s Econometric Methodology.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. Glymour, Clark. 1980. “Explanations, Tests, Unity and Necessity.” NOÛS, A.P.A. Western Division Meetings, March 1980, 14(1):31-50. Goldberger, Arthur S. and Otis Dudley Duncan. 1973. Structural Equation Models in the Social Sciences. New York, San Francisco, and London: Seminar Press. Haavelmo, Trygve. 1943. “The Statistical Implications of a System of Simultaneous Equations.” Econometrica 11(January):1-12. Heckman, James J., and V. Joseph Holtz. 1989. “Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training.” Journal of the American Statistical Association 84(December)862-874. Heckman, James J. 1992. “Randomization and Social Policy Evaluation. In Evaluating Welfare and Training Programs, ed. Charles F. Manski and Irwin Garfinkel. Cambridge, MA: Harvard University Press. Heckman, James J., Hidehiko Ichimura, and Petra E. Todd. 1997. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.” The Review of Economic Studies 64, Special Issue: Evaluation of Training and other Social Programmes, (October)605-654. Hedström, Peter and Swedberg, Richard, eds. 1998. Social Mechanisms: An Analytical Approach to Social Theory. Cambridge, New York, and Melbourne: Cambridge University Press. Hendry, David F. and J-F. Richard. 1990. “On the Formulation of Empirical Models in Dynamic Econometrics.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. Hendry, David F. 1997. Dynamic Econometrics. 3rd ed. Advanced Texts in Econometrics. New York: Oxford University Press. Hodges, James S, 1987. “Uncertainty, Policy Analysis and Statistics.” Statistical Science 2 (August):259-275. 37 Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81(December):945-960. Holland, Paul W. 1988. “Causal Inference and Path Analysis.” Sociological Methodology 18 (1988):449-484. Holland, Paul W. and Donald B. Rubin. 1988. “Causal Inference in Retrospective Studies.” Evaluation Review 12:203-231. Hood, Wm. C. and Koopmans, Tjalling C. [1953] 1970. Studies in Econometric Method. 3rd ed. Cowles Foundation Monograph 14. New Haven, CT: Yale University Press. Hotz, Joseph V., Charles H. Mullin, and Seth G. Sanders. 1997. “Bounding Causal Effects Using Data From a Contaminated Natural Experiment: Analysis of the Effects of Teenage Childbearing.” The Review of Economic Studies 64, Special Issue: Evaluation of Training and Other Social Programmes. (October):575-603. Hotz, Joseph V., Guido W. Imbens, and Jacob A. Klerman. 2000. “The Long-Term Gains from GAIN: A Re-Analysis of the Impacts of the California GAIN program.” National Bureau of Economic Research, Working Paper No. W8007, November 2000. Available from National Bureau of Economic Research, http://papers.nber.org/papers/W8007. Jenkins, Jeffery A. 1999. “Examining the Bonding Effects of Party: A Comparative Analysis Of Roll-Call Voting in the U.S. and Confederate Houses.” American Journal of Political Science 43(October):1144-1165. Koopmans, Tjalling C. 1949. “Identification Problems in Economic Model Construction.” Econometrica 17(April):125-144. Koopmans, Tjalling C., H. Rubin, and R.B. Leipnik. 1950. “Measuring the Equation Systems Of Dynamic Economics.” In Statistical Inference in Dynamic Economic Models, ed. Tjalling C. Koopmans. New York: John Wiley. Leamer, Edward E. 1978. Specification Searches: Ad Hoc Interference with Nonexperimental Data. New York: John Wiley & Sons. Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” The American Economic Review 73(March): 31-43. Leamer, Edward E. 1985. “Sensitivity Analyses Would Help.” The American Economic Review 75(June):308-313. Lichbach, Mark Irving. 1995. The Rebel’s Dilemma. Ann Arbor: University of Michigan Press. 38 Lichbach, Mark Irving.1996. The Cooperator’s Dilemma. Ann Arbor: University of Michigan Press. Manski, Charles F. 1993. “Identification Problems in the Social Sciences.” Sociological Methodology 23(1993):1-56. Manski, Charles F. 1995. Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press. McKim, Vaughn R. and Turner, Stephen P. eds. 1997. Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences, Notre Dame, IN: University of Notre Dame. Menzies, Peter. 2001. “Counterfactual Theories of Causation.” In The Stanford Encyclopedia of Philosophy (Spring 2001 edition), ed. Edward N. Zalta. Available on-line at http://plato.Stanford.edu/entries/causation-counterfactual/. Metrick, Andrew. 1995. “A Natural Experiment in ‘Jeopardy!’” The American Economic Review 85(March)240-253. Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin. 1995. “Workers’ Compensation and Injury Duration: Evidence from a Natural Experiment.” The American Economic Review 85(June):322-340. Mizon, G.E. 1990. “The Encompassing Approach in Econometrics.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. Pagan, Adrian R. 1990. “Three Econometric Methodologies: A Critical Appraisal.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. Picard, Richard R. and R. Dennis Cook. 1984. “Cross-Validation of Regression Models.” Journal of the American Statistical Association 79(September):575-583. Porter, Theodore M. 1986. The Rise of Statistical Thinking 1820-1900. Princeton, NJ: Princeton University Press. Rosenbaum, Paul R. and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70(April):41-55. Rosenzweig, Mark R. and Kenneth I. Wolpin. 1980. “Testing the Quantity-Quality Fertility Model: The Use of Twins as a Natural Experiment.” Econometrica 48(January):227-240. 39 Rubin, Donald. B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66:688-701. Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” Annals of Statistics 6(January):34-58. Rubin, Donald B. 1990. “[On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies.” Statistical Science 5(November):472-480. Scheffe, Henry. 1956. “Alternative Models for the Analysis of Variance.” Annals of Mathematical Statistics 27(2):251-271. Simon, Herbert A. 1954. “Spurious Correlation: A Causal Interpretation.” Journal of the American Statistical Association 49(September):467-479. Sims, Christopher A. 1990. “Macroeconomics and Reality.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. Splawa-Neyman, Jerzy, D.M. Dabrowska, T.P. Speed. 1990. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical Science 5(November):465-472. Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap Press, Harvard University. Stone, Richard. 1993. “The Assumptions on which Causal Inferences Rest.” Journal of the Royal Statistical Society, Series B (Methodological), 55(2):455-466. Student. 1909. “The Distribution of the Means of Samples which are Not Drawn at Random.” Biometrika 7(July-October): 210-214. Student. 1923. “On Testing Varieties of Cereals.” Biometrika 15(December):271-293. Turner, Stephen P. 1997. “‘Net Effects’: A Short History.” In Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences, ed. Vaugh R. McKim and Stephen P. Turner. Notre Dame, IN: University of Notre Dame. White, Halbert. 1982. “Maximum Likelihood Estimation of Misspecified Models.” Econometrica 50 (January):1-26. 40 White, Halbert. 1990. “A Consistent Model Selection.” In Modelling Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University Press. White, Halbert. 1994. Estimation, inference and specification analysis. Econometric Society Monographs, no. 22. Cambridge, New York, and Melbourne: Cambridge University Press. Wishart, John. 1934. “Statistics in Agricultural Research.” Supplement to the Journal of the Royal Statistical Society 1(1):26-61. Wold, Herman. 1954. “Causality and Econometrics.” Econometrica 22(April):162-177. Wold, Herman. 1956. “Causal Inference from Observational Data: A Review of End and Means.” Journal of the Royal Statistical Society, Series A (General), 119(1):28-61. Wold, H.O.A. and Jureen, L. 1953. Demand Analysis. New York: John Wiley. Wright, S. 1934. “The Method of Path Coefficients.” Annals of Mathematical Statistics 5(1934):161-215. Yule, G. Udny. 1895. “On the Correlation of Total Pauperism with Proportion of Out-Relief.” The Economic Journal 5(December):603-611. Yule, G. Udny. 1896. “Notes on the History of Pauperism in England and Wales from 1850, Treated by the Method of Frequency-Curves; with an Introduction on the Method.” Journal of the Royal Statistical Society 59(June):318-357. Yule, G. Udny. 1896. “On the Correlation of Total Pauperism with Proportion of Out-Relief.” The Economic Journal 6(December):613-623. Yule, G. Udny. 1897a. “On the Significance of Bravais’ Formulae for Regression, &c., in the Case of Skew Correlation.” Proceedings of the Royal Society of London 60(1897):477-489. Yule, G. Udny. 1897b. “On the Theory of Correlation.” Journal of the Royal Statistical Society 60(December):812-854. Yule, G. Udny. 1899. “An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Intercensal Decades, Part I.” Journal of the Royal Statistical Society 62(June):249-295. Yule, Udny G. 1907. “On the Theory of Correlation for any Number of Variables, Treated by 41 A New System of Notation.” Proceedings of the Royal Society of London, Series A, Containing Papers of a Mathematical and Physical Character 79(May 14):182-193. 42 Figure 1 Boyle's Experimental Data 4.0 3.8 3.6 3.4 Log of Pressure 3.2 3.0 2.8 2.6 2.4 3.2 Rsq = 0.9999 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 Log of Volume Page 1 Figure 2 -- Distribution of Children's and Parent's Heights from Galton's Data Number of Children or Parents 600 500 400 300 200 Std. Dev = 2.21 100 Mean = 68.20 N = 1856.00 0 61.25 67.25 64.25 73.25 70.25 Distribution of Heights Page 3 Figure 3 -- Regression of Children's versus Parents' Heights from Galton's Data 74 Children's Heights 72 70 68 66 64 62 62 64 66 68 70 72 74 Parents' Heights Cases weighted by NUMBER Page 2 Figure 4 -- Yule's Data on Pauperism and Out-Door Relief 9% 8% 7% 6% Percent Paupers 5% 4% 3% 2% 1% 0% 0 2 4 6 8 10 12 14 16 18 Ratio of Out-Door Relief to Indoor Relief Page 4