Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 11

views

Report

Comments

Description

Download Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory

Transcript

Models of Causal Inference: Going Beyond the Neyman-Rubin-Holland Theory

Models of Causal Inference:
Going Beyond the
Neyman-Rubin-Holland Theory
Henry E. Brady
July 16, 2002
Paper Presented at the Annual Meetings of the
Political Methodology Group, University of
Washington, Seattle, Washington
Causation and Explanation in Social Science
Henry E. Brady, University of California, Berkeley
Causality
Humans depend upon causation all the time to explain what has happened to them, to make
realistic predictions about what will happen, and to affect what happens in the future. Not
surprisingly, we are inveterate searchers after causes.1 Almost no one goes through a day without
uttering sentences of the form “X caused Y” or “Y occurred because of X,” even if the utterances
are for the mundane purposes of
-- explaining why the tree branch fell (“the high winds caused the branch to break and
gravity caused it to fall”),
-- predicting that we will be late for work work (“the traffic congestion will cause me to be
late”), or
-- affecting the future by not returning a phone call (“I did not call because I do not want
that person to bother me again”).
All these statements have the same form in which a cause (X) leads to an effect (Y).2
Social scientists typically deal with bigger and more contentious causal claims such as:
“The economy grew because of the increase in the money supply.”
“The USSR became highly repressive in the 1920's and 1930's because of Stalin’s
accession to power after Lenin.”
“The Protestant Reformation caused the development of capitalism in the West.”
“The lack of strict work requirements causes people to stay on welfare a long time.”
“‘Duverger’s Law’ – Single member electoral districts with a plurality voting rule lead to
two political parties while proportional representation creates many small parties.”
“The butterfly ballot in Palm Beach County Florida in the 2000 election caused Al Gore to
lose the election.”
1
Humans also depend upon concepts to describe and understand the world, and other parts of this book
focus on how concepts are formed and used.
2
The word “because” suggests that an explanation is being offered as well as a causal process. One way
of explaining an event is identify a cause for it. At the end of this paper, we discuss the relationship
between determining causes and proffering explanations.
1
These are bigger causal claims, but the hunger for causal statements is just as great and the form
of the statement is the same.3 The goals are also the same. Causal statements explain events,
allow predictions about the future, and make it possible to take actions to affect the future.
Knowing more about causality can be useful to social science researchers.
Philosophers and statisticians know something about causality, but entering into the philosophical
and statistical thickets is a daunting enterprise for social scientists because it requires technical
skills (e.g., knowledge of modal logic) and technical information (e.g., knowledge of probability
theory) that is not easily mastered. The net payoff from forays into philosophy or statistics
sometimes seems small compared to the investment required. The goal of this paper is to provide
a user-friendly synopsis of philosophical and statistical musings about causation. Some technical
issues will be discussed, but the goal will always be to ask about the bottom line – how can this
information make us better researchers?
Three types of intellectual questions typically arise in philosophical discussions of causality:
– Psychological and linguistic – What do we mean by causality when we use the
concept?
– Metaphysical or ontological – What is causality?
– Epistemological – How do we discover when causality is operative?4
Four distinct theories of causality, summarized in Table 1, provide answers to these and other
questions about causality. Philosophers debate which theory is the right one. For our purposes,
we embrace them all. Our primary goal is developing better social science methods, and our
perspective is that all these theories capture some aspect of causality.5 Therefore, practical
3
Some philosophers deny that causation exists, but we agree with the philosopher D.H. Mellors (1995)
who says: “I cannot see why. I know that educated and otherwise sensible people, even philosophers who
have read Hume, can hold bizarre religious beliefs. I know philosophers can win fame and fortune by
gulling literary theorists and others with nonsense they don’t themselves believe. But nobody, however
gullible, believes in no causation (page 1).” Even the the political scientist Alexander Wendt (1999) who
defends a constructivist approach to international relations theory argues that a major task for social
scientists is answering causal questions. He identifies “constitutive” theorizing, which we consider
descriptive inference and concept formation, as the other major task. For a thoroughgoing rejection of
causal argument see Taylor (1971).
4
A fourth question is pragmatic – How do we convince others to accept our explanation or causal
argument? A leading proponent of this approach is Bas van Fraassen (1980). Kitcher and Salmon (1987,
page 315) argue that “van Fraassen has offered the best theory of the pragmatics of explanation to date, but
... if his proposal is seen as a pragmatic theory of explanation then it faces serious difficulties” because
there is a difference between “a theory of the pragmatics of explanation and a pragmatic theory of
explanation.” From their perspective, knowing how people convince others of a theory does not solve the
ontological or epistemological problems.
5
Margaret Somers (1998), in a sprawling and tendentious 63 page article replying to Kiser and Hechter’s
advocacy of rational choice models in comparative sociological research (1991), ransacks philosophical
2
Table 1
Four Theories of Causality
Major Authors
Associated with
the Theory
Neo-Humean
Regularity Theory
Hume (1739);
Mill (1888);
Hempel (1965);
Beauchamp &
Rosenberg (1981)
Counterfactual
Theory
Weber (1906);
Lewis (1973; 1986);
Manipulation
Theory
Gasking (1955);
Menzies & Price
(1993);
von Wright (1971)
Mechanisms and
Capacities
Harre & Madden
(1975);
Cartwright (1989);
Glennan (1996);
Approach to the
Symmetric Aspect
of Causality
Observation of
constant conjunction
and correlation
Recipe that
regularly produces
the effect from the
cause.
Approach to the
Asymmetric
Aspect of
Causality
Major problems
solved.
Emphasis on
Causes of Effects
or Effects of
Causes?
Studies with
Comparative
Advantage Using
this Definition
Temporal
precedence
Truth in otherwise similar
worlds of “If the cause
occurs then so does the
effect” and “if the cause
does not occur then the
effect does not occur.”
Consideration of the truth
of: “If the effect does not
occur, then the cause may
still occur.”
Singular causation.
Nature of necessity.
Effects of causes
(E.g., Focus on
treatment’s effects in
experiments.)
Experiments;
Case study comparisons;
Counterfactual thought
experiments
Consideration of
whether there is a
mechanism or
capacity that leads
from the cause to the
effect.
An appeal to the
operation of the
mechanism
Necessary
connection.
Causes of effects
(E.g., Focus on
dependent variable
in regressions.)
Observational and
causal modeling
Observation of the
effect of the
manipulation
Common cause and
causal direction.
Effects of causes
(E.g., Focus on
treatment’s effects
in experiments.)
Experiments;
Natural
experiments;
Quasi-experiments
Preemption.
Causes of effects
(E.g., Focus on
mechanism that
creates effects.)
Analytic models;
Case studies
researchers can profit from drawing lessons from each one of them even though their proponents
sometimes treat them as competing or even contradictory. Our standard has been whether or not
theories to criticize rational choice theory and to defend historical sociology. Neither the attack nor the
defense is very successful because Somers draws her ammunition from arguments made by philosophers
of science. If philosophy of science were a settled field like logic or geometry, using its arguments might
be a sensible strategy, but in its unsettled state, it seems an unreliable touchstone for social science
methodologists. Turning to it for guidance merely multiplies our confusions as methodologists by their
confusions as philosophers of science rather than adding to our understanding by providing new insights.
In the end, the article accomplishes little for the practicing researcher except to suggest the limited
usefulness of the philosophy of science for our enterprise. The flavor of the article can be quickly grasped:
“Following the Lakatosian route out of Kuhn (through Popper) into the ‘hard core’ of ‘general theory,’
theoretical realism thus allows Kiser and Hechter to accomplish by theoretical fiat that which has for
centuries confounded philosophers…” (page 748). Goldstone (1998) provides some useful correctives,
and Skocpol and Somers (1980) provide a better defense of historical sociology. See also Kiser and
Hechter (1998). As we shall see later, Somers also fails to understand the way that mechanisms can
provide an explanation.
3
we could think of concrete examples of research that utilized (or could have utilized) a
perspective to some advantage. If we could think of such examples, then we think it is worth
drawing lessons from the causal theory.
Indeed, we believe that a really good causal inference should satisfy the requirements of all four
theories. Causal inferences will be stronger to the extent that they are based upon finding all the
following. (1) Constant conjunction of causes and effects required by the neo-Humean theory.
(2) No effect when the cause is absent in the most similar world to where the cause is present as
required by the counterfactual theory. (3) An effect after a cause is manipulated. (4) Activities
and processes linking causes and effects required by the mechanism theory. The claim that
smoking causes lung cancer, for example, first arose in epidemiological studies that found a
correlation between smoking and lung cancer. These results were highly suggestive to many, but
this correlational evidence was insufficient to others (including one of the founders of modern
statistics, R. A. Fisher). These studies were followed by experiments that showed that, at least in
animals, the absence of smoking reduced the incidence of cancer compared to the incidence with
smoking when similar groups were compared. But animals, some suggested, are not people.
Other studies showed that when people stopped smoking (that is, when the putative cause of
cancer was manipulated) the incidence of cancer went down as well. Finally, recent studies have
uncovered biological mechanisms that explain the link between smoking and lung cancer. Taken
together the evidence for a relationship between smoking and lung cancer now seems
overwhelming.
The remainder of this chapter explains the four theories in much more detail. Before providing
this detail, we first define the notion of counterfactual which crops up again and again in
discussions of causality. Then we briefly discuss the nature of psychological, ontological, and
epistemological arguments regarding causality in order to situate our own efforts and to develop a
language for thinking about causality. The central part of the chapter elaborates upon the four
theories in Table 1. Then a well-known statistical approach to causation proposed by Jerzy
Neyman, Don Rubin, and Paul Holland is discussed in light of the four theories. The chapter ends
with a discussion of causation and explanation.
Counterfactuals
Causal statements are so useful that most people cannot let an event go by without asking why it
happened and offering their own “because”. They often enliven these discussions with
counterfactual assertions such as “If the cause had not occurred, then the effect would not have
happened.” A counterfactual is a statement, typically in the subjunctive mood, in which a false or
“counter to fact” premise is followed by some assertion about what would have happened if the
premise were true. For example, if someone drank too much and got in a terrible accident, then a
counterfactual assertion might be “if he had not drunk five scotch and sodas, he would not have
had that terrible accident.” The statement uses the subjunctive (“if he had not drunk...., he would
not have had”), and the premise is counter to the facts. The premise is false because the person
did drink five scotch and sodas in the real world as it unfolded. The counterfactual claim is that
without this drinking, the world would have proceeded differently, and he would not have had the
terrible accident. Is this true?
4
The truth of counterfactuals is closely related to the existence of causal relationships. The
counterfactual claim made above implies that there is a causal link between drinking five scotch
and sodas (the cause X) and the terrible accident (the effect Y). The counterfactual, for example,
would be true if drinking caused the person to drive recklessly and to hit another car on the way
home because of his drinking. Therefore, if the person had not drunk five scotch and sodas then
he would not have driven recklessly and the accident would not have occurred – thus
demonstrating the truth of the counterfactual “if he had not drunk five scotch and sodas he would
not have had that terrible accident.”
Another way to think about this is to simply ask what would have happened in the most similar
world in which the person did not drink the five scotch and sodas. Would the accident still have
happened? One way to do this would be to rerun the world with the cause eradicated so that no
scotch and sodas are drunk. The world would otherwise be the same. If the accident does not
occur, then we would say that the counterfactual is true. Thus, the statement that drinking caused
the accident is essentially the same as saying that in the most similar world in which drinking did
not occur, the accident did not occur either. The existence of a causal connection can be checked
by determining whether or not the counterfactual would be true in the most similar possible world
where its premise is true. The problem, of course, is defining the most similar world and finding
evidence for what would happen in it.
Consider the problem of definition first. Suppose, for example, that in the real world the person
drank five scotch and sodas, got on a bus, and the bus driver got into a terrible accident which
injured the drinker. In this case, the most similar world, in which the (sober) person got on the
bus would still have led to the terrible accident which hurt the drinker in the other world.
Drinking could not be held responsible for the accident in this case. Or could it? What is the
most similar world? Would the person who took the bus when he got drunk have taken the bus if
he had not gotten drunk? Would he have driven home instead? If he had driven home, wouldn’t
he have avoided the accident on the bus? Which is the most similar world, the one in which the
person takes the bus or takes an automobile? This is a difficult question.
Beyond these definitional questions about most similar worlds, there is the problem of finding
evidence for what would happen in the most similar world. We cannot rerun the world so that the
person does not drink five scotch and sodas. What can we do? Many philosophers have wrestled
with this question, and we discuss the problem in detail later in the section on the counterfactual
theory of causation. 6 For now, we merely note that people act as if they can solve this problem
because they assert the truth of counterfactual statements all the time.
6
Standard theories of logic cannot handle counterfactuals because propositions with false premises are
automatically considered true which would mean that all counterfactual statements, with their false
premises, would be true, regardless of whether or not a causal link existed. Modal logics, which try to
capture the nature of necessity, possibility, contingency, and impossibility, have been developed for
counterfactuals (Lewis, 1973). These logics typically judge the truthfulness of the counterfactual on
whether or not the statement would be true in the most similar possible world where the premise is true.
Problems arise, however, in defining the most similar world. These logics, by the way, typically broaden
the definition of counterfactuals to include statements with true premises for which they consider the
closest possible world to be the actual world so that their truth value is judged by whether or not their
conclusion is true in the actual world.
5
In everyday conversation, counterfactuals serve many purposes. They are sometimes offered as
explanatory laments7 such as “if he had not had that drink, he wouldn’t have had that terrible
accident” or as sources of guidance for the future such as when, after a glass breaks at the dinner
table, we admonish the miscreant that “if you had not reached across the table, then the glass
would not have broken.” For social scientists, discussions of counterfactuals are serious attempts
to understand the mainsprings of history, and each of the following (contentious) counterfactuals
which is parallel to the causal assertions listed earlier suggests an explanation about why things
turned out as they did and why they might have turned out differently:
“If the money supply had been increased more, then the economy would have grown
more.”
“If Stalin had not succeeded Lenin, then the Soviet Union would have been more
democratic.”
“If there had not been the Protestant Reformation, then capitalism would not have
developed in the West.”
“If welfare recipients were required to meet strict work requirements, then they would get
off welfare faster.”
“If Italy used plurality voting with single member districts instead of proportional
representation with multi-member districts, then there would be fewer small parties and
less government instability.”
“If the butterfly ballot had not been so confusing, Al Gore would have won the 2000
election.”
These counterfactuals, if true, provide us with a better understanding of these events and an
ability to think about how we might change outcomes in the future. But their truth depends upon
the validity of their implicit causal assertions.
Exploring Three Basic Questions about Causality
Causality is at the center of explanation and understanding, but what, exactly, is it? And how is it
related to counterfactual thinking? Somewhat confusingly, philosophers mingle psychological,
ontological, and epistemological arguments when they discuss causality. Those not alerted to the
different purposes of these arguments may find philosophical discussions perplexing as they
move from one kind of discussion to another. Our primary focus is epistemological. We want to
know when causality is truly operative, not just when some psychological process leads people to
believe that it is operative. And we do not care much about metaphysical questions regarding
7
Roese, Sanna, and Galinsky (2002) show that counterfactuals are often activated by negative affect (e.g.,
losses in the stock market, failure to achieve a goal) as well as being intentionally invoked to plan for the
future.
6
what causality really is, although such ontological considerations become interesting to the extent
that they might help us discover causal relationships.
Psychological and Linguistic Analysis – Although our primary focus is epistemological, our
everyday understanding, and even our philosophical understanding, of causality, is rooted in the
psychology of causal inference. Perhaps the most famous psychological analysis is David
Hume’s investigation of what people mean when they refer to causes and effects. Hume (17111776) was writing at a time when the pre-eminent theory of causality was the existence of a
necessary connection – a kind of “hook” or “force” – between causes and their effects so that a
particular cause must be followed by a specific effect. Hume looked for the feature of causes that
guaranteed their effects. He argued that there was no evidence for the necessity of causes because
all we could ever find in events was the contiguity, precedence, and regularity of cause and effect.
There was no evidence for any kind of hook or force. He described his investigations as follows
in his Treatise of Human Nature (1739):
What is our idea of necessity, when we say that two objects are necessarily connected
together? .... I consider in what objects necessity is commonly supposed to lie; and
finding that it is always ascribed to causes and effects, I turn my eye to two objects
supposed to be placed in that relation, and examine them in all the situations of which they
are susceptible. I immediately perceive that they are contiguous in time and place, and
that the object we call cause precedes the other we call effect. In no one instance can I go
any further, nor is it possible for me to discover any third relation betwixt these objects. I
therefore enlarge my view to comprehend several instances, where I find like objects
always existing in like relations of contiguity and succession. The reflection on several
instances only repeats the same objects; and therefore can never give rise to a new idea.
But upon further inquiry, I find that the repetition is not in every particular the same, but
produces a new impression, and by that means the idea which I at present examine. For,
after a frequent repetition, I find that upon the appearance of one of the objects the mind is
determined by custom to consider its usual attendant, and to consider it in a stronger light
upon account of its relation to the first object. It is this impression, then, or determination,
which affords me the idea of necessity.” (Hume (1738), 1978, page 155).8
Thus for Hume the idea of necessary connection is a psychological trick played by the mind that
observes repetitions of causes followed by effects and then presumes some connection that goes
beyond that regularity. For Hume, the major feature of causation, beyond temporal precedence
and contiguity, is simply the regularity of the association of causes with their effects, but there is
8
In the Enquiry (1748, pages 144-45) which is a later reworking of the Treatise, Hume says: “So that,
upon the whole, there appears not, throughout all nature, any one instance of connexion, which is
conceivable by us. All events seem entirely loose and separate. One event follows another; but we never
can observe any tye between them. They seem conjoined, but never connected. And as we can have no
idea of any thing, which never appeared to our outward sense or inward sentiment, the necessary
conclusion seems to be, that we have no idea of connexion or power at all, and that these words are
absolutely without meaning, when employed either in philosophical reasonings, or common life.... This
connexion, therefore, we feel in the mind, this customary transition of the imagination from one object to
its usual attendant, is the sentiment or impression, from which we form the idea of power or necessary
connexion.”
7
no evidence for any kind of hook or necessary connection between causes and effects.9
The Humean analysis of causation became the predominant perspective in the nineteenth and
most of the twentieth century, and it led in two directions both of which focused upon the logical
form of causal statements. Some, such as the physicist Ernst Mach, the philosopher Bertrand
Russell, and the statistician/geneticist Karl Pearson concluded that there was nothing more to
causation than regularity so that the entire concept should be abandoned in favor of functional
laws or measures of association such as correlation which summarized the regularity.10 Others,
such as the philosophers John Stuart Mill (1888), Karl Hempel (1965), and Tom Beauchamp and
Alexander Rosenberg (1981) looked for ways to strengthen the regularity condition so as to go
beyond mere accidental regularities. For them, true cause and effect regularities must be
unconditional and follow from some lawlike statement. Their neo-Humean approach improved
upon Hume’s theory, but, as we shall see, there appears to be no way to define lawlike statements
in a way that captures all that we mean by causality.
What, then, do we typically mean by causality? In their analysis of the fundamental metaphors
used to mark the operation of causality, the linguist George Lakoff and the philosopher Mark
Johnson (1980a,b, 1999) describe prototypical causation as “the manipulation of objects by force,
the volitional use of bodily force to change something physically by direct contact in one’s
immediate environment.” (1999, page 177) Causes bring, throw, hurl, propel, lead, drag, pull,
push, drive, tear, thrust, or fling the world into new circumstances. These verbs suggest that
causation is forced movement, and for Lakoff and Johnson the “Causation Is Forced Movement
metaphor is in a crucial way constitutive of the concept of causation.” (Page 187) Causation as
forceful manipulation differs significantly from causation as the regularity of cause and effect
because forceful manipulation emphasizes intervention, agency, and the possibility that the failure
to engage in the manipulation will prevent the effect from happening. For Lakoff and Johnson,
causes are forces and capacities that entail their effects in ways that go beyond mere regularity
and that are reminiscent of the causal “hooks” rejected by Hume, although instead of hooks they
emphasize manipulation, mechanisms, forces, and capacities.11
9
There are different interpretations of what Hume meant. For a thorough discussion see Beauchamp and
Rosenberg (1981).
10
Bertrand Russell famously wrote that “the word ‘cause’ is so inextricably bound up with misleading
associations as to make its complete extrusion from the philosophical vocabulary desirable.... The law of
causality, like so much that passes muster among philosophers, is a relic of a bygone age, surviving like
the monarchy, only because it is erroneously supposed to do no harm.” (Russell, 1918). Karl Pearson
rejected causation and replaced it with correlation: “Beyond such discarded fundamentals as ‘matter’ and
‘force’ lies still another fetish amidst the inscrutable arcana of even modern science, namely the category
of cause and effect. Is this category anything but a conceptual limit to experience, and without any basis in
perception beyond a statistical approximation?” (Pearson, 1911, page vi) “It is this conception of
correlation between two occurrences embracing all relationship from absolute independence to complete
dependence, which is the wider category by which we have to replace the old idea of causation.” (Pearson,
1911, page 157).
11
As we shall show, two different theories of causation are conflated here. One theory emphasizes agency
and manipulation. The other theory emphasizes mechanisms and capacities. The major difference is the
locus of the underlying force that defines causal relationships. Agency and manipulation theories
8
“Causation as regularity” and “causation as manipulation” are quite different notions, but each
carries with it some essential features of causality. And each is the basis for a different
philosophical or everyday understanding of causality. From a psychological perspective, their
differences emerge clearly in research done in the last fifteen years on the relationship between
causal and counterfactual thinking (Spellman and Mandel, 1999). Research on this topic
demonstrates that people focus on different factors when they think causally than when they think
counterfactually. In experiments, people have been asked to consider causal attributions and
counterfactual possibilities in car accidents in which they imagine that they chose a new route to
drive home and were hit by a drunk driver. People’s causal attributions for these accidents tend to
“focus on antecedents that general knowledge suggest would covary with, and therefore predict,
the outcome (e.g., the drunk driver),” but counterfactual thinking focuses on controllable
antecedents such as the choice of route (Spellman and Mandel, 1999, page 123). Roughly
speaking, causal attributions are based upon a regularity theory of causation while counterfactual
thinking is based upon a manipulation theory of causation. The regularity theory suggests that
drunken drivers typically cause accidents but the counterfactual theory suggests that in this
instance the person’s choice of a new route was the cause of the accident because it was
manipulable by the person.
The logic of causal and the logic of counterfactual thinking are so closely related that these
psychological differences in attributions lead to the suspicion that both the regularity and the
manipulation theory tell us something important about causation. This psychological research
also reminds us that causes are defined in relation to what the philosopher John Mackie calls a
“causal field” of other factors and that what people choose to consider the cause of an event
depends upon how they define the causal field.
Thus, an unfortunate person who lights a cigarette in a house which ignites a gas leak and causes
an explosion will probably consider the causal field to be a situation where lighting a cigarette
and no gas leak is the norm, hence the gas leak will be identified as the cause of the explosion.
But an equally unfortunate person who lights a cigarette at a gas station which causes an
explosion will probably consider lighting the cigarette to be the cause of the explosion and not the
fact that gas fumes were present at the station.12 Similarly, a political scientist who studies great
power politics may consider growing instability in the great power system to be the cause of
World War I because a stable system could have weathered the assassination of Archduke
Ferdinand, but an historian who studies the impact of assassination on historical events might
argue that World War I was a prime example of how assassinations can cause bad consequences
such as a world war. As Mackie notes, both are right, but “What is said to be caused, then, is not
emphasize human intervention. Mechanism and capacity theories emphasize processes within nature
itself.
12
Legal wrangling over liability often revolves around who should be blamed for an accident where the
injured party has performed some action in a causal field. The injured party typically claims that the
action should have been anticipated and its effects mitigated or prevented by the defendant in that causal
field and the defendant argues that the action should not have been taken by the plaintiff or could not have
been anticipated by the defendant.
9
just an event, but an event-in-a-certain-field, and some ‘conditions’ can be set aside as not causing
this-event-in-this-field simply because they are part of the chosen field, though if a different field
were chosen, in other words if a different causal question were being asked, one of those
conditions might well be said to cause this-event-in-that-other-field.” (Mackie, 1974, page 35)
Those familiar with regression analysis in which multiple factors are said to cause an event might
translate this result into the simple adage that some researchers look at one coefficient in a
regression equation (that for the causal impact of assassinations) and other researchers look at
another coefficient (that for the causal impact of instability in the great power system), but the
lesson is larger than that. The historian interested in assassinations will collect and study cases of
failed and successful assassinations and will measure their impact in terms of changes in
governmental policies. These changes will include declarations of war, but they will include
many other things as well such as the passage of the Civil Rights Act after the assassination of
John F. Kennedy. The political scientist studying the balance among great powers will have an
entirely different set of cases and probably measure outcomes such as declarations of war,
alliances, embargoes, and other international actions. In terms of our earlier example involving
cigarettes and gas, the historian is studying the consequences of cigarette smoking and the
political scientist is interested in the consequences of the use of gas and gasoline. The lesson for
the practicing researcher is that the same events can be studied and understood from many
different perspectives and the researcher must think carefully about the causal field.
These investigations of everyday causal thinking are very suggestive, but there is ultimately no
reason why the way people ordinarily use the concept of causation should suffice for scholarly
inquiry, although we would surely be concerned if scholarly uses departed from ordinary ones
without any clear reason.
Ontological Questions – Knowing how most people think and talk about causality is useful, but
we are ultimately more interested in knowing what causality actually is and how we would
discover it in the world. These are respectively ontological and epistemological questions.13 As
we shall see, these questions are quite separate but their answers are often closely intertwined.
Ontological questions ask about the characteristics of the abstract entities that exist in the world.
Queries about the definition of events, the existence of abstract properties, the nature of causality,
and the existence of God are all ontological questions. The study of causality raises a number of
fundamental ontological questions regarding the things that are causally related and the nature of
the causal relation.14
13
Roughly speaking, philosophy is concerned with three kinds of questions regarding “what is”
(ontology), “how it can be known” (epistemology), and “what value it has” (ethics and aesthetics). In
answering these questions, twentieth century philosophy has also paid a great deal of attention to logical,
linguistic, and even psychological analysis.
14
Symbolically, we can think of the causal relation as a statement XcY where X is a cause, Y is an effect,
and c is a causal relation. X and Y are the things that are causally related and c is the causal relation. As
we shall see later, this relationship is usually considered to be incomplete (not all X and Y are causally
related), asymmetric for those events that are causally related (either XcY or YcX but not both), and
irreflexive (XcX is not possible).
10
What are the things, the “causes” and the “effects” that are linked by causation? Whatever they
are, they must be the same things because causes can also be effects and vice-versa. But what are
they? Are they facts, properties, events, or something else?15
The practicing researcher cannot ignore questions about the definition of events. Are “arm
reaching,” “glasses breaking,” “Stalin succeeding Lenin,” “a Democratic USSR,” and “the
butterfly ballot” all events? They certainly differ in size, complexity, duration, and other features.
One of the things that researchers must consider is the proper definition of an event,16 and a great
deal of the effort in doing empirical work is defining events suitably. Not surprisingly,
tremendous effort has gone into defining wars, revolutions, firms, organizations, democracies,
religions, participatory acts, political campaigns, and many other kinds of events and structures
that matter for social science research. Much could be said about defining events, but we shall
only emphasize that defining events in a useful fashion is one of the major tasks of good social
science research.
A second basic set of ontological questions concern the nature of the causal relationship. Is
causality different when it deals with physical phenomena (e.g., billiard balls hitting one another
or planets going around stars) than when it deals with social phenomena (democratization,
business cycles, cultural change, elections) that are socially constructed?17 What role do human
agency and mental events play in causation?18 What can we say about the time structure and
nature of causal processes?19
Once again, there are real philosophical issues here, but we shall elide most of them because it
would take us too far afield to deal with each one of them and because we are concerned with
those situations where researchers want to determine causality. Our general attitude is that social
science is about the formation of concepts and the identification of causal mechanisms. We
believe that social phenomena such as the Protestant ethic, the system of nation-states, and culture
15
Events are located in space and time (e.g., “the WWI peace settlement at Versailles”) but facts are not
(“The fact that the WW I peace settlement was at Versailles”). For discussions of causality and events see
Bennett (1988) and for causality and facts see Mellors (1995). Many philosophers prefer to speak of
“tropes” which are particularized properties (Ehring, 1997). Some philosophers reject the idea that the
world can be described in terms of distinct events or tropes and argue for events as enduring things. (Harre
and Madden, 1973, Chapter 6).
16
A potpourri of citations that deal with the definition of events and social processes are Abbott (1983,
1992, 1993), Pierson (2002), Riker (1957), Tilly (1984).
17
For representative discussions see Durkheim (19xx), Berger and Luckman (1966), von Wright (1971),
Elster (19xx, nuts and bolts), Searle (1995), Wendt (1999).
18
See Dilthey (19xx), von Wright (1971, Chapter 1), Davidson (19xx), Elster (nuts and bolts), Searle
(19xx), Wendt (1999).
19
In a vivid set of metaphors, Pierson (2002) compares different kinds of social science processes with
tornadoes, earthquakes, large meteorites, and global warming in terms of the time horizon of the cause and
the time horizon of the impact. He shows that the causal processes in each situation are quite different.
11
exist and have causal implications. We also believe that reasons, perceptions, beliefs, and
attitudes affect human behavior. Furthermore, we believe that these things can be observed and
measured. We are prepared to defend these assertions in the abstract, but our focus here is on the
methods that practicing researchers should use to study these things. Nevertheless, as we shall
show, getting a grip on causality requires researchers to have a detailed understanding of the
kinds of mechanisms that could link one event with another. Researchers must think about what
these processes might be and how they operate. Perhaps most importantly, researchers must think
very hard about the nature of human action.
Another basic question about the causal relation is whether it is deterministic or probabilistic.
The classic model of causation is the deterministic, clockwork Newtonian universe in which the
same initial conditions inevitably produce the same outcome. But modern science has produced
many examples where causal relationships appear to be probabilistic. The most famous is
quantum mechanics where the position and momentum of particles is represented by probability
distributions, but many other sciences rely upon probabilistic relationships. Geneticists, for
example, do not expect that couples in which all the men have the same height and all the women
have the same height will have children of the same height. In this case, the same set of
(observed) causal factors produce a probability distribution over possible heights. We now know,
that even detailed knowledge of the couple’s DNA would not lead to exact predictions.
Probabilistic causation, therefore, seems possible in the physical sciences, common in the
biological sciences, and pervasive in the social sciences. Nevertheless, following the custom of a
great deal of philosophical work, we shall start with a discussion of deterministic causation in
order not to complicate the analysis.
Epistemological Questions – Epistemology is concerned with how we can obtain intellectually
certain knowledge (what the Greeks called “episteme”) and how we can identify and learn about
causality. How do we figure out that X really caused Y? At the dinner table, our admonition not
to reach across the table might be met with “I didn’t break the glass, the table shook,” suggesting
that our causal explanation for the broken glass was wrong. How do we proceed in this situation?
We would probably try to rule out alternatives by investigating whether someone shook the table,
whether there was an earthquake, or something else happened to disturb the glass. The problem
here is that there are many possibilities that must be ruled out, and what must be ruled out
depends, to some extent, on our definition of causality.
Learning about causality, then, requires that we know what it is and that we know how to
recognize it when we see it. Simple Humean theories appear to solve both problems at once.
Two events are causally related when they are contiguous, one precedes another, and they occur
regularly in constant conjunction with one another. Once we have checked these conditions, we
know that we have a causal connection. But upon examination, these conditions are not enough
for causality because we would not say that night causes day, even though day and night are
contiguous, night precedes day, and day and night are regularly associated. Furthermore, simple
regularities like this do not make it easy to distinguish cause from effect – after all, day precedes
night as well as night preceding day so that we could just as well, and just as mistakenly, say that
day causes night. Something more is needed.20 It is this something more that causes most of the
20
Something different might also be needed. Hume himself dropped the requirement for contiguity in his
1748 rewrite of his 1738 work, and many philosophers would also drop his requirement for temporal
12
problems for understanding causation. John Stuart Mill suggested that there had to be an
“unconditional” relationship between cause and effect and modern neo-Humeans have required a
“lawlike” relationship, but even if we know what this means21 (which would solve the ontological
problem of causation) it is hard to ensure that it is true in particular instances so as to solve the
epistemological problem.
In fact, it is possible, just as we might know what the perfect surfing wave should be without
knowing how or where to find it, that we can know what causality is without knowing how or
where to find it. We might have solved the ontological problem without solving the
epistemological problem. Or, just as we might be able to bake an excellent souffle without being
able to describe how it rises, we might be able to determine causality without really knowing what
it is. We might just have a recipe for finding it. In this case, we would have solved the
epistemological problem without solving the ontological problem. To make things even more
complicated, some people might argue that the solution to the epistemological problem is the
solution to the ontological one – a souffle is simply the recipe for it. Or alternatively, they might
argue that the solution to the ontological problem indicates that there can be no solution to the
epistemological one – we can know what causality is, but we can never establish it.
In the following sections, we begin with a review of four theories of what causality might be. We
spend most of our time on a counterfactual definition, mostly amounting to a recipe, that is now
widely used in statistics. We end with a discussion of the limitations of the recipe and how far it
goes towards solving the epistemological and ontological problems.
Humean and Neo-Humean Theories of Causation
Lawlike Generalities and the Humean Regularity Theory of Causation – Humean and neoHumean theories propose logical conditions that must hold for the constant conjunction of events
to justify the inference that they have a cause-effect relationship. Specifically, Humeans have
explored whether a cause must be sufficient for its effects, necessary for its effects, or something
more complicated.
The classic definition shared by Hume, John Stuart Mill, and many others was that “X is a cause
of Y if and only if X is sufficient for Y.” That is, the cause must always and invariably lead to the
effect. Certainly an X that is sufficient for Y can be considered a cause, but what about the many
putative causes are not sufficient for their effect? Striking a match, for example, may be
necessary for it to light, but it may not light unless there is enough oxygen in the atmosphere. Is
striking a match never a cause of a match lighting? This leads to an alternative definition in
which “X is a cause of Y if and only if X is necessary for Y.” Under this definition, it is assumed
that the cause (such as striking the match) must be present for the effect to occur, but it may not
always be enough for the cause to actually occur (because there might not be enough oxygen).
precedence.
21
Those new to this literature are presented with many statements about the need for lawfulness and
unconditionality which seem to promise a recipe that will insure lawfulness. But the conditions that are
presented always seem to fall short of the goal.
13
But how many causes are even necessary for their effects? If the match does not light after
striking it, someone might use a blowtorch to light it so that striking the match is not even
necessary for the match to ignite. Do we therefore assume that striking the match is never a cause
of its lighting? Necessity and sufficiency seem unequal to the task of defining causation.22
These considerations led John Mackie to propose a set of conditions requiring that a cause be an
insufficient [I] but necessary [N] part of a condition which is itself unnecessary [U] but
exclusively sufficient [S] for the effect. These INUS conditions can be explained by an example.
Consider two ways that the effect (E), which is a building burning down, might occur. (See
Figure 1.) In one scenario the wiring might short-circuit and overheat, thus causing the wooden
framing to burn. In another, a gasoline can might be next to a furnace that ignites and causes the
gasoline can to explode. A number of factors here are INUS conditions for the building to burn
down. The short circuit (C) and the wooden framing (W) together might cause the building to
burn down, or the gasoline can (G) and the furnace (F) might cause the building to burn down.
Thus, C and W together are exclusively sufficient [S] to burn the building down, and G and F
together are exclusively sufficient [S] to burn the building down. Furthermore, the short circuit
and wooden framing (C&W) are unnecessary [U], and the gasoline can and the furnace (G&F) are
unnecessary [U] because the building could have burned down with just one or the other
combination of factors. Finally, C, W, G, or F alone is insufficient [I] to burn the building down
even though C is necessary [N] in conjunction with W (or vice-versa) and G is necessary [N] in
conjunction with F (or vice-versa). This formulation allows for the fact that no single cause is
sufficient or necessary, but when experts say that a short-circuit caused the fire they “... are
saying, in effect that the short-circuit (C) is a condition of this sort, that it occurred, that the other
conditions (W) which, conjoined with it, form a sufficient condition were also present, and that no
other sufficient condition (such as G&F) of the house’s catching fire was present on this
occasion.” (Mackie, 1965, page 245, letters addded).
Figure 1
C & W ------->
E: Burning Building
G&F
------->
From the perspective of a practicing researcher, three lessons follow from the INUS conditions.
First a putative cause such as C might not cause the effect E because G&F might be responsible.
Hence, the burned down building (E) will not always result from a short circuit (C) even though C
could cause the building to burn down. Second, interactions among causes may be necessary for
any one cause to be sufficient (C and W require each other and W and G require each other).
Third, the relationship between any INUS cause and its effect might appear to be probabilistic
because of the other INUS causes. In summary, the INUS conditions suggest the multiplicity of
22
And there are problems such as the following favorite of the philosophers: “If two bullets pierce a man’s
heart simultaneously, it is reasonable to suppose that each is an essential part of a distinct sufficient
condition of the death, and that neither bullet is ceteris paribus necessary for the death, since in each case
the other bullet is sufficient.” (Sosa and Tooley, pages 8-9).
14
causal pathways and causes, the possibility of conjunctural causation (Ragin, 1987), and the
likelihood that social science relationships will appear probabilistic even if they are
deterministic.23
[This section until the next asterisks might be put in an appendix.]**********************
A specific example might help to make these points clearer. Assume that the four INUS factors
mentioned above, C, W, G, and F, occur independently of one another and that they are the only
factors which cause fires in buildings. Further assume that short circuits (C) occur 10% of the
time, wooden (W) frame buildings 50% of the time, furnaces (F) 90% of the time, and gasoline
(G) cans near furnaces 10% of the time. Because these events are assumed independent of one
another, it is easy to calculate that C and W occur 5% of the time and that G and F occur 9% of
the time. (We simply multiply the probability of the two independent events.) All four
conditions occur 0.45% of the time. (The product of all four percentages.) Thus, fires occur
13.55% of the time. This percentage includes the cases where the fire is the result of C and W
(5% of the time) and where it is the result of G and F (9% of the time), and it adjusts downward
for double-counting that occurs in the cases where all four INUS conditions occur together
(0.45% of the time).
Now suppose an experimenter did not know about the role of wooden frame buildings or gasoline
cans and furnaces and only looked at the relationship between fires and short-circuits. A crosstabulation of fires with the short-circuit factor would yield Table 2. As assumed above, short
circuits occur 10% of the time (see the third column total at the bottom of the table) and as
calculated above, fires occur 13.55% of the time (see the third row total on the far right). The
entries in the interior of the table are calculated in a similar way.24
Table 2 – Fires by Short Circuits in Hypothetical Example
(Total Percentages of each Event)
Not C – No short circuits
Not E – No fires
E – Fires
Column Totals
C – Short Circuits
Row Totals
81.90%
4.55%
86.45%
8.10%
5.45%
13.55%
90.00%
10.00%
100.00%
Even though each case occurs because of a deterministic process – either a short-circuit and a
wooden frame building or a gasoline can and a furnace (or both), this cross-tabulation suggests a
probabilistic relationship between fires and short-circuits. In 4.55% of the cases, short circuits
occur but no fires result because the building was not wooden. In 8.10% of the cases, there are no
23
These points are made especially forcefully in Marini and Singer (1988).
24
Thus, the entry for short circuits and fires comes from the cases where there are short-circuits and
wooden frame buildings (5% of the time) and where there are short-circuits and no wooden frame
buildings but there are gasoline cans and furnaces (5% times 9%).
15
short circuits, but a fire occurs because the gasoline can has been placed near the furnace. For this
table, a standard measure of association, the Pearson correlation, between the effect and the cause
is about .40 which is far short of the 1.0 required for a perfect (positive) relationship. If, however,
the correct model is considered in which there are the required interaction effects, the relationship
will produce a perfect fit.25 Thus, a misspecification of a deterministic relationship can easily lead
a researcher to think that there is a probabilistic relationship between the cause and effect.
*****************************************************************************
INUS conditions reveal a lot about the complexities of causality, but as a definition of it, they turn
out to be too weak – they do not rule out situations where there are common causes, and they do
not exclude accidental regularities. The problem of common cause arises in a situation where, for
example, lightning strikes (L) the wooden framing (W) and causes it to burn (E) while also
causing a short in the circuitry (C). That is, L –> E and L –> C (where the arrow indicates
causation). If lightning always causes a short in the circuitry, but the short never has anything to
do with a fire in these situations because the lightning starts the fire directly through its heating of
the wood, we will nevertheless always find that C and E are constantly conjoined through the
action of the lightning, suggesting that the short circuit caused the fire even though the truth is
that lightning is the common cause of both.26 In some cases of common causes such as the rise in
barometric pressure followed by the arrival of a storm, common sense tells us that the putative
cause (the rise in barometric pressure) cannot be the real cause of the thunderstorm. But in the
situation with the lightning, the fact that short circuits have the capacity to cause fires makes it
less likely that we will realize that lightning is the common cause of both the short-circuits and
the fires. We might be better off in the case where the lightning split some of the wood framing
of the house instead of causing a short-circuit. In that case, we would probably reject the fantastic
theory that split wood caused the fire because split wood does not have the capacity to start a fire,
but the Humean theory would be equally confused by both situations because it could not appeal,
within the ambit of its understanding, to causal capacities. For a Humean, the constant
conjunction of split wood and fires suggests causation as much as the constant conjunction of
short-circuits and fires. Indeed, the constant conjunction of storks and babies would be treated as
probative of a causal connection.
Attempts to fix-up these conditions usually focus on trying to require “lawlike” statements that
are unconditionally true, not just accidentally true. Since it is not unconditionally true that
splitting wood causes fires, the presumption is that some such conditions can be found to rule-out
this explanation. Unfortunately, no set of conditions seem to be successful.27 Although the
25
If each variable is scored zero or one depending upon whether the effect or cause is present or absent,
then a regression equation of the effect on the product (or interaction) of C and W, the product of G and F,
and the product of C, W, G, and F will produce a multiple correlation of one indicating a perfect fit.
26
It is also possible that the lightning’s heating of the wood is (always or sometimes) insufficient to cause
the fire (not L–>E), but its creation of a short-circuit (L–>C) is (always or sometimes) sufficient for the
fire (C–>E). In this case, the lightning is the indirect cause of the fire through its creation of the short
circuit. That is, L –> C –> E.
27
For some representative discussions of the problems see (Harre and Madden, 1975, Chapter 2; Salmon,
1990, Chapters 1-2; Hausman, 1998, Chapter 3). Salmon (1990, page 15) notes that “Lawfulness, modal
import [what is necessary, possible, or impossible], and support of counterfactuals seems to have a
16
regularity theory identifies a necessary condition for describing causation, it basically fails
because association is not causation and there is no reason why purely logical restrictions on
lawlike statements should be sufficient to characterize causal relationships. Part of the problem is
that there are many different types of causal laws and they do not fit any particular patterns. For
example, one restriction that has been proposed to insure lawfulness is that lawlike statements
should either not refer to particular situations or they should be derivable from laws that do not
refer to particular situations. This would mean that Kepler’s first “law” about all planets moving
in elliptical orbits around the sun (a highly specific situation!) was not a causal law before
Newton’s laws were discovered, but it was a causal law after it was shown that it could be derived
from Newton’s laws. But Kepler’s laws were always considered causal laws, and there seems to
be no reason to rest their lawfulness on Newton’s laws. Furthermore, by this standard, almost all
social science and natural science laws (e.g., plate tectonics) are about particular situations. In
short, logical restrictions on the form of laws do not seem sufficient to characterize causality.
The Asymmetry of Causation – The regularity theory also fails because it does not provide an
explanation for the asymmetry of causation. Causes should cause their effects, but INUS
conditions are almost always symmetrical such that if C is an INUS cause of E, then E is also an
INUS cause of C. It is almost always possible to turn around an INUS condition so that an effect
is an INUS for its cause.28 One of the most famous examples of this problem involves a flagpole,
the elevation of the sun, and the flagpole’s shadow. The law that light travels in straight lines
implies that there is a relationship between the height of the flagpole, the length of its shadow,
and the angle of elevation of the sun. When the sun rises, the shadow is long, at midday it is
short, and at sunset it is long again. Intuition about causality suggests that the length of the
shadow is caused by the height of the flagpole and the elevation of the sun. But, using INUS
conditions, we can just as well say that the elevation of the sun is caused by the height of the
flagpole and the length of the shadow. There is simply nothing in the conditions that precludes
this fantastic possibility.
The only feature of the Humean theory that provides for asymmetry is temporal precedence. If
changes in the elevation of the sun precede corresponding changes in the length of the shadow,
then we can say that the elevation of the sun causes the length of the shadow. And if changes in
the height of the flagpole precede corresponding changes in the length of the shadow, we can say
that the height of the flagpole causes the length of the shadow. But many philosophers reject
making temporal precedence the determinant of causal asymmetry because it precludes the
possibility of explaining the direction of time by causal asymmetry and it precludes the possibility
of backwards causation. From a practical perspective, it also requires careful measures of timing
that may be difficult in a particular situation.
Summary – This discussion reveals two basic aspects of the causal relation. One is a symmetrical
form of association between cause and effect and the other is an asymmetrical relation in which
causes produce effects but not the reverse. The Humean regularity theory, in the form of INUS
common extension; statements either possess all three or lack all three. But it is extraordinarily difficult to
find criteria to separate those statements that do from those that do not.”
28
*** Insert a footnote giving the source on the reversibility of INUS conditions.
17
conditions, provides a necessary condition for the existence of the symmetrical relationship,29 but
it does not rule out situations such as common cause and accidental regularities where there is no
causal relationship at all. From a methodological standpoint, it can easily lead researchers to
presume that all they need to do is to find associations, and it also leads to an underemphasis on
the rest of the requirement for a “lawlike” or “unconditional” relationship because it does not
operationally define what that would really mean. A great deal of what passes for causal
modeling suffers from these defects (Freedman, 1987, 1991, 1997, 1999)
The Humean theory does even less well with the asymmetrical feature of the causal relationship
because it provides no way to determine asymmetry except temporal precedence. Yet there are
many other aspects of the causal relation that seem more fundamental than temporal precedence.
Causes not only typically precede their effects, but they also can be used to explain effects or to
manipulate effects while effects cannot be used to explain causes or to manipulate them.30
Effects also depend upon causes, but causes do not depend upon effects. Thus, if a cause does not
occur, then the effect will not occur because effects depend on their causes. The counterfactual,
“if the cause did not occur, then the effect would not occur” is true. However, if the effect does
not occur, then the cause might still occur because causes can happen without leading to a specific
effect if other features of the situation are not propitious for the effect. The counterfactual, “if the
effect did not occur, then the cause would not occur” is not necessarily true. For example, where
a short-circuit causes a wooden frame building to burn down, if the short-circuit does not occur,
then the building will not burn down. But if the building does not burn down, it is still possible
that the short-circuit occurred but its capacity for causing fires was neutralized because the
building was made of brick. This dependence of effects on causes suggests that an alternative
definition of causation might be based upon a proper understanding of counterfactuals.
Counterfactual Definition of Causation
In a book On the Theory and Method of History published in 1902, Eduard Meyer claimed that it
was an “unanswerable and so an idle question” whether the course of history would have been
different if Bismarck, then Chancellor of Prussia, had not decided to go to war in 1866. By some
accounts, the Austro-Prussian-Italian War of 1866 paved the way for German and Italian
unification (see, Wawro, 1997). In reviewing Meyer’s book in 1906, Max Weber agreed that
“from the strict ‘determinist’ point of view” finding out what would have happened if Bismarck
had not gone to war “was ‘impossible’ given the ‘determinants’ which were in fact present.” But
he went on to say that “And yet, for all that, it is far from being ‘idle’ to raise the question what
might have happened, if, for example, Bismarck had not decided for war. For it is precisely this
question which touches on the decisive element in the historical construction of reality: the causal
29
Probabilistic causes do not necessarily satisfy INUS conditions because an INUS factor might only
sometimes produce an effect. Thus, the short-circuit and the wooden frame of the house might only
sometimes lead to a conflagration in which the house is burned down. Introducing probabilistic causes
would add still another layer of complexity to our discussion which would only provide more reasons to
doubt the Humean regularity theory.
30
Hausman (1998, page 1) also catalogs other aspects of the asymmetry between causes and effects.
18
significance which is properly attributed to this individual decision within the totality of infinitely
numerous ‘factors’ (all of which must be just as they are and not otherwise) if precisely this
consequence is to result, and the appropriate position which the decision is to occupy in the
historical account.” (Weber, 1978, 111). Weber’s review is an early discussion of the importance
of counterfactuals for understanding history and making causal inferences. He argues forcefully
that if “history is to raise itself above the level of a mere chronicle of noteworthy events and
personalities, it can only do so by posing just such questions” as the counterfactual in which
Bismarck did not decide for war.31
Lewis’s Counterfactual Theory of Causation – The philosopher David Lewis (1973b) has
proposed the most elaborately worked out theory of how causality is related to counterfactuals.32
His theory requires the truth of two statements regarding two distinct events X and Y. Lewis starts
from the presumption that X and Y have occurred so that the “counterfactual” statement:33 “If X
were to occur, then Y would occur” is true. The truth of this statement is Lewis’s first condition
for a causal relationship. Then he considers the truth of a second counterfactual:34 “If X were not
to occur, then Y would not occur either.” If this is true as well, then he says that X causes Y. If,
for example, Bismarck decided for war in 1866 and, as some historians argue, German unification
followed because of his decision, then we must ask “If Bismarck had not decided for war, would
Germany have remained divided?” The heart of Lewis’s theory is the set of requirements,
described below, that he lays down for the truth of this kind of counterfactual.
Lewis’ theory has a number of virtues. It deals directly with singular causal events, and it does
not require the examination of a large number of instances of X and Y. At one point in the
philosophical debate about causation, it was believed that the individual cases such as “the
hammer blow caused the glass to break” or “the assassination of Archduke Ferdinand caused
World War I” could not be analyzed alone because these cases had to be subsumed under a
general law (“hammer blows cause glass to break”) derived from multiple cases plus some
31
I am indebted to Richard Swedberg for pointing me towards Weber’s extraordinary discussion.
Lewis finds some support for his theory in the work of David Hume. In a famous change of course in a
short passage in his Enquiry Concerning Human Understanding (1748), Hume first summarized his
regularity theory of causation by saying that “we may define a cause to be an object, followed by another,
and where all the objects similar to the first, are followed by objects similar to the second,” and then he
changed to a completely different theory of causation by adding “Or in other words, where if the first
object had not been, the second had never existed.” (Enquiry, page 146) As many commentators have
noted, these were indeed other words, implying an entirely different theory of causation. The first theory
equates causality with the constant conjunction of putative causes and effects across similar circumstances.
The second, which is a counterfactual theory, relies upon what would happen in a world where the cause
did not occur.
32
33
Lewis considers statements like this as part of his theory of counterfactuals by simply assuming that
statements in the subjunctive mood with true premises and true conclusions are true. As noted earlier,
most theories of counterfactuals have been extended to include statements with true premises by assuming,
quite reasonably, that they are true if their conclusion is true and false otherwise.
34
This is a simplified version of Lewis’s theory based upon Lewis (1973a,b; 1986) and Hausman (1998,
Chapter 6).
19
particular facts of the situation in order to meet the requirement for a “lawlike” relationship. The
counterfactual theory, however, starts with singular events and proposes that causation can be
established without an appeal to a set of similar events and general laws regarding them.35 The
possibility of analyzing singular causal events is important for all researchers, but especially for
those doing case studies who want to be able to say something about the consequences of Stalin
succeeding Lenin as head of the Soviet Union or the impact of the butterfly ballot on the 2000
election.
The counterfactual theory also deals directly with the issue of X’s causal “efficacy” with respect
to Y by considering what would happen if X did not occur. The problem with the theory is the
difficulty of determining the truth or falsity of the counterfactual “If X were not to occur, then Y
would not occur either.” The statement cannot be evaluated in the real world because X actually
occurs so that the premise is false, and there is no evidence about what would happen if X did not
occur. It only makes sense to evaluate the counterfactual in a world in which the premise is true.
Lewis’s approach to this problem is to consider whether the statement is true in the closest
possible world to the actual world where X does not occur. Thus, if X is a hammer blow and Y is
a glass breaking, then the closest possible world is one in which everything else is the same
except that the hammer blow does not occur. If in this world, the glass does not break, then the
counterfactual is true, and the hammer blow (X) causes the glass to break (Y). The obvious
problem with this approach is identifying the closest possible world. If X is the assassination of
Archduke Ferdinand and Y is World War I, is it true that World War I would not have occurred in
the closest possible world where the bullet shot by the terrorist Gavrilo Princip did not hit the
Archduke? Or would some other incident have inevitably precipitated World War I? And, to add
to the difficulty, would this “World War I” be the same as the one that happened in our world?
Lewis’ theory substitutes the riddle of determining the similarity of possible worlds for the neoHumean theory’s problem of determining lawlike relationships. To solve these problems, both
approaches must be able to identify similar causes and similar effects. The Humean theory must
identify them across various situations in the real world. This aspect of the Humean approach is
closely related to John Stuart Mill’s “Method of Concomitant Variation” which he described as
follows: “Whatever phenomenon varies in any manner, whenever another phenomenon varies in
some similar manner, is either a cause or an effect of that phenomenon, or is connected to it
through some fact of causation.” (Mill, 1888, page xxx)36 Lewis’s theory must also identify
similar causes and similar effects in the real world in which the cause does occur and in the many
possible worlds in which the cause does not occur. This approach is closely related to Mill’s
35
In fact, many authors now believe that general causation (involving lawlike generalizations) can only be
understood in terms of singular causation. “...general causation is a generalisation of singular causation.
Smoking causes cancer iff (if and only if) smokers’ cancers are generally caused by their smoking.”
(Mellors, 1995, pages 6-7). See also Sosa and Tooley, 1993. More generally, whereas explanation was
once thought virtually to supercede the need for causal statements, many philosophers now believe that a
correct analysis of causality will provide a basis for suitable explanations (see Salmon, 1990).
36
The Humean theory also has affinities with Mill’s Method of Agreement which he described as follows:
“If two or more instances of the phenomenon under investigation have only one circumstance in common,
the circumstance in which alone all the instances agree, is the cause (or effect) of the given phenomenon.”
(Mill, 1888, page 280)
20
“Method of Difference” in which: “If an instance in which the phenomenon under investigation
occurs, and an instance in which it does not occur, have every circumstance in common save one,
that one occurring only in the former; the circumstance in which alone the two instances differ, is
the effect, or the cause, or an indispensable part of the cause, of the phenomenon.” (Mill, 1888,
page 280).37
In addition to identifying similar causes and similar effects, the Humean theory must determine if
the conjunction of these similar causes and effects is accidental or lawlike. This task requires
understanding what is happening in each situation and comparing the similarities and differences
across situations. Lewis’s theory must identify the possible world where the cause does not occur
that is most similar to the real world. This undertaking requires understanding the facts of the real
world and the laws that are operating in it. Consequently, assessing the similarity of a possible
world to our own world requires understanding the lawlike regularities that govern our world.38 It
seems as if Lewis has simply substituted one difficult task, that of establishing lawfulness, for the
job of identifying the most similar world.
The Virtues of the Counterfactual Definition of Causation – Lewis has substituted one difficult
problem for another, but the reformulation of the problem has a number of benefits. The
counterfactual approach provides new insights into what is required to establish causal connection
between causes and effects. The counterfactual theory makes it clear that establishing causation
does not require observing the universal conjunction of a cause and an effect.39 One observation
of a cause followed by an effect is sufficient for establishing causation if it can be shown that in a
most similar world without the cause, the effect does not occur. The counterfactual theory
proposes that causation can be demonstrated by simply finding a most similar world in which the
absence of the cause leads to the absence of the effect. Consequently, comparisons, specifically
the kind of comparison advocated by John Stuart Mill in his “Method of Difference,” have a
central role in the counterfactual theory as they do in the analysis of case studies.
Lewis’s theory provides us with a way to think about the causal impact of singular events such as
the badly designed butterfly ballot in Palm Beach County, Florida that led some voters in the
2000 Presidential election to complain that they mistakenly voted for Reform Party candidate
Patrick Buchanan when they meant to vote for Democrat Al Gore. The ballot can be said to be
causally associated with these mistakes if in the closest possible world in which the butterfly
ballot was not used the vote for Buchanan was lower than in the real world. Ideally this closest
37
Mill goes on to note that the Method of Difference is “a method of artificial experiment.” (Page 281).
Notice that for both the Method of Concomitant Variation and the Method of Difference, Mill emphasizes
the association between cause and effect and not the identification of which event is the cause and which is
the effect. Mill’s methods are designed to detect the symmetric aspect of causality but not its asymmetric
aspect.
38
Nelson Goodman makes this point in a 1947 article on counterfactuals, and James Fearon (1991), in a
masterful exposition of the counterfactual approach to research, discusses its implications for
counterfactual thought experiments in political science. Also see Tetlock and Belkin (1996).
39
G. H. von Wright notes that the counterfactual conception of causality shows that the hallmark of a
lawlike connection is “necessity and not universality.” (von Wright, 1971, page 22)
21
possible world would be a parallel universe in which the same people received a different ballot,
but this, of course, is impossible. The next best thing is a situation where similar people
employed a different ballot. In fact, the butterfly ballot was only used for election day voters in
Palm Beach County. It was not used by absentee voters. Consequently, the results for the
absentee voting can be considered a surrogate for the closest possible world in which the butterfly
ballot was not used, and in this absentee voting world, voting for Buchanan was dramatically
lower, suggesting that at least 2000 people who preferred Gore – more than enough to give the
election to Gore – mistakenly voted for Buchanan on the butterfly ballot.
The difficult question, of course, is whether the absentee voting world can be considered a good
enough surrogate for the closest possible world in which the butterfly ballot was not used.40 The
counterfactual theory does not provide us with a clear sense of how to make that judgment.41 But
the framework does suggest that we should consider the similarity of the election-day world and
the absentee voter world. To do this, we can ask whether election day voters are different in some
significant ways from absentee voters, and this question can be answered by considering
information on their characteristics and experiences. In summary, the counterfactual perspective
allows for analyzing causation in singular instances, and it emphasizes comparison, which seems
difficult but possible, rather than the recondite and apparently fruitless investigation of the
lawfulness of statements such as “All ballots that place candidate names and punch-holes in
confusing arrangements will lead to mistakes in casting votes.”
Controlled Experiments and Closest Possible Worlds – The difficulties with the counterfactual
definition are identifying the characteristics of the closest possible world in which the putative
cause does not occur and finding an empirical surrogate for this world. For the butterfly ballot,
sheer luck led a team of researchers to discover that the absentee ballot did not have the
problematic features of the butterfly ballot.42 But how can we find surrogates in other
circumstances?
One answer is controlled experiments. Experimenters can create mini-closest-possible worlds by
finding two or more situations and assigning putative causes (called “treatments”) to some
situations but not to others (which get the “control”). If in those cases where the cause C occurs,
the effect E occurs, then the first requirement of the counterfactual definition is met: when C
occurs, then E occurs. Now, if the situations which receive the control are not different in any
40
For an argument that the absentee votes are an excellent surrogate, see Wand et al., “The Butterfly Did
It,” American Political Science Review, December, 2001.
41
In his book on counterfactuals, Lewis only claims that similarity judgments are possible, but he does not
provide any guidance on how to make them. He admits that his notion is vague, but he claims it is not illunderstood. “But comparative similarity is not ill-understood. It is vague–very vague–in a wellunderstood way. Therefore it is just the sort of primitive that we must use to give a correct analysis of
something that is itself undeniably vague.” (Lewis, 1973a, page 91). In later work Lewis (1979, 1986)
formulates some rules for similarity judgements, but they do not seem very useful to us and to others
(Bennett, 1984).
42
For the story of how the differences between the election day and absentee ballot were discovered, see
Brady et al, 2001a.
22
significant ways from those that get the treatment, then they can be considered surrogates for the
closest possible world in which the cause does not occur. If in these situations where the cause C
does not occur, the effect E does not occur either, then the second requirement of the
counterfactual definition is confirmed: in the closest possible world where C does not occur, then
E does not occur. The crucial part of this argument is that the control situation, in which the
cause does not occur, must be a good surrogate for the closest possible world to the treatment.
Two experimental methods have been devised for insuring closeness between the treatment and
control situations. One is classical experimentation in which as many circumstances as possible
are physically controlled so that the only significant difference between the treatment and the
control is the cause. In a chemical experiment, for example, one beaker holds two chemicals and
a substance that might be a catalyst and another beaker of the same type, in the same location, at
the same temperature, and so forth contains just the two chemicals in the same proportions
without the suspected catalyst. If the reaction occurs only in the first beaker, it is attributed to the
catalyst. The second method is random assignment of treatments to situations so that there are no
reasons to suspect that the entities that get the treatment are any different, on average, from those
that do not. We discuss this approach in detail below.
Problems with the Counterfactual Definition43 – Although the counterfactual definition of
causation leads to substantial insights about causation, it also leads to two significant problems.
Using the counterfactual definition as it has been described so far, the direction of causation
cannot be established, and two effects of a common cause can be mistaken for cause and effect.
Consider, for example, an experiment as described above. In that case, in the treatment group,
when C occurs, E occurs, and when E occurs, C occurs. Similarly, in the control group, when C
does not occur, then E does not occur, and when E does not occur, then C does not occur. In fact,
there is perfect observational symmetry between cause and effect which means that the
counterfactual definition of causation as described so far implies that C causes E and that E
causes C. The same problem arises with two effects of a common cause because of the perfect
symmetry in the situation. Consider, for example, a rise in the mercury in a barometer and
thunderstorms. Each is an effect of high pressure systems, but the counterfactual definition would
consider them to be causes of one another.44
These problems bedevil Humean and counterfactual theories. If we accept these theories in their
simplest forms, we must live with a seriously incomplete theory of causation that cannot
distinguish causes from effects and that cannot distinguish two effects of a common cause from
real cause and effect. That is, although the counterfactual theory can tell whether two factors A
and B are causally connected45 in some way, it cannot tell whether A causes B, B causes A, or A
43
This section relies heavily upon Hausman, 1999, especially Chapters 4-7 and Lewis, 1973b.
44
Thus, if barometric pressure rises, thunderstorms occur and vice-versa. Furthermore, if barometric
pressure does not rise, then thunderstorms do not occur and vice-versa. Thus, by the counterfactual
definition, each is the cause of the other. (To simplify matters, we have ignored the fact that there is not a
perfectly deterministic relationship between high pressure systems and thunderstorms.)
45
As implied by this paragraph, there is a causal connection between A and B when either A causes B, B
causes A, or A and B are the effects of a common cause. (See Hausman, 1998, pages 55-63).
23
and B are the effects of a common cause (sometimes called spurious correlation). The reason for
this is that the truth of the two counterfactual conditions described so far amounts to a particular
pattern of the crosstabulation of the two factors A and B. In the simplest case where the columns
are the absence or presence of the first factor (A) and the rows are the absence or the presence of
the second factor (B), then the same diagonal pattern is observed for situations where A causes B
or B causes A, or for A and B being the effects of a common cause. In all three cases, we either
observe the presence of both factors or their absence. It is impossible from this kind of
symmetrical information, which amounts to correlational data, to detect causal asymmetry or
spurious correlation. The counterfactual theory as elucidated so far, like the Humean regularity
theory, only describes a necessary condition, the existence of a causal connection between A and
B, for us to say that A causes B.
Requiring temporal precedence can solve the problem of causal direction by simply choosing the
phenomenon that occurs first as the cause, but it cannot solve the problem of common cause
because it would lead to the ridiculous conclusion that since the mercury rises in barometers
before storms, this upward movement in the mercury must cause thunderstorms. For this and
other reasons, David Lewis rejects using temporal precedence to determine the direction of
causality. Instead, he claims that when C causes E but not the reverse “then it should be possible
to claim the falsity of the counterfactual ‘If E did not occur, then C would not occur.’” This
counterfactual is different from “if C occurs then E occurs” and from “if C does not occur then E
does not occur” which, as we have already mentioned, Lewis believes must both be true when C
causes E. The required falsity of ‘If E did not occur, then C would not occur’ adds a third
condition for causality.46 This condition amounts to finding situations in which C occurs but E
does not – typically because there is some other condition that must occur for C to produce E.
Appendix 1 explores this strategy in much more detail, but it suffices to say here that there is
typically a much better way of establishing causal priority that is explored in the next section.
46
There are four possible counterfactuals involving C and E, and unlike standard propositional logic in
which the truth of ‘if C then E’ implies the truth of its contrapositive, ‘if not E then not C’, the truth or
falsity of these four counterfactuals is logically independent of one another. That is, the law of the
contrapositive does not hold for counterfactuals. Lewis proposes three conditions on these four
counterfactuals for C to be said to cause E. First, the counterfactual “if C occurs then E would occur” must
be true. In an experiment, this means that both C and E must occur in the treatment condition. We would
expect this to happen if C deterministically causes E. Thus Lewis’s first condition for causality holds
when C causes E. Lewis proposes that a second counterfactual, “if C did not occur then E would not
occur” must also be true if we are to say that C causes E. The premise of this counterfactual (“C did not
occur”) is true for the control group, and the counterfactual will be true if the control group is considered
the closest possible world to the treatment group for which C did not occur and if E does not occur in the
control group. Now, there is every reason to consider the control situation the closest possible world to
the treatment situation, and if C really causes E, then E will not occur in the control group. Thus, the
second possible counterfactual “if C did not occur then E would not occur” will be true when C causes E,
and we can say that C does cause E according to Lewis’s definition. But when C causes E, the results for
the treatment group also imply that a third counterfactual “if E occurs then C would occur” is true which
leads to the possibility that E also causes C according to Lewis’s definition even though E does not really
cause C at all. To avoid concluding that E causes C as well, the fourth counterfactual “if E did not occur
then C would not occur” must be false. (If it were true then Lewis’s first two conditions for a causal
relationship would hold for E causing C.) The falsity of this fourth counterfactual is Lewis’s third
condition for claiming that C causes E but not the reverse.
24
The counterfactual theory provides us with substantial insights into the nature of causation by
leading us towards experiments as a way to construct counterfactual worlds. It also illuminates
one very important aspect of experiments. Although the cross-tabulation of the data from an
experiment will indicate that there is a causal connection between one factor and another if the
entries lie along a diagonal formed by cases where both factors are absent or both are present,47 it
will not rule out a common cause or reveal the direction of causation if one factor directly causes
another. Consequently, other ways (described in Appendix 1) must be found to determine
causation such as introducing a factor that interacts (or conditions) the operation of the supposed
cause or that might be the entire cause itself. In an experimental situation, extra factors like these
can help establish the direction of causation and rule out common causes, although they must be
used artfully. Considering other factors can also be useful in both experimental and
observational studies because it leads to more careful consideration of the exact mechanisms by
which causality occurs. However, considering other factors in observational (as opposed to
experimental) studies cannot even assure us that we will avoid spurious correlations.
Whatever its virtues and defects, this technique of finding another factor seems a bit unwieldy
because it requires the identification and introduction of a factor in addition to the supposed cause
and the supposed effect. From the perspective of a practicing researcher, temporal precedence
would seem to be a much easier way to establish the direction of causation. But it has its own
limitations including the difficulty of identifying what comes before what in many situations.
Sometimes this is just the difficulty of measuring events in a timely fashion – when, for example,
did Protestantism become fully institutionalized and did it precede the institutionalization of
capitalism? Does the increase in the money supply really precede economic upturns?48
But identifying what comes before what can also involve deep theoretical difficulties regarding
the role of expectations (Shiffrin, 19xx), intentions, and human decision-making. Consider, for
example, the relationship between educational attainment and marriage timing. “Among women
who leave full-time schooling prior to entry into marriage, there are some who will leave school
and then decide to get married and others who will decide to get married and then leave school in
anticipation of the impending marriage.” (Marini and Singer, 1988, page 377). In both cases,
leaving school will precede marriage, but in the first case leaving school preceded the decision to
marry and in the second case leaving school came after the decision to get married. Thus the
timing of observable events cannot always determine causality, although the timing of intentions
(to marry in this case) can determine causality. Unfortunately, it may be hard to get data on the
timing of intentions. Finally, there are philosophical qualms about using temporal precedence to
determine causal priority. Clearly, from a practical and theoretical perspective, it would be better
47
We are assuming the same set-up as we described earlier in which each factor is coded absent or
present, and the diagonals represent the factors being jointly absent or jointly present. In observational
data, this same pattern can be produced if the factors are the effects of a common cause, but the
experimental context rules out this possibility.
48
The appropriate lag length in the relationship between money and economic output continues to be
debated in economics, and it has led to the “established notion that monetary policy works with long and
variable lags (Abdullah and Rangazas, 1988, page 680).”
25
to have a way of establishing causal priority that did not rely upon temporal precedence.
Experimentation and the Manipulation Theory of Causation
In an experiment, there is a readily available piece of information that we have overlooked so far
because it is not mentioned in the counterfactual theory. The factor that has been manipulated can
determine the direction of causality and help to rule out spurious correlation. The cause must be
the manipulated factor.49 It is hard to exaggerate the importance of this insight. Although
philosophers are uncomfortable with manipulation and agency theories of causality because they
put people (as the manipulators) at the center of our understanding of causality, there can be little
doubt about the power of manipulation for determining causality. Agency and manipulation
theories of causation (Gasking, 1955; von Wright, 1975; Menzies and Price, 1993) elevate this
insight into their definition of causation. For Gasking “the notion of causation is essentially
connected with our manipulative techniques for producing results” (1955, pages 483), and for
Menzies and Price “events are causally related just in case the situation involving them possesses
intrinsic features that either support a means-end relation between the events as is, or are identical
with (or closely similar to) those of another situation involving an analogous pair of means-end
related events.” (1993, pages 197). These theories focus on establishing the direction of
causation, but Gasking’s metaphor of causation as “recipes” also suggests an approach towards
establishing the symmetric, regularity aspect of causation. Causation exists when there is a recipe
that regularly produces effects from causes.
Perhaps our ontological definitions of causality should not employ the concept of agency because
most of the causes and effects in the universe go their merry way without human intervention, and
even our epistemological methods often discover causes, as with Newtonian mechanics or
astrophysics, where human manipulation is impossible. Yet our epistemological methods cannot
do without agency because human manipulation appears to be the best way to identify causes, and
many researchers and methodologists have fastened upon experimental interventions as the way
to pin-down causation. These authors typically eschew ontological aims and emphasize
epistemological goals. After explicitly rejecting ontological objectives, for example, Herbert
Simon proceeds to base his initial definition of causality on experimental systems because “in
scientific literature the word ‘cause’ most often occurs in connection with some explicit or
implicit notion of an experimenter’s intervention in a system.” (Simon, 1952, page 518). When
full experimental control is not possible, Thomas Cook and Donald T. Campbell recommend
“quasi-experimentation,” in which “an abrupt intervention at a known time” in a treatment group
makes it possible to compare the impacts of the treatment over time or across groups (Cook and
Campbell, 1986, page 149). The success of quasi-experimentation depends upon “a world of
probabilistic multivariate causal agency in which some manipulable events dependably cause
49
It might be more correct to say that the cause is buried somewhere among those things that were
manipulated or that are associated with the manipulation. It is not always easy, however, to know what
was manipulated as in the famous Hawthorne experiments in which the experimenters thought the
treatment was reducing the lighting for workers but the workers apparently thought of the treatment as
being treated differently from all other workers. (See ***) Part of the work required for good causal
inference is clearly describing what was manipulated and unpacking it to see what feature caused the
effect.
26
other things to change.” (Page 150). John Stuart Mill suggests that the study of phenomena
which “we can, by our voluntary agency, modify or control” makes it possible to satisfy the
requirements of the Method of Difference (“a method of artificial experiment”) even though “by
the spontaneous operations of nature those requisitions are seldom fulfilled.” (Mill, 1888, pages
281, 282). Sobel champions a manipulation model because it “provides a framework in which the
nonexperimental worker can think more clearly about the types of conditions that need to be
satisfied in order to make inferences” (Sobel, 1995, page 32). David Cox claims that quasiexperimentation “with its interventionist emphasis seems to capture a deeper notion” (Cox, 1992,
page 297) of causality than the regularity theory.
As we shall see, there are those who dissent from this perspective, but even they acknowledge
that there is “wide agreement that the idea of causation as consequential manipulation is stronger
or ‘deeper’ than that of causation as robust dependence.” (Goldthorpe, 2001, page 5). This
account of causality is especially compelling if the manipulation theory and the counterfactual
theory are conflated, as they often are, and viewed as one theory. Philosophers seldom combine
them into one perspective, but all the methodological writers cited above (Simon, Cook and
Campbell, Mill, Sobel, and Cox) conflate them because they draw upon controlled experiments,
which combine intervention and control, for their understanding of causality. Through
interventions, experiments manipulate one (or more) factor which simplifies the job of
establishing causal priority by appeal to the manipulation theory of causation. Through
laboratory controls or statistical randomization experiments also create closest possible worlds
that simplify the job of eliminating confounding explanations by appeal to the counterfactual
theory of causation.
The combination of intervention and control in experiments makes them especially effective ways
to identify causal relationships. If experiments only furnished closest possible worlds, then the
direction of causation would be indeterminate without additional information. If experiments
only manipulated factors, then accidental correlation would be a serious threat to valid inferences
about causality. Both features of experiments do substantial work.
Any approach to determining causation in non-experimental contexts that tries to achieve the
same success as experiments must recognize both these features. The methodologists cited above
conflate them, and the psychological literature on counterfactual thinking cited at the beginning of
this chapter shows that our natural inclination as human beings is to conflate them. When
considering alternative possibilities, people typically consider nearby worlds in which individual
agency figures prominently. When asked to consider what could have happened differently in a
vignette involving a drunken driver and a new route home from work, subjects focus on having
taken the new route home instead of on the factors that lead to drunken driving. They choose a
cause and a closest possible world in which their agency matters. But there is no reason why the
counterfactual theory and the manipulation theory should be combined in this way. The
counterfactual theory of causation emphasizes possible worlds without considering human agency
and the manipulation theory of causation emphasizes human agency without saying anything
about possible worlds. Experiments derive their strength from combining both theoretical
perspectives, but it is all too easy to overlook one of these two elements in generalizing from
experimental to observational studies.50
50
Some physical experiments actually derive most of their strength by employing such powerful
27
As we shall see in a later section, the best known statistical theory of causality emphasizes the
counterfactual aspects of experiments without giving equal attention to their manipulative aspects.
Consequently, when the requirements for causal inference are transferred from the experimental
setting to the observational setting, those features of experiments that rest upon manipulation tend
to get underplayed.
Preemption and the Mechanism Theory of Causation
Preemption – Experimentation’s amalgamation of the lessons of counterfactual and manipulation
theories of causation produces a powerful technique for identifying the effects of manipulated
causes. Yet, in addition to the practical problems of implementing the recipe correctly, the
experimental approach does not deal well with two related problems. It does not solve the
problem of causal preemption which occurs when one cause acts just before and preempts
another, and it does not so much explain the causes of events as it demonstrates the effects of
manipulated causes. In both cases, the experimentalists’ focus on the impacts of manipulations in
the laboratory instead of on the causes of events in the world, leads to a failure to explain
important phenomena, especially those phenomena which cannot be easily manipulated or
isolated.
The problem of preemption illustrates this point. The following example of preemption is often
mentioned in the philosophical literature. A man takes a trek across a desert. His enemy puts a
hole in his water can. Another enemy, not knowing the action of the first, puts poison in his
water. Manipulations have certainly occurred, and the man dies on the trip. The enemy who
punctured the water can thinks that she caused the man to die, and the enemy who added the
poison thinks that he caused the man to die. In fact, the water dripping out of the can preempted
the poisoning so that the poisoner is wrong. This situation poses problems for the counterfactual
theory because one of the basic counterfactual conditions required to establish that the hole in the
water can caused the death of the man, namely the truth of the counterfactual “if the hole had not
been put in the water can, the man would not have died,” is false even though the man did in fact
die of thirst. The problem is that the man would have died of poisoning if the hole in the water
can had not preempted that cause, and the “back-up” possibility of dying by poisoning falsifies
the counterfactual.
The preemption problem is a serious one, and it can lead to mistakes even in well-designed
experiments. Presumably the closest possible world to the one in which the water can has been
punctured is one in which the poison has been put in the water can as well. Therefore, even a
carefully designed experiment will conclude that the puncturing of the can did not kill the man
crossing the desert because the unfortunate subject in the control condition would die (from
manipulations that no controls are needed. At the detonation of the first atom bomb, no one doubted that
the explosion was the result of nuclear fission and not some other uncontrolled factor. Similarly, in what
might be an apocryphal story, it is said that a Harvard professor who was an expert on criminology once
lectured to a class about how all social science evidence suggested that rehabilitating criminals simply did
not work. A Chinese student raised his hand and politely disagreed by saying that during the Cultural
Revolution, he had observed cases where criminals had been rehabilitated. Once again, a powerful
manipulation may need no controls.
28
poisoning) just as the subject in the treatment would die (from the hole in the water can). The
experiment alone would not tell us how the man died. A similar problem could arise in medical
experiments. Arsenic was once used to cure venereal disease, and it is easy to imagine an
experiment in which doses of arsenic “cure” venereal disease but kill the patient while the
members of the control group without the arsenic die of venereal disease at the same rate. If the
experiment simply looked at the mortality rates of the patients, it would conclude that arsenic had
no medicinal value because the same number of people died in the two conditions.
In both these instances, the experimental method focuses on the effects of causes and not on
explaining effects by adducing causes. Instead of asking why the man died in his trek across the
desert, the experimental approach asks what happens when a hole is put in the man’s canteen and
everything else remains the same. The method concludes that the hole had no effect. Instead of
asking what caused the death of the patients with venereal disease, the experimental method asks
whether giving arsenic to those with venereal disease had any net impact on mortality rates. It
concludes that it did not. In short, experimental methods do not try to explain events in the world
so much as they try to show what would happen if some cause were manipulated. This does not
mean that experimental methods are not useful for explaining what happens in the world, but it
does mean that they sometimes miss the mark.
Mechanisms, Capacities, and the Pairing Problem –The preemption problem is a vivid example
of a more general problem with the Humean account that requires a solution. The general
problem is that constant conjunction of events is not enough to “pair-up” particular events even
when preemption is not present. Even if we know that holes in water cans generally spell trouble
for desert travelers, we still have the problem of linking a particular hole in a water can with a
particular death of a traveler. Douglas Ehring notes that:
Typically, certain spatial and temporal relations, such as spatial/temporal contiguity, are
invoked to do this job. [That is, the hole in the water can used by the traveler is obviously
the one that caused his death because it is spatially and temporally contiguous to him.]
These singularist relations are intended to solve the residual problem of causally pairing
particular events, a problem left over by the generalist core of the Humean account.
(Ehring, 1997, page 18)
Counterfactual theories, because they can explain singular causal events, do not suffer so acutely
from this “pairing” problem, but the preemption problem shows that remnants of the difficulty
remain even in counterfactual accounts. (Ehring, 1997, Chapter 1) In both the desert traveler and
arsenic examples, the counterfactual account cannot get at the proper pairing of causes and effects
because there are two redundant causes to be paired with the same effects. Something more is
needed.
The solution in both these cases seems obvious, but it does not follow from the neo-Humean,
counterfactual, or manipulation definitions of causality. The solution is to inquire more deeply
into what is happening in each situation in order to describe the capacities and mechanisms that
are operating. An autopsy of the desert traveler would show that the person died of thirst, and an
examination of the water can would show that the water would have run out before the poisoned
water could be imbibed. An autopsy of those given arsenic would show that the signs of venereal
29
disease were arrested while other medical problems, associated with arsenic poisoning, were
present. Further work might even show that lower doses of arsenic cure the disease without
causing death. In both these cases, deeper inquires into the mechanism by which the causes and
effects are linked would produce better causal stories.
But what does it mean to explicate mechanisms and capacities?51 “Mechanisms” we are told by
Machamber, Darden, and Craver (2000, page 3) “are entities and activities organized such that
they are productive of regular changes from start or set-up to finish or termination conditions.”
The crucial terms in this definition are “entities and activities” which suggest that mechanisms
have pieces. Glennan (1996, page 52) calls them “parts,” and he requires that it should be
possible “to take the part out of the mechanism and consider its properties in another context
(page 53).” Entities, or parts, are organized to produce change. For Glennan (page 52), this
change should be produced by “the interaction of a number of parts according to direct causal
laws.” The biological sciences abound with mechanisms of this sort such as the method of DNA
replication, chemical transmission at synapses, and protein synthesis. But there are many
mechanisms in the social sciences as well including markets with their methods of transmitting
price information and bringing buyers and sellers together, electoral systems with their routines
for bringing candidates and voters together in a collective decision-making process, the diffusion
of innovation through social networks, the two-step model of communication flow, weak ties in
social networks, dissonance reduction, reference groups, arms races, balance of power, etc.
(Hedstrom and Swedberg, 1998). As these examples demonstrate, mechanisms are not
exclusively mechanical, and their activating principles can range from physical and chemical
processes to psychological and social processes. They must be composed of appropriately
located, structured, and oriented entities which involve activities that have temporal order and
duration, and “an activity is usually designated by a verb or verb form (participles, gerundives,
etc.)” (Machamber, Darden, and Craver, 2000, page 4) which takes us back to the work of Lakoff
and Johnson (1999) who identified a “Causation Is Forced Movement metaphor.”
Mechanisms provide another way to think about causation. Glennan argues that “two events are
causally connected when and only when there is a mechanism connecting them” and “the
necessity that distinguishes connections from accidental conjunctions is to be understood as
deriving from a underlying mechanism” which can be empirically investigated (page 64). These
mechanisms, in turn, are explained by causal laws, but there is nothing circular in this because
these causal laws refer to how the parts of the mechanism are connected. The operation of these
parts, in turn, can be explained by lower level mechanisms. Eventually the process gets to a
bedrock of fundamental physical laws which Glennan concedes “cannot be explained by the
mechanical theory (page 65).”
Consider explaining social phenomena by examining their mechanisms. Duverger’s law, for
51
These approaches are not the same, and those who favor one often reject the other (see, e.g., Cartwright,
1989 on capacities and Machamer, Darden, and Craver, 2000 on mechanisms). But both emphasize
“causal powers” (Harre and Madden, 1975, Chapter 5) instead of mere regularity or counterfactual
association. We focus on mechanisms because we believe that they are somewhat better way to think
about causal powers, but in keeping with our pragmatic approach, we find much that is useful in
“capacity” theories.
30
example, is the observed tendency for just two parties in simple plurality single-member district
elections systems (such as the United States). The entities in the mechanisms behind Duverger’s
law are voters and political parties. These entities face a particular electoral rule (single district
plurality voting) which causes two activities. One is that voters often vote strategically by
choosing a candidate other than their most liked because they want to avoid throwing their vote
away on a candidate who has no chance of winning and because they want to forestall the election
of their least wanted alternative. The other activity is that political parties often decide not to run
candidates when there are already two parties in a district because they anticipate that voters will
spurn their third party effort.
These mechanisms underlying Duverger’s law suggest other things that can be observed beyond
the regularity of two party systems being associated with single member plurality-vote electoral
systems that led to the law in the first place. People’s votes should exhibit certain patterns and
third parties should exhibit certain behaviors. And a careful examination of the mechanism
suggests that in some federal systems that use simple plurality single-member district elections we
might have more than two parties, seemingly contrary to Duverger’s Law. Typically, however,
there are just two parties in each province or state, but these parties may differ from one state to
another, thus giving the impression, at the national level, of a multi-party system even though
Duverger’s Law holds in each electoral district.52
Or consider meterological53 and physical phenomena. Thunderstorms are not merely the result of
cold fronts hitting warm air or being located near mountains, they are the results of parcels of air
rising and falling in the atmosphere subject to thermodynamic processes which cause warm
humid air to rise, to cool, and to produce condensed water vapor. Among other things, this
mechanism helps to explain why thunderstorms are more frequent in areas, such as Denver,
Colorado, near mountains because the mountains cause these processes to occur – without the
need for a cold air front. Similarly, Boyle’s law is not merely a regularity between pressure and
volume, it is the result of gas molecules moving within a container and exerting force when they
hit the walls of the container. This mechanism for Boyle’s law also helps to explain why
temperature affects the relationship between the pressure and volume of a gas. When the
temperature increases, the molecules move faster and exert more force on the container walls.
Mechanisms like these are midway between general laws on the one hand and specific
descriptions on the other hand, and activities can be thought of as causes which are not related to
lawlike generalities.54 Mechanisms typically explicate observed regularities in terms of lower
level processes, and the mechanisms vary from field to field and from time to time. Moreover,
these mechanisms “bottom-out” relatively quickly – molecular biologists do not seek quantum
mechanical explanations and social scientists do not seek chemical explanations of the
phenomena they study.
52
This radically simplifies the literature on Duverger’s law (see Cox, 19xx for more details).
53
The points in this paragraph, and the thunderstorm example, come from Dessler (1991).
54
Jon Elster says: “Are there lawlike generalizations in the social sciences? If not, are we thrown back on
mere description and narrative? In my opinion, the answer to both questions is No. The main task of this
essay is to explain and illustrate the idea of a mechanism as intermediate between laws and descriptions.”
(Elster, 1998, page 45)
31
When an unexplained phenomenon is encountered in a science, “Scientists in the field often
recognize whether there are known types of entities and activities that can possibly accomplish
the hypothesized changes and whether there is empirical evidence that a possible schemata is
plausible.” They turn to the available types of entities and activities to provide building blocks
from which to construct hypothetical mechanisms. “If one knows what kind of activity is needed
to do something, then one seeks kinds of entities that can do it, and vice versa.” (Machamber,
Darden, and Craver, page 17)
Mechanisms, therefore, provide a way to solve the pairing problem, and they leave a multitude of
traces that can be uncovered if a hypothesized causal relation really exists. For example, those
who want to subject Max Weber’s hypothesis about the Reformation leading to capitalism do not
have to rest content with simply correlating Protestantism with capitalism. They can also look at
the detailed mechanism he described for how this came about, and they can look for the traces left
by this mechanism (Hedstrom and Swedberg, 1998, page 5; Sprinzak, 1972).55
Multiple Causes and Mechanisms – Earlier in this paper, the need to rule out common causes and
to determine the direction of causation in the counterfactual theory led us towards a consideration
of multiple causes. In this section, the need to solve the problem of preemption and the pairing
problem led to a consideration of mechanisms. Together, these theories lead us to consider
multiple causes and the mechanisms that tie these causes together. Many different authors have
come to a similar conclusion about the need to identify mechanisms (Cox, 1992; Simon and
Iwasaki, 1988; Freedman, 1991; Goldthorpe, 2001), and this approach seems commonplace in
epidemiology (Bradford Hill, 1965) where debates over smoking and lung cancer or sexual
behavior and AIDS have been resolved by the identification of biological mechanisms that link
the behaviors with the diseases.
Four Theories of Causality [Incomplete]
What is Causation? – We are now at the end of our review of four causal theories. We have
described two fundamental features of causality. One is the symmetric association between
causes and effects. The other is the asymmetric fact that causes produce effects, but not the
reverse. Table 1 summarizes how each theory identifies these two aspects of causality.
Regularity and counterfactual theories do better at capturing the symmetric aspect of causation
than its asymmetric aspect. Regularity theories rely upon the constant conjunction of events and
temporal precedence to identify causes and effects. Their primary tool is essentially the “Method
of Concomitant Variation” proposed by John Stuart Mill in which the causes of a phenomenon are
sought in other phenomena which vary in a similar manner. Counterfactual theories rely upon
elaborations of the “Method of Difference” to find causes by comparing instances where the
55
Hedstrom and Swedberg (1998) and Sorenson (1998) rightfully criticize causal modeling for ignoring
mechanisms and treating correlations among variables as theoretical relationships. But it might be worth
remarking that causal modelers in political science have been calling for more theoretical thinking (Achen,
19xx, Bartels and Brady, 19xx) for at least two decades, and a constant refrain at the annual meetings of
the Political Methodology Group has been the need for better “micro-foundations.”
32
phenomenon occurs and instances where it does not occur to see in what circumstances the
situations differ. Counterfactual theories suggest searching for surrogates for the closest possible
worlds where the putative cause does not occur to see how they differ from the situation where
the cause did occur. This strategy leads naturally to experimental methods where the likelihood
of the independence of assignment and outcome, which insures one kind of closeness, can be
increased by rigid control of conditions or by randomly assigning treatments to cases. None of
these methods is fool-proof because none solves the pairing problem or gets at the connections
between events, but experimental methods typically offer the best chance of achieving closest
possible worlds for comparisons.
Causal theories that emphasize mechanisms and capacities provide guidance on how to solve the
pairing problem and how to get at the connections between events. Our emphasis in this book
upon causal process observations is in that spirit. These observations can be thought of as
elucidations and tests of possible mechanisms. And the growing interest in mechanisms in the
social sciences (Hedstrom and Swedberg, 1998; Elster, 19xx) is providing a basis for opening up
the black-box of the Humean regularity and the counterfactual theories.
The other major feature of causality, the asymmetry of causes and effects, is captured by temporal
priority, manipulated events, and the independence of causes. Each notion takes a somewhat
different approach to distinguishing causes from effects once the unconditional association of two
events (or sets of events) has been established. Temporal priority simply identifies causes with
the events that came first. If growth in the money supply reliably precedes economic growth,
then the growth in the money supply is responsible for growth. Manipulation theories identify
the manipulated event as the causally prior one. If a social experiment manipulates work
requirements and finds that greater stringency is associated with faster transitions off welfare,
then the work requirements are presumed to cause these transitions. Finally, one event is
considered the cause of another if a third event can be found that satisfies the INUS conditions for
a cause and that varies independently of the putative cause. If short-circuits vary independently of
wooden frame buildings, and both satisfy INUS conditions for burned down buildings, then both
must be causes of those conflagrations. Or if education levels of voters vary independently of
their getting the butterfly ballot, and both satisfy INUS conditions for mistakenly voting for
Buchanan instead of Gore, then both must be causes of those mistaken votes.
Causal Inference with Experimental and Observational Data – Now that we know what causation
is, what lessons can we draw for doing empirical research? Table 1 shows that each theory
provides sustenance for different types of studies and different kinds of questions. Regularity and
mechanism theories tend to ask about the causes of effects while counterfactual and manipulation
theories ask about the effects of imagined or manipulated causes. The counterfactual and
manipulation theories converge on experiments, although counterfactual thought experiments
flow naturally from the possible worlds approach of the counterfactual theory. Regularity
theories are at home with observational data, and the mechanical theory thrives on analytical
models and case studies.
Which method, however, is the best method? Clearly the gold-standard for establishing causality
is experimental research, but even that is not without flaws. When they are feasible, well done
experiments can help us construct closest possible worlds and explore counterfactual conditions.
33
But we still have to assume that there is no preemption occurring which would make it impossible
for us to determine the true impact of the putative cause, and we also have to assume that there is
no interactions across units in the treatment and control groups and that treatments can be
confined to the treated cases. If, for example, we are studying the impact of a skill training
program on the tendency for welfare recipients to get jobs, we should be aware that a very strong
economy might preempt the program itself and cause those in both the control and treatment
conditions to get jobs simply because employers did not care much about skills. As a result, we
might conclude that skills do not count for much in getting jobs even though they might matter a
lot in a less robust economy. Or if we are studying electoral systems in a set of countries with a
strong bimodal distribution of voters, we should know that the voter distribution might preempt
any impact of the electoral system by fostering two strong parties. Consequently, we might
conclude that single-member plurality systems and proportional representation systems both led
to two parties, even though this is not generally true. And if we are studying some educational
innovation that is widely known, we should know that teachers in the “control” classes might
pick-up and use this innovation thereby nullifying any effect it might have.
If we add an investigation of mechanisms to our experiments, we might be able to develop
safeguards against these problems. For the welfare recipients, we could find out more about their
job search efforts, for the party systems we could find out about their relationship to the
distribution of voters, and for the teachers we could find out about their adoption of new teaching
methods.
Once we go to observational studies, matters get much more complicated. Spurious correlation is
a real danger. There is no way to know whether those cases which get the treatment and those
which do not differ from one another in other ways. It is very hard to be confident that either
independence of assignment and outcome or conditional independence of treatment and
assignment holds. Because nothing has been manipulated, there is no surefire way to determine
the direction of causation. Temporal precedence provides some information about causal
direction, but it is often hard to obtain and interpret it.
The Causality Checklist and Social Science Examples – [Still to be completed; This section will
analyze several social science examples such as Duvergers’ Law, the Protestant ethic and the rise
of capitalism, work requirements and welfare rolls, and the butterfly ballot and voting in Florida.
Table 3 is a preliminary version of the causality checklist which will be “filled out” for each
example in order to show what must be established.]
34
Table 3 Causality Checklist
General Issues
#
What is the “cause” (C) event? What is the “effect” (E) event?
#
What is the exact causal statement of how C causes E?
#
What is the corresponding counterfactual statement about what happens when C does not occur?
#
What is the causal field? What is the context or universe of cases in which the cause operates?
#
Is this a physical or social phenomenon or some mixture?
#
What role, if any, does human agency play?
#
What role, if any, does social structure play?
#
Is the relationship deterministic or probabilistic?
Neo-Humean Theory
#
Is there a constant conjunction (i.e., correlation) of cause and effect?
#
Is the cause necessary, sufficient or INUS?
#
What are other possible causes, i.e., rival explanations?
#
Is there a constant conjunction after controls for these other causes are introduced?
#
Does the cause precede the effect? In what sense?
Counterfactual Theory
#
Is this a singular conjunction of cause and effect?
#
Can you describe a closest possible (most similar) world to where C causes E but C does not occur?
How close are these worlds?
#
Can you actually observe any cases of this world (or something close to it, at least on average)?
Again, how close are these worlds?
#
In this closest possible world, does E occur in the absence of C?
#
Are there cases where E occurs but C does not occur? What factor intervenes and what does this tell
us about C causing E?
Manipulation Theory
#
What does it mean to manipulate your cause? Be explicit. How would you describe the cause?
#
Do you have any cases where C was actually manipulated? How? What was the effect?
#
Is this manipulation independent of other factors that influence E?
Mechanism and Capacities Theories
#
Can you explain, at a lower level, the mechanism(s) by which C causes E?
#
Do the mechanisms make sense to you?
#
What other predictions does this mechanism lead to?
#
Does the mechanism solve the pairing problem?
#
Can you identify some capacity that explains the way the cause leads to the effect?
#
Can you observe this capacity when it is present and measure it?
#
What other outcomes might be predicted by this capacity?
#
What are possible preempting causes?
35
Case Study: The Neyman-Rubin-Holland Counterfactual Conditions for Causation
Among statisticians, the best known theory of causality has grown out of the experimental
tradition. The roots of this perspective are in Fisher (19xx) and especially Neyman (1923), and it
has been most fully articulated by Rubin (1974, 1978) and Holland (1986). In this section, which
is more technical than the rest of this chapter, we explain this perspective, and we evaluate it in
terms of the four theories of causality described above.
There are four aspects of the Neyman-Rubin-Holland (NHR) approach:
1. A Counterfactual Definition of Causal Effect – Causal relationships are defined using a
counterfactual perspective which focuses on estimating causal effects. The definition
provides no guidance on how researchers can actually identify causes because it relies
upon an unobservable counterfactual. To the extent that the NHR approach considers
causal priority, it equates it with temporal priority.
2. Finding a Substitute for the Counterfactual Situation: The Independence of Assignment
and Outcome – As a step towards identifying causes, the NHR approach goes on to
formulate a set of epistemological assumptions, namely the independence of assignment
and outcome or the mean conditional independence of assignment and outcome, that make
it possible to estimate causal effects with observable data, although there is no way to
verify the assumption.
3. An Assumption for Creating Mini-Possible Worlds – As a prelude to suggesting
concrete ways that the independence or conditional independence of assignment and
outcome can be achieved, the statistical approach describes an assumption, the Stable Unit
Treatment Value Assumption (SUTVA) that makes it possible to treat cases as separate
mini closest possible worlds by assuming that they do not interfere or communicate with
one another and that treatments do not vary from case to case.
4. Methods for Insuring Independence of Assignment and Outcome if SUTVA holds –
Finally, the NRH approach describes methods such as unit homogeneity or random
assignment for obtaining independence or mean independence of assignment and outcome
as long as SUTVA holds.
The definition of a causal effect based upon unobserved counterfactuals was first described in a
1923 paper published in Polish by Jerzy Neyman (1990). Although Neyman’s paper was
relatively obscure until 1990, similar ideas informed much of the statistical work on
experimentation from the 1920's to the present. Rubin (1974, 1978, 1990a,b) and Heckman
(19xx) were the first to stress the importance of independence of assignment and outcome. A
number of experimentalists identified the need for the SUTVA assumption (e.g., Cox, 1958).
Random assignment as a method for estimating causal effects was first championed by R.A.
Fisher in (1925 and 1926). Holland (1986) provides the best synthesis of the entire perspective.
Ontological Definition of Causal Effect Based Upon Counterfactuals – According to the NRH
36
understanding of causation, establishing a causal relationship consists of comparing:
(a) the value of the outcome variable for a case that has been exposed to a treatment (Yt,
with “t” for treatment), with
(b) the value of the outcome variable for the same case if that case had not been exposed
to the treatment (Yc, with “c” for control).
Note that (a) refers to an actual observation in the treatment condition (“a case that has been
exposed to a treatment”) so the value Yt is observed while (b) refers to a counterfactual
observation of the control condition (“if that case had not been exposed to the treatment”).56
Because the case was exposed to the treatment, it cannot simultaneously be in the control
condition, and the value Yc is the outcome in the closest possible world where the case was not
exposed to the treatment. Although this value cannot be observed, we can still describe the
conclusions we would draw if we could observe it.
The Net Effect of the Treatment (NET) for a particular case is the difference in outcomes, NET =
(Yt - Yc), for the case, and if this difference is zero (i.e., if NET = 0), we say the treatment has no
net effect.57 If this difference is non-zero (i.e., NET… 0), then the treatment has a net effect. Then,
based on the counterfactual approach of David Lewis, there is a causal connection between the
treatment and the outcome if two conditions hold. First, the treatment must be associated with a
net effect, and second the absence of the treatment must be associated with no net effect.58
Although the satisfaction of these two conditions is enough to demonstrate a causal connection, it
56
For simplicity, we assume that the treatment case has been observed, but the important point is not that
the treatment is observed but rather that only one of the two conditions can be observed. There is no
reason why the situation could not be reversed with the actual observation of the case in the control group
and the counterfactual involving the unobserved impact of the treatment condition.
57
Technically, we mean that the treatment has no effect with respect to that outcome variable.
58
With a suitable definition of effect, one of these conditions will always hold by definition and the other
will be determinative of the causal connection. The NHR approach focuses on the Net Effect of the
Treatment (NET = Yt - Yc) in which the control outcome Yc is the baseline against which the treatment
outcome Yt is compared. A nonzero NET implies the truth of the counterfactual “if the treatment occurs,
then the net effect occurs,” and a zero NET implies that the counterfactual is false. In the NHR set-up the
Net Effect for the Control (NEC) must always be zero because NEC = (Yc - Yc) is always zero. Hence, the
counterfactual “if the treatment is absent then there is no net effect” is always true. The focus on the net
effect of the treatment (NET) merely formalizes the fact that in any situation one of the two counterfactuals
required for a causal connection can always be defined to be true by an appropriate definition of an effect.
Philosophers, by custom, tend to focus on the situation where some effect is associated with some putative
cause so that it is always true that “if the cause occurs then the effect occurs as well” and the important
question is the truth or falsity of “if the cause does not occur then the effect does not occur.” Statisticians
such as NHR, with their emphasis on the null hypothesis, seem to prefer the equivalent, but reverse, set-up
where the important question is the truth or falsity of “if the treatment occurs, then the effect
occurs.” The bottom line is that a suitable definition of effect can always lead to the truth of one
of the two counterfactuals so that causal impacts must always be considered comparatively.
37
is not enough to determine the direction of causation or to rule out a common cause. If the two
conditions for a causal connection hold, then the third Lewis condition, which establishes the
direction of causation and which rules out common cause, cannot be verified or rejected with the
available information. The third Lewis condition requires determining whether the cause occurs
in the closest possible world in which the net effect does not occur. But the only observed world
in which the net effect does not occur in the NRH setup is the control condition in which the
cause does not occur either. As discussed earlier, another situation in which the net effect does
not occur and the cause does occur must be observed to verify the third Lewis condition and to
show that the treatment causes the net effect.
Alternatively, the direction of causation can be determined (although common cause cannot be
ruled out) if the treatment is manipulated to produce the effect. Rubin and his collaborators
mention manipulation when they say that “each of the T treatments must consist of a series of
actions that could be applied to each experimental unit” (Rubin, 1978, page 39) and “it is critical
that each unit be potentially exposable to any one of the causes (Holland, 1986, page 946), but
their use of phrases such as “could be applied” or “potentially exposable” suggests that they are
more concerned about limiting the possible types of causes than with distinguishing causes from
effects.59 To the degree that causal priority is mentioned in the NHR literature, it is established by
temporal precedence. Rubin (1974, page 689), for example, says that the causal effect of one
treatment over another “for a particular unit and an interval t1 to t2 is the difference between what
would have happened at time t2 if the unit had been exposed to [one treatment] initiated at time t1
and what would have happened at t2 if the unit had been exposed to [another treatment] at t1.”
Holland (1986, pages 980) says that “The issue of temporal succession is shamelessly embraced
by the model as one of the defining characteristics of a response variable. The idea that an effect
might precede a cause in time is regarded as meaningless in the model, and apparently also by
Hume.” The problem with this approach, of course, is that it does not necessarily rule out
common cause and spurious correlation.60 In fact, as we shall see, one of the limitations and
possible confusions produced by the NHR approach is its failure to deal with the need for more
information to rule out common causes and to determine the direction of causality.
Finding a Substitute for the Counterfactual Situation: The Independence of Assignment and
Outcome – As with the Lewis counterfactual approach, the difficulty with the NHR definition of
causal connections is that there is no way to observe both Yt and Yc for any particular case. One
obvious line of attack is to consider two cases instead of just one. One case gets the treatment and
the other gets the control condition. We now explore what happens under these circumstances.
59
Rubin and Holland believe in “NO CAUSATION WITHOUT MANIPULATION” (Holland, 1986,
pages 959) which seems to eliminate attributes such as sex or race as possible causes, although Rubin
softens this perspective somewhat by describing ways in which sex might be a manipulation (Rubin, 1986,
pages 962). Clearly, researchers must consider carefully in what sense some factors can be considered
causes.
60
Consider, for example, the experiment described earlier in which randomly assigned special tutoring
first causes a rise in self-esteem and then an increase in test scores, but the increase in self-esteem does not
cause the increase in test scores. The NHR framework would incorrectly treat self-esteem as the cause of
the increased test scores because self esteem is randomly assigned and it precedes and is associated with
the rise in test scores. Clearly something more than temporal priority is needed for causal priority.
38
Table 4 describes a simple situation where we are investigating whether a hammer blow to a glass
will or will not break it. We assume that we have two glasses. The treatment is the hammer
blow. The control condition is no hammer blow. For the moment consider the row for glass
number one and the entries for the outcome variables Y1t or Y1c that are listed in the center of the
next to last row of the table. The subscript “1" for these outcome variables indicates they are for
glass number one, and the superscripts “t” or “c” indicate whether they are for the treatment or
the control condition. These variables take on the values of zero if the glass is not broken and one
if the glass is broken. The realized values of these variables, that is, the ones that are actually
observed, depend upon61 whether glass number one gets the treatment or the control condition.
These conditions are mutually exclusive states of the world in the sense that if the glass is in one
of them, then it cannot be in the other. The glass either gets the treatment or the control
condition. Consequently, only one of the columns can be observed for glass number one (or any
other glass), and the final column which provides an evaluation of the impact of the treatment
cannot be computed row by row because one of the two quantities is not observed. This
unobserved quantity is the counterfactual outcome. In the introduction to this section, the
counterfactual outcome for the glass was for the control condition, but we have not yet made any
assumptions about which glass in Table 4 does or does not get the treatment.
In practice, therefore, those doing causal inference must find some way to get a substitute value
for the unobserved counterfactual outcome in the final column of Table 4. How can this be done?
Suppose, for example, the researcher hits glass number one with a hammer blow and observes
that the glass is broken so that Y1t = 1. Some substitute is then needed for Y1c, the counterfactual
situation where the hammer blow is not struck against glass number one. One possibility is to
observe the glass just before it was hit by the hammer and to take that value as a substitute for Y1c.
Le t us assume that the glass was unbroken in the moment before the hammer hit so that we set
Y1c = Y1c* = 0 where Y1c* is the state of the glass a moment (indicated by c*) before the hammer
hit. In this case, we might conclude that the hammer blow is causally connected with the broken
glass because the treatment is associated with the glass breaking and the glass was unbroken the
moment before the hammer hit. That is, there is a difference between the outcomes t and c*: Y1t Y1c* = 1- 0 = 1.
This approach to inference makes use of what Holland (1986) calls the “temporal stability” and
“causal transience” assumptions. Temporal stability assumes the constancy of response over
time so that Y1c = Y1c*. That is, the observation of the glass at c* is assumed to be the same as the
observation would have been at c. Many of our everyday causal inferences, such as the belief that
our turning the key in the ignition made the car start, are based upon this assumption. We believe
that since the engine was not going just before we turned the key, it would not have been going a
moment later if we had not turned the key. This assumption, however, is risky when things
change over time as they often do. The second assumption of “causal transience,” which might
be better called “measurement transience,” asserts that the act of observing the state of the glass
did not change it in any way. This assumption sometimes makes sense with inanimate objects,
61
We are using the term “depend upon” to mean that they may vary from one condition to another, but we
do not mean to imply that they must vary in some way. In fact, the whole enterprise is designed to find out
whether or not they vary based solely upon whether or not the glass is in the treatment or control condition.
39
but is worrisome with things that react to measurement by either learning or changing their
behaviors. For example, if we provide extra tutoring to a student after a poor test performance,
we cannot presume that the student’s improvement on a subsequent test is due to our tutoring – it
may have been due to learning (about test-taking) from the first test or from motivation to do
better from the poor showing on the first test.
Thus a skeptic could reasonably claim that the glass broke because of some other factor “Z” such
as a nearby high-pitched sound that cracked the glass by sounding just after our observation Y1c*
and just before the hammer blow (thereby violating “temporal stability”) or because of our
handling of the glass when we checked to make sure it was not broken (thereby violating “causal
transience”). That is, if it were possible to observe the actual Y1c it would have a value of one
indicating that the glass would have broken without a hammer blow because of the high-pitched
sound or our rough handling of it. Consequently there is no difference between Y1t and Y1c (both
indicate a broken glass), and the hammer blow did not break the glass.62 The problem here is that
Y1c* is similar to Y1c, but it is not identical with it. Of all the worlds in which the hammer blow is
not struck, Y1c* is not quite the closest possible world to the treatment situation because it does
not yet include the operation of the high-pitched sound which came just after the observation Y1c*.
What arguments can be used to dissuade the skeptic who makes these arguments? The problem is
to rule out factor Z. In practice, those doing causal inference seek to do this by replicating the
hypothetical comparison of Yt and Yc through real-world comparisons across (hopefully) similar
units some of which are exposed to the treatment and some of which are not. Thus, the researcher
could use a similar glass from the same manufacturer and of the same sort as a “control.” Let us
call this glass number two and affix a subscript “2" to Y for it, Y2. This approach uses what
Holland calls the “unit homogeneity” assumption in which units are prepared carefully “so that
they ‘look’ identical in all relevant aspects.” (Holland, 1986, page 948).
62
Note that this claim is not the same as the claim that “the hammer blow could not break the glass”
which is a statement about capacities. It might be possible for a hammer blow to break the glass but in this
case the glass might have broken just before the hammer blow was struck.
40
Table 4 – Making Causal Inferences and Independence of Assignment and Outcome
Mutually Exclusive States of the World
Impact of Treatment
Cases
Glass Number 1
Treatment Outcome
“Hit with hammer”
Y1t
Control Outcome
“Not hit with hammer”
Y1c
Y1t - Y1c
Glass Number 2
Y2t
Y2c
Y2t - Y2c
If this second glass is placed in the control condition, then the researcher could observe Y2c and
substitute its value for Y1c. With this information, the researcher could calculate Y1t - Y2c to see if
hitting the glass with a hammer caused it to break. If the second glass did not break while the first
one did, then Y1t would be one and Y2c would be zero, apparently demonstrating that the hammer
caused the glass to break.
The skeptic, of course, might not be silenced. The doubter might argue that the high-pitched tone
(factor Z) only affected the glass in the treatment condition but not the glass in the control
condition (thereby violating “unit homogeneity”). This differential impact of Z is the crucial
problem that can confound a causal inference. Perhaps glass number one in the treatment
condition was closer to the high pitched sound, or glass number two in the control condition was
shielded in some way from the tone. In effect, unit homogeneity failed because it did not extend
to the relevant causal circumstances. The trick for the researcher is to make sure that the
circumstances of each glass are so similar that it is very difficult, if not impossible, to think of
some difference in any factor Z that might break one glass but not the other.
Technically what is required is that the two glasses be so similar and so similarly situated that if
both glasses were in the control condition, then both would have the same outcome (either broken
or unbroken) and if both glasses were in the treatment condition, then both would have the same
outcome as well.
In terms of the quantities in Table 4, we require that Y1c and Y2c have the same value and Y1t and
Y2t have the same value.63 This condition, amounts to saying that the circumstances of the objects
must be interchangeable in terms of the outcome variable Y. If we interchange their indices (one
becomes two and two becomes one), then we must have the same entries in Table 4. If this
requirement is met, then the values of the outcomes for the treatment condition are independent of
which glass is assigned to the treatment and the values of the outcomes for the control condition
are independent of which glass is assigned to the control condition. For our purposes, the two
Note that we could observe whether Y1c equals Y2c by not hitting either glass with a hammer, but then
we never observe what happens when we hit a glass with a hammer. Similarly we could observe whether
Y1t equals Y2t by hitting each glass with a hammer, but then we never observe what happens when we do
not hit a glass with a hammer. The problem is that no matter what we do, we can only get some of the
information that we need.
63
41
glasses and their circumstances are identical.
These independence conditions rule out the operation of any factor such as Z, and they can be
thought of as a definition of what we mean by “closest possible world.” If independence of
assignment and outcome holds, no matter which glass is assigned to the control condition, the
value of Y1c and Y2c must be the same. But if glass number one is affected by high-pitched notes
(the factor Z) and glass number two is not, then Y1c and Y2c will have different values – the first
glass will break (from the high pitched sound) without the hammer blow but the second one will
not break.64 Thus, if Z acts only on glass number one, the independence condition, which requires
Y1c and Y2c to be equal, will not hold.
If assignment is independent of the outcomes, the factor Z must either act on both glasses or on
neither. If it acts on both, then all four values in Table 4 will indicate a broken glass, and the
researcher will correctly conclude that the glass that was struck was not broken because of a
hammer blow. (The researcher should not necessarily conclude that hammer blows never break
glasses because the high-pitched sound may have broken the glass just before the hammer blow
would have broken it.) If Z acts on neither, then the researcher will either correctly conclude that
the hammer blow is associated with the glass breaking (if, in fact, the glass broke when it was hit
by the hammer) or that the hammer blow is not associated with the breaking of the glass (if, in
fact, the glass did not break when hit by the hammer).
Independence of assignment and outcome, therefore, is one of the crucial conditions that ensures
good causal inference. If Y1c equals Y2c and Y1t equals Y2t, then any observed difference between
Y1t and Y2c (or alternatively between Y2t and Y1c) must be due to the treatment and not to any other
factor. If independence does not hold, then one (or both) of these inequalities does not hold and
the outcome of the control or treatment condition will depend upon which glass is assigned to
each condition. If, for example, the high-pitched notes only affect the first glass, then the
assignment of the first glass as the control will result in its breaking whereas the assignment of the
second glass as the control will not lead to a broken glass.
When a real-world comparison is employed, the quality of the resulting causal inference depends
on how cases are “assigned” to the group that gets the treatment and the group that does not. If
the cases assigned to the treatment and control groups are different in terms of what would have
happened to them if they had all been in the treatment group or had all been in the control group,
then independence of assignment and outcome is violated and causal inference will be flawed.
The independence condition with its focus on the outcome variable Y is a very convenient way of
64
If hitting a glass with a hammer does not cause it to break, then the fact that the skeptic’s factor Z only
operates on glass one will also violate the requirement that the outcomes be the same in the treatment
condition (i.e, that Y1t = Y2t) because glass number one will break (because of Z) and glass number two will
not. But if a hammer blow usually does shatter a glass, although it did not do so in the case of glass
number one because the high-pitched tone did the job before the hammer could, then Y1t and Y2t would be
identical outcomes of broken glasses – but the first would be caused by the high pitched sound and the
second (where Z was absent) would be caused by the hammer blow. Nevertheless, independence will still
be violated in this case because the first condition, that is Y1c = Y2c, will not be true. Cases of (nearly)
simultaneous causation like this can cause vexing problems which we discuss in a later section.
42
describing what we mean by the closest possible world in a counterfactual definition of causality,
but it is not a testable assumption. We cannot know whether the outcomes for the two glasses
would be identical if both got the hammer blow or both did not get it. There is no way that we
can know whether we could interchange the two glasses and get the same results.
All we can do is to try to control the circumstances of each case as much as possible to reduce the
chance that differences in them might mean that assignment would not be independent of
outcome. This discussion suggests two ways to provide that control. One is to consider whether
temporal stability and causal transience hold. Another is to consider whether unit homogeneity
holds. Confirming that these conditions hold requires a great deal of ancillary knowledge about
the world such as whether glasses tend to break on their own, whether there are other features of
the experimental situation (such as high pitched noises) that might cause them to break, and so
forth. In effect, it requires establishing that the glasses being compared in the treatment and
control condition are in closest possible worlds except for the difference in the treatment they get.
It also typically requires another, very subtle, assumption.
The SUTVA Assumption for Creating Mini-Possible Worlds – Perhaps the hammer blow is not the
same for all units or perhaps the invocation of the treatment for some of the cases causes changes
in the control cases. For example, the decision to hit some glasses with a hammer might change
the structure of the glass for some “nearby” control glasses, but this change would not occur if
none of the glasses were hit with a hammer. Or it would not occur for those glasses that were
farther away. In this situation, the outcome for a control case depends upon what happens to
other cases. This possibility seems unlikely given what we know, but consider the following.
Suppose people in a treatment condition are punished for poor behavior while those in a control
condition are not. Further suppose that those in the control condition who are “near” those in the
treatment condition are not fully aware that they are exempt from punishment or they fear that
they might be made subject to it. Wouldn’t their behavior change in ways that it would not have
changed if there had never been a treatment condition? Doesn’t this mean that it would be
difficult, if not impossible, to satisfy the conditions for independence of assignment and outcome?
In the Cal-Learn experiment in California, for example, teenage girls on welfare in the treatment
group had their welfare check reduced if they failed to get passing grades in school. Those in the
randomly selected control group were not subject to reductions but many thought they were in the
treatment group (probably because they knew people who were in the treatment group) and they
appear to have worked to get passing grades to avoid cuts in welfare (Mauldon et al., 19xx).65
Their decision to get better grades, however, may have led to an underestimate of the impact of
Cal-Learn because it reduced the difference between the treatment group and the control group.
The problem here is that there is interaction between the units. Similar problems arise if
supposedly identical treatments vary in effectiveness so that the causal effect for a specific unit
65
Experimental subjects were told which group they were in, but some apparently did not get the
message. They may not have gotten the message because the control group was only a small number of
people and almost all teenage welfare mothers in the state were in the treatment group. In these
circumstances, an inattentive teenager in the control group could have sensibly supposed that the program
applied to everyone. Furthermore, getting better grades seemingly had the desired effect because their
welfare check was not cut!
43
depends upon which bag of fertilizer the plot agricultural plot got or which teacher the student
had. To rule out these possibilities, Rubin (1990) proposed the “Stable-Unit-Treatment-ValueAssumption (SUTVA)” which asserts that the outcome for a particular case does not depend upon
what happens to the other cases or which of the supposedly identical treatments the unit
receives.66
SUTVA rules out a number of phenomena described in the literature. Agricultural experimenters
have worried that bags of supposedly identical fertilizer are different because of variations in the
manufacturing process. As a result, the causal effect of fertilizer for a plot may depend upon
which bag of fertilizer was applied to it. Agricultural experimenters have also worried that the
treatments given to agricultural plots could interact with one another if rainstorms cause the
fertilizer applied to one plot to flow into adjacent plots. As a result, the causal impact for a plot
may depend upon the pattern of assignment of fertilizer in neighboring plots. Researchers using
human subjects have worried about similar problems. Cook and Campbell (1986, pages 148)
mention four fundamental threats to randomized experiments. Compensatory rivalry occurs when
control units decide that even though they are not getting the treatment, they can do as well as
those getting it. Resentful demoralization occurs when those not getting the treatment become
demoralized because they are not getting the treatment. Compensatory equalization occurs when
those in charge of control units decide to compensate for the perceived inequities between
treatment and control units, and treatment diffusion occurs when those in charge of control units
mimic the treatment because of its supposed beneficial effects.
SUTVA implies that each supposedly identical treatment really is identical and that each unit is a
separate, isolated possible world that is unaffected by what happens to the other units. SUTVA is
the master assumption that makes controlled or randomized experiments a suitable solution to the
problem of making causal inferences. SUTVA insures that treatment and control units really do
represent the closest possible worlds to one another except for the difference in treatment. In
order to believe that SUTVA holds, we must have a very clear picture of the units, treatments, and
outcomes in the situation at hand so that we can convince ourselves that experimental (or
observational) comparisons really do involve similar worlds. Rubin (1986, page 962) notes, for
example, that statements such as “If the females at firm f had been male, their starting salaries
would have averaged 20% higher” require much more elaboration of the counterfactual
possibilities before they can be tested. What kind of treatment, for example, would be required
for females to be males? Are individuals or the firm the basic unit of analysis? Is it possible to
simply randomly assign men to the women’s jobs to see what would happen to salaries? From
what pool would these men be chosen? If men were randomly assigned to some jobs formerly
held by women, would there be interactions across units that would violate SUTVA?
66
SUTVA amounts to assuming that the outcome values in each cell in Table 4 do not vary with the
pattern of assignment of treatments and controls or with the specific treatment or control given to the unit.
If SUTVA fails, then we must develop additional notation that specifies each of the four possible patterns
of assignment of treatment and control conditions [namely for (c,c), (c,t), (t,c), and (t,t) where the first
entry refers to the first glass and the second to the second glass] and that specifies each possible treatment
separately [with t1 and t2 considered different versions of the treatment and c1 and c2 different versions of
the control]. Combining these notations, we must consider (c1,c2) to be different from (c2,c1) and from
(c1,t1) and so forth. Thus, the entries in each cell in Table 4 will vary according to the pattern of
treatments and controls and the allocation of each version of treatments and controls.
44
Not surprisingly, if the SUTVA assumption fails, then it will be at best hard to generalize the
results of an experiment and at worst impossible to even interpret its results. Generalization is
hard if, for example, imposing a policy of welfare time-limits on a small group of welfare
recipients has a much different impact than imposing it upon every recipient. Perhaps the
imposition of limits on the larger group generates a negative attitude towards welfare that
encourages job-seeking which is not generated when the limits are only imposed on a few people.
Or perhaps the random assignment of a “Jewish” culture to one country (such as Israel) is much
different than assigning it to a large number of countries in the same area. In both cases, the
pattern of assignment to treatments seems to matter as much as the treatments themselves because
of interactions among the units, and the interpretation of these experiments might be impossible
because of the complex interactions among units. If SUTVA does not hold, then there are no
ways such as randomization to construct closest possible worlds, and the difficulty of determining
closest possible worlds must be faced directly.
If SUTVA holds and if there is independence of assignment and outcome,67 then the degree of
causal connection can be estimated.68 But there are no direct tests that can insure that these
assumptions hold, and much of the art in experimentation goes into strategies that will increase
the likelihood that they do hold. Cases can be isolated from one another to minimize interference,
treatments can be made as uniform as possible, and the characteristics and circumstances of each
case can be made as uniform as possible, but nothing can absolutely insure that SUTVA and the
independence of assignment and outcome hold.69
Finding a Substitute for the Counterfactual Situation: Conditional Independence of Assignment
and Outcome – Independence of assignment and outcome is a very strong condition that can be
approached by thoroughgoing control over all confounding factors in the research situation, but it
67
These assumptions are logically independent of one another. SUTVA asserts that the values in Table 4
do not depend upon what ultimately happens to each glass, while independence of assignment and
outcomes refers to the columns having the same values. If SUTVA does not hold, the values of Y1t, Y1c,Y2t,
and Y2c will depend upon the overall pattern of assignment of treatment and control conditions and the
specific treatments and controls given to each unit. The values within each column could be equal for a
given pattern but different across different patterns so that independence of assignment and outcomes
would hold (although this seems highly unlikely in most instances), or the values within each column
might be different within a given pattern so that independence of assignment and outcomes would not
hold. If SUTVA does hold, it is easy to see that independence of assignment and outcomes might or might
not hold.
68
If SUTVA fails and independence of assignment and outcome obtains, then causal effects can also be
estimated, but they will differ depending upon the pattern of treatments. Furthermore, the failure of
SUTVA may make it impossible to rely upon standard methods such as experimental control or
randomization to insure that the independence of assignment and outcome holds because the interaction of
units may undermine these methods.
69
Rosenbaum worries that SUTVA incorporates too much and that “it seems to bear a distinct
resemblance to an attic trunk; what does not fit is neatly folded and packed away.” (Rosenbaum, 1987,
page 313) He would like “to see SUTVA divided up into a series of more tangible assumptions with
practical interpretations, so that violations could be quickly discerned and perhaps addressed.” (Page 110).
45
requires a degree of control that is seldom possible. Even physical scientists typically have to
make corrections in their observations because of confounding factors such as stray sources of
particles in high energy accelerators or stray sources of electromagnetic radiation that affect radio
or visible light telescopes. When these scientists do this, they are using a weaker assumption
called conditional independence.
Conditional independence holds there is some confounding variable Z that produces violations of
independence of assignment but any subgroup of cases with the same value of Z satisfy the
condition for independence of assignment. In this case, each subgroup can be analyzed separately.
Thus, in the hammer and glass example, the effect of the hammer blow treatment for all glasses
subject to the high-pitched tone can be analyzed as one subgroup and the effect of the hammer
blow treatment for all those glasses not subject to the high-pitched tone can be analyzed as another
subgroup.
Table 5 – Causal Inference and Conditional Independence of Assignment and Outcome
Mutually Exclusive States of the World
Subject to Z?
Cases
Treatment Outcome
“Hit with hammer”
Glass Number 1
Y1t
Control Outcome
“Not hit with
hammer”
Y1c
Glass Number 2
Y2t
Y2c
Yes
Glass Number 3
Y3t
Y3c
No
Glass Number 2
Y4t
Y4c
No
Yes
Table 5 provides a schematic of the situation. To simplify matters, assume that the only
confounding or concomitant variable is Z and that it either operates or it does not – either a glass is
in the range of the high-pitched tone or it is not. In Table 5, the first two glasses are subject to Z,
and they will have identical values of Y1c and Y2c and identical values of Y1t, and Y2t.
Consequently, independence of assignment will hold true for these two glasses. Because all these
glasses are affected by Z, all of their Y values are equal to one because the glasses will break from
the high-pitched tone. Similarly, glasses three and four will have identical values of Y3c and Y4c
and identical values of Y3t and Y4t. The values of Y3c and Y4c will be equal to zero because the
glasses will not break. The values of Y3t and Y4t will either both be zero if the hammer does not
cause glasses to break or one if the hammer does cause glasses to break. In any case,
independence of assignment will hold for these two glasses.
Although the outcome values for glasses within the two subgroups are identical because Y1c = Y2c =
1 and Y3c = Y4c = 0, these values are not identical across the two groups. The first two glasses are
subject to Z and will break without the hammer blow; the second two glasses are not subject to Z
and will not break without the hammer blow. As a result, independence of assignment and
outcome does not hold, but there is independence of assignment that is conditional on the value of
Z. There is conditional independence of assignment and outcome.
46
Finding a Substitute for the Counterfactual Situation: Mean Independence of Assignment and
Outcome when there is Outcome Variability – The preceding version of conditional independence,
in which the cases are identical in the sense that the case values on the outcome variable are the
same within each column conditional on Z, is very strong. It might be met in a situation where
there is a lot of control over the factors that distinguish situations. We might be able to get
identical glasses in identical situations, or identical chemicals in identical situations, but in most
social science, outcomes are highly variable from case to case. For example, if the rows in Table 5
are welfare recipients who are involved in a job training program and the outcomes are wages in
subsequent jobs, then we might expect that the training program would work for some but not for
others and that some who did not even get job training would still get high wages. In short, the
row values in the control and the treatment groups would vary considerably.
Conditional independence of assignment and outcome can be extended to this circumstance where
there is variability in outcomes. The basic device is to be content with estimating only an average
causal effect and to require only that outcomes be similar on average.70 In this and the following
section, we begin by generalizing independence of assignment to average or mean independence of
assignment, and then we generalize still further to mean conditional independence.
Identical values of all the Yit and of all the Yic (where i represents different cases) are not required.
Rather, all that is needed is that cases are assigned in such a way that those in the treatment group
are similar to those in the control group. Table 6 shows what is needed. In this table, the first four
cases are control cases and the next four are treatment cases.71 Four averages are presented in the
table, and the requirements for conditional independence are that:
– the average of the treatment outcomes for those cases used as controls (Ct) equals the
average of the treatment outcomes for those cases that get the treatment (Tt) and
– the average of the control outcomes for those cases getting the control condition (Cc)
equals the average of the control outcomes for those cases that get the treatment (Tc).
In sum, the two averages in each column, one for the treated cases and the other for the control
cases, must equal one another. On average the treatment and control groups must be similar.
Obviously, these conditions will be met if the values in each column are identical to one another as
in the example with the glasses, but we do not require this much.
If these conditions hold, then the researcher will be able to make a good inference about the causal
effect of the treatment by comparing the average of the observed Yt among the cases given the
treatment with the observed Yc among the cases given the control. If these conditions do not hold,
70
See Stone (19xx) for a discussion of the relationship between various definitions of causal impacts and
the types of assumptions about conditional independence required to estimate them.
71
This discussion ignores two important, but somewhat technical issues. It does not mention how cases are
sampled, and it does not mention that we really need the expectations of the averages to be equal. It also
ignores some conditions that are stronger than equal means and which lead to stronger possibilities for
causal inference. See Rubin (1974) for a very clear exposition of these issues.
47
then the researcher will be in danger of making a biased causal inference.
One way to assess and to influence the probability of independence of assignment and outcome is
to use statistical randomization. If cases are randomly assigned to treatment and control
conditions, then we can calculate the probability that there are deviations from a given level of
independence. We can also develop statistics to see if observed differences between the treatment
and control group are due to chance or to a real causal impact of the treatment. Textbooks on
experimentation provide the details of how this can be done (Fisher, 19xx; Kempthorne, 1952;
Cox, 1958).
Table 6– Independence of Assignment and Outcome with Variable Outcomes
Mutually Exclusive States of the World
Cases
Treatment Outcome
Control Outcome
Control Cases
Ct = (Y1t+Y2t+Y3t+Y4t)/4
Cc = (Y1c+Y2c +Y3c+Y4c)/4
1
Y1t
Y1c
2
Y2t
Y2c
3
Y3t
Y3c
4
Y4t
Y4c
Treatment Cases
Tt = (Y5t+Y6t+Y7t+Y8t)/4
Tc = (Y5c+Y6c+Y7c+Y8c)/4
5
Y5t
Y5c
6
Y6t
Y6c
7
Y7t
Y7c
8
Y8t
Y8c
Finding a Substitute for the Counterfactual Situation: Mean Conditional Independence when
there is Outcome Variability – In an observational study, randomization is not available, and the
mean independence assumption can easily fail for the same reason that independence of
assignment can fail for the example of a hammer striking glasses. If there is some factor Z, say
prior job experience among those getting job training, that affects the treatment cases but not the
control cases, then the average Ct may not equal Tt and the average Cc may not equal Tc. If, for
example, the treatment group has more prior job experience, then we would expect that even
without job training, their average wages, namely Tc, would be higher than those of the control
group Cc. And we would probably expect Ct to be higher than Tt as well.
What can be done in this situation? If, and this is a big if, the researcher can identify the variable
(or variables) Z that cause these departures from independence of assignment, then statistical
48
corrections can be made for the problem. The logic is simple, although the practice is hard
because Z is seldom known. Suppose that cases with the same value of Z satisfy mean
independence of assignment. This amounts to saying that it is possible to construct tables like
Table 6 for each value of Z in which Ct = Tt and Cc = Tc. If this is possible, then statistical
corrections can be made for the confounding caused by Z.
Summary of the NHR Approach – If SUTVA holds and if the conditional independence conditions
hold, then mini-closest-possible worlds have been created which can be used to compare the
effects in a treatment and control condition. If SUTVA holds, then there are three ways to get the
conditional independence conditions to hold:
(a) Controlled experiments in which either unit homogeneity holds or temporal stability
and causal (or measurement) transience holds.
(b) Statistical experiments in which random assignment holds.
(c) Observational studies in which corrections are made for covariates that ensure mean
conditional independence of assignment and outcome.
The mathematical conditions required for the third method to work follow easily from the
Neyman-Holland-Rubin set-up, but there is no method for identifying the proper covariates. And
outside of experimental studies, there is no way to be sure that conditional independence of
assignment and outcome holds. Even if we know about some Z that may confound our results, we
may not know about all of them, and without knowing all of them, we cannot be sure that
correcting for some of them insures conditional independence. Thus observational studies face the
problem of identifying a set of Z variables that will insure conditional independence so that the
impact of the treatment can be determined. A great deal of research, however, does this in a rather
cavalier way.
Even if SUTVA and some form of conditional independence is satisfied, the NRH framework, like
Lewis’s counterfactual theory to which it is a close relative, can only identify causal connections.
Additional information is needed to rule out spurious correlation and to establish the direction of
causation. Appeal can be made to temporal precedence or to what was manipulated to pin-down
the direction of causation, but neither of these approaches provides full protection against
common-cause. More experiments or observations which study the impact of other variables
which suppress supposed causes or effects may be needed, and these have to be undertaken
imaginatively in ways that explore different possible worlds.
There is a further problem of moving from experiments to observational studies by using these
conditions. As shown in Chapter XX, the major method that has been used to make the statistical
corrections required for conditional independence is regression analysis in which the left-hand-side
variable is the outcome variable Y and the right-hand-side variables are the covariates Z and some
measure of the treatment and control such as a dummy variable X. The use of this technique has
led to two difficulties. First, unlike correlation analysis which is inherently symmetrical,
regression analysis is inherently asymmetrical. One variable has to be chosen as the left-hand-side
or “dependent” variable and others have to be chosen as the right-hand-side or “independent”
variables. It is all too easy for researchers to fall into the easy assumption that the left-hand-side
49
variable is the effect of the right-hand-side causes. Yet, once outside of the experimental
paradigm, there is no guidance about which variable should be considered the outcome or effect.
Observational studies very seldom have any built-in asymmetry that suggests the proper dependent
variable, but experiments do have this asymmetry which comes from one variable being
manipulated. Second, all the right-hand-side variables are treated symmetrically in regression.
Yet, the conditional independence framework treats covariates and treatments asymmetrically. If
conditional independence holds for the assignment of the treatment X and the outcome Y when the
covariates Z are controlled, it does not follow that we can interchange X and Z. Once again, it is
important to recognize that experiments identify putative causes through their manipulation of
them.
Thus, those using the conditional independence framework for observational studies must do two
things. First, they must identify a variable X that has been or could be manipulated to affect some
outcome Y. The variable X is the putative cause and Y its effect. Then they must identify a set of
covariates Z which can be used to adjust the Y values so that the impact of X in the closest possible
world can be evaluated. In short, they must employ lessons from both the manipulation and
counterfactual theories of causality.
Conclusion: Causality and Explanation
Wesley Salmon ends his review of “Four Decades of Scientific Explanation” (1990) with a chapter
entitled “Peaceful Coexistence?” He finds that explanation has made a comeback after the logical
positivists and logical empiricists had written it off as humbug and metaphysics. There are two
major approaches to explanation. One is the unification approach that seeks general laws and a
reduction in the number of independent assumptions needed to explain what happens on in the
world. Salmon calls this a “top-down” approach which has close kinship with the neo-Humean
theories of causality. The second approach builds explanations from the “bottom-up” analysis of
causation for singular events and the investigation of causal mechanisms. Although these two
approaches have fought with one another, Salmon considers them to be complementary aspects of
scientific understanding. Sometimes, it turns out, we explain things best by appealing to a very
general principle or law, but at other times, we explain them best when we appeal to specific
events and mechanisms. There is no need to enshrine one approach over the other. Both have
their uses.
50
Bibliography [Incomplete]
Abdullah, Dewan A. and Peter C. Rangazas, “Money and the Business Cycle: Another Look (in
Notes),” The Review of Economics and Statistics, Vol. 70, No. 4. (Nov., 1988), pp. 680685.
Achen, Christopher H., “Toward Theories of Data: The State of Political Methodology,” in
Political Science: The State of the Discipline, Ada Finifter (editor), Washington, D.C.,
American Political Science Association, 1983.
Angrist, Joshua D., “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social
Security Administrative Records,” The American Economic Review, Vol. 80, No. 3. (Jun.,
1990), pp. 313-336.
Bartels, Larry M., Presidential Primaries and the Dynamics of Public Choice, Princeton, N.J.:
Princeton University Press, 1988.
Bartels, Larry and Henry E. Brady, “The State of Quantitative Political Methodology,” in
Political Science: The State of the Discipline, 2nd Edition, Ada Finifter (editor),
Washington, D.C.: American Political Science Association, 1993.
Beauchamp, Tom L. and Alexander Rosenberg, Hume and the Problem of
Causation, New York: Oxford University Press, 1981.
Bennett, Jonathan, Events and Their Names, Indianapolis: Hackett Publishing
Company, 1984.
Bertrand, Russell, “On the Notion of Cause,” in Mysticism and Logic and Other
Essays, New York: Longmans, Green and Co., 1918.
Brady, Henry E., “Knowledge, Strategy and Momentum in Presidential Primaries,” in Political
Analysis, John Freeman (editor), Ann Arbor: University of Michigan Press, 1996.
Brady, Henry E., Michael C. Herron, Walter R. Mebane, Jasjeet Singh Sekhon, Kenneth W.
Shotts, and Jonathan Wand, “Law and Data: The Butterfly Ballot Episode,” PS:
Political Science & Politics, v34, n1 (2001), pp. 59-69.
Brady, Henry E., Mary H. Sprague, Fredric C. Gey and Michael Wiseman, “The Interaction
of Welfare-Use and Employment Dynamics in Rural and Agricultural California Counties,”
2000.
California Work Pays Demonstration Project: County Welfare Administrative Data, Public Use
Version 4.1, Codebook, Berkeley, California: UC DATA Archive and Technical
Assistance, 2001.
Campbell, Donald T. and Julian C. Stanley, Experimental and Quasi-Experimental Designs for
51
Research, Chicago: Rand McNally, 1966.
Card, David and Alan B. Krueger, “Minimum Wages and Employment: A Case Study of the
Fast-Food Industry in New Jersey and Pennsylvania,” The American Economic Review,
Vol. 84, No. 4. (Sep., 1994), pp. 772-793.
Cartwright, Nancy, Nature's Capacities and Their Measurement, New York: Oxford
University Press, 1989
Chatfield, Chris,”Model Uncertainty, Data Mining and Statistical Inference,” Journal of the
Royal Statistical Society Series A (Statistics in Society), Vol. 158, No. 3. (1995), pp. 419466.
Cook, Thomas D. and Donald T. Campbell, Quasi-Experimentation: Design & Analysis Issues
for Field Settings, Boston: Houghton Mifflin Company, 1979.
Cook, Thomas D. and Donald T. Campbell, “The Causal Assumptions of QuasiExperimental Practice,” Synthese, Vol. 68, pp. 141-180, 1986
Cox, David Roxbee, “Causality: Some Statistical Aspects,” Journal of the Royal Statistical
Society. Series A (Statistics in Society), Vol. 155, No. 2 (1992), pp. 291-301.
Cox, Gary W., Making Votes Count : Strategic Coordination in the World's Electoral
Systems, New York: Cambridge University Press, 1997.
Dessler, David, “Beyond Correlations: Toward a Causal Theory of War,” International
Studies Quarterly, Vol. 35, No. 3. (Sep., 1991), pp. 337-355.
Elster, Jon, “A Plea for Mechanisms,” in Social Mechanisms, Peter Hedstrom and Richard
Swedberg (editors),Cambridge: Cambridge University Press, 1998.
Ehring, Douglas, Causation and Persistence: A Theory of Causation. New York:
Oxford University Press, 1997.
Fearon, James D., “Counterfactuals and Hypothesis Testing in Political Science” in
World Politics, Vol. 43, No. 2. (Jan 1991), pp. 169-195.
Ferber, Robert and Werner Z. Hirsch, “Social Experimentation and Economic Policy: A
Survey,” Journal of Economic Literature, Volume 16, Issue 4 (Dec., 1978), pp. 1379-1414.
Firebaugh, Glenn and Kevin Chen, “Vote Turnout of Nineteenth Amendment Women: The
Enduring Effect of Disenfranchisement,” American Journal of Sociology, Vol. 100, No. 4.
(Jan., 1995), pp. 972-996.
Fisher, Ronald Aylmer, Sir, The Design of Experiments, Edinburgh, London: Oliver and Boyd,
1935.
52
Fraker, Thomas and Rebecca Maynard, The Adequacy of Comparison Group
Designs for Evaluations of Employment-Related Programs (in Symposium on the
Econometric Evaluation of Manpower Training Programs),” The Journal of
Human Resources, Vol. 22, No. 2 (1987), pp. 194-227.
Franke, Richard Herbert and James D. Kaul, “The Hawthorne Experiments: First
Statistical Interpretation,” American Sociological Review, Vol. 43, No. 5 (1978), pp. 623643.
Freedman, David A., “Statistical Models and Shoe Leather” in Sociological
Methodology, Vol. 21 (1991), pp. 291-313.
Freedman, David A., “As Others See Us: A Case Study in Path Analysis” in Journal of
Educational Statistics, Vol. 12. No. 2 (1987), pp. 101-223, with discussion.
Freedman, David A., “From Association to Causation via Regression,” in V. R.
McKim and S. P. Turner (editors), Causality in Crisis? Notre Dame IN: University of
Notre Dame Press, 1997, pp. 113-161.
Freedman, David A., “From Association to Causation: Some Remarks on the History of
Statistics,” Statistical Science, Vol. 14 (1999), pp. 243–58.
Galison, Peter Louis, How Experiments End, Chicago : University of Chicago Press, 1987.
Gasking, Douglas, “Causation and Recipes,” Mind, New Series, Vol. 64, No. 256 (Oct. 1955), pp.
479-487.
Glennan, Stuart S., “Mechanisms and the Nature of Causation,” Erkenntnis, Vol.
44 (1996), pp. 49-71.
Goldthorpe, John H., “Causation, Statistics, and Sociology,” European Sociological
Review, Vol. 17, No. 1 (2001), pp. 1-20.
Goodman, Nelson, “The Problem of Counterfactual Conditionals,” Journal of
Philosophy, Vol. 44, No. 5. (Feb 1947), pp. 113-128.
Greene, William H., Econometric Analysis, Upper Saddle River, New Jersey: Prentice-Hall,
1997.
Harre, Rom and Edward H. Madden, Causal Powers: A Theory of Natural Necessity.
Imprint: Oxford, [Eng.]: B. Blackwell, c1975.
Hausman, Daniel M, Causal Asymmetries. Imprint: Cambridge, U.K.; New York:
Cambridge University Press, 1998.
Heckman, James J., “Sample Selection Bias as a Specification Error,” Econometrica, Vol. 47, No.
1. (Jan., 1979), pp. 153-162.
53
Heckman, James J., “Randomization and Social Policy Evaluation,” in Evaluating Welfare and
Training Programs, Charles F. Manski and Irwin Garfinkel (editors), Cambridge, MA:
Harvard University Press, 1992.
Heckman, James J. and V. Joseph Hotz, “Choosing Among Alternative Non-experimental
Methods for Estimating the Impact of Social Programs: The Case of Manpower Training:
Rejoinder (in Applications and Case Studies), Journal of the American Statistical
Association, Vol. 84, No. 408. (Dec., 1989), pp. 878-880.
Heckman, James and Richard Robb, “Alternative Methods for Evaluating the Impact of
Interventions,” in Longitudinal Analysis of Labor Market Data, James Heckman and
Burton Singer (editors), New York: Wiley, 1995.
Heckman, James J. and Jeffrey A. Smith, “Assessing the Case for Social Experiments,” The
Journal of Economic Perspectives, Volume 9, Issue 2 (Spring 1995), pp 85-110.
Hedstrom, Peter and Richard Swedberg (editors), Social Mechanisms: An
Analytical Approach to Social Theory, New York : Cambridge University Press, 1988.
Hempel, Carl G., Aspects of Scientific Explanation, New York: Free Press, 1965.
Hill, A. Bradford, “The Environment and Disease: Association or Causation?, ” Proceedings of
the Royal Society of Medicine, Vol. 58 (1965), pp. 295-300.
Holland, Paul W., “Statistics and Causal Inference (in Theory and Methods),” Journal of the
American Statistical Association, Vol. 81, No. 396. (Dec., 1986), pp. 945-960.
Holland, Paul W., “Causal Inference, Path Analysis, and Recursive Structural Equations,”
Sociological Methodology, Vol. 18. (1988), pp. 449-484.
Holland, Paul W. and Donald B. Rubin, “Causal Inference in Retrospective
Studies,” Evaluation Review, Vol. 12 (1988), pp. 203-231.
Hotz V. Joseph, Guido W. Imbens and Jacob A. Klerman, “ The Long-Term Gains from
GAIN: A Re-Analysis of the Impacts of the California GAIN Program,” September 2001.
Hotz, V. Joseph, Charles H. Mullin, and Seth G. Sanders, “Bounding Causal Effects Using
Data From a Contaminated Natural Experiment: Analysis the Effects of Teenage
Childbearing, “ The Review of Economic Studies, Vol. 64, No. 4, Special Issue: Evaluation
of Training and Other Social Programmes. (Oct., 1997), pp. 575-603.
Hume, David, A Treatise of Human Nature (1739), edited by L. A. Selby-Bigge and P.H.
Nidditch, Oxford: Clarendon Press, 1978.
Jenkins, Jeffery A., “Examining the Bonding Effects of Party: A Comparative Analysis of RollCall Voting in the U.S. and Confederate Houses,” American Journal of Political Science,
54
Vol. 43, No. 4. (Oct., 1999), pp. 1144-1165.
Jones, Stephen R. G., “Was There a Hawthorne Effect?” American Journal of
Sociology, Vol. 98. No. 3. (1992), pp. 451-468.
Judge, George G., R. Carter Hill, William E. Griffiths, Helmut Lütkepohl and Tsoung-Chao
Lee, Introduction to the Theory and Practice of Econometrics, New York: John Wiley and
Sons, 1988.
Lakoff, George and Mark Johnson, “Conceptual Metaphor in Everyday Language” in
The Journal of Philosophy, Vol. 77, No. 8. (Aug., 1980), pp. 453-486, (1980a).
Lakoff, George and Mark Johnson, Metaphors We Live By, Chicago: University of
Chicago Press, 1980 (1980b).
Lakoff, George and Mark Johnson, Philosophy in the Flesh: The Embodied Mind
and its Challenge to Western Thought, New York: Basic Books, 1999.
LaLonde, Robert J., “Evaluating the Econometric Evaluations of Training Programs with
Experimental Data,” The American Economic Review, Vol. 76, No. 4. (Sep., 1986), pp.
604-620.
Lewis, David, Counterfactuals, Cambridge: Harvard University Press, 1973 (1973a).
Lewis, David, “Causation,” Journal of Philosophy, Vol. 70, No. 17, (Oct 1973), pp.
556-567 (1973b).
Lewis, David, “Counterfactual Dependence and Time's Arrow,” Noûs, Vol. 13, No. 4,
Special Issue on Counterfactuals and Laws, pp. 455-476 (Nov 1979).
Lewis, David, Philosophical Papers, New York: Oxford University Press, Vol. 2,
1986.
Lichbach, Mark Irving, The Rebel's Dilemma, Ann Arbor : University of Michigan Press, 1995.
Lichbach, Mark Irving, The Cooperator's Dilemma, Ann Arbor : University of Michigan Press,
1996.
Lijphart, Arend, Electoral Systems and Party Systems : A Study of Twenty-seven Democracies,
1945-1990, Oxford ; New York: Oxford University Press, 1994.
Machamber, Peter, Lindley Darden, and Carl F. Craver, “Thinking about
Mechanisms” in Philosophy of Science, Vol. 67, No. 1 (2000), pp. 1-25.
Mackie, John L., “Causes and Conditions,” American Philosophical Quarterly,
2/4 (1965), pp. 245-64.
55
Manski, Charles F., “Identification Problems in the Social Sciences,” Sociological Methodology,
Vol. 23. (1993), pp. 1-56.
Manski, Charles F., Identification Problems in the Social Sciences,” Cambridge, Mass.: Harvard
University Press, 1995.
Marini, Margaret Mooney, and Burton Singer, “Causality in the Social Sciences,”
Sociological Methodology, Vol. 18 (1988), pp. 347-409.
Mauldon, Jane, Jan Malvin, Jon Stiles, Nancy Nicosia and Eva Seto, “Impact of California’s
Cal-Learn Demonstration Project: Final Report”, UC DATA Archive and Technical
Assistance, 2000.
Mellors, D.H., The Facts of Causation, London: Routledge, 1995.
Menzies, Peter and Huw Price, “Causation as a Secondary Quality,” British
Journal for the Philosophy of Science, Vol. 44, No. 2 (1993), pp. 187-203.
Metrick, Andrew, “Natural Experiment in "Jeopardy!",” The American Economic Review, Vol.
85, No. 1. (Mar., 1995), pp. 240-253.
Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin, “Workers' Compensation and Injury
Duration: Evidence from a Natural Experiment,” The American Economic Review, Vol. 85,
No. 3. (Jun., 1995), pp. 322-340.
Mill, John Stuart, A System of Logic, Ratiocinative and Inductive, 8th Edition,
New York: Harper and Brothers, 1988.
Pearson, Karl, The Grammar of Science, 3rd Edition, Revised and Enlarged, Part 1.
– Physical, London: Adam and Charles Black, 1911.
Pindyck, Robert S. and Daniel L. Rubinfeld, Econometric Models and Economic Forecasts,
New York: McGraw-Hill, 1991.
Ragin, Charles C., The Comparative Method: Moving beyond Qualitiative and
Quantitative Strategies. Imprint: Berkeley: University of California Press, 1987.
Riccio, James et al., “GAIN: Benefits, Costs, and Three-Year Impacts of a Welfare-to-Work
Program”, Manpower Demonstration Research Corporation, 1994.
Rosenbaum, Paul R. and Donald B. Rubin, “The Central Role of the Propensity Score in
Observational Studies for Causal Effects,” Biometrika, Vol. 70, No. 1. (Apr., 1983), pp. 4155.
Rosenweig, Mark R. and Kenneth I. Wolpin, “Testing the Quantity-Quality Fertility Model: The
Use of Twins as a Natural Experiment,” Econometrica, Vol. 48, No. 1. (Jan., 1980), pp.
227-240.
56
Rubin, Donald B., “Estimating Causal Effects of Treatments in Randomized and Nonrandomized
Studies,” Journal of Educational Psychology, Vol. 66. No. 5 (1974), pp. 688-701.
Rubin, Donald B., “Bayesian Inference for Causal Effects: The Role of Randomization,” Annals
of Statistics, Vol. 6, No. 1. (Jan., 1978), pp. 34-58.
Rubin, Donald B., “Comment: Neyman (1923) and Causal Inference in Experiments and
Observational Studies.” Statistical Science, Vol. 5, No. 4 (1990), pp. 472-480.
Salmon, Wesley C., Four Decades of Scientific Explanation, Imprint: Minneapolis:
University of Minnesota Press, 1989.
Sheffrin, Steven M., Rational Expectations, Cambridge [Cambridgeshire] ; New York :
Cambridge University Press, 1983.
Simon, Herbert A., “On the Definition of the Causal Relation,” The Journal of
Philosophy, Vol. 49, No. 16(Jul 1952), pp. 517-528.
Simon, Herbert A. and Yumi Iwasaki, “Causal Ordering, Comparative Statics, and
Near Decomposability,” Journal of Econometrics, Vol. 39 (1988), pp. 149-173.
Sobel , Michael E., “Causal Inference in the Social and Behavioral Sciences,” in Handbook of
Statistical Modeling for the Social and Behavioral Sciences, Gerhard Arminger, Clifford C.
Clogg, and Michael E. Sobel (editors)New York: Plenum Press, 1995.
Sorenson, Aage B., “Theoretical Mechanisms and the Empirical Study of Social Processes,” in
Social Mechanisms, Peter Hedstrom and Richard Swedberg (editors), Cambridge:
Cambridge University Press, 1998.
Sosa, Ernest and Michael Tooley, Causation, edited by Ernest Sosa and Michael
Tooley. Imprint: Oxford; New York: Oxford University Press, 1993.
Sprinzak, Ehud, “Weber's Thesis as an Historical Explanation,” History and Theory, Vol. 11,
No. 3. (1972), pp. 294-320.
Tetlock, Philip E. and Aaron Belkin (editors), Counterfactual Thought Experiments in World
Politics: Logical, Methodological, and Psychological Perspectives, Imprint: Princeton,
N.J.: Princeton University Press, 1996.
von Wright, Georg Henrik, Explanation and Understanding. Ithaca, N.Y.: Cornell University
Press, 1971.
von Wright, Georg Henrik, Causality and Determinism, New York: Columbia University
Press, 1974.
Wand, Jonathan N., and Kenneth W. Shotts, Jasjeet S. Sekhon, Walter R. Mebane,
57
Michael C. Herron, and Henry E. Brady. “The butterfly did it: the aberrant vote for
Buchanan in Palm Beach County, Florida,” American Political Science Review, Vol. 95,
No. 4 (Dec. 1991), pp. 793-810.
Wawro, Geoffrey, The Austro-Prussian War : Austria's war with Prussia and Italy in
1866, New York: Cambridge University Press, 1996.
Weber, Max, Selections in Translation, W.G. Runciman (editor) and E. Matthews (translator),
Cambridge: Cambridge University Press, 1906 [1978].
58
Appendix 1 -- Causal Independence Among the Causes of a Given Effect
Lewis claims that when C causes E but not the reverse “then it should be possible to claim the
falsity of the counterfactual ‘If E did not occur, then C would not occur.’” This counterfactual is
different from “if C occurs then E occurs” and from “if C does not occur then E does not occur”
which Lewis believes must both be true when C causes E. The required falsity of ‘If E did not
occur, then C would not occur’ adds a third condition for causality.
The intuition for this third requirement is that causes produce their effects but not vice-versa.
Consequently, nullifying a cause should nullify an effect because the effect cannot be produced
without the cause, but nullifying an effect should not nullify the cause because the presence or
absence of an effect can have no determinant impact on one of its causes. This third counterfactual
requirement also helps to rule out common causes.
Lewis presents a complicated rationale for why the counterfactual “If E did not occur, then C
would not occur” would be false in what he considers the closest possible world to the world in
which C truly causes E .72 Our reading of Lewis’s articles leads us to conclude that his theory is
contrived and somewhat outlandish.73 Moreover, in an experiment where the cause is effective,
then both C and E do not occur in the control condition, and it appears as if we can assert the truth,
from our empirical observations, of the proposition that “if E does not occur, then C does not
occur.” Thus, it seems as if the counterfactual theory and experimental observations could lead to
the conclusion that E causes C as well as the reverse.
What can be done? One avenue might be to abandon the counterfactual theory, but there are two
other paths that might lead us out of this unhappy result. One path is to assert that of all the
possible worlds in which the effect E does not occur, the control group should not be considered
the closest possible world to the treatment group.74 This approach, however, seems foolish. Of all
the worlds in which E does not occur, the only additional way that the control group differs from
the treatment group is that C does not occur. Hence, it does not seem possible that we can find any
world in which E does not occur that is any closer to the treatment group.
Another way the counterfactual might be false is if there is another possible world, just as close to
the treatment group as the control group, in which E does not occur, but C does occur. A
moment’s reflection suggests that in this world there must be another factor, call it W, that is
72
His arguments are in a series of papers (1979; 1986) which have been criticized by several authors (e.g.,
Bennett, 1984; Hausman, 1999, Chapter 6).
73
Among other things, Lewis requires “miracles” in his possible worlds which might make sense for a
metaphysical (ontological) definition of causation, but miracles seem well-beyond the capacity of most
social scientists who want a practical method for discovering causation in their everyday work.
74
Note that we have no quarrel with the notion that the closest possible world to the treatment group in
which the cause does not occur is the control group, but we are questioning whether the closest possible
world to the treatment group in which the effect does not occur is the control group. There is no a priori
reason why the control group should be the closest to the treatment group in both circumstances.
59
necessary (actually INUS)75 for the effect, and the absence of this factor must prevent E from
occurring. Thus, if the cause is a short-circuit (C) and the effect (E) is the building burning down,
then W might be the requirement that the building is fabricated of wood. If a short-circuit occurs,
but the building is not fabricated of wood (say it is made of brick),76 then the building will not burn
down. That is, “if E does not occur, then C can still occur because W did not occur”.
Consequently there are now two possible worlds that are equally close to the treatment world in
which the building does not burn down. In one world, the building does not burn down, the shortcircuit occurs, and the building is made of brick. Thus we have (not E, C, not W). In the original
control group world, the building burns down, the short-circuit occurs, and the building is made of
wood. Thus we have (not E, not C, and W). These two worlds can be considered equally close to
the treatment world in which the building burns down, there is a short-circuit, and the building is
made of wood (E, C, and W) because they differ in exactly the same number of ways.77 Thus, it is
not necessarily true that “if E did not occur, then C would not not occur” because it is also possible
that “if E did not occur, then W would not occur.” A counterfactual that is true is that “if E did not
occur, then either C would not occur or W would not occur.” The only requirement for this result
is that there must be more than one INUS cause for E. There must be at least two different causes
of a common effect which can occur independently of one another. Short-circuits, for example,
can occur outside wooden buildings and wooden buildings can occur without short circuits. The
existence of each is independent of the other. Hausman (1999, pages 64-70) makes a strong
argument that this “independence of causes” will always be true in any case of interest.
To determine the direction of causation, we can perform an experiment in which we construct
three possible worlds. We can assign one group C and W, another W and not C, and still another C
and not W. For completeness, we might also want to assign one group neither C nor W, but this is
not necessary for the argument. The result will be the pattern of four possible worlds depicted in
Table A-1. The entries in the table are the expected effects for each world given the true
(deterministic) causal relationships. Note that the entries in this table are symmetrical in C and W
(interchanging these two variables leads to the same entries in the table), but they are not
symmetrical in C and E or W and E. From an observational perspective, this asymmetry is the
reason that the data can support the claim that C (or W) causes E. Specifically, we observe from
three entries in the table (all but the bottom right-hand corner) that “if the building does not burn
down (not E), then either the short circuit did not occur (not C) or the building was not made of
wood (not W).” But the counterfactual conditional “if E does not occur, then C does not occur” is
not true, because the building could be saved from burning down by being made of brick – by W
not occurring (see the top right-hand entry in the table). Hence, we can obtain the falsity of the
counterfactual conditional that we need to show that C (and W for that matter) are causes of the
building burning down. That is, the pattern of entries in Table A-1 supports the claim that W and
75
Remember that INUS is really nothing more than the statement that W could cause E in conjunction with
the right set of things but that an entirely different set of things (not including W) could also cause E.
76
We assume that buildings can only be wood or brick so that “not W” means a brick building.
77
Without any clear-cut notion of how to define similarity, it seems reasonable to conclude that two
possible worlds differing in “exactly the same number of ways” from a third are equally close to the third
world.
60
C together cause E.
Table A-1 – Four Closest Possible Worlds
TREATMENTS
No short circuit (no C)
Short Circuit (C)
Brick building (not W)
No fire (not E)
No fire (not E)
Wood building (W)
No fire (not E)
Fire (E)
It is imperative, however, to remember that this causal claim rests heavily on the fact that the four
possible worlds in Table A-1 are as close as possible to one another given their differences in C
and W. A similar table with the same pattern of entries from observational data cannot be used to
make the same inference because the entries might be based upon quite different situations which
are not close to one another. Consider, for example, a universe different from ours in which shortcircuits never cause fires even when the short-circuit occurs next to a piece of wood because shortcircuits do not have the capacity to start fires. In this universe, we could still get the pattern in
Table A-1 from observational data if wooden buildings burned when they were hit by lightning
and if, by sheer happenstance, only wooden buildings with short-circuits were hit by lightning. In
statistical parlance, lightning would be a confounder that is correlated with short-circuits so that it
appears that short-circuits cause wooden buildings to burn down, when in truth it is the lightning
that does the work. Short-circuits and fires are the joint effects of a common cause which is
lightning. In these observational data, the cases in the bottom right-hand corner where fires occur
in wooden buildings with short-circuits are not the closest possible world to the cases in the bottom
left-hand corner where fires do not occur in wooden buildings without short-circuits because
lightning strikes the buildings in the first set of cases but not in the second set of cases. The two
sets of cases would be closer if lightning struck both (in which case both would burn down) or it
struck neither (in which case neither would burn down).
This method of finding additional causes can also help to rule out the possibility that a causal
connection between two events results from a common cause for both of them, although ruling out
common causes can be a tricky business. Consider, for example, an experiment in which
randomly assigned special tutoring causes a rise in self-esteem and then an increase in test scores,
but the increase in self-esteem does not cause the increase in test scores. Rather it is the
substantive content of the special tutoring that causes the rise in test scores. Self-esteem might be
incorrectly considered the cause of the increased test scores in this situation because self esteem is
randomly assigned and it precedes and is associated with the rise in test scores.
Two problems must be sorted out. One is the possibility that the treatment (which is special
tutoring with substantive content) is misnamed “increases in self-esteem.” The other is the failure
to identify the common cause. If the first mistake has been committed, then subsequent
manipulations of special tutoring thought of as manipulations of self-esteem will perpetuate the
mistake. But if attempts are made, as they should be, to look for other possible “independent”
causes of increases in test scores, such as the substantive content of the special tutoring, and to
manipulate the hypothesized causes (self-esteem and substantive content) separately, then it will
become clear that even the second of Lewis’s conditions does not hold for self-esteem, properly
61
described. Although self-esteem is associated with higher test scores in the initial situation, when
we construct an experimental condition in which the increases in self-esteem are counteracted but
the substantive content of the special tutoring remains, then test scores will still be high. Hence,
self-esteem cannot be responsible for the higher test scores. Note that this experimental condition
(in which only self-esteem is manipulated) is actually a closer possible world to the initial situation
than the experiment in which self-esteem is manipulated through the presence or absence of
special tutoring because the manipulation of special tutoring actually changes (at least) two aspects
of the world, the self-esteem of the subjects and the substantive content to which they have been
exposed. Thus, looking for independent causes is one way in which researchers can try to get
closer and closer possible worlds.
Perhaps the most important lesson from this discussion is that the introduction of a new factor W in
the experimental situation leads towards identifying the exact factors that cause the effect and the
mechanism whereby the effect occurs. It turns out that it is not just short-circuits that cause
buildings to burn down. Rather, the interaction of a short-circuit with a wooden building causes
fires. Once establishing this, we can investigate the mechanism in more detail by considering
other sources of sparks and other sources of flammable material. Similarly, it turns out that it is
not self-esteem that causes grades to rise, rather it is the substantive content of the special tutoring.
More generally, this logic can be applied to any situation. In the butterfly ballot case, if we find
that less well-educated people were more likely to make mistakes (which was true), then we can
see whether less well-educated people make other kinds of mistakes with voting equipment such as
“overvoting” by inadvertently voting twice for the same office.78 Or we can check to see whether
the mistakes that were made were really due to the butterfly ballot itself or to some other feature of
the Palm Beach County election administration of which the ballot was just another effect. The
counterfactual theory of causation, therefore, leads us towards introducing other possible causes
and considering mechanisms that tie causes together.
78
See Brady et al (2001b).
62
Studying the Causes of Human Variability:
The Role of Conditional Independence
Good social science research must rule out alternative cause and effect relationships.1
This requires finding a way to move from association to causation. This step is a very big one.
For example, informal observation suggests that those countries using proportional
representation tend to have more political parties than those using the familiar American system of
single-member plurality voting districts. But it is wrong to move from this observation to the
claim that proportional representation (PR) leads to more political parties than plurality voting
systems if, as seems to be the case, PR systems are more common in those nations that include
numerous powerful groups that want political parties devoted to their needs. In these
circumstances, the powerful groups may have persuaded their governments to choose PR over
plurality systems because they thought it would help them form political parties.
A tangle of causal possibilities follows from the observed associations in this case. It is
possible that PR causes more parties to form, but it is also possible that powerful groups are the
major causal factor. Without PR, these groups might still spawn many parties in which case the
association of the number of parties with PR is completely spurious. Alternatively, without PR,
the groups might be thwarted in their attempts to create political parties which they could form if
PR were used. In this case, PR interacts with powerful groups to produce many flourishing
parties. Whatever the truth, it is clearly hard to figure out from the pattern of associations.
The problem here, as in a great deal of social science research, is that the requirement for
good causal inferences, called conditional independence by statisticians and the specification
assumption by econometricians, is not met. The putative cause (proportional representation in
this case) is correlated with another factor (numerous powerful political groups) that might
explain the effect (the number of political parties) and the other factor is not controlled in a way
that allows the researcher to rule it out as the true cause of a multiplicity of parties. Conditional
independence might be satisfied if the confounding factor is controlled, but even if many
confounding factors are identified, controlled, and ruled out, there still may be others that have
not been identified that can derail an inference. Satisfying the requirement for conditional
independence is the Achilles heel of all social science inference because it is so very hard to do.
This chapter shows why this is so.
1
Of course it also requires ensuring the representativeness of the phenomena under consideration,
conceptualizing and measuring the variables whose relationships are being studied, and guaranteeing, through
statistical significance testing, that putative relationships are not solely due to chance (Campbell and Stanley,
1966). All four tasks are important parts of the scientific enterprise, but there is something especially futile about a
representative study of the causes of, say, revolutions that pays careful attention to conceptualization and
measurement and that dutifully reports statistical significance, but which gets the causes wrong because it failed to
rule out obvious alternatives explanations.
1
The fundamental problem for researchers who want to rule out alternative explanations is
the extent of human variability and the difficulty of controlling that variation. In this chapter, we
do four things towards understanding the ways that variability can be controlled. First we
describe the experimental approaches, dating from the 17th century, used by physical scientists to
develop and test new laws. We show that these scientists were well aware of the need for control
and for minimizing errors, but they dealt with a world in which statistical variation was not a great
problem. Nevertheless, one of the major methods for analyzing social science data, ordinary least
squares, was first developed to deal with the bothersome but relatively innocuous measurement
error in astronomical and physical data. Second, we show how the recognition of statistical
variability – a notion that went significantly beyond the idea of errors – by social and biological
scientists in the 19th century posed a conceptual problem that required new ways of doing science.
We show how 19th century statisticians found ways to describe, to summarize, and to understand
this variability. Third, we discuss how randomized
experiments provide a method for developing reliable law-like statements about the social and
biological world by providing a setting in which counterfactual possibilities can be clearly stated
and in which these counterfactual outcomes can be tested while ruling out confounding factors.
Experiments do this by insuring that the outcome, conditional on the treatment, is independent of
all other factors that affect the outcome of the experiment. We explain what this means and how
experiments can provide evidence about counterfactuals. Fourth, we discuss the ways that
observational studies fall short of randomized experiments by not automatically satisfying
conditional independence. We consider what observational work must accomplish to develop
reliable knowledge about the social and biological world.
Experiments Under Ideal Conditions – Physical Science Research
Until the middle of the 19th century, the prototype of scientific progress was the development of a
physical law relating physical quantities such as velocity, force, pressure, temperature, or
momentum. The methods for establishing these laws were controlled experiments and
observational studies such as Galileo’s early 17th century experiments with falling bodies and his
observations of the moons of Jupiter. Both methods depended upon conceptual and theoretical
perspectives that suggested specific hypotheses. The experimental method also depended upon
these perspectives to determine the factors that had to be manipulated and controlled.
Boyle’s Gas Law – Robert Boyle’s (1627-1691) experiments in pneumatics in the middle
of the 17th century, which led to the eponymous law relating the volume and pressure of a gas to
one another, exemplify the ideal. Boyle was a careful experimenter, and his experiments were the
culmination of investigations that established the existence of air pressure caused by the sea of air
surrounding the earth. The fundamental instrument in these experiments, invented by Evangelista
Torricelli (1608-1647), was a column of mercury – a barometer – to measure air pressure. Critics
of Torricelli’s theory believed that air could not have enough weight to push up a column of
mercury, and the experiments that led to Boyle’s law were the result of Franciscus Linus’s now
fantastic, but then conceivable alternative hypothesis that “the space above the mercury column in
a Torricellian tube contained an invisible membrane or cord which he called a funiculus (Conant,
2
1957, page 50).” This membrane, Linus claimed, could draw up a column of mercury to about 29
inches, and it explained the observed behavior of the mercury in the Torricelli tube.
Boyle rejected the idea of the funiculus, but to provide the necessary proof he had to find a
way to show that air has sufficient “weight and spring ... to perform such great matters as the
counterpoising of a mercurial cylinder of 29 inches, as we teach that it may.” (Boyle, cited in
Conant, 1957, page 51). Boyle developed a series of ingenious experiments to do this by varying
the pressure exerted on a volume of air. By doing this, he showed that pressures other than about
29 inches of mercury were possible, thus dealing a blow to the theory of the funiculus. In the
process, he noted that the volume (V) of air decreased in a regular way as the pressure (P)
increased. Specifically, the pressure was related to volume (V) according to the following simple
relationship where C is a constant:
(1)
PV = C,
which can be rewritten as follows after taking logarithms and rearranging terms:
log(P) = log(C) - log(V).
If we let p = log(P), v = log(V), and c = log(C), then we get the following linear equation:
(2)
p = c - v.
Figure 1 plots the measurements of pressure versus volume from Boyle’s experiments (Conant,
1957, page 53) along with an ordinary least squares (OLS) fit to them.2 OLS finds a line that best
fits a scatter of points by minimizing the sum of the squared vertical errors. These errors are the
difference between the observed pressure for a given volume and the value that would be predicted
for that volume from the regression line. In Figure 1, the points are so close to the fitted line that
is hard to see any discrepancies, but they do exist.
Although Boyle could have taken logarithms as we did, he did not have fit an OLS line to
his data because OLS was developed in the 18th and perfected in the early 19th century (Stigler,
1986). The inspiration for OLS was to provide better fits to physical data in which the deviations
were thought of as errors of measurement. A 19th century physicist would have readily used this
method to smooth out the errors of measurement and to determine the slope of the line (-1) and
the intercept (c) in equation (2). In Figure 1, the slope of the line is -1.0015 with a standard error
of 0.0017 so that its departure from the predicted value of minus one in equation (2) is
2
It is reasonable to ask why we plot pressure versus volume instead of the reverse. For each of his trials,
Boyle apparently chose a target volume and then varied the pressure until he hit that target. Berkson suggests in
this case that volume is the independent variable. In any case, the reverse regression has a coefficient of -0.9984
with a standard error of 0.0017 which is essentially identical to the first one. For a more thorough discussion of
the issues involved in this case, see Berkson, 1950.
3
insignificant. The fit is remarkably good, and the largest error is only about 1%. This strongly
suggests that Boyle’s care in constructing and using his experimental device led to very little
imprecision in the measurement of volume and pressure. But the error could have also had another
cause that is discussed below.
Boyle’s result illustrates a number of features of controlled experiments. First, his work
involves a working hypothesis (about the “weight and spring” of the air), and there is an alternative
hypothesis (the funiculus). Second, the experiment is carefully controlled so that precise numerical
measurements can be made. Third, although he makes no explicit statement about how
temperature would affect the accuracy of his results, Boyle knew that heated air expands and
cooled air contracts. To check for the effects of temperature, he warmed his volume of air with a
candle and noted that there were only small changes in its volume. Conant notes that “This fact
must have assured Boyle that the minor fluctuations in the room temperature during the
experiment in which he varied the height of the column of mercury would not affect the
significance of his results (Conant, 1957, page 56).” Nevertheless, changes in room temperature
undoubtedly did occur during his experimentation, and they could be (partly) responsible for the
deviations from a perfect fit in Figure 1.
More Complete Gas Laws – In fact, the modern version of the gas law incorporates two
other laws, Charles’ law on temperature and pressure and Avogodrado’s law regarding the amount
of matter:
(3)
PV = R N T
where R is the “gas constant,” N is the amount of matter (measured in moles, a specific number of
molecules), and T is temperature in degrees Kelvin. Taking logarithms and setting r=log(R),
n=log(N), and t = log(T), we get:
(4)
p = r + n - v + t.
Thus, Boyle developed his law by omitting two important variables (n and t), but he was saved
from making an incorrect inference for one of two reasons. Either the variables did not vary
because Boyle controlled them, or their range of variation was very small compared to the range of
variation in v and p. It is likely that n was controlled because he did a series of experiments with
the same amount of matter, but t undoubtedly varied during the course of his experiment. In this,
Boyle was helped by the fact that the amount of variation in t required to have an impact is very
large, and by the fact that he did check qualitatively to see that this was true. It is worth
contemplating, however, what might have happened if the range of temperature had been greater in
his experiments. We shall return to this problem later.
Experiments such as the one undertaken by Boyle were successful, then, because of
precision in measurement and calculated (or lucky) efforts to control “confounding” variables such
as temperature. But there is a still deeper reason for their success. Boyle was lucky because he
4
was studying a subject for which the individual parts of matter could be isolated and would behave
the same as any other part of matter. Boyle did not have to worry that a sample of air from one
location would differ from that in another location. He did not have to worry that air might have
many different characteristics that would vary and confound his analysis.
Indeed, it turns out that he did not even have to worry about what kind of gas he was
studying because by Avogadro’s Law, equal volumes of all gases at the same temperature and
pressure have the same number of particles. Gases are extraordinarily homogeneous substances
once pressure and temperature are controlled. The behavior of liquids, for example, is much
harder to understand because different liquids behave quite differently. In truth, even gases have
very small differences in the forces between molecules (van der Waal’s forces) that lead to
departures from Avogadro’s Law, but these were undetectable until 19th century improvements in
the art of physical measurement.
Finding the Right Tools for Analyzing Data in the Social and Biological Sciences
Social and biological scientists, however, are faced with a more difficult situation as they try to
explain human variation. Even the most elementary social or biological phenomena display
extraordinary variability. Tall parents do not inevitably have tall children – common experience
tells us that tall parents can have children of all sizes. Susceptibility to disease varies enormously,
and not all people get sick even in the midst of a plague. Criminals are not necessarily poor, badly
educated, from broken families, or marked by distinctive body types. New agricultural techniques
do not lead to the same yield on all plots. Even if one tried to hold every factor constant in these
instances (the same nutrition for all children, the same sanitary conditions for those exposed to
disease, the same social conditions for possible criminals, the same fertilizers, rainfall, and other
conditions for agricultural plots), we still find substantial variation in these phenomena. It does not
seem that there is any way to produce deterministic laws in the social sciences such as the one
obtained by Boyle.
In fact, upon initial examination, the biological and social world seems to be idiosyncratic,
anomalous, and hopelessly variable. Few, if any, law-like regularities are apparent. Only through a
series of halting steps that would require most of the 19th century, would a strategy be developed
for dealing with this problem. These steps would lead researchers towards a recognition of
stochastic laws. First there would be a recognition that there were societal regularities that could
be described by mean tendencies. Second, for some important situations it would be recognized
that deviations from these tendencies took a lawlike form. Third, correlational and regression
methods would be developed for summarizing statistical laws that differed in important ways from
the deterministic laws discovered by Boyle and other physical scientists.3
3
As well as consulting many of the original articles myself, I have relied heavily upon three books for
this section of the paper: Stephen Stigler, The History of Statistics: The Measurement of Uncertainty before 1900,
Cambridge: Harvard University Press, 1986 (chapters 5, 8, 10). Theodore M. Porter, The Rise of Statistical
Thinking: 1820-1900, Princeton: Princeton University Press, 1986 (Chapters 2, 4, 5, 9), and Gerd Gigerenzer,
5
Step One: Categorizing the Average Person – Adolphe Quetelet (1796-1874) and Francis
Galton (1822-1911) would be the leaders in the first steps. Quetelet, born in Belgium and
educated in France, spent most of his professional life in Brussels. He was an energetic arranger of
statistical data, but his greatest contribution was his conception of social physics, partly derived
from his early training in astronomy, which served as a metaphor for his belief in the essential
lawfulness of social phenomena. His belief in these regularities was buttressed by the ever
growing efforts of industrializing nation-states to collect data that revealed surprising regularities
in births, marriages, deaths, crime, and many other phenomena. “Quetelet interpreted the
regularity of crime as proof that statistical laws may be true when applied to groups even though
they are false in relation to any particular individual (Porter, 1986, page 51).”
Statistical regularity, “the law of large numbers,” became for him the fundamental axiom of
social physics. He identified the average man, “l’homme moyen,” as the central concept in his two
volume work published in 1835 entitled On Man and the Development of His Faculties, or an
Essay on Social Physics.4” For Quetelet, there were many average individuals, one for each way
of categorizing people. “It was the relationship between these average men that was the focus of
Quetelet’s attention, their rates of development and their differences and similarities (Stigler, 1986,
page 171)” The assumption that the average person differed from one category to another made
it imperative that Quetelet have a way to define analytically distinct groups. Yet, for any
categorization, people often varied as much within the category as between categories. Quetelet
had to solve this problem of variability within groups, and he had to do so in a way that would
provide some theoretical power.
Solving the problem would not be easy. The recognition of the average man provided a
focal point for the comparison of groups, but it made deviations from that average within the
group a more difficult problem for social and biological scientists. In physical science, phenomena
could be described in essentially complete ways such that physical laws would completely
determine their other characteristics. The same amount of any gas contained in the same volume
and at the same temperature would exert the same pressure. A range of different pressures was
not observed for a given volume and temperature. Stars observed on one night at a given time
would appear in the exact same locations on the next night with due allowance for the earth’s
movement. Except for measurement error, they did not appear to move randomly about these
locations. One way to describe these results is to say that physical phenomena followed
deterministic causal laws, but this description requires the additional baggage of determinism and
causal explanation. It is far simpler to say that physical laws provided ways to describe
homogenous groups of phenomena such that once some facts were known about them, then others
followed (almost) exactly. The same amount of any gas with a given volume and set temperature
did not exhibit a range of pressures. Stars did not deviate from their predicted positions. In these
Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz Kruger, The Empire of Chance,
Cambridge: Cambridge University Press, 1989, (Chapter 2).
4
Sur l’homme et le development de ses facultes, ou essai de physique sociale, Paris: Bachelier.
6
cases, any deviations from the predicted values invariably seemed to be errors that could be
eliminated through better measurement.5 And even if individual measurements could not be
improved to eliminate deviations, statistical techniques such as ordinary least squares could be used
to average out errors.
But it seemed difficult, if not impossible, to develop descriptions of social and biological
phenomena that were homogeneous in any way. A description of parents’ heights did not
perfectly predict children’s heights, and additional information about the parents and the family still
left a large residual of unexplained variation. Extensive knowledge about the income and living
conditions of people did not perfectly predict whether they would become criminals or get ill.
Detailed measurements of past fertility and other characteristics of agricultural plots was not
enough to reliably predict their future yield. All descriptions seemed inevitably to lead to
heterogeneous values on other characteristics, and it did not seem that better measurements or
better theories would inevitably solve this problem. There did not appear to be deterministic
social or biological laws. What could be done in these circumstances? What kind of laws were
appropriate for social and biological phenomena? Was there another way to define homogeneity
that would lead to useful results?
Although Quetelet’s first step showed that mean values could be used to describe groups
categorized by different characteristics, further progress seemed to require a new conception of
homogeneous groups. What could this conception be? Quetelet’s second contribution was to
suggest a way to do this that led to an interpretation for the normal distribution that made it more
than a theory of errors.
The normal error curve was well-known by the middle of the 19th century, dating back to
the work of Abraham De Moivre (1667-1754) in the 1730's and the generalization to a “central
limit theorem” in the work of Pierre Simon Laplace (1749-1827) in the 1770s. De Moivre showed
that the normal distribution was the limiting distribution of the binomial distribution that arose in
games of chance, and Laplace showed how the normal distribution could be thought of as the
result of a large number of independent and identical factors such as errors of measurement or
small deviations in ideal conditions that would cause measurements to deviate from their true
mean. The normal distribution produced by this central limit theorem was considered an ideal
model for the numerous factors such as eye fatigue, lens distortions, weather conditions, and
recording error that affected astronomical measurements. With this in mind, it was natural
(although by no means simple) for Carl Friedrich Gauss (1777-1855) and Laplace to join this
model to the method of least squares in the early part of the 19th century in order to improve the
analysis of astronomical data. The same rationale justifies our use of least squares in fitting
Boyle’s data in Figure 1.
Quetelet, however, developed a novel use of the normal distribution. He decided that the
normal distribution was an apt standard for judging the homogeneity of a category. “If a
5
Or in extreme cases, through new theories.
7
collection of variable measurements were in fact homogeneous (that is, susceptible to the same
dominant causes, differing only in the more minor and random aspects that Quetelet would term
accidental causes), then Laplace’s theorem would tell us to expect the observations to follow the
normal law ... supposing the accidental causes sufficiently numerous (Stigler, 1986, pages 203-5).”
In effect, he turned the central limit theorem on its head. Instead of using it as a model of the way
errors followed a normal distribution that justified their averaging, Quetelet used the central limit
theorem as a way to judge the homogeneity or averageness of a group.6 He considered a group
whose distribution of a trait followed a normal distribution to be homogeneous because the
deviations within the group could have been produced by essentially random errors.
An example demonstrates the reasonableness of Quetelet’s approach. The heights of men
or women separately follow an approximately normal law, but if men and women are mixed
together, the result is not quite normal which suggests, by Quetelet’s criterion, that they should be
analyzed separately. The conclusion is reasonable in this case, but Quetelet’s reliance on the
normal distribution to make this decision ultimately falters on practical and theoretical grounds.
First, it is remarkably hard to conclude that a set of data does or does not follow the normal
distribution. Second, the basic logic of his approach is forced. There are random “error”
processes that form distributions other than the normal (e.g., random arrival times for people
entering queues follow the Poisson distribution) and empirically important subgroups can exist
within normally distributed populations. Random errors, in short, are not the only explanation of
a normal distribution, and they do not always lead to a normal distribution. Nevertheless,
Quetelet’s acceptance of variation within groups was a big step forward, even if his explanation of
it (random errors) and his solution (looking for normally distributed data) were both flawed.
Step Two: Developing Stochastic Laws – The next step that had to be taken was the
recognition that stochastic variation, and not just error, was a natural feature of the biological and
social world. Francis Galton would take this step in his study of human characteristics such as
height and other bodily measurements. To take it, he would have to overcome the legacy of the
classical theory of errors had led researchers to assume that the normal curve was inevitably the
result of many factors operating independently. For if this were necessarily true, then “what
opportunity was there for a single factor, such as a parent, to have a measurable impact? And why
did population variability not increase from year to year? (Stigler, 1986, page 272).” Galton’s
solution involved two steps. First he showed that the normal distribution for an entire population
could arise as the weighted sum of many different normal distributions with different means, with
these means indicating a natural variability across subpopulations. Then he described a mechanism,
regression to the mean, whereby this variability would neither grow nor diminish from generation
to generation. In effect, he developed the first stochastic model of human phenomena.
6
The central limit theorem amounts to the result that if there are many small and independent causes of
deviations, then the resulting distribution will be normal. Quetelet turned this around by presuming that if a
distribution within a category was normal, then there must be many small and independent causes of deviations
from the mean and the group within the category must be homogenous.
8
In the 1870's and 1880's, Galton was concerned with understanding the relationships
among bodily measurements from the same people and their kin. To do this, he collected data
from people and their relatives. He invariably found that these measurements, suitably categorized
by sex, age, and other factors, were normally distributed. For example, Figure 2 presents a
histogram of height data of 205 parents and their 928 adult children from Galton’s 1886 study of
“Regression towards mediocrity in hereditary stature.” These data are approximately normal, as
we would expect,7 and there is substantial variability in both groups. Moreover, this variability
remains even after we control, in Figure 3, for parents’ heights by plotting children’s stature versus
that of their parents. This graph, appropriately called a “scatterplot” in modern statistical parlance,
looks much different from the one constructed from Boyle’s experiment. The points, represented
by the number of petals on “sunflowers,” do not all lie along a straight line. Although they have a
central tendency, they are scattered about. Galton’s recognized that this variation was not due to
errors. Rather it was the result of variability in the human population for which he offered “a
simple and far-reaching law that governs the hereditary transmission of, I believe, every one of
those simple quantities which all possess, though in unequal degrees.” (Galton, 1886, page 246).
Galton’s most important insight, described in his 1886 paper, was that this variability could
be built up, from one generation to the next, from the mixture of the separate normal distributions
for children produced from each group of parents with the same height. And the variability would
remain stable if the means of these separate distributions regressed to the overall mean so that the
average height of children of parents of above average stature was less than their parents and the
average height of children of parents of “mediocre” stature was greater than their parents. Figure
3 plots this regression line along with a dashed line that would indicate no such regression to the
mean. The dashed line is above the solid regression line for those parents who are below average
stature and below it for those parents who are above average stature. With this stochastic model,
Galton showed how the inter-generational stability of stature could be preserved without requiring,
as would a deterministic law, that children have exactly the same heights as their parents. Instead,
children could be different from their parents as long as there was enough regression to the mean
to insure the same overall variance in heights in the new generation as in the old. And in this
situation of intergenerational stability, the amount by which children could vary from their parents
was related in a clear mathematical way to the amount of regression to the mean. One implied the
other.8
7
Galton adjusted female heights by multiplying them by 1.08, and then he averaged the heights of both
parents and included the heights of both male and female children.
8
Using modern notation, we can say that if Y is the deviation of a child’s height from the mean for all
children and X is the deviation of the parent’s height from the mean for all parents, then we have the following
regression equation: Y = bX + e, where b is the regression coefficient and e is the variability of children with a
parents of a given height X. We assume that e has mean zero for each X and constant variance Var(e) so that the
variation in children from tall parents is the same as the variation in children from short parents. Galton was
dealing with the situation of stability where not only the mean the mean of Y and the mean of X are equal – on
average parents and children have the same heights but the variance Var(Y) of Y and the variance Var(X) of X are
equal so that the amount of variability in the two generations is the same. By the elementary properties of
9
Galton’s approach showed that it was possible for two characteristics to be related to one
another in a non-deterministic way. Homogenous groups of parents could be defined based upon
their heights, and even though the children of these parents with the same stature would vary in
their heights, it was possible to say something lawlike about the relationship between the heights of
parents and children. Galton had shown how to specify stochastic laws. All that remained was for
him to find a way to characterize his law. One approach, of course, was to report the regression
coefficient, the slope of the line, in Figure 3. But there are two regression coefficients. One for
the regression of children’s heights on parents’ heights and one for the reverse regression. Which
one should be reported? In this case, it does not matter. The regression lines are identical because
the height measurements of parents and children are in the same units and have the same variance,9
but problems arise when Galton’s approach is extended to the relationship of the length of a
people’s arms to the lengths of their legs.10 In this case, the two regression coefficients are quite
different.
Step Three: Measuring Association through Correlations – Galton’s final major
contribution to statistics, in “Co-Relations and Their Measurement, Chiefly from Anthropometric
Data” (1888), showed that a common measure of relationship could be produced for any two
characteristics by rescaling both of them by what we would now call their standard deviation, and
taking the common value of the two possible regressions as a measure of their relationship. This
correlation index provided a single measure of the degree of association between two normally
distributed characteristics, and it ranged conveniently from -1 to +1. For Boyle’s data, the
correlation is 0.999965, and for Galton’s data in Figure 3 it is 0.460.11
Galton’s insights might seem relatively prosaic. After all, the normal error model and its
relationship to least squares was well-known by Gauss and Laplace in the early part of the
nineteenth century. What Galton added was an interpretation of the normal distribution as a
measure of variability – and not errors– in the population, and he developed the first stochastic
variance, we know that for the equation above, Var(Y) = b2 Var(X) + Var(e) so that (1-b2)Var(X) = Var(e) because
Var(X) = Var(Y). Thus, once the amount of variation within generations Var(X) = Var(Y) is known, the regression
coefficient b (and hence the amount of regression to the mean) and the amount that children can vary from their
parents Var(e) are mathematically related.
9
Given the assumptions of the preceding footnote, the regression coefficient of Y on X is Cov(Y,X)/Var(X)
where Cov(Y,X) is the covariance of Y and X, and the regression coefficient of X on Y is Cov(Y,X)/Var(Y). Since
Var(Y)=Var(X), the regression coefficients are identical.
10
Because, using the notation of the previous footnote, Var(X) does not necessarily equal Var(Y) so that
Cov(Y,X)/Var(X) does not necessarily equal Cov(Y,X)/Var(Y).
11
Galton used the median and the median deviation where we would use the mean and standard
deviation, and he did not use the covariance. But his 1888 paper describes a measure that is close relationship to
the modern correlation coefficient, Cov(Y,X)/[Var(X) Var(Y)]½. By rescaling each variable by dividing by its
standard deviation, Galton ensured that each variable had a unit variance so that the correlation coefficient and
both regression coefficients were equal to Cov(Y,X).
10
model to explain how stability could coexist with variability. His model of the relationship
between parental height and children’s height is very simple, involving only one cause (parental
height) to produce one effect (children’s height), but Galton’s thinking about the problem was not
simplistic. In his 1886 paper, he goes to some lengths to rule out alternative causes. He studied
height because of “... its practical constancy during thirty-five years of middle life, its small
dependence on differences of bringing up, and its inconsiderable influence on the rate of mortality
(p. 249).” He shows that “the stature of the children depends closely on the average stature of the
two parents, and may be considered in practice as having nothing to do with their individual
heights (page 249),” and he provides data to show that “marriage selection takes little or no
account of shortness or tallness (Pages 250-51).” In sum, stature is a good subject for study
because “its discussion is little entangled with considerations of nurture, of the survival of the
fittest, or of marriage selection (p 251).” Modern researchers would want to consider these
factors in more detail (e.g., Floud, Wachter, and Gregory, 1990), but Galton chose a subject for
which his bivariate approach was very well-suited. Within a decade of his major publications, Karl
Pearson (1857-1936) and his student G. Udny Yule (1871-1951) would extend his framework to
the multivariate case.
Developing the Logic of Causal Inference in Observational Studies
Karl Pearson developed the modern “product moment” approach to correlation that every student
learns in introductory courses, and he constructed the institutional infrastructure that allowed
mathematical statistics to thrive (Stigler, 1986, Chapter 10, Porter, 1986, Chapter 9). Although he
is arguably the father of modern statistics, his approach to multivariate inference through the
development of generalized frequency curves that could be fit to multivariate data proved to be
less fruitful than the approach taken by Yule that involved the generalization of regression.
Regression as a Model of Association – Yule’s seminal papers, written from 1895 to 1899
while Yule was still in his twenties involved the application of the new method of correlation to a
question that confronted 19th century social reformers and that seems remarkably up-to-date given
recent welfare reform efforts. Is pauperism (i.e., being supported by public welfare) increased by
providing “out-of-doors” relief to people in their own homes with no work requirements instead of
requiring more stigmatizing “indoors” relief given in workhouses? In modern terms, is the welfare
caseload increased when people are allowed to receive welfare without work requirements?
In a book published in 1894, Charles Booth claimed that lax work requirements had no
impact on welfare caseloads – there was no relationship between pauperism and the proportion of
total relief provided out-of-doors. Yule thought that Booth’s own data proved otherwise. Figure
4, based upon Table II (p. 609) in Yule’s 1895 paper – published when Yule was 24 years old –
shows the relationship between pauperism and the ratio of out-relief to in-relief in 1891 for
districts in Britain. In his comments on this table (and a similar one for data from 1871), Yule
notes that the use of “‘Galton’s function’ or coefficient of correlation” is somewhat problematic
because the joint distribution is not bivariate normal and “no theory of skew correlation has yet
been published (p. 604).” But even though “no great stress can be laid on the value of the
11
correlation coefficient (the surfaces not being normal), its magnitude at least may be suggestive (p.
604-605).” Yule reports a value of .388 using the new product-moment method developed by
Karl Pearson12 from which he concludes that “the rate of total pauperism is positively correlated
with the proportion of out-relief given (p. 605)” so that lax work requirements were associated
with larger welfare caseloads.
In a footnote to his claim, Yule reveals that he understands the difficulty of making causal
inferences from a correlation, and he demonstrates a sophisticated understanding of a causal
equilibrium between pauperism and the form of administration:
“This statement does not say either that the low mean proportion of out-relief is the cause
of the lesser mean pauperism or vice-versa; such terms seem best avoided where one is not
dealing with a catena of causation at all. To use a simile, due I believe to Professor
Marshall, the case is like that of a lot of balls– say half a dozen – resting in a bowl. Then
you cannot say that the position of ball No. 3 is the cause of the position of No. 5 or the
reverse. But the position of 3 is a function of the position of all the others including 5; and
the position of 5 is a function of the positions of all the others including 3: hence variations
in the positions of the two will be correlated, and it is to this term I prefer to adhere. To
be quite clear, I do not mean simply that out-relief determines pauperism in one union, and
pauperism out-relief in another, so that you cannot say which is which in the average; but I
mean that out-relief and pauperism mutually react in one and the same union (footnote 2, p.
605).”
This sophisticated understanding is, at best, only hinted at in Yule’s subsequent papers, where he
regresses pauperism on form of administration and other variables and he offers a causal
interpretation of the impact of form of administration and other variables on pauperism.
Booth’s reply (March, 1896) to Yule raised the stakes by noting that “I did not find much
which suggested the influence of the form of administration [out-of-doors or in-door relief] on
pauperism, but a good deal to show the influence on administration of the different shapes which
pauperism assumes, due to density or sparseness of population, to the presence of many old
people, to geographical or industrial characteristics, or to prosperity or the reverse as connected
with increase or decrease of population; and I came to the conclusion that good results follow
wherever an appropriate and well-considered policy is acted upon, whatever the policy may be
(page 71).” Booth’s comments suggest two possible problems with Yule’s analysis, although it is
unlikely that Booth had a clear picture of either one of them. One, to which Yule responds in
subsequent work, is that factors other than administration affect pauperism so that Yule’s inference
may be spurious. In his subsequent papers, Yule goes to substantial lengths to avoid spuriousness
12
My calculations for these data produce the same results up to the third decimal place. In a footnote,
Yule notes that “Professor Pearson kindly permits me to make use of results obtained by him since this paper was
written, to state that the coefficient of correlation remains equally significant for skew surfaces, although it no
longer completely gives the form of the distribution.” (Page 604).
12
by controlling for other variables. The second problem, to which Yule never really responds even
though he identified it quite clearly in his original paper, amounts to the assertion that Yule has the
causal arrow going the wrong way – the form of relief does not cause pauperism, rather the form
of poverty determines the form of relief. Adjudicating between Yule and Booth on this issue
requires allowing for the possibility of simultaneous causation – the form of administration might
cause pauperism and the pauperism might cause the form of administration – but the nature of this
problem and a possible solution for it would only become clear almost fifty years later with the
work of econometricians studying supply and demand (Haavelmo, 1943; Koopmans, 1945).
In a December, 1896 article, Yule takes on the first problem by breaking down the data by
different ways of measuring pauperism, by metropolitan, urban, mixed, and rural districts, by age
groups, and by poverty level. He shows that the coefficient of correlation between pauperism and
out-relief is significant in every circumstance except in the metropolitan areas, but he dismisses this
result on the grounds that the metropolitan data are known to be of poor quality. His most
interesting endeavor in this paper appears in a footnote where he considers the gross and partial
correlations for rural districts among three variables, “pauperism (proportion of the population in
receipt of relief of any kind), ratio of outdoor to indoor relief, and estimated earnings of
agricultural labourers in each union (p. 615).” He shows, as we would expect, that higher earnings
tend to reduce pauperism, and after controlling for earnings, out-relief still appears to increase
pauperism. This is a nice step towards controlling for confounding variables, but the net
impression from reading this footnote is that Yule is struggling with problems of confounding
variables and causality and only beginning to get a foothold. He concludes in a somewhat
contorted fashion by saying that “the question in the present case does not seem to me to be
whether pauperism is mainly due to an out-relief policy but whether there is any direct connection
between pauperism and out-relief, however slight.... My two notes have shown distinctly that there
is a connection, but do not show whether it is direct, or whether, e.g., I must simply attribute the
result, that pauperism is positively correlated with out-relief, to the fact that pauperism and outrelief are both positively correlated with poverty. I prefer not to follow Mr. Booth into what must
be at present mere guesswork on this point, but may remark that the figures quoted in my note on
p. 615 directly contradict any such hypothesis for rural unions. (p. 620).”13
Regression as a Model of Multivariate Causation – In December of the next year, Yule
(1897b) published a paper which, while it was titled “On the theory of Correlation,” was really the
first complete treatment of multiple regression analysis. In his first paragraph Yule announces his
intention to use regression to discover causal relationships:
“The investigation of causal relations between economic phenomena presents many
problems of peculiar difficulty, and offers many opportunities for fallacious conclusions.
13
A multivariate regression analysis using Yule’s data produces the following equation where each
variable is assumed to be mean deviated: Pauperism = .524(Out-Relief Ratio) - .592(Earnings). The coefficient of
the out-relief ratio, .524, though smaller than the bivariate coefficient of .60 when Pauperism is regressed on just
the Out-Relief Ratio, is still highly statistically significant with these data.
13
Since the statistician can seldom or never make experiments for himself, he has to accept
the data of daily experience, and discuss as best he can the relations of a whole group of
changes; he cannot, like the physicist, narrow down the issue to the effect of one variation
at a time. The problems of statistics are in this sense far more complex than the problems
of physics (Yule, 1897b, p. 812)
In this paper and in an earlier one (1897a), Yule proposes to achieve his goal by using correlations
and estimating linear regression equations to analyze the typically non-normal distributions found
in social statistics. Yule treats both bivariate and trivariate regression in detail and he gives
examples of bivariate regression using data on poor relief.
Yule’s ambitious program led his teacher, Pearson, to write him that he did not think that a
linear functional relationship was adequate to summarize social or biological data and that the
proper approach started with a frequency surface (Stigler, 1986, p. 351).14 But although Yule
noted that “the much more general problem of obtaining an expression completely describing the
frequency distribution is one that may sometimes become of importance (Yule, 1987b, p. 839),”
the difficulty of solving the problem for distributions other than the multivariate normal and the
simplicity of Yule’s approach meant that linear regression analysis would carry the day within the
social sciences. Regression analysis, however, typically meant that researchers chose one variable
as the left-hand-side or “dependent” variable even though there are two regressions for two
variables and K regressions for K variables. One of the consequences is that whereas the
symmetry of correlational analysis discouraged causal interpretations, the asymmetry inherent in
regression analysis led to causal interpretations in which the independent variables were assumed
to affect the dependent variable. Researchers would eventually realize that substantial thought
would have to be given to the choice of the dependent variable, and the techniques of causal
modeling would become much more sophisticated (Wright, 1934, Koopmans, 1950, Simon, 1954,
Wold and Jureen, 1952, Hood and Koopmans, 1953, Blalock, 1964, Joreskog, 1970,Goldberger
and Duncan, 1973). The resulting causal modeling tradition is considered a major achievement in
some quarters (Hendry and Morgan, 1995, Morgan, 1990) and a failed approach in others
(Freedman, 1987, 1991). In short, we are still debating the adequacy of Yule’s solution (McKim
and Turner, 1997).
Yule’s 1899 paper on “An Investigation into the Causes of Changes in Pauperism in
England, Chiefly during the Last Two Intercensal Decades (Part I.)” suggests the strengths and
limitations of the approach. For Stephen Stigler, a University of Chicago statistician widely
14
Stigler quotes a letter from Pearson which makes the key point: “In physics you know by experience
that the finer your methods of observation and your powers of observation, the more nearly you get your two
variables related by a single valued equation and you are justified in trying to find the value of its constants.... They
key to your method is, such a relation between the two variables actually exists in nature, it is the axiom from
which you start. In biology you start with the exact opposite– no such single valued relation exists, but I
understand by correlation the theory which endeavors to supply its place.” (Stigler, 1986, p. 351). Pearson’s
observation, though correct, amounted to restating the problem. It did not indicate why Yule’s solution would not
solve it.
14
recognized as the major chronicler of the history of statistical methods, “the paper was in its way a
masterpiece, a careful, full-scale, applied regression analysis of social science data (Stigler, 1986,
p. 355).” For David Freedman, a University of California statistician widely recognized as a major
critic of causal modeling in the social sciences, the paper is “quite modern in spirit” (p. 118) and
fatally flawed (Freedman, 1997, p. 119). Both are right.
The remarkable features of Yule’s paper are its ambition and its recognition of many
problems that would bedevil all subsequent observational studies of its type. The ambition comes
in Yule’s desire to develop an explanatory model for pauperism. He begins his second paragraph
by speaking of causes, and he endeavors to classify “[t]he various causes that one may conceive to
effect changes in the rate of pauperism” (p. 249) under five headings: Changes in the
administration of the law, changes in economic conditions, changes of a general social character
such as overcrowding, changes of a moral character such as crime or illegitimacy, and changes in
the age structure of the population. A modern researcher would have trouble coming up with a
better list. Yule goes on to note that these causes might be interdependent and a method is needed
to decide between “different interpretations of the same facts (p. 250).” For example, a change in
pauperism might result from a change in the proportion of out-relief, but both pauperism and the
proportion of out-relief might be due to a common association of both with economic and social
changes. Some way, therefore, must be found to control for the other factors. “This,” he claims,
“the method I have used is perfectly competent to do (p. 250).” By including all of the other
causal factors in a regression equation along with the factor of interest, his method “gives the
change due this factor when all the others are kept constant (p. 251).” Yule claims that he can
determine the net effect of one factor on another. He recognizes that there may still be problems
for he quickly adds that “There is still a certain chance of error depending on the number of factors
correlated both with pauperism and with proportion of out-relief which have been omitted, but
obviously this chance of error will be much smaller than before (p. 251).” The last part of this
sentence seems too optimistic even to the most ardent supporters of causal modeling, and it seems
positively wrong to critics of the methods. The chances of error are much greater than Yule
imagined, and there are heated debates about what can be learned from regression analysis
(Freedman, 1991). We review the problems in detail below.
Although Yule achieves a great deal in his paper,15 Freedman is right in complaining that
“there seem to be some important variables missing from the equation, including variables that
measure economic activity (p. 116-117).” Freedman also notes that some coefficients change signs
from one source of data to another, and Booth’s second concern may be a problem. Out-relief
may be the result, not the cause of pauperism. Yule is not entirely unaware of some of these
difficulties, and he includes a section on “Unaccounted Changes” (p. 260) and spends many pages
15
Yule’s paper includes a number of innovations. Using data from 1871, 1881, and 1891, he takes
differences in order to explain the change in pauperism. He also estimates regression equations for rural, mixed,
urban, and metropolitan groups, thus anticipating time-series cross-sectional regressions. By trying to explain the
determinants of the out-relief ratio as well as pauperism, he anticipates causal models in which there are chains of
causation.
15
discussing his results. Freedman even finds a deft retraction of causal claims in a footnote where
Yule admits that in his tables, “Strictly, for ‘due to’ read ‘associated with.’” (footnote 25, p. 270).
But later Yule screws up his courage again and argues that “It seems impossible to attribute the
greater part, at all events, of the observed correlation between changes in pauperism and changes
in out-relief to anything but the direct influence of change of policy on change of pauperism, the
change in policy not being due to any external causes such as growth of population or economic
changes (p. 277).”
The comments published with Yule’s paper (p. 287-295) are remarkable for their similarity
to modern discussions of the topic, and it is worth cataloging them. There are criticisms of the
models used by Yule. Other variables (e.g., growing prudence, the distribution of age) might
explain the observed reductions in pauperism, and the statistical methods may not be appropriate
for non-normal distributions or for measures bounded between zero and one. There are concerns
that the causal mechanisms are obvious or too opaque. One commentator says that statistical
analysis of this sort only confirms what administrators already know, and another asks what
mechanisms explain the association of reductions of pauperism with decreases in out-relief – was it
“the rejection of applications unwarranted by real destitution, or was it due to the deeply-rooted
dread of the workhouse, which prevented application for relief in cases of real destitution? (p.
289)” There are concerns about the generality of the results. Restrictions on out-door relief
might only work “in a society where certain conditions already existed, those conditions being at
the present time a constant improvement in the economic and moral conditions of the community.”
(P. 293). Finally, there is a call for more data. With more data collection, “It might be found that
there were two kinds of pauperism to be dealt with: the pauperism which was more or less chronic,
where the people were receiving relief from year’s end to year’s end, and on the other hand a
pauperism which was more or less transient, where people received relief for short periods.” (Page
293). In his reply, Yule notes that practically speaking, the burdens of the arithmetic mean that
only three or four factors can be considered in this type of analysis. Modern computers, of course,
have overcome this defect, but they have not overcome the possibility that one or more of the
assumptions of regression analysis might fail and vitiate the conclusions of a regression analysis.
The most worrisome problem is that some omitted factors might be correlated with included ones.
The Specification Assumption, Conditional Independence, and Regression
How Regression Can Go Wrong – It is worth stopping to describe this problem in detail
because it is the Achilles heel of regression analysis, and the extent of the problem is often
underestimated. It is best to make a few simplifying assumptions, none of which are essential to
the results.16 First, we assume that all variables have been centered about their means17 which has
16
This analysis is adapted from Clogg and Haritou, 1997, pages 95-96, but the basic ideas go back to
Wold and Jureen, 1952 and Ezekiel, 1930. (Theil?)
17
That is, they have been mean-deviated by subtracting off the value of their means. As a result, these
mean-deviated variables have a mean of zero.
16
the effect of eliminating the intercept in the regression equation. Second, we assume that all
variables have been standardized to have unit variance by dividing by their standard deviations.18
Third, we consider a model with only one independent variable. Consider the bivariate regression:
(5)
Y=bX+e
where Y is the dependent variable, b is a regression coefficient, X is an independent variable, and e
is the error term consisting of all omitted variables. The dependent variable Y could be pauperism,
the independent variable X could be the out-relief ratio, and e could be omitted variables such as
wages, age distribution, type of area, moral climate, and so forth. Or Y could be pressure, X could
be volume, and e could be temperature, the amount of matter, or simple measurement error.
The standard OLS estimator b* for b is the ratio of the covariance of X and Y to the
variance of X: b* = Cov(X,Y)/Var(X). Because the variables have unit variance, b* is identical in
this case to the correlation coefficient of X and Y, b* = Cor(X,Y).19 A fundamental result in
statistics is that this estimator provides an unbiased estimate of the “true” value b if Cov(X,e) is
zero – if there is no covariance (or correlation) between the included independent variable X and
the omitted variables e. This assumption implies that none of the omitted variables are correlated
with the included variables. There are various names for this assumption. Econometric textbooks
(e.g., Greene, 1993) call it the specification assumption. Statisticians call it conditional
independence (e.g., Holland, 1986, p. 949). It is closely related to other conditions such as no
confounding and strong ignorability.20 It is always identified as a crucial assumption whose failure
can lead to very poor inferences.
What happens if it is not true? (For those not interested in a mathematical derivation,
please skip to the next paragraph and equation (9) to find the answer to this question.) If the
specification assumption fails, then the true b will be the solution to two equations formed by
taking variances on both sides of equation (5) and the covariance of both sides of (5) with X:21
Var(Y) = Var(bX + e) = b2 Var(X) + 2 b Cov(X,e) + Var(e)
Cov(Y,X) = Cov(bX+e,X) = b Var(X) + Cov(X,e)
18
Thus, Var(Y) = 1 and Var(X) = 1.
19
As noted earlier, the Pearsonian correlation coefficient is Cor(X,Y) = Cov(X,Y)/[Var(X) Var(Y)]½
20
The literature in this area is quite technical, with many different conditions, and results. A recent
summary is in Stone, 1993.
21
This step may appear a bit mysterious to those unfamiliar with the algebra of variances and
covariances, but it is nothing more than the application of several simple rules that are proved in elementary
statistics courses. Namely, Var(X + Y) = Var(X) + Cov(X,Y) + Var(Y); for a constant a, Var(aX) = a2 Var(X);
Cov(X + Y, Z) = Cov(X,Z) + Cov(Y,Z); and Cov(X,X)=Var(X).
17
Because the variables are standardized these equations reduce to:
(6)
1 = b2 + 2 b Cov(X,e) + Var(e)
(7)
Cor(Y,X) = b + Cov(X,e)
In addition, we know by the definition of correlation that:
(8)
Cor(X,e) = Cov(X,e)/[Var(X) Var(e)]½ = Cov(X,e)/[Var(e)]½
Equations (6-8) are three equations in the five unknowns, b, Var(e), Cor(X,e), Cor(Y,X), and
Cov(X,e).
With some tedious algebra, including an application of the quadratic formula, two of the
unknowns can be eliminated to obtain an expression entirely in terms of the correlations of Y with
X and X with e.
(9)
b = Cor(Y,X) ± [1 - Cor2(Y,X)]½ Cor(X,e)/[1 - Cor2(X,e)]½
The first term on the right is the OLS estimator b* for b, and the second term shows how the true
value of the regression coefficient b for X departs from this estimate depending upon the degree to
which the specification assumption, Cor(X,e) = 0, does not hold. When Cor(X,e) is near minus
one, the value of b is minus infinity; when Cor(X,e) is near plus one, the value of b is plus infinity.
Only when Cor(X,e) is zero does the specification assumption hold. Then the OLS coefficient is
the correct measure of the impact of X. Note that even if the observed correlation between Y and
X is zero, seemingly implying that X and Y are unrelated, the true impact of b can be anywhere
from minus infinity to plus infinity.
This result shows why Cor(X,e) = 0 is called the specification assumption. Without it, the
estimated value of b can be wildly wrong. There is a mildly reassuring result in the literature which
shows that small departures from Cor(X,e) = 0 only cause small departures of b* from b (see
Wold, 1956, p. 43), but the simple truth is that big problems can occur when the specification
assumptions fails.
Examples of How Regression Can Go Wrong – If Yule omitted some important factor,
such as moral climate, that affects pauperism and that is also correlated with the out-relief, then his
estimate of the impact of out-relief on pauperism could have any sign or magnitude depending
upon the correlation between moral climate and out-relief. If Boyle had encountered a rather
unusual British day with temperatures ranging from -20 degrees Fahrenheit in the morning to 100
degrees in the midday and if he had set-up his apparatus to deal with small volumes in the morning
and with large volumes in the middle of the day, then he would not have found that pressure times
volume equals a constant. Instead of obtaining a regression coefficient of minus one for the impact
of logged volume on logged pressure, he would have obtained a regression coefficient for volume
18
of about minus .82 – nowhere near his value of minus one.22 Luckily, the monotony, if not the
salubrity, of the British weather saved Boyle from missing his chance to have a law named after
him.
These results are theoretically disquieting, but perhaps we typically obtain similar
regression coefficients across many different data sets which would imply that the theoretical
problem is empirically trivial. Unfortunately, changes in magnitudes and even the signs of
coefficients are not unusual from one data set to the next (remember Freedman’s critique of Yule’s
work), but even if coefficients remained the same, stability of regression coefficients across data
sets could be misleading. If the same specification is used across these data sets and if the same
omitted variables are operating in the same way across them, then stable results provide little
evidence for a correct specification. If Boyle consistently sets up his apparatus to study small
volumes in the morning and large volumes in the midday and if the British weather continues to
have such extremes, then he will get the same result time after time. If the weather changes, which
is perhaps the best possible outcome for his research, his regression coefficients will become
unstable, suggesting that something is amiss. In fact, the oft-noted instability of regression
coefficients provides strong evidence that the specification is incorrect, but stability does not
necessarily indicate that the specification is correct unless there are reasons to believe that plausible
confounders have varied across the data sets without changing the regression results.
Another way to think about the problem is this. Regressions in observational studies tell us
about the mean values of Y when we select a set of subjects with characteristics X from a given
population. If we repeatedly select subjects in the same way from this population, the regression
equation will typically provide a similar result. Then it is easy, for example, to assume that because
pauperism is associated with high levels of out-relief that a change in out-relief will decrease
pauperism. But regression cannot guarantee this unless the specification assumption holds, and the
specifications assumption amounts to being able to say that when X is changed, all other things
equal, then Y changes.
Freedman (1997, p. 118) puts it this way. There is a substantial difference between the
following two procedures:
“Procedure #1. Select subjects with X=x; look at the average of their Y’s.
Procedure #2. Intervene and set X=x for some subjects; look at the average of their Y’s.
The first involves the data set as you find it. The second involves an intervention.
(Emphasis added)”
22
The problem, of course, is that Boyle would have omitted an important variable, temperature, that was
highly correlated with his changes in volume. A regression of logged pressure on logged volume and logged
temperature would give the correct result.
19
Does this mean that only experimental studies that intervene while controlling all other factors can
provide good inferences and that observational studies can never do so? This is too pessimistic,
and Freedman’s terminology hints at part of the solution. Freedman distinguishes between what
we might call “selection studies” and “intervention studies.” This distinction is not the same as that
between observational and experimental studies. Experiments are invariably interventions, but
observational studies are not always merely selection studies. Many observational studies
consider, or adventitiously encounter, interventions such as new social programs, political
campaigns, or efforts to change behavior. Under some circumstances, these studies can lead to
useful inferences. In order to understand these circumstances, however, it is useful to see how
experimental studies solve the inference problem.
Developing the Logic of Causal Inference in Experimental Studies
Modern randomized experiments come out of the long tradition of agricultural research in which
experimenters addressed practical questions about varieties of seed, methods of ploughing, types
of fertilizer, and methods of planting. The first scientific field experiments probably occurred at
Rothamsted in England in 1839 (Wishart, 1934, p. 26), although Cochran (1976) discusses a
surprisingly modern approach that appeared in 1771 as a three-volume work, A Course of
Experimental Agriculture, published by Arthur Young. Young distrusted single trials because of
uncontrollable variability in the outcomes, and he insisted that experiments must be comparative;
thereby identifying the two most important ideas of modern experimental methods.
Variability in Experimental Studies – Young’s work and the experiments at Rothamsted
revealed problems that were different from those faced by physical scientists who could often
control their experimental circumstances. Too many factors – the weather, the natural fertility of
the soil, drainage, and bird and insect damage – could not be controlled in agricultural research. In
an effort to test for the possible impacts of these uncontrolled factors, the early experiments at
Rothamsted placed “control” plots at opposite ends of the field to check for variations in soil
fertility. “If the control plots differed little in their final yield it was held to demonstrate that the
area was satisfactorily uniform, and could continue to be used with confidence (Wishart, 1934, p.
26).” But this strategy for control could be easily confounded if soil fertility increased (or
decreased) from each end of the field towards the middle which contained the experimental plots.
Even though more sophisticated methods of control were developed (Student, 1909, 1923), none
was entirely satisfactory until R. A. Fisher (1890-1962) developed a new conception of a field
experiment. Fisher’s methods were based upon a new statistical technique, the analysis of
variance, and they involved an element of randomization in which experimental treatments were
assigned randomly to plots.
Fisher’s methods provided a way to deal with the inherent heterogeneity in the social and
biological worlds. This heterogeneity led to two problems that the observational approaches, even
the clever techniques proposed by Yule, had not solved. First, there had to be some definition of
what was meant by a treatment effect. If fertilizers were meant to increase yields, then how was
one to define the yield of a plot with and without fertilizer? In a physical experiment, such as tests
20
of Hooke’s law of the stretching of springs by weights, the deviation of the spring before and after
attaching a weight could be observed. The difference would be the net treatment effect. To
check that the effect was due to the weight, the spring’s deviation after the weight had been
removed could also be recorded. Alternatively, the impact of the weight could be compared with
two identical springs – one with a weight and one without. There was no such obvious strategy
with agricultural experiments. Experience showed that even if a fertilizer were generally regarded
as beneficial, it was possible for a plot to yield less with it than without it. The problem, of course,
is the heterogeneity of plots over space and over time and the difficulty of defining an effect under
these circumstances. The second problem was finding a way to control for this heterogeneity. The
two problems are linked, but they are different, and it is remarkable that they seemed to have been
solved by two different people, although each person might rightly claim to have perceived both
problems and to have offered solutions to them.
Defining the Impact of an Treatment – Jerzey Neyman (1894-1981) solved the problem
of defining the impact of a treatment in an article, taken from his doctoral dissertation, that was
published in a Polish journal in 1923.23 Neyman considered agricultural experiments in which the
yield on a field from one variety of a crop is compared with the yield from another variety.
Perhaps his most significant contribution in this paper was a formal notation that clarified the
nature of the inference problem. Neyman specified the outcome Y (e.g., the crop yield) of the
experiment for a unit u (e.g., a plot of land) for each possible condition i (e.g., varieties of a crop).
Today, we might think of the conditions as treatment i=t and control i=c. Thus, Yt(u) is the
outcome for unit u when it gets the treatment t, and Yc(u) is the outcome for unit u when it is in the
control condition c.
Figure 5 describes Neyman’s setup. He envisioned a situation with a large number of
different units, such as plots in a field, taken from the same population such as a field or a farm.
For the sake of pictorial economy, we list just four units in the four rows of Figure 5, but there
would typically be many more units. The possible outcomes for each of these units are listed in the
second and third columns. These outcomes depend upon the condition assigned to the unit. If the
unit gets the treatment, then its outcome is described by the value in the second column, Yt(u). If
the unit gets the control condition, then its outcome is described by the value in the third column,
Yc(u). The obvious measure of the impact of the treatment compared to the control condition for a
plot u is the difference between the outcome for u in the treatment condition and in the control
condition, Yt(u) and Yc(u). But the variability in social and biological phenomena might mean that
this difference, Yt(u) - Yc(u), will not be representative of the average effect of the treatment
compared to the control in the reference population because the reaction of one plot in a field or
one person in a group to a treatment is unlikely to be representative of the entire field or group.
23
The history of this article is interesting. Partly because it was published in Polish, it was little noticed,
except for a few citations, before 1990. In that year, an English translation of part of it led to a recognition of its
relationship to the work of Rubin (1974, 1978) who had extended the approach 50 years later (Rubin, 1990)
without knowing about Neyman’s paper. My discussion draws upon Rubin, 1978 and Holland, 1986 which
provide a modern version of Neyman’s model along with an extension to observational studies.
21
Another problem is that the difference cannot be computed because, for any experiment,
only one of Yt(u) and Yc(u) can be observed. Either the unit gets the treatment, or it gets the
control condition. It cannot get both. If it gets the treatment, then Yt(u) is observed. If it gets the
control condition, then Yc(u) is observed. For each unit only one of the two outcomes, Yt(u) or
Yc(u), will be, in fact, observed. The value of the other outcome is a counterfactual observation –
it can only be observed in a state of the world that does not occur. Yet having this observation
seems crucial for evaluating an experiment. Indeed, philosophers have made counterfactuals the
centerpiece of their explanations of causal inference (Goodman, 1983, Menzies, 2000).
Causal statements make assertions about what would have happened if the cause had not
been present. For example, Boyle’s law asserts that if the pressure had not been increased in
Boyle’s experiments, then the volume would not have decreased. Similarly, according to Yule, his
regression analysis shows that if out-relief had not increased in some districts, then pauperism
would not have increased as well. Both of these assertions involve counterfactual statements. In
neither of these cases, could the researcher actually compare what happened to the same unit with
and without the cause present. They had to find some other way to establish their causal
argument.
Both Boyle and Yule had to overcome the difficulty of not being able to observe a
counterfactual, but Boyle faced a much easier problem. The much lower variability in Yt(u) and
Yc(u) across units in his experiments, the ability to control other factors that might affect the
outcomes, and the homogeneity of his units (a large number of gas molecules) meant that
convincing comparisons could be made that substituted for knowing the counterfactual outcome.
In Boyle’s experiments, for example, he could use his apparatus to increase pressure and to then
decrease it within such a short period of time that very little else could change (such as the
temperature). As a result, he could, with some degree of confidence, compare different rows of
Figure 5. For example, if in row one the unit of air was in the control condition (say, “low
pressure”) with volume Yc(1) and in row two the unit was in the treatment condition (say, “high
pressure”), with volume Yt(2), then the difference Yt(2) - Yc(1), could be considered the change in
volume with the change in pressure. This difference is not at all the same as Yt(1) - Yc(1) which
involves a counterfactual, and the comparison Yt(2) - Yc(1) runs the risk of confounding, but Boyle
could make the conditions for unit 1 so close to those for unit 2 that the comparison seems
acceptable. Moreover, if there were concerns about the validity of this comparison, then the same
experiment could be tried again in rows three and four with nearly identical results. Of course, this
strategy might fail if every time Boyle increased the pressure, his physical activity heated up the
room so as to confound the result, but the almost perfect relationship between pressure and
volume that he obtained (see Figure 1) suggested that other factors would have to be working
rather artfully to confound him.
Boyle’s success depended upon a number of factors, but the most important one was that
he could limit the variability in outcomes. This suggests that finding a way to control the
variability in the outcomes of social and biological experiments might make comparisons possible
22
that would substitute for having counterfactual information. Neyman’s contribution was to find a
way to control this variability. His method was to take the average of the impact of the treatment
condition over all units and to compare this with the average over all units of the impact of the
control condition. Neyman defined the average outcome for the treatment group as the average of
the outcomes Yt(u) in the second column or Yt* = 'u=1,...,4 Yt(u)/4, and he defined the average yield
for the control group as the average of the Yc(u) in the third column or Yc* = 'u=1,...,4 Yc(u)/4.
Obviously, the impact of the treatment compared to the control is simply the difference between
these two, or Yt* - Yc*. At first blush, this does not seem to advance the situation very much
because it involves averages which include some counterfactual information, but we shall see that
this approach is the first step towards finding a solution to the problem of only being able to
observe the impact of the treatment or the control condition for each plot.
One of the remarkable features of Neyman’s approach is that it deals with heterogeneity in
a new way and it makes very minimal assumptions about the values of Yt(u) and Yc(u). Unlike
Quetelet, Galton, and others who thought that comparisons could only be made if the treatment
and control plots demonstrated some homogeneity, which they typically inferred was present when
the outcomes from each type of plot followed the normal distribution, Neyman required no such
thing. He states, complete with his own emphasis, that there is a misunderstanding “that
probability theory can be applied to solve problems similar to the one discussed only if the yields
from the different plots follow the Gaussian [normal] law.” (Page 468) But “consistency with the
law of random errors [the normal law] should not justify a framework which is based on an
assumption of independence of the measurements” and “it is enough to assume that our
measurements are independent, and for that we need a large number of plots on the field.” (Page
468).
For Neyman, the distribution of yields over a set of plots could be virtually anything
depending upon the physical features of the field, and he made no assumptions about the
distribution. In fact, Neyman conceived of his “experiment” very abstractly as the problem of
drawing balls from two urns, one for the treatment condition and one for the control condition.
Each urn has as many balls as plots and each ball is inscribed with the number of the plot and its
outcome under the condition. The outcomes in an urn could have any distribution including a
multimodal one if some plots are especially fertile, some only moderately so, and others not at all
fertile. The average of the outcomes written on the balls for each urn are equal to the averages Yt*
and Yc* described above. These averages help to reduce variability. But how can the experimenter
estimate them when they involve counterfactual observations?
Using Randomization to Get Estimates of Effects – Assume that the experimenter
chooses equal number of balls from each urn so that there is the same number of plots in the
treatment and the control condition. The urns have the property that if a ball is taken from one of
them, then the ball having the same plot number in the other urn disappears (Neyman, p. 467).
This assumption, of course, is the requirement that only one possible world can be realized. Each
plot either gets the treatment or control condition. It is at this point, that Neyman gets
tantalizingly close to the idea of random assignment. By choosing balls at random, Neyman is
23
essentially assigning treatment and control conditions randomly, but he never mentions the physical
act of randomization.24 Neyman seems to anticipate the random assignment of plots to treatment
and control conditions, but his paper never makes that suggestion. Instead, he offers a thoroughly
worked out “thought experiment.” In his 1926 paper, Fisher makes the suggestion explicit: “One
way of making sure that a valid estimate of error will be obtained is to arrange the plots
deliberately at random, so that no distinction can creep in between pairs of plots treated alike and
pairs treated differently (p.507).”
Although Fisher made the idea explicit, Neyman’s justification for it is clearer. By randomly
assigning treatment and control conditions, Neyman obtains random samples of Yt(u) and Yc(u) that
can be used to calculate good estimates of Yt* and Yc* if the number of plots is large enough. Both
random assignment and the large number of plots matter for this argument. To demonstrate this,
we begin with the example of four plots in Figure 5. The last six columns show all the possible
ways that experiments can be set up in which two plots receive the treatment condition and two
receive the control condition. There are six such ways for this to happen with four plots, two
conditions, and the requirement of equal number of plots for each condition. We call the six
columns “states-of-the-world” because they are mutually exclusive and exhaustive ways that the
world might look after the assignment of treatment and control conditions.
For concreteness, suppose that the first two plots (u=1,2) are very fertile and the second
two (u=3,4) are not. If design A were chosen by the experimenter, then even if the treatment has
no effect, the average for the first two plots will be larger than the average for the second two plots
simply because of their greater fertility. Even if the number of plots is increased without limit by
reproducing equal numbers of the first type of plots (“high fertility”) and the second type (“low
fertility”), the experiment will still give the wrong answer.
If plots are assigned at random, situation A will still occur one-sixth of the time, but other
situations will also occur, including four (B,C,E,F) in which fertile and infertile plots are evenly
divided between control and treatment groups and one (D) in which the control condition is
favored. If the number of plots is again increased in the way described above and each plot is
randomly assigned a condition, then the chance of randomly getting a state of the world favorable
to the treatment decreases still farther and in a way that can be easily calculated. As a result, it is
possible to say how likely an observed difference is due to chance versus some real effect. In short,
randomization ensures that variability will be averaged out and the true impact of the treatment
versus the control will be estimated. As the number of plots increases, the law of large numbers
averages out the variability and ensures that the estimates will be close to the actual values.25
24
Rubin says: “I am in full agreement with Scheffe’s (1956) description of Neyman’s mathematical
model as corresponding to the completely randomized experiment, and I also agree with Dabrowska and Speed
[1990] that the explicit suggestion to use the urn model to physically assign varieties to plots is absent.” (P. 477).
25
Here Neyman was not as careful as a modern scholar. If the number of plots increases, then some
assumption has to be made about how the Yt(u) and Yc(u) change as well. If, for example, the initial plots are
24
Hence, the difference of the estimates is a good estimate of the impact of the treatment versus the
control condition.
Another way to think of this randomization is that it selects a random subset of observations
from all possible observations of the impacts of the treatment and control conditions. The
outcomes of all the possible observations from the treatment condition can be written as:
(10)
Yt(u) = Yt* + et(u),
where et(u) represents Yt(u)’s deviation from the mean Yt*. This equation says that the outcome for
any specific unit is equal to the average impact of the treatment plus some deviation from that
average. If there are N units, then there are N equations like (10). The average of all N of these
deviations et(u) will be zero across all units by the definition of the mean. If half the units are
randomly assigned the treatment, then the law of large numbers implies that the average of the (N/2)
deviations et(u) for the observed units will get closer and closer to zero as (N/2) increases. If the
treatment condition is not assigned randomly, then this will not necessarily be true. For example, if
all the plots with high yields are in the treatment condition, then all the observed outcomes for the
treatment condition will have positive values of the deviations et(u).
Similarly, all the possible observations for the control condition, can be written as:
(11)
Yc(u) = Yc* + ec(u)
where ec(u) represents Yc(u)’s deviation from the mean Yc*. The average of all ec(u) will be zero
across all units by the definition of the mean, and the average of the deviations ec(u) for the (N/2)
observed units will be approximately zero if the treatment condition is assigned randomly and if
there are many units.
With this notation, we can show that randomized experiments satisfy the specification
condition described earlier. We can write (10) by simply adding and subtracting Yc* to the
equation:
(12)
Yt(u) = Yc* + (Yt* - Yc*) + et(u)
Remember that (11) involves N possible observations and (12) involves N more possible
observations. Assigning treatment and control conditions amounts to choosing one of the columns,
that is, one of the states of the world, A,B,C,D,E,F in Figure 5. When this is done, half of the
located at Rothamsted, the next plots are in the arctic, those after that are in greenhouses, and so forth, then the
law of large numbers will be unable to provide a stable estimate of average outcomes because the mean values will
be jumping around. Thus, implicit in Neyman’s model is some understanding that the outcomes from the new
plots look like the outcomes from the old plots. A sufficient condition is that all plots are taken from the same
population (for which average outcomes exist for both treatment and control conditions).
25
possible observations become impossible. Then, (11) and (12) can be written more economically
for each value of u by defining a variable X(u) with the value of one if the person is in the treatment
group (i=t) and zero if the unit is in the control group (i=c). By doing this, all counterfactual
observations will be dropped from the new equation, and only actual observations will be included.
Because we have designed the experiment so that there are equal numbers of units in each
condition, there are N/2 units from (11) that will be assigned to the control condition, and there are
N/2 units from (12) that will be assigned to the treatment condition. All the actual observations can
be written as:
(13)
Y(u) = Yc* + (Yt* - Yc*)X(u) + et(u)X(u) + ec(u)(1- X(u)).
Y(u) = a + b X(u) + e(u),
where a = Yc*, b = (Yt* - Yc*), and e(u) = et(u)X(u) + ec(u)(1- X(u)). Note that the“t” or “c”
subscript has been dropped, and there will be N of these equations, one for each observation. The
intercept a is a measure of the average impact of the control condition and the slope b is a measure
of the net average impact of the treatment over the control condition. The error term e(u) is either
the error et(u) if X=1 or ec(u) if X=0. This equation has the form of a regression of the observed
values Y(u) on the observed values of X(u).
We know from the discussion above that the OLS estimate of the slope b will be unbiased if
Cov(X,e) = 0. Does randomization insure this result in (13)? It is easy to show that without
randomization difficulties might arise. If the fertile plots are assigned to the treatment and the
infertile ones to the control condition, then when X=1 for the treatment plots, the values of e(u) will
be highly positive values of et(u) and when X=0 for the control plots, the values of e(u) will be
highly negative values of ec(u). Obviously, Cov(X,e) will be very positive because a value of X=1
will be associated with high values of e and a value of X=0 will be associated with low values of e.
We know from our previous discussion that this will lead to an upward bias in the OLS estimate of
the true impact of the treatment. The treatment will appear to work because it has been assigned to
the fertile plots. The reverse will occur if fertile plots are assigned to the control condition and
infertile ones to the treatment plots. But when there is randomization, we know that the average
value of the errors will be zero when X=1 and zero when X=0. Hence, the covariance of X and e
must be zero, and OLS will give an unbiased estimate of the average net effect. Randomized
experiments automatically satisfy the specification assumption.26
26
Two things will actually be true. The first has to do with a thought experiment in which the
experiment is done repeatedly and the average result over all such experiments is considered. The expected values
of the errors for each condition in these circumstances will be zero [E(e(u)|X=1) = 0 and E(e(u)|X=0) = 0] because
randomization makes each of the six states of the world in Figure 5 equally likely, and the average of the errors for
a given condition across the states of the world for all the units in Figure 5 will be zero. This condition ensures
that the estimator of b, the net treatment impact, will be statistically unbiased no matter how many units are
considered. The second has to do with a thought experiment in which the number of units increases without limit.
This will increase the length of each column in Figure 5, and it will “split-up” the existing columns into many subcolumns as different randomizations occur for the new units. Thus, adding one new unit will cause state of the
26
Problems with Experiments – Randomized experiments are classic intervention studies in
which units – plots or people – are assigned some value of X, and then the average impact of X is
observed. The fact that conditional independence is automatically satisfied with experiments makes
them very attractive. Why, then, don’t we do them more often? Probably the biggest reason is that
they are hard to do, especially in the social sciences. The limitations are both practical and ethical.
The practical limitations are the same as those which held-up the first experimental test of Newton’s
theories of orbiting satellites until almost 300 years after their formulation. Sputnik was expensive
and complicated and so are most social and biological experiments. The ethical limitations have to
do with the unacceptability of randomly assigning people to families, political parties, or guerilla
groups. But there are other limitations as well.
Consider, for example, what would happen if a “Boyle’s Law” experiment varied T and
measured P but the apparatus was set-up, unbeknownst to the investigator, so that V would adjust.
If V adjusted enough, then P might not vary at all. Alternatively, P might vary somewhat as T was
adjusted, but V might vary as well. If only P and T were being measured, then the experiment might
grossly underestimate the possible impacts of T. There are not violations of physical laws here, and
there is no failure of the experimental method. The method would be giving a true and correct
rendition of what occurs under the experimental circumstances which just happen to allow V to
vary. One of the problems, then, with experiments is that they only tell us what happens under the
conditions that happen to exist in the experiment.
Now consider an experiment which tries to increase the employment P of people by
providing them with some training T requiring substantial study and reading. Suppose that V
measures the violence of this group of people, and suppose that the treatment affects some subjects
by causing them to become employed (increasing P) but it affects others by getting them frustrated
and more violent (increasing V) because they cannot seem to learn. In fact, suppose that for any
individual, the relationship among these three variables is exactly the same as the gas law so that PV
is always proportional to T, but for some people only P can vary, for others only V can vary, and for
still others both can vary but to a different extent depending upon the person. Under these
circumstances, the treatment, T, could lead to highly variable employment outcomes depending
upon the mix of people in the program. In some instances the program might seem, on average, to
get people employed, and in others it might seem to harm them by decreasing their employment
possibilities. The inferences in each case would be correct for the population that was studied, but
it would be wrong to generalize from it. In fact, if the experimenter had also measured V, then it
would become clear that there was a structural, lawlike relationship among P, V, and T.
world A to split into two possibilities – one in which the new unit gets the treatment condition and the other in
which the new unit gets the control condition. In this situation, by the law of large numbers, the average of the
errors for a given condition down each column in Figure 5 will be more and more likely to be close to zero as the
number of units increases. This condition ensures that the estimator of b will be statistically consistent as the
number of units increases.
27
This example might seem far-fetched, but American social policy has already generalized
from a series of experiments that might have been misleading in just this way. Through a series of
experiments in California, welfare researchers concluded that a “work-first” welfare program was
much better than a “training-first” program. The work-first program was based upon job
attachment theory which presumes that welfare recipients have the skills to be good workers, but
they have forgotten (or never learned) the habits required to get jobs. Consequently, getting
welfare recipients tied to jobs is the best way to move them out of poverty. Job attachment theory
has much different implications than human capital theory which supposes that welfare recipients
lack the skills to be good workers. According to this theory, “job-training” is needed to provide
welfare recipients with the skills they need to get jobs. Recent work by Hotz, Klerman, and
Imbens (19xx) suggest that work-first programs did well because they were implemented in
counties where jobs were plentiful and the welfare recipients had relatively high education and past
job experience, but if these same programs had been implemented in counties with a less job-ready
population, then work-first might not have been so successful. Indeed, it might just frustrate
welfare recipients who try to get jobs even though they are not suited for them.
There are other problems with experiments as well. Heckman (1992) has argued that
randomization may affect participation decisions so that the people who get involved in a
randomized experiment may differ from those who would get involved in a full-scale program. The
assumption that there is no effect of randomization on participation decisions “is not controversial
in the context of randomized agricultural experimentation” (227) which is where the Fisher’s
experimental model was developed. This model is the intellectual basis for modern social
experiments, but it may require some modification with human subjects. Heckman also argues that
experiments are “inflexible vehicle for predicting outcomes in environments different from those
used to conduct the experiment.” (p. 227).
Nevertheless, randomized experiments are still the gold standard for making valid
inferences. LaLonde (1986) and Fraker and Maynard (1987) have shown that when experimental
data is analyzed using standard observational methods, the results are quite different, and there are
good reasons to believe that the experimental methods are much more trustworthy. Although
critics (Heckman and Hotz, 1989) using better observational methods provide evidence that
“tempers the recent pessimism about nonexperimental evaluation procedures” (p. 863),
experimental methods still seem to be the only reliable method to make reliable Dinferences.
Doing Observational Studies
Where does this leave observational studies? One, rather weak, answer is that we have no
choice but to use them to answer many questions for which experiments are either impractical or
unethical. A better answer is that observational studies can still produce reliable inferences if we are
careful to consider alternative causes and to rule out competing explanations. The basic tool for
this is disciplined comparisons where we try to find as many ways as possible to compare one
situation with another in order to rule out competing explanations. We can offer six kinds of tools
for improving this process. They range from better theory (models that provide mechanisms and
28
explanations), through better research design (thinking about the inference problem and employing
natural experiments and matching), to improved model-building (better model selection, more
concern with model uncertainty, and model replication through the use of multiple data-sets).
1. Better Theory: Models that provide mechanisms and explanations – One of the major
flaws in many observational studies (and experiments as well) is that there is often very little theory
to help guide the inferential task. Yet, most observational studies must make a passel of
assumptions – what variables to include, the functional form of relationships, the way error enters
the model – that can affect the ultimate inference. One of the best things that social scientists can
do is to develop better models that will provide guidance about these decisions. These models
should pay special attention to the “social mechanisms” (Hedstrom and Swedberg, 1998) that
generate and explain events. At the simplest level, this means that researchers should not be happy
with regression “models” that simply throw variables into a regression. It is nowhere near enough
to know that job training programs increase the work effort of welfare recipients, that the
possession of civic skills increases political participation, or that proportional voting systems
increase the number of political parties. Researchers must seek to understand the exact mechanisms
by which training increases work effort, civic skills increase participation, and proportional voting
methods lead to more parties. These mechanisms must include detailed descriptions of the decisionmaking problem facing individuals and the way that they solve this problem. For example, if some
candidates in American presidential primaries gain “momentum” from winning early primaries
(Bartels, 1988), then models of the individual level processes (e.g., increased name recognition or
strategic voting) that might lead to momentum should be developed (Brady, 1996) and experiments
should be undertaken to see whether these processes actually occur (Brady, 19xx).
Social science theories should seek to explain social phenomena in the same way that the
Maxwell-Boltzmann theory of gases explains the regularities of the gas laws by developing a microtheory that unifies seemingly disparate phenomena – pressure and temperature – through the
concept of energy and Newton’s laws. The Maxwell-Boltzmann theory postulates a large number
of individual molecules that rush about at random with varying speeds and whose average speed is
affected by the amount of energy in the system. This theory implies that if the temperature of a gas
in a container is increased, the molecules become more energetic and their average speed increases,
thus increasing their momentum and the pressure they exert when they hit the sides of the vessel
that contains them. If the volume of the container gets smaller, then the number of collisions
between these particles and the wall of the container increases, also increasing the pressure. The
Maxwell-Boltzmann theory presents a unified way to understand the gas law, and it eliminates what
Clark Glymour calls contingency. The theory makes it clear that pressure must, as a consequence
of Newton’s laws, increase with temperature and decrease with volume (Glymour, 1980).
Explanations like these not only improve our ability to make inferences, they also provide additional
reasons to believe a theory.
This call for better theory may seem utopian, but social scientists have developed theories
that help guide the research process. Political scientists, for example, have gained considerable
understanding of electoral systems not only through detailed empirical studies (Lijphart, 1994) but
29
also through sophisticated modeling (Cox, 1997) which helps to explain empirical regularities.
Sociologists and others have developed theories of mass political behavior (Lichbach, 1995, 1996)
which explain the actions of rebel’s and cooperators alike. Economic theory provides guidance
about both macro-economic and micro-economic phenomena.
2. Better Research Design: Thinking about the inference problem – Researchers can never
worry enough about the validity of their inferences. In his book on How Experiments End (1987),
Peter Galison argues that experiments end when researchers believe they have a result that will
stand up in court because they cannot think of any credible ways to make it go away. A lot of the
work of inference is trying to think of ways to make results go away, and researchers should think
hard about this before, during, and after a study. Researchers who never experience sleepless nights
worrying “what if I am wrong?” should probably rethink their research strategies.
General frameworks for thinking about inference can help to generate lists of generic threats
to inference. Fisher’s classic The Design of Experiments (1935) is all about setting up experiments
in ways that will make the results stand up in court. The classic handbook for observational studies,
Campbell and Stanley’s Experimental and Quasi-Experimental Designs for Research (1966, see
also Cook and Campbell, 1979), provides an extraordinarily fertile list of threats to validity for
many different kinds of research designs. All researchers should be familiar with the CampbellStanley-Cook lists.
In the past 25 years, Rubin and his collaborators (Rubin, 1974, 1978, 1990; Holland and
Rubin, 1988; Holland, 1986; Rosenbaum and Rubin, 1983) have developed an elegant
generalization of the Neyman framework for inference that covers experiments and observational
studies. The central focus of this work has been a careful explication of the assignment or selection
method (see also Heckman, 1978, Heckman and Robb, 1985). This framework has led to concrete
methods for improving causal inference such as the use of propensity scores for matching
(Rosenbaum and Rubin, 1983) and the analysis of the conditions under which path modeling can be
successful (Holland, 1988). Every empirical researcher should become familiar with this
framework. Manski (1991, 1995) has explored what can be inferred from observational studies
when there are problems of extrapolation, selection, simultaneity, mixing, or reflection. An
understanding of these generic problems should also be part of every researcher’s tool-kit.
Familiarity with this literature should enable researchers to develop better designs for their
research which control for some of the major threats to valid inferences. Time-series studies, for
example, make it possible to determine whether putative causes really change before their supposed
effects. Time-series cross-sectional studies add the possibility of comparing across different units to
see if the same results occur. Life-history data provide more and better controls for individual
differences. None of these designs is foolproof, but they can provide some confidence that major
sources of confounding have been controlled.
3. Better Research Design: Employing natural experiments and using matching – Partly as
a result of doubts about observational studies, researchers have increasingly looked for “natural
30
experiments” in which essentially random events provide some inferential leverage akin to
randomized experiments. For example, the Vietnam War draft lottery randomly selected some
people to enter the military which makes it possible to determine the impact of military service on
future earnings (Angrist, 1990), and miscarriages occur almost at random so that they can be used
to study the consequences of teenage childbearing on mother’s incomes (Hotz, Mullin, Sanders,
1997). This approach has been used to determine the consequences of workers’ compensation on
injury duration (Meyer, Viscusi, and Durbin, 1995), past disenfranchisement on future voting
(Firebaugh and Chen, 1995), uncertainty on decision-making (Metrick, 1995), ballot form on voting
choices (Wand, Shotts, Sekhon, Mebane, Herron, and Brady, 2001), the draft lottery and veteran
status on lifetime earnings (Angrist, 1990), the minimum wage on total employment (Card and
Krueger, 1994), the number of children in a family on their future life prospects (Rosenzweig and
Wolpin, 1980), and political parties on voting behavior in the U.S. and Confederate Houses during
the Civil War (Jenkins).
Another method that has shown some promise for yielding good inferences is sophisticated
matching methods based upon variants of the propensity score of Rosenbaum and Rubin (1983).
For example, a study of the impact of Workers’ Compensation on future wages might start from
workers who have been injured and then match them with a number of other workers in the same
firm with similar characteristics who have not been injured. The impact of the injury on earnings is
simply the difference between the earnings of the non-injured and the injured workers. These data
could be analyzed with standard regression techniques by regressing future wages on worker and
firm characteristics with a dummy variable for injury, but this approach inevitably makes strong
assumptions about functional forms. Matching techniques appear to rely on fewer assumptions and
seem to provide good estimates of program impacts (Heckman, Ichimura, and Todd, 1997;
Friedlander, Greenberg, and Robins, 1995), but much more needs to be learned about their
strengths and limitations.
4. Improved Model Building: Model selection through encompassing and specification
tests – If researchers must do observational studies, then they should be much more thoughtful
about model selection. A number of strategies and statistical tests have been developed to improve
this process. The “encompassing” methods of Hendry and his colleagues (Hendry and Richard,
1982; Mizon, 1984; Gilbert, 1990; Hendry, 1995, Chpt. 14) emphasizes evaluating alternative
theories by developing frameworks that encompass all the theories. Test statistics are then used to
evaluate which theories fit the data. Leamer (1990) worries about fragile inferences, and he asks
that “all empirical studies offer convincing evidence of inferential sturdiness” (page 88). He uses
Bayesian methods to determine how sensitive parameter estimates are to decisions about model
specification. Sims (1980), working in a time-series context, advocates using the minimal amount
of prior information and letting the data tell the story in vector auto-regressions in which each
variable is regressed on its own lagged values and the current and lagged values of all other
variables. Sims goes too far for my taste, but his concerns about the precarious state of our prior
knowledge are well worth considering. Pagan (1990) provides a nice comparison of all three
methods. In a series of publications, White (1982, 1990, 1994) has developed methods for
producing “robust” standard errors and for testing the specification of models. Although it is
31
already somewhat dated, the 1990 volume edited by Clive Granger provides readings on all these
approaches. Davidson and MacKinnon (1990) provide a summary of simple ways to perform
specification tests. Heckman and Hotz (1989) employ these methods to show that observational
methods can produce estimates of impacts close to experimental results.
5. Improved Model Building: Model uncertainty and model Averaging – Once a model is
selected, the researcher should be very skeptical about it. At least since Edward Leamer’s challenge
to “Let’s Take the ‘con’ out of Econometrics,” (1986) economists have been more sensitive to the
way in which their “specification searches” (Leamer, 1978) make a mockery of standard statistical
tests. The problem is very simple. Using the same dataset, researchers invariably try many different
specifications before they find the one that they report, but when they report it, they attach
significance levels to parameter estimates as if they had only tested this one specification. No
account is taken of the pre-testing.
There is ample evidence that this procedure leads to overfitting in which there is much
greater sense of confidence in the result than is warranted. In effect, all measures of fit are too
optimistic and all standard errors are too small (Draper, 1995; Chatfield, 1995). One way to
diagnose the extent of this problem is put some data aside during the model selection phase and to
use it to validate the model (Picard and Cook, 1984) once it is chosen. Unfortunately, this only
makes the researcher aware of the problem; it does not solve it. Model averaging (Draper, 1995)
incorporates model uncertainty by averaging over a number of different model specifications.
Bartels provides an accessible introduction to the method (1997), and Bartels and Zaller (2001)
apply it to the presidential election forecasts. Chatfield (1995) provides a general discussion of
model uncertainty.
6. Improved Model Building: Model Replication with Multiple Data Sets and Multiple
Studies – The model selection methods described above can improve the quality of inferences, but
they still lead to just one model that might be flawed. Incorporating model uncertainty can usefully
increase our skepticism about any one model by considering an array of models. But these methods
still use only one dataset, and they cannot protect researchers from a mischievous Nature that fails
to vary or include important variables in a dataset. Therefore, the ultimate test of any finding is that
it can be reproduced in other datasets derived from varying circumstances and situations.
Ultimately, researchers should be looking for regularities across “many sets of data, drawn from
different populations.” (Ehernberg and Bound, 1993). The statistical analysis of many such studies,
called “meta-analysis” (Hedges and Olkin, 1985) is one way to provide stronger evidence for a
hypothesis. Perhaps even better, researchers should look for circumstances that might be likely to
disprove their hypothesis in order to test its limits.
Conclusions
Making an inference is engaging in an argument with nature. In the course of this argument,
we must presume that nature is mischievous, if not downright cunning and deceitful. There is no
reason to believe that our initial theories are correct or that the data we have are very illuminating.
32
We must constantly think of new queries to ask our adversary, and we must be skeptical of the
answers we get. Randomized experiments provide a way to reduce nature’s efforts to confound us,
but even they are not foolproof. In the end, David Freedman is right in saying that success depends
upon “the clarity of the prior reasoning, the bringing together of many different lines of evidence,
and the amount of shoe leather” (1991, p. 298) provided by the researcher. For observational
studies, there is no specific technique that will solve the problem of making good inferences. But
there are lots of things we can do better, and we have described many of them.
Figure 5 -- Outcomes Yi(u) Under Different Conditions
and Experimental Set-Ups
Units
(u)
Conditions (i)
Possible States of the World with Only Two Units
Getting Each Condition
Treatment
Control
A
B
C
D
E
F
1
Yt(1)
Yc(1)
Yt(1)
Yt(1)
Yt(1)
Yc(1)
Yc(1)
Yc(1)
2
Yt(2)
Yc(2)
Yt(2)
Yc(2)
Yc(2)
Yc(2)
Yt(2)
Yt(2)
3
Yt(3)
Yc(3)
Yc(3)
Yt(3)
Yc(3)
Yt(3)
Yt(3)
Yc(3)
4
Yt(4)
Yc(4)
Yc(4)
Yc(4)
Yt(4)
Yt(4)
Yc(4)
Yt(4)
33
References [Incomplete]
Angrist, Joshua D. 1990. “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from
Social Security Administrative Records.” The American Economic Review 80(June):
313-336.
Bartels, Larry M. 1997. “Specification Uncertainty and Model Averaging.” American Journal
Of Political Science 41(April):641-674.
Bartels, Larry M., and Zaller, John. 2001. “Presidential Vote Models: A Recount.” Political
Science & Politics 34(1):9-20.
Berkson, Joseph. 1950. “Are There Two Regressions?” Journal of the American Statistical
Association 45(June):164-180.
Blalock, H.M. Jr. 1964. Causal Inferences in Nonexperimental Research. Chapel Hill, NC:
University of North Carolina.
Booth, Charles. 1896. “Poor Law Statistics.” The Economic Journal 6(March):70-74.
Bronars, Stephen G. and Jeff Grogger. 1994. “The Economic Consequences of Unwed
Motherhood: Using Twin Births as a Natural Experiment.” The American Economic
Review 84(December):1141-1156.
Campbell, D.T. and J.C. Stanley. 1966. Experimental and Quasi-Experimental Designs for
Research. Chicago: Rand McNally.
Card, David and Alan B. Krueger. 1994. “Minimum Wages and Employment: A Case Study of
The Fast Food Industry in New Jersey and Pennsylvania.” The American Economic
Review 84(September):772-793.
Chatfield, Chris. 1995. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of
The Royal Statistical Society, Series A (Statistics in Society), 158(3):419-466.
Clogg, Clifford C. and Haritou, Adamantios. 1997. “The Regression Method of Causal Inference
and a Dilemma Confronting This Method.” In Causality in Crisis? Statistical Methods
and the Search for Causal Knowledge in the Social Sciences, ed. Vaugh R. McKim
and Stephen P. Turner, Notre Dame, IN: University of Notre Dame.
Cochran, William G. 1976. “Early Development of Techniques in Comparative
Experimentation.” In On the History of Statistics and Probability. Statistics Textbooks
and Monographs, ed. D.B. Owen, vol.17. New York: Marcel Dekker, Inc.
35
Conant, James Bryant. 1957. “Robert Boyle’s Experiments in Pneumatics, Edited by James
Bryant Conant.” In Harvard Case Studies in Experimental Science, Vol. 1, ed. James
Bryant Conant. Cambridge, MA: Harvard University Press.
Cox, Gary W. 1997. Making Votes Count: Strategic Coordination in the World’s Electoral
Systems. Cambridge, New York, and Melbourne: Cambridge University Press.
Davidson, Russell, and James G. MacKinnon. 1990. “Specification Tests Based on Artificial
Regressions.” Journal of the American Statistical Association 85(March):220-227.
Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” Journal of the
Royal Statistical Society, Series B (Methodological), 57(1):45-97.
Ehrenberg, A.S.C. 1968. “The Elements of Lawlike Relationships.” Journal of the Royal
Statistical Society, Series A (General), 131(3):280-302.
Ehrenberg, A.S.C., J.A. Bound. 1993. “Predictability and Prediction.” Journal of the Royal
Statistical Society, Series A (Statistics in Society), 156(2):167-206.
Firebaugh, Glenn, and Kevin Chen. 1995. “Vote Turnout of Nineteenth Amendment Women:
The Enduring Effect of Disenfranchisement.” American Journal of Sociology 100
(January):972-996.
Fisher, R.A. [1926] 1972. “The Arrangement of Field Experiments.” In Collected Papers of R.A.
Fisher, vol. II—1925-31, ed. J.H. Bennett. Adelaide, Australia: University of Adelaide.
Fisher, Ronald A. [1935] 1971. The Design of Experiments. 9th ed. New York: Hafner Press.
Freedman, David A. 1991. “Statistical Models and Shoe Leather.” Sociological Methodology 21
(1991):291-313.
Freedman, David A. 1997. “From Association to Causation via Regression.” In Causality in
Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences,
ed. Vaugh R. McKim and Stephen P. Turner. Notre Dame, IN: University of Notre
Dame.
Galison, Peter. 1987. How Experiments End. Chicago: University of Chicago Press.
Galton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.” Journal of the
Anthropological Institute of Great Britain and Ireland 15(1886):246-263.
Galton, Francis. 1888. “Co-Relations and Their Measurement, Chiefly from Anthropometric
Data.” Proceedings of the Royal Society of London 45(1888-1889):135-145.
36
Gigerenzer, Gerd, Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, and Lorenz
Krüger. 1989. The Empire of Chance: How probability changed science and everyday
life. Cambridge, New York, and Melbourne: Cambridge University Press.
Gilbert, Christopher L. 1990. “Professor Hendry’s Econometric Methodology.” In Modelling
Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger.
New York: Oxford University Press.
Glymour, Clark. 1980. “Explanations, Tests, Unity and Necessity.” NOÛS, A.P.A. Western
Division Meetings, March 1980, 14(1):31-50.
Goldberger, Arthur S. and Otis Dudley Duncan. 1973. Structural Equation Models in the Social
Sciences. New York, San Francisco, and London: Seminar Press.
Haavelmo, Trygve. 1943. “The Statistical Implications of a System of Simultaneous Equations.”
Econometrica 11(January):1-12.
Heckman, James J., and V. Joseph Holtz. 1989. “Choosing Among Alternative Nonexperimental
Methods for Estimating the Impact of Social Programs: The Case of Manpower
Training.” Journal of the American Statistical Association 84(December)862-874.
Heckman, James J. 1992. “Randomization and Social Policy Evaluation. In Evaluating Welfare
and Training Programs, ed. Charles F. Manski and Irwin Garfinkel. Cambridge, MA:
Harvard University Press.
Heckman, James J., Hidehiko Ichimura, and Petra E. Todd. 1997. “Matching as an Econometric
Evaluation Estimator: Evidence from Evaluating a Job Training Programme.” The
Review of Economic Studies 64, Special Issue: Evaluation of Training and other Social
Programmes, (October)605-654.
Hedström, Peter and Swedberg, Richard, eds. 1998. Social Mechanisms: An Analytical
Approach to Social Theory. Cambridge, New York, and Melbourne: Cambridge
University Press.
Hendry, David F. and J-F. Richard. 1990. “On the Formulation of Empirical Models in
Dynamic Econometrics.” In Modelling Economic Series, 2nd ed. Advanced Texts in
Econometrics, ed. C.W.J. Granger. New York: Oxford University Press.
Hendry, David F. 1997. Dynamic Econometrics. 3rd ed. Advanced Texts in Econometrics. New
York: Oxford University Press.
Hodges, James S, 1987. “Uncertainty, Policy Analysis and Statistics.” Statistical Science 2
(August):259-275.
37
Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical
Association 81(December):945-960.
Holland, Paul W. 1988. “Causal Inference and Path Analysis.” Sociological Methodology 18
(1988):449-484.
Holland, Paul W. and Donald B. Rubin. 1988. “Causal Inference in Retrospective Studies.”
Evaluation Review 12:203-231.
Hood, Wm. C. and Koopmans, Tjalling C. [1953] 1970. Studies in Econometric Method. 3rd
ed. Cowles Foundation Monograph 14. New Haven, CT: Yale University Press.
Hotz, Joseph V., Charles H. Mullin, and Seth G. Sanders. 1997. “Bounding Causal Effects
Using Data From a Contaminated Natural Experiment: Analysis of the Effects of
Teenage Childbearing.” The Review of Economic Studies 64, Special Issue:
Evaluation of Training and Other Social Programmes. (October):575-603.
Hotz, Joseph V., Guido W. Imbens, and Jacob A. Klerman. 2000. “The Long-Term Gains from
GAIN:
A Re-Analysis of the Impacts of the California GAIN program.” National
Bureau of Economic Research, Working Paper No. W8007, November 2000. Available
from National Bureau of Economic Research, http://papers.nber.org/papers/W8007.
Jenkins, Jeffery A. 1999. “Examining the Bonding Effects of Party: A Comparative Analysis
Of Roll-Call Voting in the U.S. and Confederate Houses.” American Journal of
Political Science 43(October):1144-1165.
Koopmans, Tjalling C. 1949. “Identification Problems in Economic Model Construction.”
Econometrica 17(April):125-144.
Koopmans, Tjalling C., H. Rubin, and R.B. Leipnik. 1950. “Measuring the Equation Systems
Of Dynamic Economics.” In Statistical Inference in Dynamic Economic Models, ed.
Tjalling C. Koopmans. New York: John Wiley.
Leamer, Edward E. 1978. Specification Searches: Ad Hoc Interference with Nonexperimental
Data. New York: John Wiley & Sons.
Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” The American Economic
Review 73(March): 31-43.
Leamer, Edward E. 1985. “Sensitivity Analyses Would Help.” The American Economic Review
75(June):308-313.
Lichbach, Mark Irving. 1995. The Rebel’s Dilemma. Ann Arbor: University of Michigan Press.
38
Lichbach, Mark Irving.1996. The Cooperator’s Dilemma. Ann Arbor: University of Michigan
Press.
Manski, Charles F. 1993. “Identification Problems in the Social Sciences.” Sociological
Methodology 23(1993):1-56.
Manski, Charles F. 1995. Identification Problems in the Social Sciences. Cambridge, MA:
Harvard University Press.
McKim, Vaughn R. and Turner, Stephen P. eds. 1997. Causality in Crisis? Statistical
Methods and the Search for Causal Knowledge in the Social Sciences, Notre Dame, IN:
University of Notre Dame.
Menzies, Peter. 2001. “Counterfactual Theories of Causation.” In The Stanford Encyclopedia of
Philosophy (Spring 2001 edition), ed. Edward N. Zalta. Available on-line at
http://plato.Stanford.edu/entries/causation-counterfactual/.
Metrick, Andrew. 1995. “A Natural Experiment in ‘Jeopardy!’” The American Economic
Review 85(March)240-253.
Meyer, Bruce D., W. Kip Viscusi, and David L. Durbin. 1995. “Workers’ Compensation and
Injury Duration: Evidence from a Natural Experiment.” The American Economic
Review 85(June):322-340.
Mizon, G.E. 1990. “The Encompassing Approach in Econometrics.” In Modelling
Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger.
New York: Oxford University Press.
Pagan, Adrian R. 1990. “Three Econometric Methodologies: A Critical Appraisal.” In Modelling
Economic Series, 2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger.
New York: Oxford University Press.
Picard, Richard R. and R. Dennis Cook. 1984. “Cross-Validation of Regression Models.”
Journal of the American Statistical Association 79(September):575-583.
Porter, Theodore M. 1986. The Rise of Statistical Thinking 1820-1900. Princeton, NJ:
Princeton University Press.
Rosenbaum, Paul R. and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in
Observational Studies for Causal Effects.” Biometrika 70(April):41-55.
Rosenzweig, Mark R. and Kenneth I. Wolpin. 1980. “Testing the Quantity-Quality Fertility
Model: The Use of Twins as a Natural Experiment.” Econometrica 48(January):227-240.
39
Rubin, Donald. B. 1974. “Estimating Causal Effects of Treatments in Randomized and
Nonrandomized Studies.” Journal of Educational Psychology 66:688-701.
Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.”
Annals of Statistics 6(January):34-58.
Rubin, Donald B. 1990. “[On the Application of Probability Theory to Agricultural Experiments.
Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in
Experiments and Observational Studies.” Statistical Science 5(November):472-480.
Scheffe, Henry. 1956. “Alternative Models for the Analysis of Variance.” Annals of
Mathematical Statistics 27(2):251-271.
Simon, Herbert A. 1954. “Spurious Correlation: A Causal Interpretation.” Journal of the
American Statistical Association 49(September):467-479.
Sims, Christopher A. 1990. “Macroeconomics and Reality.” In Modelling Economic Series,
2nd ed. Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford
University Press.
Splawa-Neyman, Jerzy, D.M. Dabrowska, T.P. Speed. 1990. “On the Application of Probability
Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical
Science 5(November):465-472.
Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty before
1900. Cambridge, MA: Belknap Press, Harvard University.
Stone, Richard. 1993. “The Assumptions on which Causal Inferences Rest.” Journal of the
Royal Statistical Society, Series B (Methodological), 55(2):455-466.
Student. 1909. “The Distribution of the Means of Samples which are Not Drawn at Random.”
Biometrika 7(July-October): 210-214.
Student. 1923. “On Testing Varieties of Cereals.” Biometrika 15(December):271-293.
Turner, Stephen P. 1997. “‘Net Effects’: A Short History.” In Causality in Crisis? Statistical
Methods and the Search for Causal Knowledge in the Social Sciences, ed. Vaugh R.
McKim and Stephen P. Turner. Notre Dame, IN: University of Notre Dame.
White, Halbert. 1982. “Maximum Likelihood Estimation of Misspecified Models.”
Econometrica 50 (January):1-26.
40
White, Halbert. 1990. “A Consistent Model Selection.” In Modelling Economic Series, 2nd ed.
Advanced Texts in Econometrics, ed. C.W.J. Granger. New York: Oxford University
Press.
White, Halbert. 1994. Estimation, inference and specification analysis. Econometric Society
Monographs, no. 22. Cambridge, New York, and Melbourne: Cambridge University
Press.
Wishart, John. 1934. “Statistics in Agricultural Research.” Supplement to the Journal of the
Royal Statistical Society 1(1):26-61.
Wold, Herman. 1954. “Causality and Econometrics.” Econometrica 22(April):162-177.
Wold, Herman. 1956. “Causal Inference from Observational Data: A Review of End and
Means.” Journal of the Royal Statistical Society, Series A (General), 119(1):28-61.
Wold, H.O.A. and Jureen, L. 1953. Demand Analysis. New York: John Wiley.
Wright, S. 1934. “The Method of Path Coefficients.” Annals of Mathematical Statistics
5(1934):161-215.
Yule, G. Udny. 1895. “On the Correlation of Total Pauperism with Proportion of Out-Relief.”
The Economic Journal 5(December):603-611.
Yule, G. Udny. 1896. “Notes on the History of Pauperism in England and Wales from 1850,
Treated by the Method of Frequency-Curves; with an Introduction on the Method.”
Journal of the Royal Statistical Society 59(June):318-357.
Yule, G. Udny. 1896. “On the Correlation of Total Pauperism with Proportion of Out-Relief.”
The Economic Journal 6(December):613-623.
Yule, G. Udny. 1897a. “On the Significance of Bravais’ Formulae for Regression, &c., in the
Case of Skew Correlation.” Proceedings of the Royal Society of London
60(1897):477-489.
Yule, G. Udny. 1897b. “On the Theory of Correlation.” Journal of the Royal Statistical Society
60(December):812-854.
Yule, G. Udny. 1899. “An Investigation into the Causes of Changes in Pauperism in England,
Chiefly During the Last Two Intercensal Decades, Part I.” Journal of the Royal
Statistical Society 62(June):249-295.
Yule, Udny G. 1907. “On the Theory of Correlation for any Number of Variables, Treated by
41
A New System of Notation.” Proceedings of the Royal Society of London, Series A,
Containing Papers of a Mathematical and Physical Character 79(May 14):182-193.
42
Figure 1
Boyle's Experimental Data
4.0
3.8
3.6
3.4
Log of Pressure
3.2
3.0
2.8
2.6
2.4
3.2
Rsq = 0.9999
3.4
3.6
3.8
4.0
4.2
4.4
4.6
4.8
Log of Volume
Page 1
Figure 2 -- Distribution of Children's and
Parent's Heights from Galton's Data
Number of Children or Parents
600
500
400
300
200
Std. Dev = 2.21
100
Mean = 68.20
N = 1856.00
0
61.25
67.25
64.25
73.25
70.25
Distribution of Heights
Page 3
Figure 3 -- Regression of Children's versus
Parents' Heights from Galton's Data
74
Children's Heights
72
70
68
66
64
62
62
64
66
68
70
72
74
Parents' Heights
Cases weighted by NUMBER
Page 2
Figure 4 -- Yule's Data on Pauperism and
Out-Door Relief
9%
8%
7%
6%
Percent Paupers
5%
4%
3%
2%
1%
0%
0
2
4
6
8
10
12
14
16
18
Ratio of Out-Door Relief to Indoor Relief
Page 4