Covariate Balancing Propensity Score ∗ Kosuke Imai Marc Ratkovic

by user

on 15-09-2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Covariate Balancing Propensity Score ∗ Kosuke Imai Marc Ratkovic

Transcript

Covariate Balancing Propensity Score ∗ Kosuke Imai Marc Ratkovic

Covariate Balancing Propensity Score∗
Kosuke Imai†
Marc Ratkovic‡
First Draft: March 3, 2012
This Draft: March 7, 2012
Abstract
The propensity score plays a central role in a variety of settings for causal inference. In
particular, matching and weighting methods based on the estimated propensity score have become
increasingly common in observational studies. Despite their popularity and theoretical appeal,
the main practical difficulty of these methods is that the propensity score must be estimated.
Researchers have found that slight misspecification of the propensity score model can result in
substantial bias of estimated treatment effects. In this paper, we introduce covariate balancing
propensity score (CBPS) estimation, which simultaneously optimizes the covariate balance and
the prediction of treatment assignment. We exploit the dual characteristics of the propensity
score as a covariate balancing score and the conditional probability of treatment assignment and
estimate the CBPS within the generalized method of moments or empirical likelihood framework.
We find that the CBPS dramatically improves the poor empirical performance of propensity score
matching and weighting methods reported in the literature. We also show that the CBPS can
be extended to a number of other important settings, including the estimation of generalized
propensity score for non-binary treatments, causal inference in longitudinal settings, and the
generalization of experimental and instrumental variable estimates to a target population.
Key Words: causal inference, instrumental variables, inverse propensity score weighting, marginal
structural models, observational studies, propensity score matching, randomized experiments
∗
Financial support from the National Science Foundation (SES-0550873; SES–0752050) is acknowledged. We thank
Jens Hainmueller, Dylan Small, and seminar partcipants at UCLA and Princeton for helpful suggestions.
†
Assistant Professor, Department of Politics, Princeton University, Princeton NJ 08544. Phone: 609–258–6601,
Email: [email protected], URL: http://imai.princeton.edu
‡
Post-doctoral Fellow in Formal and Quantitative Methods, Department of Politics, Princeton University, Princeton
NJ 08544.
1
Introduction
The propensity score, defined as the conditional probability of receiving treatment given covariates,
plays a central role in a variety of settings for causal inference. In their seminal article, Rosenbaum
and Rubin (1983) showed that if the treatment assignment is strongly ignorable given covariates, then
an unbiased estimate of the average treatment effect can be obtained by adjusting for the propensity
score alone rather than the entire vector of covariates, which is often of high dimension. Over the next
three decades, a number of new methods based on the propensity score have been developed and they
have become an essential part of applied researchers’ toolkit in numerous disciplines. In particular,
the propensity score is used to adjust for observed confounding through matching (e.g., Rosenbaum
and Rubin, 1985; Rosenbaum, 1989; Abadie and Imbens, 2006), subclassification (e.g., Rosenbaum
and Rubin, 1984; Rosenbaum, 1991; Hansen, 2004), weighting (e.g., Rosenbaum, 1987; Robins et al.,
2000; Hirano et al., 2003), regression (e.g., Heckman et al., 1998), or their combinations (e.g., Robins
et al., 1995; Ho et al., 2007; Abadie and Imbens, 2011). Imbens (2004), Lunceford and Davidian
(2004), and Stuart (2010) provide comprehensive reviews of these and other related methods.
Despite their popularity and theoretical appeal, the main practical difficulty of these methods is
that the propensity score must be estimated. In fact, researchers have found that slight misspecification of the propensity score model can result in substantial bias of estimated treatment effects (e.g.,
Kang and Schafer, 2007; Smith and Todd, 2005). This challenge highlights the paradoxical nature
of the propensity score — the propensity score is designed to reduce the dimension of covariates and
yet its estimation requires modeling of high-dimensional covariates. In practice, applied researchers
search for an appropriate propensity score model specification by repeating the process of changing
their model and checking the resulting covariate balance. Imai et al. (2008) called this phenomenon
the “propensity score tautology” — the propensity score is correct if it balances covariates. To directly address this problem, it is essential to develop an automated and robust method for estimating
the propensity score.
In this paper, we introduce the covariate balancing propensity score (CBPS) and show how to
estimate the propensity score such that both the resulting covariate balance and the prediction of
treatment assignment are optimized. We exploit the dual characteristics of the propensity score as
a covariate balancing score and the conditional probability of treatment assignment. Specifically,
we use a set of moment conditions implied by both the covariate balancing property (i.e., mean
independence between treatment and covariates after inverse propensity score weighting) and the
standard estimation procedure (i.e., maximum likelihood). These moment conditions are then com1
bined to estimate the CBPS within the familiar generalized method of moments (GMM) or empirical
likelihood (EL) framework (Hansen, 1982; Owen, 2001).
The CBPS has several attractive characteristics. First, the CBPS estimation guards against the
potential misspecification of a parametric propensity score model by selecting parameter values that
maximizes the resulting covariate balance. In Section 3, we find that the CBPS dramatically improves
the poor empirical performance of propensity score matching and weighting methods reported by
Kang and Schafer (2007) and Smith and Todd (2005). Second, the CBPS can be extended to a
number of other important settings in causal inference. In Section 4, we extend the CBPS to the
generalized propensity score for non-binary treatment (Imbens, 2000; Imai and van Dyk, 2004), timevarying treatments and time-dependent confounding in longitudinal data (e.g., Robins et al., 2000),
and the generalization of experimental and instrumental variable estimates to a target population
(e.g., Cole and Stuart, 2010; Angrist and Fernandez-Val, 2010). Third, the CBPS inherits all the
theoretical properties and methodologies that already exist in the GMM and EL literature. For
example, GMM specification tests and moment selection procedures are directly applicable. Finally,
because the proposed methodology simply improves the estimation of the propensity score, various
propensity score methods such as matching and weighting can be implemented without modification.
The CBPS contributes to the recent development in the causal inference literature on automated
covariate balancing methods (e.g., Diamond and Sekhon, 2011; Iacus et al., 2011; Hainmueller, 2012;
Ratkovic, 2012). In particular, the CBPS builds upon and is closely related to several existing
methods. First, Hainmueller (2012) proposes the entropy balancing method to construct a weight
for each control observation such that the sample moments of observed covariates are identical
between the treatment and weighted control groups. Unlike the entropy balancing method, however,
the CBPS constructs balancing weights directly from the propensity score. In this respect, the CBPS
is similar to the method proposed by Graham et al. (Forthcoming) under the empirical likelihood
framework. Yet, the CBPS critically differs from these methods in that it simultaneously maximizes
the covariate balance and the prediction of treatment assignment.
In addition, the CBPS is related to the method proposed by Tan (2010), which adjusts the maximum likelihood estimate of the propensity score to generate observation-specific weights. Like the
method of Graham et al., Tan’s method incorporates the outcome model and achieves many desirable properties such as double robustness, local efficiency, and sample boundedness. In contrast,
the CBPS focuses on the estimation of the propensity score without consulting the outcome data,
which aligns with the original spirit of the propensity score methodology (Rubin, 2007). This sepa-
2
ration from the outcome model enables the CBPS to be applicable to a variety of causal inference
settings and to be used with the existing propensity score methods such as matching and weighting.
Section 2.4 gives a more detailed comparison between the CBPS and these existing methods.
2
The Proposed Methodology
2.1
The Setup
Consider a simple random sample of N observations from a population P. For each unit i, we observe
a binary treatment variable Ti and a K dimensional column vector of pre-treatment covariates
Xi where the support of Xi is denoted by X . The propensity score is defined as the conditional
probability of receiving the treatment given the covariates Xi . Following Rosenbaum and Rubin
(1983), we assume that the true propensity score is bounded away from 0 and 1,
0 < Pr(Ti = 1 | Xi = x) < 1
for
any x ∈ X .
(1)
Rosenbaum and Rubin (1983) showed that if we further assume the ignorability of treatment assignment, i.e.,
{Yi (1), Yi (0)} ⊥⊥ Ti | Xi
(2)
where Yi (t) represents the potential outcome under the treatment status t, then the treatment
assignment is ignorable given the propensity score,
{Yi (1), Yi (0)} ⊥⊥ Ti | π(Xi ).
(3)
This implies that the unbiased estimation of treatment effects is possible by conditioning on the
propensity score alone rather than the entire covariate vector Xi , which is in practice frequently
of high dimension. This key observation about its dimension reduction property led to the subsequent development of various propensity score methods, including matching, subclassification, and
weighting.
In observational studies, however, the propensity score is unknown and must be estimated from
the data. Typically, researchers assume a parametric propensity score model,
Pr(Ti = 1 | Xi ) = πβ (Xi )
(4)
where β ∈ Θ is an L dimensional column vector of unknown parameters. For example, a popular
choice is the following logistic model,
πβ (Xi ) =
exp(Xi> β)
1 + exp(Xi> β)
3
(5)
in which case we have L = K. Researchers then maximize the empirical fit of the model so that the
estimated propensity score predicts the observed treatment assignment well. This is often done by
maximizing the likelihood function,
β̂MLE = argmax
β∈Θ
N
X
Ti log πβ (Xi ) + (1 − Ti ) log{1 − πβ (Xi )}.
(6)
i=1
Assuming that πβ (Xi ) is differentiable with respect to β, this implies the following first order condition,
N
1 X
sβ (Ti , Xi ) = 0
N
where
sβ (Ti , Xi ) =
i=1
Ti πβ0 (Xi )
πβ (Xi )
−
(1 − Ti )πβ0 (Xi )
1 − πβ (Xi )
(7)
and πβ0 (Xi ) = ∂π(Xi )/∂β > .
The major difficulty of this estimation strategy is that the propensity score model may be misspecified, leading to the biased estimates of treatment effects. While in theory a more complex
nonparametric model can be used to estimate the propensity score (e.g., McCaffrey et al., 2004), the
high dimensionality of covariates Xi often makes its successful application challenging. To directly
address this issue, we develop covariate balancing propensity score (CBPS) estimation as a robust
estimation method of a parametric propensity score model. We exploit the dual characteristics of
the propensity score as a covariate balancing score and the conditional probability of treatment
assignment (Rosenbaum and Rubin, 1983).
Specifically, we operationalize the covariate balancing property using the inverse propensity score
weighting,
(
E
ei
ei
Ti X
(1 − Ti )X
−
πβ (Xi ) 1 − πβ (Xi )
)
= 0
(8)
ei = f (Xi ) is a M -dimensional vector-valued function of Xi specified by the researcher.
where X
ei = Xi balances the
Equation (8) holds regardless of the choice of f (·). For example, setting X
ei = (X > X 2 > )> balances both the first and second
first moment of each covariate, while defining X
i
i
moments. The researcher may also wish to balance a subset of covariates or include interactions.
Similarly, if the average treatment effect for the treated is of interest, we may wish to weight the
control group observations such that their (weighted) covariate distribution matches up with that of
the treatment group. In this case, the moment condition becomes,
)
(
e
π
(X
)(1
−
T
)
X
i
i
ei − β i
= 0.
E Ti X
1 − πβ (Xi )
4
(9)
Two remarks are in order before we move to the discussion of estimation and inference. First,
equation (7) represents another balancing condition where the inverse propensity score weighting of
the first derivative of the propensity score has the identical first moment between the treatment and
control groups. This observation also justifies combining these moment conditions to estimate the
propensity score. Second, we note that the covariate balancing property follows directly from the
definition of the propensity score and does not require ignorability given in equation (2).
2.2
Estimation and Inference
We combine the above moment conditions based on the covariate balancing property (i.e., equation (8) or (9)) with the score condition of the likelihood inference (i.e., equation (7)) under the
Generalized Method of Moments (GMM) or Empirical Likelihood (EL) framework. Define the sample analogue of the covariate balancing moment condition given in equation (8) as,
N
1 X
ei
wβ (Ti , Xi )X
N
where
wβ (Ti , Xi ) =
i=1
Ti − πβ (Xi )
.
πβ (Xi ){1 − πβ (Xi )}
(10)
If the ATT rather than the ATE is the quantity of interest, we use the sample analogue of equation (9)
and the weight becomes,
wβ (Ti , Xi ) =
N Ti − πβ (Xi )
·
N1 1 − πβ (Xi )
(11)
where N1 is the number of units in the treatment group.
When combined with the score equation given in equation (7), the total number of moment
conditions is L + M , which exceeds the number of parameters L. Thus, we can use them as the
over-identifying restrictions under the optimal GMM framework (Hansen, 1982),
β̂GMM = argmin ḡβ (T, X)> Σβ (T, X)−1 ḡβ (T, X)
(12)
β∈Θ
where ḡβ (T, X) is the sample mean of the moment conditions,
ḡβ (T, X) =
N
1 X
gβ (Ti , Xi )
N
(13)
i=1
and gβ (Ti , X) combining all moment conditions,

gβ (Ti , Xi ) = 
sβ (Ti , Xi )
ei
wβ (Ti , Xi )X

.
(14)
We use the “continuous-updating” GMM estimator of Hansen et al. (1996) , which, unlike the twostep optimal GMM estimator, is invariant (e.g., to the linear transformation of moment conditions)
5
and is found to have better finite sample properties. Our choice of a consistent covariance estimator
for gβ (Ti , Xi ) is given by,
Σβ (T, X) =
N
1 X
E(gβ (Ti , Xi )gβ (Ti , Xi )> | Xi )
N
(15)
i=1
where we integrate out the treatment variable Ti conditional on the pre-treatment covariates Xi .
We find that this covariance estimator outperforms the sample covariance of moment conditions
because the latter does not penalize large weights. In particular, in the case of the logistic regression
propensity score model, i.e., πβ (Xi ) = logit−1 (Xi> β), we have the following expression for Σβ (T, X),


N
>
X
π
(X
){1
−
π
(X
)}X
X
0
1
i
i i
β
 β i
 (16)
Σβ (T, X) =
N
eX
e>
0
[1 + π (X ){1 − π (X )}]−1 X
i=1
β
i
β
i
i
i
when we use the ATE weight given in equation (10), or


N
e>
πβ (Xi )Xi X
1 X  πβ (Xi ){1 − πβ (Xi )}Xi Xi>
i

Σβ (T, X) =
N
>
e
ei X
e>
πβ (Xi )Xi Xi
πβ (Xi )/{1 − πβ (Xi )}X
i=1
i
(17)
if the ATT weight given in equation (11) is used. With a set of reasonable starting values (e.g.,
β̂MLE ), we find that the gradient-based optimization method works well both in terms of speed and
reliability.
Finally, an alternative estimation strategy is based on the EL framework applied to the above
moment conditions (Qin and Lawless, 1994) where the profile empirical likelihood ratio function is
given by,
R(β) = sup
( n
Y
i=1
)
n
n
X
X
npi pi ≥ 0,
pi = 1,
pi gβ (Ti , Xi ) = 0 .
i=1
(18)
i=1
This EL estimator shares many of the attractive properties of the above continuous-updating GMM
estimator, notably invariance and higher-order bias properties.
2.3
Specification Test
Another advantage of combining the moment conditions in the GMM or EL framework is that we
can use the test of over-identifying restrictions as a specification test for the propensity score model.
For example, under the GMM framework, the Hansen’s J statistic is given by,
n
o
d
J = N · ḡβ̂GMM (T, X)> Σβ̂GMM (T, X)−1 ḡβ̂GMM (T, X)
−→ χ2L+M
(19)
where the null hypothesis is that the propensity score model is correctly specified. We emphasize
that the failure to reject this null does not necessarily imply the correct specification of the propensity score model because it may simply mean that the test lacks the power (Imai et al., 2008).
6
Nevertheless, the test could be useful for detecting the model misspecification. Similarly, within the
EL framework, the specification test can be conducted based on the likelihood ratio.
2.4
Related Methods
The CBPS is closely related to two existing methodologies. First, Hainmueller (2012) proposed
the entropy balancing method, which weights each observation in order to achieve optimal balance.
Since identifying N weights by balancing the sample mean of K covariates between the treatment
and control groups is an underdetermined problem when K < N , the entropy balancing method
minimizes the Kullback-Leibler divergence from a set of baseline weights chosen by researchers. The
critical difference between the CBPS and the entropy balancing methods is that the former directly
models the propensity score, which in turn is used to construct weights for observations. While
the weights that result from the entropy balancing method imply the propensity score, the CBPS
makes it easier for applied researchers to model the propensity score directly by incorporating their
substantive knowledge into a parametric model.
Moreover, as illustrated in Section 4, the direct modeling of the propensity score widens the
applicability of the CBPS to more complex situations. In addition, because we model the propensity
score, the score equation is included as another set of moment conditions, leading to an over-identified
restriction test for model specification. In his 2008 working paper, Hainmueller (2008) briefly discusses the incorporation of covariate balancing conditions into the estimation of the propensity score
under the EL framework as a potential extension of his method (Section IV C). However, his setup
differs from ours in that the weights are constructed separately from the propensity score.
Second, Graham et al. (Forthcoming) proposes the EL method similar to the entropy balancing
method where covariate balancing moment conditions are used to estimate the propensity score.
Unlike these methods, however, the CBPS combines the score equation from the likelihood function
with the covariate balancing conditions so that it simultaneously optimizes the prediction of treatment assignment and the covariate balance. Thus, the CBPS is based on a set of overidentifying
restrictions, which enables the specification test for the propensity score model.
Third, under the likelihood framework, Tan (2010) identifies a set of constraints to generate
observation specific weights that enable desirable properties such as double robustness, local efficiency, and sample boundedness. These weights may not fall between 0 and 1 and hence cannot be
interpreted as the propensity score. In addition, like the method of Graham et al., Tan’s method
incorporates the outcome model in a creative manner when constructing weights. In contrast, the
CBPS focuses on the estimation of the propensity score without consulting the outcome data, which
7
aligns with the original spirit of the propensity score methods (Rubin, 2007). As illustrated in Section 4, The direct connection between the CBPS and the propensity score widens the applicability
of the CBPS. For the same exact reason, the CBPS can also be easily used in conjunction with the
existing propensity score methods such as matching and weighting.
3
Simulation and Empirical Studies
In this section, we apply the proposed CBPS methodology to prominent simulation and empirical
studies where propensity score methods have shown to fail. We show that the CBPS can dramatically
improve the poor performance of the propensity score weighting and matching methods reported in
the previous studies.
3.1
Improved Performance of Propensity Score Weighting Methods
In a controversial article, Kang and Schafer (2007) present a set of simulation studies to study the
performance of propensity score weighting methods. They find that the misspecification of propensity
score model can negatively affect the performance of various weighting methods. In particular, they
show that while the doubly robust (DR) estimator of Robins et al. (1994) provides a consistent
estimate of the treatment effect if either the outcome model or the propensity score model is correct,
the performance of DR estimator can deteriorate when both models are slightly misspecified. This
finding led to a rebuttal by Robins et al. (2007) who criticizes the simulation setup and introduces
alternative DR estimators.
In this section, we exactly replicate the simulation study of Kang and Schafer (2007) except
that we estimate the propensity score using our proposed methodology. We then examine whether
or not the CBPS can improve the empirical performance of the propensity score weighting estimators. In particular, Kang and Schafer (2007) used the following data generating process for
∗ , X ∗ , X ∗ , X ∗ ), each
their simulation study. There exist four pre-treatment covariates Xi∗ = (Xi1
i2
i3
i4
of which is independently, identically distributed (i.i.d.) according to the standard normal distribution. The true outcome model is a linear regression with these covariates and the error term
is an i.i.d. standard normal random variate such that the mean outcome equals 210, which is
the quantity of interest to estimate.
The true propensity score model is a logistic regression
with Xi∗ being the linear predictor such that the mean probability of receiving the treatment
equals 0.5. Finally, only the nonlinear transform of covariates are observed and they are given by
∗ /2), X ∗ /(1+exp(X ∗ )+10, (X ∗ X ∗ /25+0.6)3 , (X ∗ +X ∗ +20)2 }.
Xi = (Xi1 , Xi2 , Xi3 , Xi4 ) = {exp(Xi1
i2
1i
i1 i3
i1
i4
Kang and Schafer (2007) study the four propensity score weighting estimators based on the
8
propensity score that is a logistic regression with Xi as the linear predictor. This propensity score
model is misspecified because the true propensity score is a logistic regression with Xi∗ as the linear
predictor. The weighting estimators they examine are the Horvitz-Thompson estimator (HT; Horvitz
and Thompson, 1952) (HT), the inverse propensity score weighting estimator (IPW; Hirano et al.,
2003), the weighted least squares regression estimator (WLS; Robins et al., 2000; Freedman and
Berk, 2008), and the doubly robust regression estimator (DR; Robins et al., 1994),
n
µ̂HT =
1 X Ti Yi
n
πβ̂ (Xi )
(20)
i=1
µ̂IPW
n
n
X
Ti Yi . X Ti
=
πβ̂ (Xi )
πβ̂ (Xi )
i=1
µ̂WLS =
µ̂DR =
(21)
i=1
!−1 n
n
n
X Ti Xi Yi
X
1X >
Ti Xi Xi>
(22)
Xi γ̂WLS where γ̂WLS =
n
πβ̂ (Xi )
π (Xi )
i=1 β̂
i=1
i=1
(
)
!−1 n
n
n
X
X
1X
Ti (Yi − Xi> γ̂OLS )
>
>
Xi γ̂OLS +
where γ̂OLS =
Ti Xi Xi
Ti Xi Yi .
n
πβ̂ (Xi )
i=1
i=1
i=1
(23)
Both the weighted least squares and ordinary least squares regressions are misspecified because the
true model is linear in Xi∗ rather than Xi .
To estimate the propensity score, Kang and Schafer (2007) use the logistic regression with Xi
being the linear predictor, i.e., πβ (Xi ) = logit−1 (Xi> β), which is misspecified because the true
propensity score model is the logistic regression with Xi∗ being the linear predictor. Our simulation
study uses the same propensity and outcome model specifications but we investigate whether the
CBPS improves the empirical performance of the weighting estimators. To estimate the CBPS,
we use the same exact logistic regression specification but use the covariate balancing moment
ei = Xi under the GMM framework of the proposed methodology outlined
conditions by setting X
in Section 2. Thus, our simulation study examines how replacing the standard logistic regression
propensity score with the CBPS will improve the empirical performance of the four commonly used
weighting estimators.
As in Kang and Schafer (2007), we conduct simulations under four scenarios: (1) both propensity
score and outcome models are correctly specified, (2) only the propensity score model is correctly
specified, (3) only the outcome model is correctly specified, and (4) neither the propensity score nor
outcome model is correctly specified. For each scenario, two sample sizes, 200 and 1, 000, are used
and we conduct 10, 000 Monte Carlo simulations and calculate the bias and root mean squared error
(RMSE) for each estimator.
9
Bias
Sample size Estimator
GLM
Balance CBPS
(1) Both models correct
HT −0.01
2.02
0.73
IPW −0.09
0.05 −0.09
n = 200
WLS
0.03
0.03
0.03
DR
0.03
0.03
0.03
HT −0.03
0.39
0.15
IPW −0.02
0.00 −0.03
n = 1000
WLS −0.00
−0.00 −0.00
DR −0.00
−0.00 −0.00
(2) Propensity score model correct
HT −0.32
1.88
0.55
IPW −0.27
−0.12 −0.26
n = 200
WLS −0.07
−0.07 −0.07
DR −0.07
−0.07 −0.07
HT
0.03
0.38
0.15
IPW −0.02
−0.00 −0.03
n = 1000
WLS −0.01
−0.01 −0.01
DR −0.01
−0.01 −0.01
(3) Outcome model correct
HT
24.72
0.33 −0.47
IPW
2.69
−0.71 −0.80
n = 200
WLS −1.95
−2.01 −1.99
DR
0.01
0.01
0.01
HT
69.13
−2.14 −1.55
IPW
6.20
−0.87 −0.73
n = 1000
WLS −2.67
−2.68 −2.69
DR
0.05
0.02
0.02
(4) Both models incorrect
HT
25.88
0.39 −0.41
IPW
2.58
−0.71 −0.80
n = 200
WLS −1.96
−2.01 −2.00
DR −5.69
−2.20 −2.18
HT
60.60
−2.16 −1.56
IPW
6.18
−0.87 −0.72
n = 1000
WLS −2.68
−2.69 −2.70
DR −20.20
−2.89 −2.94
RMSE
Balance CBPS
True
GLM
True
0.68
−0.11
0.03
0.03
0.29
−0.01
−0.00
−0.00
13.07
4.01
2.57
2.57
4.86
1.73
1.14
1.14
4.65
3.23
2.57
2.57
1.77
1.44
1.14
1.14
4.04
3.23
2.57
2.57
1.80
1.45
1.14
1.14
23.72
4.90
2.57
2.57
10.52
2.25
1.14
1.14
−0.17
−0.35
−0.07
−0.07
0.01
−0.04
−0.01
−0.01
12.49
3.94
2.59
2.59
4.93
1.76
1.14
1.14
4.67
3.26
2.59
2.59
1.75
1.45
1.14
1.14
4.06
3.27
2.59
2.59
1.79
1.46
1.14
1.14
23.49
4.90
2.59
2.59
10.62
2.26
1.14
1.14
0.25 141.09
−0.17
10.51
0.49
3.86
0.01
2.62
−0.10 1329.31
−0.04
13.74
0.18
3.08
0.02
4.86
4.55
3.50
3.88
2.56
3.12
1.87
3.13
1.16
3.70
3.51
3.88
2.56
2.63
1.80
3.14
1.16
23.76
4.89
3.31
2.56
10.36
2.23
1.48
1.15
−0.14 186.53
−0.24
10.32
0.47
3.86
0.33
39.54
0.05 1387.53
−0.04
13.40
0.17
3.09
0.07 615.05
4.64
3.49
3.88
4.22
3.11
1.86
3.14
3.47
3.69
3.50
3.88
4.23
2.62
1.80
3.15
3.53
23.65
4.92
3.31
3.69
10.52
2.24
1.47
1.75
Table 1: Relative Performance of the Four Different Propensity Score Weighting Estimators Based
on Different Propensity Score Estimation Methods under the Simulation Setting of Kang and Schafer
(2007). The bias and root mean squared error (RMSE) are computed for the Horvitz-Thompson
(HT), the inverse propensity score weighting (IPW), the inverse propensity score weighted least
squares (WLS), and the doubly-robust least squares (DR) estimators. The performance of the
CBPS is compared with that of the standard logistic regression (GLM), the CBPS without the
score equation (Balance), and the true propensity score (True). We consider four scenarios where
the outcome and/or propensity score models are misspecified. The sample sizes are n = 200 and
n = 1, 000. The number of simulations is 10, 000. Across the four weighting estimators, the CBPS
is most robust to misspecification and dramatically improves the performance of GLM.
10
The results of our simulation study is presented in Table 1. For each of the four scenarios, we
examine the bias and RMSE of each weighting estimator based on four different propensity score
methods: (a) the standard logistic regression with Xi being the linear predictor as in the original
simulation study (GLM), (b) the CBPS estimation with the covariate balancing moment conditions
with respect to Xi and without the score condition (Balance), (d) the proposed CBPS estimation,
and (4) the true propensity score (True), which is given by πβ (Xi ) = logit−1 (Xi∗ > β). In the first
scenario where both models are correct, all four weighting methods have relatively low bias regardless
of what propensity score estimation method we use. However, the HT estimator has a large variance
and hence a large RMSE when either the standard logistic regression or the true propensity score
is used. In contrast, when used with the CBPS, the same estimator has a much lower RMSE. A
similar observation can be made for the IPW estimator while the WLS and DR estimators are not
sensitive to the choice of propensity score estimation methods. The CBPS without the score equation
improves the performance of the HT estimator, but the bias is somewhat larger when compared to
the CBPS with the score equation.
The second simulation scenario shows the performance of various estimators when the propensity
score model is correct but the outcome model is misspecified. As expected, the results are quite
similar to those of the first simulation scenario. When the propensity score model is correctly
specified, the four weighting estimators have low bias. However, the CBPS significantly reduces the
variance of the HT estimator even when compared with the true propensity score. This confirms
the theoretical result in the literature that the estimated propensity score leads to a more efficient
estimator of the average treatment effect (e.g., Hahn, 1998; Hirano et al., 2003).
The third simulation scenario examines an interesting situation where the propensity score model
is misspecified while the outcome models for the WLS and DR estimators are correctly specified.
Here, we find that as expected, the HT and IPW estimators, which solely rely on the propensity
score, have large bias and RMSE when used with the standard logistic regression model. However,
the CBPS significantly reduces their bias and RMSE regardless of whether it incorporates the score
equation from the logistic likelihood function. In contrast, the bias and RMSE of the WLS and DR
estimators remain low and essentially identical across the propensity score estimation methods. Together with the above results under the second scenario, these results confirm the double-robustness
property where the DR estimator performs well so long as either the propensity score or outcome
model is correctly specified.
The final simulation scenario illustrates the most important point made by Kang and Schafer
11
Log−Likelihood
10
1
10−1
●
●
●
●●●
●
●
●
●
●●
● ●●
●●●●
●
● ●
●
●●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●●●
●
●
●●
0.1
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
100
GLM
101
102
●
●●
●●
●
●
Log−Likelihood (GLM−CBPS)
103
104
−520
−550
GLM
−580
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
−610
Both Models
Specified Correctly
Likelihood−Balance
Tradeoff
Covariate Imbalance
●
●
−610
−580
−550
●
−1
−520
0
10
10
10
2
10
3
10
10
4
100
CBPS
● ●
●●●
● ●●
●●●●
●
●
●
●
●●
●●
●
●●●
●
●
●
●●
●
● ● ●●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
● ●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●●
●●●
●
●
●
●
●
● ●●
●
●●●
●
●●
●●
●
104
−520
CBPS
1
●
101
102
103
●
0.1
1
10
● ●
●
●
●●
●●
● ●●
● ●
●●
●●
●●
●
●
●
●
●
●
●●
●●
●●●
●
●
●●
●●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●● ●
● ●●
●● ●
●●
●●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●●
●●●●
●●●
●
●
●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●●
●
●
●
●
● ●
●
●●●
●●●
●●
●
● ●
●
●
10−1
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
Log−Likelihood (GLM−CBPS)
GLM
101
102
103
●
100
−550
GLM
−580
−610
Neither Model
Specified Correctly
●
●
−610
−580
−550
−520
104
Imbalance (GLM−CBPS)
●
●
●
● ●●
●
−1
10
0
10
CBPS
10
1
CBPS
2
10
3
10
10
4
●
100
101
102
103
104
Imbalance (GLM−CBPS)
Figure 1: Comparison of Likelihood and Imbalance between the Standard Logistic Regression (GLM)
and the Covariate Balancing Propensity Score (CBPS). The two scenarios from the simulation results
reported in Table 1 are shown for the sample size of n = 1, 000: both the outcome and propensity
score models are correctly specified (upper row), and both models are misspecified (bottom row).
Each dot in a plot represents one Monte Carlo simulation draw. When both models are correct,
CBPS reduces the multivariate standardized bias, which is a summary measure of imbalance across
all covariates (middle upper plot) without sacrificing much likelihood, which is a measure of model fit
(upper left plot). In contrast, when both models are incorrect, CBPS significantly improves balance
while sacrificing some degree of model fit. The two plots in the right column illustrates this trade-off
between likelihood and balance.
(2007) that the performance of the DR estimator can deteriorate when both the propensity score and
outcome models are incorrectly specified. Under this scenario, all models suffer from some degree
of bias when used with the standard logistic regression model. The bias and RMSE are the largest
for the HT estimator but the DR estimator also exhibit a significant amount of bias and variance.
However, the CBPS with or without the score equation can substantially improve the performance of
the DR estimator. Specifically, when the sample size is 1, 000, the bias and RMSE are dramatically
reduced. The CBPS also significantly improves the performance of the HT and IPW estimators even
when compared to the true propensity score. In sum, even when both the outcome and propensity
score models are misspecified, the CBPS can yield robust estimates of treatment effects.
12
How can the CBPS dramatically improve the performance of the GLM? Figure 1 shows that the
CBPS sacrifices likelihood in order to improve balance under two of the four simulation scenarios.
We use the following multivariate version of the “standardized bias” (Rosenbaum and Rubin, 1985)
to measure the overall covariate imbalance,


N
1 X
Imbalance =
wβ̂ (Ti , Xi )Xi
 N
i=1
!>
N
1 X
Ti Xi Xi>
N1
i=1
!−1
!1/2
N

X
1
wβ̂ (Ti , Xi )Xi
(24)

N
i=1
The figure shows that when both the outcome and propensity score models are correctly specified
(the top row), the CBPS achieves better covariate balance (middle plot) without sacrificing much
likelihood (left plot). In contrast, when both models are misspecified (the bottom row), the CBPS
significantly improves covariate balance (middle plot) at the cost of some likelihood (left plot). The
plots in the right column shows the CBPS’s tradeoff between likelihood and covariate balance where
a larger improvement in covariate balance is associated with a larger loss of likelihood.
3.2
Improved Performance of Propensity Score Matching Methods
In an influential article, LaLonde (1986) empirically evaluated the ability of various estimators to
obtain an unbiased estimate of the average treatment effect in the absence of randomized treatment
assignment. From a randomized study of a job training program (National Supported Work Demonstration or NSW) where an unbiased estimate of the average treatment effect is available, LaLonde
constructed an “observational study” by replacing the control group of the experimental NSW data
with untreated observations from non-experimental data sets such as the Current Population Survey
(CPS) and the Panel Study for Income Dynamics (PSID). LaLonde showed that the estimators he
evaluated failed to replicate the experimental benchmark, and this finding led to increasing interests
in experimental evaluation among social scientists.
More than ten years later, Dehejia and Wahba (1999) revisited the LaLonde study and showed
that the propensity score matching estimator was able to closely replicate the experimental benchmark. Dehejia and Wahba estimated the propensity score using the logistic regression and matched
a treated observation with a control observation based on the estimated propensity score. This
finding, however, came under intense criticism by Smith and Todd (2005). Smith and Todd argued
that the impressive performance of the propensity score matching estimator reported by Dehejia
and Wahba critically hinges on a particular subsample of the original LaLonde data they analyze.
This subsample excludes most of high income workers from the non-experimental comparison sets,
thereby making the selection bias much easier to overcome. Indeed, Smith and Todd showed that
13
if one focuses on the Dehejia and Wahba sample other conventional estimators do as well as the
propensity score matching estimator. Moreover, the propensity score matching estimator has a difficulty of replicating the experimental benchmark when applied to the original LaLonde data and is
quite sensitive to the propensity score model specification.
In what follows, we investigate whether the CBPS can improve the poor performance of the
propensity score matching estimator reported by Smith and Todd (2005). Specifically, we analyze
the original LaLonde experimental sample (297 treated and 425 untreated observations) and use
the PSID as the comparison data (2, 490 observations). The data sets are obtained from Dehejia’s
website. The pre-treatment covariates in the data include age, education, race (white, black or
hispanic), marriage status, high school degree, 1974 earnings, and 1975 earnings. The outcome
variable of interest is 1978 earnings. In this sample, the experimental benchmark for the average
treatment effect is $886 with the standard error of $488. The only discrepancy between the publicly
available data we analyze and the data analyzed by Smith and Todd is that the former does not
include 1974 earnings in the experimental sample. Therefore, we impute this variable by regressing
1974 earnings on the remaining pre-treatment covariates in the non-experimental sample, and then
using the fitted linear regression to predict 1974 earnings for the experimental sample. An affine
transformation, truncated at zero, is applied to the predicted values such that the sample mean of
1974 earnings and the sample proportion of those who earn nothing in 1974 match closely with the
corresponding results reported in Smith and Todd (2005).
We first replicate Smith and Todd’s analysis by estimating the “evaluation bias,” which is defined
as the average effect of being in the experimental sample on 1978 earnings. Specifically, we estimate
the conditional probability of being in the experimental sample given the pre-treatment covariates.
Based on this estimated propensity score, we match the control observations in the experimental
sample with similar observations in the non-experimental sample. Since neither group of workers
received the job training program, the true average treatment effect is zero. Following the original
analysis, we conduct one-to-one and one-to-ten nearest neighbor matching with replacement where
matching is done on the log-odds of the estimated propensity score. Furthermore, to examine the
sensitivity to the propensity score model specification, we fit three different logistic regression models
— a linear specification, a quadratic specification that includes the squares of non-binary covariates,
and the specification used by Smith and Todd (2005) which is based on Dehejia and Wahba’s variable
selection procedure and adds an interaction term between Hispanic and zero earnings in 1974 to the
quadratic specification. Finally, our standard errors are based on Abadie and Imbens (2006) rather
14
Model Specification
Linear
Quadratic
Smith and Todd (2005)
1–to–1 Nearest Neighbor
GLM
Balance
CBPS
−1643
−377
−188
(877)
(841)
(792)
−2800
−1180
234
(935)
(932)
(799)
−2882
−879
−346
(950)
(850)
(830)
1–to–10 Nearest Neighbor
GLM
Balance
CBPS
−1329
−564
−392
(727)
(708)
(711)
−1828
−675
−465
(714)
(739)
(686)
−1951
−735
−224
(725)
(720)
(745)
Table 2: Estimated Evaluation Bias of One-to-one and One-to-ten Nearest Neighbor Propensity
Score Matching Estimators with Replacement. The results (standard errors are in parentheses)
represent the estimated average effect of being in the experimental sample on the 1978 earnings
where the experimental control group is compared with the matched subset of the untreated nonexperimental sample. If a matching estimator is successful, then its estimated effect should be close
to the true effect, which is zero. The propensity score is estimated in three different ways using
the logistic regression – standard logistic regression (GLM), the CBPS without the score equation
(Balance), and the CBPS which combines the score equation and the balance conditions (CBPS). In
addition, we consider three different logistic propensity score model specification – the linear, and
the quadratic function of covariates, and the model specification used by Smith and Todd (2005).
Across three different propensity score model specifications and two matching estimators, the CBPS
has the smallest bias and significantly improves the standard logistic regression.
than the bootstrap used by Smith and Todd, which may not yield a valid standard error (Abadie
and Imbens, 2008).
Table 2 presents the estimated evaluation bias of one-to-one and one-to-ten nearest neighbor
propensity score matching with replacement across three different propensity score model specifications described above. The results are compared across the propensity score estimation methods
– the standard logistic regression (GLM), the CBPS with balance condition only (Balance), and
the CBPS that combines the likelihood and balance conditions. Despite the small discrepancy in
the data, our estimates based on the standard logistic regression are comparable with the estimates
presented in Smith and Todd (2005). For one-to-one (one-to-ten) matching, our estimate is −2882
(−1951) with the standard error of 950 (725) while the Smith and Todd’s estimate is −2932 (−2119)
with the standard error of 898 (787).
Across all specifications and matching methods, the CBPS exhibits the smallest estimated bias
and significantly improves the performance of propensity score matching estimator based on the
standard logistic regression. For example, if the same propensity score specification as the one used
in Smith and Todd (2005), the estimated evaluation bias is reduced by almost 90% (−2882 is reduced
to −346 for one-to-one matching) if one uses the CBPS instead of the standard logistic regression.
The CBPS with balance conditions alone also has a smaller bias than the standard logistic regression,
15
but the CBPS that combines the likelihood and balance conditions appears to work best across the
scenarios considered here. In addition, as seen in Section 3.1, the CBPS also seems to make the
matching estimators less sensitive to changes in propensity score model specification.
To see where the dramatic improvement of the CBPS comes from, we examine the covariate
balance across the estimation methods and model specifications. Table 3 presents the standardized
bias for each covariate both in the entire sample (top panel) and the (one-to-one) matched sample
(bottom panel). The standardized bias for each covariate in the entire sample is defined as the inverse
propensity score weighted difference in means between the treatment and control groups, divided
by the standard deviation of the covariate within the treatment group. Similarly, for the matched
sample, it equals the difference in means between the two groups, divided by the standard deviation
of the treated observations. In addition to the univariate standardized bias, we also present the loglikelihood and the multivariate standardized bias statistic where the latter is defined in equation (24)
for the entire sample and for the matched sample we use the standard definition available in the
literature (Rosenbaum and Rubin, 1985).
As seen in Section 3.1, the CBPS significantly improves the balance across covariates when
compared to the standard logistic regression (GLM) while sacrificing some degree of likelihood. This
is true across model specifications and in both the entire and matched samples. The CBPS without
the score equation reduces covariate imbalance as well. By construction, it matches the inverse
propensity score weighted means perfectly between the treatment and control groups. However, for
the one-to-one matched sample, the CBPS has generally lower standardized bias and much higher
likelihood. This may suggest that the CBPS without the score equation overfits the data and may
explain the fact that the CBPS with both the score and balancing equations has a better performance
in Table 2. In addition, we note that the over-identifying restriction test described in Section 2.3
rejects the null hypothesis that the propensity score model specification is correct. The J-statistic
is equal to 79 (22 degrees of freedom), 89 (30), and 99 (32) for the linear, quadratic and Smith
and Todd model specifications, respectively. This may explain the fact that the estimates based on
the CBPS are still biased by several hundred dollars though the magnitude of bias is significantly
reduced when compared with the standard logistic regression estimates.
Finally, we estimate the average treatment effect rather than the evaluation bias. Smith and
Todd (2005) focused on the estimation of evaluation bias but others including LaLonde (1986)
and Dehejia and Wahba (1999) studied the differences between the experimental benchmark and
the estimates based on various statistical methods. As explained by Smith and Todd (2005), the
16
17
GLM
0.097
−0.004
−0.196
0.270
−0.020
0.114
−0.104
−0.069
−0.365
0.051
−1097
0.577
GLM
0.137
−0.215
−0.185
0.102
0.135
−0.282
−0.197
−0.639
−0.551
−0.480
−0.094
−580
5.936
Linear
Balance
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
−609
0.000
Linear
Balance
0.042
0.090
−0.086
0.146
0.000
0.000
−0.016
−0.046
0.236
−0.415
−1186
0.332
CBPS
−0.003
0.107
−0.048
0.104
0.015
0.005
−0.008
−0.014
0.208
−0.296
−1152
0.266
CBPS
0.039
−0.105
−0.066
0.056
0.041
−0.097
−0.072
−0.080
−0.104
−0.032
0.013
−609
0.982
GLM
0.004
−0.017
−0.172
0.166
0.065
0.052
−0.124
−0.057
−0.230
−0.131
−1117
0.718
GLM
0.069
−0.144
−0.128
−0.053
0.146
−0.101
−0.081
−0.376
−0.316
−0.339
−0.039
−548
4.242
Quadratic
Balance
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
−591
0.000
Quadratic
Balance
0.047
0.142
−0.052
0.166
−0.025
0.095
−0.021
−0.015
0.107
−0.290
−1213
0.412
CBPS
0.010
0.070
−0.043
0.125
0.005
0.119
−0.003
−0.003
0.174
0.136
−1163
0.191
CBPS
0.048
−0.131
−0.056
0.047
0.074
−0.110
−0.081
−0.109
−0.131
−0.047
0.018
−574
0.959
Smith
GLM
0.076
−0.154
−0.133
−0.037
0.143
−0.120
−0.086
−0.392
−0.338
−0.333
−0.032
−547
4.234
Smith
GLM
−0.025
−0.028
−0.115
0.073
0.045
0.091
−0.117
−0.050
−0.258
−0.182
−1118
0.692
and Todd (2005)
Balance
CBPS
0.000
0.049
0.000
−0.135
0.000
−0.065
0.000
0.072
0.000
0.043
0.000
−0.116
0.000
−0.084
0.000
−0.111
0.000
−0.134
0.000
−0.032
0.000
0.018
−591
-576
0.000
1.144
and Todd (2005)
Balance
CBPS
0.075
0.028
0.150
0.126
0.062
−0.019
−0.062
0.135
−0.099
0.030
0.062
0.100
−0.033
0.018
−0.041
−0.001
0.107
0.174
−0.375
−0.068
−1220
−1177
1.123
0.180
Table 3: Standardized Covariate Balance Across Different Propensity Score Estimation Methods and Model Specifications. We compare
three different ways of estimating the propensity score – the standard logistic regression (GLM), the CBPS without the score equation
(Balance), and the CBPS that combines the score equation with the balancing conditions (CBPS). For each estimation method, we consider
three logistic regression specifications – the linear (Linear), the quadratic function of covariates (Quadratic), and the model specification
used by Smith and Todd (2005). For each covariate, we report the inverse propensity score weighted difference in means between the
treatment and control groups of the entire sample, divided by its standard deviation among the treated observations (upper panel) as
well as the difference in means between the two groups of the (one-to-one) matched sample, divided by its standard deviation among
the treated observations (bottom panel). In addition, the log-likelihood and a multivariate analogue of the above imbalance statistics are
presented. The CBPS sacrifices some degree of likelihood in order to improve balance.
Weighted Difference-in-Means
(Entire Sample)
Intercept
Age
Education
Black
Hisp
Married
High School Degree
Earnings, 1974
Earnings, 1975
Employed, 1974
Employed, 1975
Log-Likelihood
Overall Imbalance
Difference-in-Means
(1–to–1 Matched Sample)
Age
Education
Black
Hispanic
Married
High School Degree
Earnings, 1974
Earnings, 1975
Employed, 1974
Employed, 1975
Log-Likelihood
Overall Imbalance
Evaluation propensity score
Model specification
Linear
Quadratic
Smith and Todd (2005)
Treatment propensity score
Model specification
Linear
Quadratic
1–to–1
GLM
−928
(1080)
−2825
(1229)
−2489
(1203)
1–to–1
GLM
−298
(1050)
−675
(1106)
Nearest Neighbor
Balance
CBPS
66
692
(966)
(989)
−144
1419
(1023)
(979)
−422
554
(1039)
(977)
Nearest Neighbor
Balance
CBPS
585
350
(986)
(962)
861
291
(1039)
(986)
1–to–10
GLM
−1340
(873)
−1533
(879)
−1506
(858)
1–to–10
GLM
−616
(777)
−643
(885)
Nearest Neighbor
Balance
CBPS
−93
84
(843)
(898)
−35
145
(894)
(849)
−183
309
(843)
(863)
Nearest Neighbor
Balance
CBPS
−227
90
(834)
(760)
50
−38
(886)
(755)
Table 4: Comparison between the Estimated Average Treatment Effect and the Experimental Benchmark. The experimental estimate is 886 with the standard error of 488. The top panel represents
the matching estimates where the conditional probability of being in the experimental sample is
used as the propensity score whereas the bottom panel represents the estimates with the conditional
probability of being in the experimental treatment group as the propensity score. We compare three
different ways of estimating the propensity score – the standard logistic regression (GLM), the CBPS
without the score equation (Balance), and the CBPS that combines the score equation with the balancing conditions (CBPS). For each estimation method, we consider two or three logistic regression
specifications – the linear (Linear), the quadratic function of covariates (Quadratic), and the model
specification used by Smith and Todd (2005) for the evaluation propensity score. The standard
errors are in parentheses. The CBPS with and without the score equation yield the estimates that
are much closer to the experimental benchmark when compared to the standard logistic regression.
LaLonde experimental sample combined with the PSID comparison sample we analyze presents a
difficult selection bias problem to be overcome. In the literature, for example, Diamond and Sekhon
(2011) analyze this same sample using the Genetic Matching estimator and presents the estimate of
−571 (with the 95% confidence interval of [−2786, 1645]), which is not very close to the experimental
benchmark of 866 with the standard error of 488. In contrast, when applied to other samples, various
methods yielded estimates that are much closer to the experimental benchmark (e.g., Dehejia and
Wahba, 1999; Diamond and Sekhon, 2011; Hainmueller, 2012).
We conduct two propensity score matching analyses. First, we use the conditional probability
of being in the experimental sample as the propensity score. This corresponds to the propensity
score model specifications and estimation methods used to generate the estimates of evaluation bias
given in Table 2. The advantage of this approach is that we have a larger sample size because the
experimental control group is used to estimate the propensity score. Second, we use the conditional
probability of belonging to the experimental treatment group as the propensity score by only using
18
the experimental treatment group and the PSID comparison sample. This resembles the scenario
faced by most applied researchers where the experimental control group is not available. After
generating the propensity score based on various specifications, we conduct the one-to-one and oneto-ten nearest neighbor matching with replacement based on the log-odds of the estimated propensity
score as done before. Note that the Smith and Todd specification is tailored to the evaluation
propensity score and hence is not used for the treatment propensity score.
Table 4 presents the results based on the estimated evaluation propensity score in the top panel
and those based on the estimated treatment propensity score in the bottom panel. One clear pattern
emerges from these results. The CBPS with or without the score equation yields the matching
estimates that are much closer to the experimental benchmark when compared with the standard
logistic regression (GLM). For the evaluation propensity score, the CBPS with the score equation
does better than the CBPS without it. However, for the treatment propensity, neither the CBPS
nor the CBPS without the score equation dominates the other. In addition, even for the CBPS,
the one-to-ten nearest neighbor matching estimator tends to yield estimates that are further away
from the experimental benchmark when compared with the one-to-one nearest neighbor matching
estimator. This suggests that for some treated observations there may not be sufficiently many
comparable control observations.
4
Extensions
So far, we have shown that the CBPS can dramatically improve the performance of propensity score
weighting and matching estimators when estimating the average treatment effects in observational
studies. Another important advantage of the CBPS is that it can be extended to a number of other
important causal inference settings by directly incorporating various balancing conditions. In this
section, we briefly describe several potential extensions of the CBPS while leaving their empirical
applications to future research.
4.1
The Generalized Propensity Score for Multi-Valued Treatments
First, we extend the CBPS to causal inference with multi-valued treatments. Suppose that we have
a multi-valued treatment where the treatment variable Ti takes one of the K integer values, i.e.,
Ti ∈ T = {0, . . . , K − 1} where K ≥ 2. The binary treatment case considered in Section 2 is a
special case with K = 2. Following Imbens (2000), we can define the generalized propensity score
as the following multinomial probabilities,
πβk (Xi ) = Pr(Ti = k | Xi )
19
(25)
where all conditional probabilities sum up to unity, i.e.,
PK−1
k=0
πβk (Xi ) = 1. For example, one may use
the multinomial logistic regression to model this generalized propensity score. As in the binary case,
we have the moment condition that is based on the score function under the likelihood framework,
)
(
N K−1
k
1 X X 1{Ti = k} ∂πβ (Xi )
= 0
(26)
·
N
∂β >
πβk (Xi )
i=1 k=0
Regarding the balancing property of the generalized propensity score, weighting covariates by its
inverse will balance them across all treatment levels. This fact yields the following (K − 1) sets of
moment conditions,
N
1 X
N
(
i=1
ei
ei 1{Ti = k − 1}X
1{Ti = k}X
−
πβk (Xi )
πβk−1 (Xi )
)
= 0
(27)
for each of k = 1, . . . , K − 1. As before, these moment conditions can be combined with that of
equation (26) under the GMM or EL framework.
4.2
Adjusting for Time-Dependent Confounding in Longitudinal Data
Next, we show that the CBPS is also applicable to causal inference with time-dependent confounding
and time-varying treatments in the longitudinal setting. Specifically, we show how to extend the
CBPS to the marginal structural models, which employs inverse propensity score weighting under
the assumption of no unmeasured confounding (Robins, 1999; Robins et al., 2000). Suppose that
we have a simple random sample of N units from a population for each of whom we have repeated
measurements of the outcome and treatment variables throughout a total of J time periods. Let Yij
and Tij denote the observed outcome and time-varying treatment for unit i at time j, respectively.
We also observe time-dependent confounders Xij which may or may not be affected by the treatment
status of the same unit in the previous time periods. Under this setting, for a given unit i in a given
time period j, the propensity score is defined as the conditional probability of receiving the treatment
given the treatment history up to time j − 1 and the covariate history up to time j,
πβ (T i,j−1 , X ij ) = Pr(Tij = 1 | T i,j−1 , X ij )
(28)
where T ij = {Ti0 , Ti1 , . . . , Tij } and X ij = {Xi0 , Xi1 , . . . , Xij } represent the treatment and covariate
history up to time period j, respectively. Thus, the maximum likelihood estimate of β is given by,
β̂MLE = argmax
β∈Θ
N X
J
X
Tij log πβ (T i,j−1 , X ij ) + (1 − Tij ) log{1 − πβ (T i,j−1 , X ij )}
(29)
i=1 j=1
Then, the moment condition based on the score function of this likelihood formulation equals,
(
)
N J
0
0
1 X X Tij πβ (T i,j−1 , X ij ) (1 − Tij )πβ (T i,j−1 , X ij )
−
= 0.
(30)
JN
πβ (T i,j−1 , X ij )
1 − πβ (T i,j−1 , X ij )
i=1 j=1
20
We combine this moment condition with the balancing property of the propensity score. In
this context, weighting the inverse of the propensity score should balance the covariate distribution
between the treated and control units at each time period. Thus, we have a total of J additional
sets of moment conditions, which are of the following form,
)
(
N
eij
eij
(1 − Tij )Z
Tij Z
1 X
−
= 0
N
πβ (T i,j−1 X ij ) 1 − πβ (T i,j−1 , X ij )
i=1
(31)
where Zeij = f (T i,j−1 , X ij ) is a vector valued function of the treatment history up to time j − 1
and the covariate history up to time j. Together with equation (30), these moment conditions can
yield the GMM or EL estimate of the propensity score that simultaneously optimizes the resulting
balance and the prediction of treatment assignment.
4.3
Generalizing Experimental Results
Another possible extension of the CBPS concerns the generalization of experimental results to a
target population. While the randomization of treatment assignment eliminates selection bias, experimental studies may suffer from the lack of external validity because experimental samples are
not representative of a target population. Cole and Stuart (2010) and Stuart et al. (2011) use the
propensity score to generalize experimental results (see also Imai and Ratkovic, 2011). Suppose that
we have an experimental sample of Ne units where the binary treatment variable Ti is completely
randomized. Let Si represent the sampling indicator where Si = 1 if unit i is in the experimental
sample and Si = 0 otherwise. In this context, the “propensity score” is defined as the conditional
probability of being in the experimental sample given the pre-treatment characteristics,
πβ (Xi ) = Pr(Si = 1 | Xi )
(32)
where Xi is an K dimensional vector of covariates and β is an L-dimensional vector of unknown
parameters.
In addition to the experimental sample, we assume that a random sample representative of
the target population P is available and the sample size is Nne . Without loss of generality, we
assume that the first Ne units belong to the experimental sample Si = 1 for i = 1, . . . , Ne , and
the last Nne units belong to the non-experimental sample Si = 0 for i = Ne + 1, . . . , N where
N = Ne + Nne represents the total sample size. The assumption that makes the generalization of
experimental results possible is {Yi (1), Yi (0)}⊥⊥Si | Xi , which implies that the sample selection bias
can be eliminated by conditioning on the pre-treatment characteristics Xi . Under this assumption,
the propensity score πβ (Xi ) is estimated by fitting, for example, a logistic regression with Si as the
21
response variable. Similar to equation (7), the moment condition from this model can be represented
by the following score function,
N
1 X
N
(
Si πβ0 (Xi )
i=1
πβ (Xi )
−
(1 − Si )πβ0 (Xi )
1 − πβ (Xi )
)
= 0
(33)
In addition, there are two moment conditions based on the balancing property. First, if the
propensity score is correct, then appropriately weighting the covariates in the experimental sample
will make their distribution similar to that of the weighted non-experimental sample,
)
(
N
ei
ei
1 X
Si X
(1 − Si )X
= 0
−
N
πβ (Xi ) 1 − πβ (Xi )
(34)
i=1
ei = f (Xi ) is an M dimensional vector-valued function of covariates. Second, in many cases,
where X
it can be assumed that the treatment is not available for the non-experimental sample and the control
condition in the experiment is equivalent to the condition to which the units in the non-experimental
sample are exposed, i.e., Ti = 0 for i = Ne + 1, . . . , N . In this case, we can impose an additional
moment condition based on the fact that the outcome distribution for the experimental control group
should match with that of the weighted non-experimental sample. Formally, we have,
N Ne
(1 − Si )Yi
1 X
Si (1 − Ti )Yi
−
= 0.
·
N
Ne − N1
πβ (Xi )
1 − πβ (Xi )
(35)
i=1
Finally, the moment conditions given in equations (33)–(34) and (35) can be combined under the
GMM or EL framework as done in Section 2 to estimate the propensity score.
4.4
Generalizing the Instrumental Variables Estimates
The final extension we consider is the generalization of instrumental variables estimates. As shown
by Angrist et al. (1996), the method of instrumental variables identifies the average treatment effect
for the compliers or the Local Average Treatment Effect (LATE). However, this LATE parameter has
been under criticism because of the external validity concern (see e.g., Deaton, 2009; Heckman and
Urzua, 2010; Imbens, 2010). Specifically, in the presence of treatment effect heterogeneity, the LATE
does not generally equal the average treatment effect for the overall population (ATE). Recently, in
order to extrapolate the LATE to the ATE, Angrist and Fernandez-Val (2010) propose to weight
the conditional LATE by the “propensity score,” which in this context is defined as the conditional
probability of being a complier given the covariates (see also Aronow and Sovey, 2011).
Formally, consider a simple random sample of N units from a population. Suppose that we have a
binary instrumental variable Zi where Zi = 1 (Zi = 0) means that unit i is encouraged (not) to receive
22
the treatment. Let Ti represent a binary treatment variable as before and Ti (z) be the potential
treatment value under the encouragement status z. Following Angrist et al. (1996), we define a
complier Ci = c as a unit who receives the treatment only when encouraged ((Ti (1), Ti (0)) = (1, 0)),
a never-taker Ci = n as a unit who never takes the treatment ((Ti (1), Ti (0)) = (0, 0)), and an
always-taker Ci = a as a unit who always receives the treatment ((Ti (1), Ti (0)) = (1, 1)). The
monotonicity assumption states that there is no defier who only takes the treatment when not
encouraged ((Ti (1), Ti (0)) = (0, 1)). Under this setting, the propensity score is defined as,
πβ (Xi ) = Pr(Ci = c | Xi )
(36)
where β ∈ Θ is a vector of unknown parameters.
First, we consider the case of one-sided noncompliance where the units with Zi = 0 are not
eligible for the treatment and hence no always-taker exists. Under this setting, the propensity score
can be estimated using the units who are encouraged (Zi = 1). This group of units consists of either
compliers (Ti = 1) or never-takers (Ti = 0), and hence the ML estimator can be written as,
β̂MLE = argmax
β∈Θ
N
X
Zi [Ti log πβ (Xi ) + (1 − Ti ) log{1 − πβ (Xi )}] .
(37)
i=1
Thus, the moment condition based on the score function is given by,
(
)
N
Ti πβ0 (Xi ) (1 − Ti )πβ0 (Xi )
1 X
Zi
−
= 0.
N
πβ (Xi )
1 − πβ (Xi )
(38)
i=1
Now, the moment conditions based on the balancing property of the propensity score is given by,
(
)
N
ei
ei
1 X N Zi
Ti X
(1 − Ti )X
−
= 0
(39)
N
N1
πβ (Xi ) 1 − πβ (Xi )
i=1
where N1 =
PN
i=1 Zi
is the number of units who are encouraged. In addition, the instrumental
variable Zi is randomized as is often the case for randomized experiments with noncompliance, then
the distribution of weighted covariates in the encouragement should be matched with that of the
non-encouragement group. This implies the following additional moment condition,
(
)
N
ei
ei
N (1 − Zi )X
1 X N Zi Ti X
−
= 0
N
N1 πβ (Xi )
N − N1
(40)
i=1
These moment conditions then in turn can be leveraged to estimate the CBPS within the GMM or
EL framework.
23
Next, we consider the case of two-sided noncompliance. Here, we allow for the existence of
always-takers so that some units who are not encouraged may receive the treatment (Ti , Zi ) = (1, 0).
Thus, the propensity score is now defined as a set of multinomial probabilities,
πβc (Xi ) = Pr(Ci = c | Xi )
(41)
πγa (Xi ) = Pr(Ci = a | Xi )
(42)
where γ is a vector of unknown parameters and the conditional probability of being a never-taker is
given by πγn (Xi ) = Pr(Ci = n | Xi ) = 1 − πβc (Xi ) − πγa (Xi ). As shown by Imbens and Rubin (1997),
the likelihood function has the following mixture structure,
N h
Y
{πβc (Xi ) + πγa (Xi )}Ti {1 − πβc (Xi ) − πγa (Xi )}(1−Ti )
iZi h
πγa (Xi )Ti {1 − πγa (Xi )}(1−Ti )
i1−Zi
(43)
i=1
Thus, the moment condition based on the score equation can be derived based on this likelihood
function.
In addition, we can exploit the covariate balancing properties of the propensity score. In particular, units in each of the observed four cells defined by (Ti , Zi ) can be weighted so that their covariate
distribution matches with the population distribution. The moment conditions
)
(
N
ei
ei
1 X N Zi
(1 − Ti )X
Ti X
−
=
N
N1
πβc (Xi ) + πγa (Xi ) 1 − πβc (Xi ) − πγa (Xi )
i=1
(
)
N
ei
ei
1 X N (1 − Zi )
Ti X
(1 − Ti )X
−
=
N
N − N1
πγa (Xi ) 1 − πγa (Xi )
i=1
)
(
N
ei
ei
N (1 − Zi )X
1 X
N Zi X
−
Ti
=
N
N1 πβc (Xi ) (N − N1 )πγa (Xi )
are then given by,
0
(44)
0
(45)
0
(46)
i=1
Again, these moment conditions can be combined to estimate the CBPS under the GMM or EL
framework.
5
Concluding Remarks
The propensity score matching and weighting methods have become popular tools for applied researchers in various disciplines who conduct causal inference in observational studies. The propensity
score methodology also has been extended to various other settings including longitudinal data, nonbinary treatment regimes, and the generalization of experimental results. Despite this development,
relatively little attention has been paid to the question of how the propensity score should be estimated (see McCaffrey et al., 2004, for an exception). This is unfortunate because researchers have
24
found that slight misspecification of the propensity score model can result in substantial bias of
estimated treatment effects.
The proposed methodology, the covariate balancing propensity score (CBPS), enables the robust
and efficient parametric estimation of the propensity score by directly incorporating the two key
properties of propensity score — a good estimate of the propensity score predicts the treatment assignment and balances covariates at the same time. We directly exploit these properties and estimate
the propensity score within the familiar framework of Generalized Method of Moments (GMM) or
Empirical Likelihood (EL). This means that the existing theoretical properties and methodologies
can be directly applied to the CBPS as well.
While we provide some empirical evidence that the CBPS can dramatically improve the performance of propensity score weighting and matching methods, there exist a number of important
issues that merit future research. First, the potential extensions outlined in this paper are of interest
to applied researchers and the empirical performance of the CBPS in these situations must be rigorously investigated. Second, an important issue of model selection has not been addressed. While the
CBPS is relatively robust to model misspecification, its successful application requires a principled
method of choosing an appropriate model specification as well as covariate balancing criteria. The
existing methods within the GMM framework such as Andrews (1999) and Caner (2009) are directly
applicable and yet their empirical performance in various causal inference contexts remain to be
investigated.
25
References
Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica 74, 1, 235–267.
Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators.
Econometrica 76, 6, 1537–1557.
Abadie, A. and Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment
effects. Journal of Business and Economic Statistics 29, 1, 1–11.
Andrews, D. (1999). Consistent moment selection procedures for generalized method of moments
estimation. Econometrica 67, 3, 543–563.
Angrist, J. and Fernandez-Val, I. (2010). ExtrapoLATE-ing: External validity and overidentification
in the LATE framework. Working Paper No. 16566, National Bureau of Economic Research.
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using
instrumental variables (with discussion). Journal of the American Statistical Association 91, 434,
444–455.
Aronow, P. M. and Sovey, A. J. (2011). Beyond LATE: Estimation of the average treatment effect
with an instrumental variable. Working paper, Department of Political Science, Yale University.
Caner, M. (2009). LASSO-type GMM estimator. Econometric Theory 25, 1, 270–290.
Cole, S. R. and Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target
populations: The ACTG 320 trial. American Journal of Epidemiology 172, 1, 107–115.
Deaton, A. (2009). Instruments of development: Randomization in the tropics, and the search for
the elusive keys to economic development. Proceedings of the British Academy 162, 123–160.
Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the
evaluation of training programs. Journal of the American Statistical Association 94, 1053–1062.
Diamond, A. and Sekhon, J. (2011). Genetic matching for estimating causal effects: A new method
of achieving balance in observational studies. Working Paper, Department of Political Science,
University of California, Berkeley.
26
Freedman, D. A. and Berk, R. A. (2008). Weighting regressions by propensity scores. Evaluation
Review 32, 4, 392–409.
Graham, B. S., Campos de Xavier Pinto, C., and Egel, D. (Forthcoming). Inverse probability tilting
for moment condition models with missing data. Review of Economic Studies .
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average
treatment effects. Econometrica 66, 315–331.
Hainmueller, J. (2008). Synthetic matching for causal effects: A multivariate reweighting method
to produce balanced samples in observational studies. Tech. rep., Department of Government,
Harvard University.
Hainmueller, J. (2012). Entropy balancing for causal effects: Multivariate reweighting method to
produce balanced samples in observational studies. Political Analysis 20, 1, 25–46.
Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of
the American Statistical Association 99, 467, 609–618.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 4, 1029–1054.
Hansen, L. P., Heaton, J., and Yaron, A. (1996). Finite-sample properties of some alternative GMM
estimators. Journal of Business & Economic Statistics 14, 3, 262–280.
Heckman, J. J., Ichimura, H., and Todd, P. (1998). Matching as an econometric evaluation estimator.
Review of Economic Studies 65, 2, 261–294.
Heckman, J. J. and Urzua, S. (2010). Comparing IV with structural models: What simple IV can
and cannot identify. Journal of Econometrics 156, 1, 27–37.
Hirano, K., Imbens, G., and Ridder, G. (2003). Efficient estimation of average treatment effects
using the estimated propensity score. Econometrica 71, 4, 1307–1338.
Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing
for reducing model dependence in parametric causal inference. Political Analysis 15, 3, 199–236.
Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from a
finite universe. Journal of the American Statistical Association 47, 260, 663–685.
27
Iacus, S., King, G., and Porro, G. (2011). Multivariate matching methods that are monotonic
imbalance bounding. Journal of the American Statistical Association 106, 493, 345–361.
Imai, K., King, G., and Stuart, E. A. (2008). Misunderstandings among experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society, Series A (Statistics
in Society) 171, 2, 481–502.
Imai, K. and Ratkovic, M. (2011). Identification of treatment effect heterogeneity as a variable selection problem. Working paper available at http://imai.princeton.edu/research/svm.html.
Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing
the propensity score. Journal of the American Statistical Association 99, 467, 854–866.
Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions.
Biometrika 87, 3, 706–710.
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A
review. Review of Economics and Statistics 86, 1, 4–29.
Imbens, G. W. (2010). Better LATE than nothing: Some comments on Deaton (2009) and Heckman
and Urzua (2009). Journal of Economic Literature 48, 2, 399–423.
Imbens, G. W. and Rubin, D. B. (1997). Bayesian inference for causal effects in randomized experiments with noncompliance. Annals of Statistics 25, 1, 305–327.
Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative
strategies for estimating a population mean from incomplete data (with discussions). Statistical
Science 22, 4, 523–539.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76, 4, 604–620.
Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score
in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23, 19,
2937–2960.
McCaffrey, D. F., Ridgeway, G., and Morral, A. R. (2004). Propensity score estimation with boosted
regression for evaluating causal effects in observational studies. Psychological Methods 9, 4, 403–
425.
28
Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, New York.
Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals
of Statistics 22, 1, 300–325.
Ratkovic, M. (2012). Achieving optimal covariate balance under general treatment regimes. Working
Paper, Department of Politics, Princeton University.
Robins, J. (1999). Statistical Models in Epidemiology, the Environment and Clinical Trials (eds. M.
E. Halloran and D. A. Berry), chap. Marginal Structural Models Versus Structural Nested Models
as Tools for Causal Inference, 95–134. Springer, New York.
Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (2007). Comment: Performance of doublerobust estimators when ‘inverse probability’ weights are highly variable. Statistical Science 22, 4,
544–559.
Robins, J. M., Hernán, M. A., and Brumback, B. (2000). Marginal structural models and causal
inference in epidemiology. Epidemiology 11, 5, 550–560.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when
some regressors are not always observed. Journal of the American Statistical Association 89, 427,
846–866.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical
Association 90, 106–121.
Rosenbaum, P. R. (1987). Model-based direct adjustment. Journal of the American Statistical
Association 82, 398, 387–394.
Rosenbaum, P. R. (1989). Optimal matching for observational studies. Journal of the American
Statistical Association 84, 1024–1032.
Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal
of the Royal Statistical Society, Series B, Methodological 53, 597–610.
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational
studies for causal effects. Biometrika 70, 1, 41–55.
29
Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 387,
516–524.
Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multivariate matched
sampling methods that incorporate the propensity score. The American Statistician 39, 33–38.
Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects:
Parallels with the design of randomized trials. Statistics in Medicine 26, 20–36.
Smith, J. A. and Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental
estimators? Journal of Econometrics 125, 1–2, 305–353.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical
Science 25, 1, 1–21.
Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. (2011). The use of propensity scores
to assess the generalizability of results from randomized trials. Journal of the Royal Statistical
Society, Series A (Statistics in Society) 174, 2, 369–386.
Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika
97, 3, 661–682.
30