Covariate Balancing Propensity Score ∗ Kosuke Imai Marc Ratkovic
by user
Comments
Transcript
Covariate Balancing Propensity Score ∗ Kosuke Imai Marc Ratkovic
Covariate Balancing Propensity Score∗ Kosuke Imai† Marc Ratkovic‡ First Draft: March 3, 2012 This Draft: March 7, 2012 Abstract The propensity score plays a central role in a variety of settings for causal inference. In particular, matching and weighting methods based on the estimated propensity score have become increasingly common in observational studies. Despite their popularity and theoretical appeal, the main practical difficulty of these methods is that the propensity score must be estimated. Researchers have found that slight misspecification of the propensity score model can result in substantial bias of estimated treatment effects. In this paper, we introduce covariate balancing propensity score (CBPS) estimation, which simultaneously optimizes the covariate balance and the prediction of treatment assignment. We exploit the dual characteristics of the propensity score as a covariate balancing score and the conditional probability of treatment assignment and estimate the CBPS within the generalized method of moments or empirical likelihood framework. We find that the CBPS dramatically improves the poor empirical performance of propensity score matching and weighting methods reported in the literature. We also show that the CBPS can be extended to a number of other important settings, including the estimation of generalized propensity score for non-binary treatments, causal inference in longitudinal settings, and the generalization of experimental and instrumental variable estimates to a target population. Key Words: causal inference, instrumental variables, inverse propensity score weighting, marginal structural models, observational studies, propensity score matching, randomized experiments ∗ Financial support from the National Science Foundation (SES-0550873; SES–0752050) is acknowledged. We thank Jens Hainmueller, Dylan Small, and seminar partcipants at UCLA and Princeton for helpful suggestions. † Assistant Professor, Department of Politics, Princeton University, Princeton NJ 08544. Phone: 609–258–6601, Email: [email protected], URL: http://imai.princeton.edu ‡ Post-doctoral Fellow in Formal and Quantitative Methods, Department of Politics, Princeton University, Princeton NJ 08544. 1 Introduction The propensity score, defined as the conditional probability of receiving treatment given covariates, plays a central role in a variety of settings for causal inference. In their seminal article, Rosenbaum and Rubin (1983) showed that if the treatment assignment is strongly ignorable given covariates, then an unbiased estimate of the average treatment effect can be obtained by adjusting for the propensity score alone rather than the entire vector of covariates, which is often of high dimension. Over the next three decades, a number of new methods based on the propensity score have been developed and they have become an essential part of applied researchers’ toolkit in numerous disciplines. In particular, the propensity score is used to adjust for observed confounding through matching (e.g., Rosenbaum and Rubin, 1985; Rosenbaum, 1989; Abadie and Imbens, 2006), subclassification (e.g., Rosenbaum and Rubin, 1984; Rosenbaum, 1991; Hansen, 2004), weighting (e.g., Rosenbaum, 1987; Robins et al., 2000; Hirano et al., 2003), regression (e.g., Heckman et al., 1998), or their combinations (e.g., Robins et al., 1995; Ho et al., 2007; Abadie and Imbens, 2011). Imbens (2004), Lunceford and Davidian (2004), and Stuart (2010) provide comprehensive reviews of these and other related methods. Despite their popularity and theoretical appeal, the main practical difficulty of these methods is that the propensity score must be estimated. In fact, researchers have found that slight misspecification of the propensity score model can result in substantial bias of estimated treatment effects (e.g., Kang and Schafer, 2007; Smith and Todd, 2005). This challenge highlights the paradoxical nature of the propensity score — the propensity score is designed to reduce the dimension of covariates and yet its estimation requires modeling of high-dimensional covariates. In practice, applied researchers search for an appropriate propensity score model specification by repeating the process of changing their model and checking the resulting covariate balance. Imai et al. (2008) called this phenomenon the “propensity score tautology” — the propensity score is correct if it balances covariates. To directly address this problem, it is essential to develop an automated and robust method for estimating the propensity score. In this paper, we introduce the covariate balancing propensity score (CBPS) and show how to estimate the propensity score such that both the resulting covariate balance and the prediction of treatment assignment are optimized. We exploit the dual characteristics of the propensity score as a covariate balancing score and the conditional probability of treatment assignment. Specifically, we use a set of moment conditions implied by both the covariate balancing property (i.e., mean independence between treatment and covariates after inverse propensity score weighting) and the standard estimation procedure (i.e., maximum likelihood). These moment conditions are then com1 bined to estimate the CBPS within the familiar generalized method of moments (GMM) or empirical likelihood (EL) framework (Hansen, 1982; Owen, 2001). The CBPS has several attractive characteristics. First, the CBPS estimation guards against the potential misspecification of a parametric propensity score model by selecting parameter values that maximizes the resulting covariate balance. In Section 3, we find that the CBPS dramatically improves the poor empirical performance of propensity score matching and weighting methods reported by Kang and Schafer (2007) and Smith and Todd (2005). Second, the CBPS can be extended to a number of other important settings in causal inference. In Section 4, we extend the CBPS to the generalized propensity score for non-binary treatment (Imbens, 2000; Imai and van Dyk, 2004), timevarying treatments and time-dependent confounding in longitudinal data (e.g., Robins et al., 2000), and the generalization of experimental and instrumental variable estimates to a target population (e.g., Cole and Stuart, 2010; Angrist and Fernandez-Val, 2010). Third, the CBPS inherits all the theoretical properties and methodologies that already exist in the GMM and EL literature. For example, GMM specification tests and moment selection procedures are directly applicable. Finally, because the proposed methodology simply improves the estimation of the propensity score, various propensity score methods such as matching and weighting can be implemented without modification. The CBPS contributes to the recent development in the causal inference literature on automated covariate balancing methods (e.g., Diamond and Sekhon, 2011; Iacus et al., 2011; Hainmueller, 2012; Ratkovic, 2012). In particular, the CBPS builds upon and is closely related to several existing methods. First, Hainmueller (2012) proposes the entropy balancing method to construct a weight for each control observation such that the sample moments of observed covariates are identical between the treatment and weighted control groups. Unlike the entropy balancing method, however, the CBPS constructs balancing weights directly from the propensity score. In this respect, the CBPS is similar to the method proposed by Graham et al. (Forthcoming) under the empirical likelihood framework. Yet, the CBPS critically differs from these methods in that it simultaneously maximizes the covariate balance and the prediction of treatment assignment. In addition, the CBPS is related to the method proposed by Tan (2010), which adjusts the maximum likelihood estimate of the propensity score to generate observation-specific weights. Like the method of Graham et al., Tan’s method incorporates the outcome model and achieves many desirable properties such as double robustness, local efficiency, and sample boundedness. In contrast, the CBPS focuses on the estimation of the propensity score without consulting the outcome data, which aligns with the original spirit of the propensity score methodology (Rubin, 2007). This sepa- 2 ration from the outcome model enables the CBPS to be applicable to a variety of causal inference settings and to be used with the existing propensity score methods such as matching and weighting. Section 2.4 gives a more detailed comparison between the CBPS and these existing methods. 2 The Proposed Methodology 2.1 The Setup Consider a simple random sample of N observations from a population P. For each unit i, we observe a binary treatment variable Ti and a K dimensional column vector of pre-treatment covariates Xi where the support of Xi is denoted by X . The propensity score is defined as the conditional probability of receiving the treatment given the covariates Xi . Following Rosenbaum and Rubin (1983), we assume that the true propensity score is bounded away from 0 and 1, 0 < Pr(Ti = 1 | Xi = x) < 1 for any x ∈ X . (1) Rosenbaum and Rubin (1983) showed that if we further assume the ignorability of treatment assignment, i.e., {Yi (1), Yi (0)} ⊥⊥ Ti | Xi (2) where Yi (t) represents the potential outcome under the treatment status t, then the treatment assignment is ignorable given the propensity score, {Yi (1), Yi (0)} ⊥⊥ Ti | π(Xi ). (3) This implies that the unbiased estimation of treatment effects is possible by conditioning on the propensity score alone rather than the entire covariate vector Xi , which is in practice frequently of high dimension. This key observation about its dimension reduction property led to the subsequent development of various propensity score methods, including matching, subclassification, and weighting. In observational studies, however, the propensity score is unknown and must be estimated from the data. Typically, researchers assume a parametric propensity score model, Pr(Ti = 1 | Xi ) = πβ (Xi ) (4) where β ∈ Θ is an L dimensional column vector of unknown parameters. For example, a popular choice is the following logistic model, πβ (Xi ) = exp(Xi> β) 1 + exp(Xi> β) 3 (5) in which case we have L = K. Researchers then maximize the empirical fit of the model so that the estimated propensity score predicts the observed treatment assignment well. This is often done by maximizing the likelihood function, β̂MLE = argmax β∈Θ N X Ti log πβ (Xi ) + (1 − Ti ) log{1 − πβ (Xi )}. (6) i=1 Assuming that πβ (Xi ) is differentiable with respect to β, this implies the following first order condition, N 1 X sβ (Ti , Xi ) = 0 N where sβ (Ti , Xi ) = i=1 Ti πβ0 (Xi ) πβ (Xi ) − (1 − Ti )πβ0 (Xi ) 1 − πβ (Xi ) (7) and πβ0 (Xi ) = ∂π(Xi )/∂β > . The major difficulty of this estimation strategy is that the propensity score model may be misspecified, leading to the biased estimates of treatment effects. While in theory a more complex nonparametric model can be used to estimate the propensity score (e.g., McCaffrey et al., 2004), the high dimensionality of covariates Xi often makes its successful application challenging. To directly address this issue, we develop covariate balancing propensity score (CBPS) estimation as a robust estimation method of a parametric propensity score model. We exploit the dual characteristics of the propensity score as a covariate balancing score and the conditional probability of treatment assignment (Rosenbaum and Rubin, 1983). Specifically, we operationalize the covariate balancing property using the inverse propensity score weighting, ( E ei ei Ti X (1 − Ti )X − πβ (Xi ) 1 − πβ (Xi ) ) = 0 (8) ei = f (Xi ) is a M -dimensional vector-valued function of Xi specified by the researcher. where X ei = Xi balances the Equation (8) holds regardless of the choice of f (·). For example, setting X ei = (X > X 2 > )> balances both the first and second first moment of each covariate, while defining X i i moments. The researcher may also wish to balance a subset of covariates or include interactions. Similarly, if the average treatment effect for the treated is of interest, we may wish to weight the control group observations such that their (weighted) covariate distribution matches up with that of the treatment group. In this case, the moment condition becomes, ) ( e π (X )(1 − T ) X i i ei − β i = 0. E Ti X 1 − πβ (Xi ) 4 (9) Two remarks are in order before we move to the discussion of estimation and inference. First, equation (7) represents another balancing condition where the inverse propensity score weighting of the first derivative of the propensity score has the identical first moment between the treatment and control groups. This observation also justifies combining these moment conditions to estimate the propensity score. Second, we note that the covariate balancing property follows directly from the definition of the propensity score and does not require ignorability given in equation (2). 2.2 Estimation and Inference We combine the above moment conditions based on the covariate balancing property (i.e., equation (8) or (9)) with the score condition of the likelihood inference (i.e., equation (7)) under the Generalized Method of Moments (GMM) or Empirical Likelihood (EL) framework. Define the sample analogue of the covariate balancing moment condition given in equation (8) as, N 1 X ei wβ (Ti , Xi )X N where wβ (Ti , Xi ) = i=1 Ti − πβ (Xi ) . πβ (Xi ){1 − πβ (Xi )} (10) If the ATT rather than the ATE is the quantity of interest, we use the sample analogue of equation (9) and the weight becomes, wβ (Ti , Xi ) = N Ti − πβ (Xi ) · N1 1 − πβ (Xi ) (11) where N1 is the number of units in the treatment group. When combined with the score equation given in equation (7), the total number of moment conditions is L + M , which exceeds the number of parameters L. Thus, we can use them as the over-identifying restrictions under the optimal GMM framework (Hansen, 1982), β̂GMM = argmin ḡβ (T, X)> Σβ (T, X)−1 ḡβ (T, X) (12) β∈Θ where ḡβ (T, X) is the sample mean of the moment conditions, ḡβ (T, X) = N 1 X gβ (Ti , Xi ) N (13) i=1 and gβ (Ti , X) combining all moment conditions, gβ (Ti , Xi ) = sβ (Ti , Xi ) ei wβ (Ti , Xi )X . (14) We use the “continuous-updating” GMM estimator of Hansen et al. (1996) , which, unlike the twostep optimal GMM estimator, is invariant (e.g., to the linear transformation of moment conditions) 5 and is found to have better finite sample properties. Our choice of a consistent covariance estimator for gβ (Ti , Xi ) is given by, Σβ (T, X) = N 1 X E(gβ (Ti , Xi )gβ (Ti , Xi )> | Xi ) N (15) i=1 where we integrate out the treatment variable Ti conditional on the pre-treatment covariates Xi . We find that this covariance estimator outperforms the sample covariance of moment conditions because the latter does not penalize large weights. In particular, in the case of the logistic regression propensity score model, i.e., πβ (Xi ) = logit−1 (Xi> β), we have the following expression for Σβ (T, X), N > X π (X ){1 − π (X )}X X 0 1 i i i β β i (16) Σβ (T, X) = N eX e> 0 [1 + π (X ){1 − π (X )}]−1 X i=1 β i β i i i when we use the ATE weight given in equation (10), or N e> πβ (Xi )Xi X 1 X πβ (Xi ){1 − πβ (Xi )}Xi Xi> i Σβ (T, X) = N > e ei X e> πβ (Xi )Xi Xi πβ (Xi )/{1 − πβ (Xi )}X i=1 i (17) if the ATT weight given in equation (11) is used. With a set of reasonable starting values (e.g., β̂MLE ), we find that the gradient-based optimization method works well both in terms of speed and reliability. Finally, an alternative estimation strategy is based on the EL framework applied to the above moment conditions (Qin and Lawless, 1994) where the profile empirical likelihood ratio function is given by, R(β) = sup ( n Y i=1 ) n n X X npi pi ≥ 0, pi = 1, pi gβ (Ti , Xi ) = 0 . i=1 (18) i=1 This EL estimator shares many of the attractive properties of the above continuous-updating GMM estimator, notably invariance and higher-order bias properties. 2.3 Specification Test Another advantage of combining the moment conditions in the GMM or EL framework is that we can use the test of over-identifying restrictions as a specification test for the propensity score model. For example, under the GMM framework, the Hansen’s J statistic is given by, n o d J = N · ḡβ̂GMM (T, X)> Σβ̂GMM (T, X)−1 ḡβ̂GMM (T, X) −→ χ2L+M (19) where the null hypothesis is that the propensity score model is correctly specified. We emphasize that the failure to reject this null does not necessarily imply the correct specification of the propensity score model because it may simply mean that the test lacks the power (Imai et al., 2008). 6 Nevertheless, the test could be useful for detecting the model misspecification. Similarly, within the EL framework, the specification test can be conducted based on the likelihood ratio. 2.4 Related Methods The CBPS is closely related to two existing methodologies. First, Hainmueller (2012) proposed the entropy balancing method, which weights each observation in order to achieve optimal balance. Since identifying N weights by balancing the sample mean of K covariates between the treatment and control groups is an underdetermined problem when K < N , the entropy balancing method minimizes the Kullback-Leibler divergence from a set of baseline weights chosen by researchers. The critical difference between the CBPS and the entropy balancing methods is that the former directly models the propensity score, which in turn is used to construct weights for observations. While the weights that result from the entropy balancing method imply the propensity score, the CBPS makes it easier for applied researchers to model the propensity score directly by incorporating their substantive knowledge into a parametric model. Moreover, as illustrated in Section 4, the direct modeling of the propensity score widens the applicability of the CBPS to more complex situations. In addition, because we model the propensity score, the score equation is included as another set of moment conditions, leading to an over-identified restriction test for model specification. In his 2008 working paper, Hainmueller (2008) briefly discusses the incorporation of covariate balancing conditions into the estimation of the propensity score under the EL framework as a potential extension of his method (Section IV C). However, his setup differs from ours in that the weights are constructed separately from the propensity score. Second, Graham et al. (Forthcoming) proposes the EL method similar to the entropy balancing method where covariate balancing moment conditions are used to estimate the propensity score. Unlike these methods, however, the CBPS combines the score equation from the likelihood function with the covariate balancing conditions so that it simultaneously optimizes the prediction of treatment assignment and the covariate balance. Thus, the CBPS is based on a set of overidentifying restrictions, which enables the specification test for the propensity score model. Third, under the likelihood framework, Tan (2010) identifies a set of constraints to generate observation specific weights that enable desirable properties such as double robustness, local efficiency, and sample boundedness. These weights may not fall between 0 and 1 and hence cannot be interpreted as the propensity score. In addition, like the method of Graham et al., Tan’s method incorporates the outcome model in a creative manner when constructing weights. In contrast, the CBPS focuses on the estimation of the propensity score without consulting the outcome data, which 7 aligns with the original spirit of the propensity score methods (Rubin, 2007). As illustrated in Section 4, The direct connection between the CBPS and the propensity score widens the applicability of the CBPS. For the same exact reason, the CBPS can also be easily used in conjunction with the existing propensity score methods such as matching and weighting. 3 Simulation and Empirical Studies In this section, we apply the proposed CBPS methodology to prominent simulation and empirical studies where propensity score methods have shown to fail. We show that the CBPS can dramatically improve the poor performance of the propensity score weighting and matching methods reported in the previous studies. 3.1 Improved Performance of Propensity Score Weighting Methods In a controversial article, Kang and Schafer (2007) present a set of simulation studies to study the performance of propensity score weighting methods. They find that the misspecification of propensity score model can negatively affect the performance of various weighting methods. In particular, they show that while the doubly robust (DR) estimator of Robins et al. (1994) provides a consistent estimate of the treatment effect if either the outcome model or the propensity score model is correct, the performance of DR estimator can deteriorate when both models are slightly misspecified. This finding led to a rebuttal by Robins et al. (2007) who criticizes the simulation setup and introduces alternative DR estimators. In this section, we exactly replicate the simulation study of Kang and Schafer (2007) except that we estimate the propensity score using our proposed methodology. We then examine whether or not the CBPS can improve the empirical performance of the propensity score weighting estimators. In particular, Kang and Schafer (2007) used the following data generating process for ∗ , X ∗ , X ∗ , X ∗ ), each their simulation study. There exist four pre-treatment covariates Xi∗ = (Xi1 i2 i3 i4 of which is independently, identically distributed (i.i.d.) according to the standard normal distribution. The true outcome model is a linear regression with these covariates and the error term is an i.i.d. standard normal random variate such that the mean outcome equals 210, which is the quantity of interest to estimate. The true propensity score model is a logistic regression with Xi∗ being the linear predictor such that the mean probability of receiving the treatment equals 0.5. Finally, only the nonlinear transform of covariates are observed and they are given by ∗ /2), X ∗ /(1+exp(X ∗ )+10, (X ∗ X ∗ /25+0.6)3 , (X ∗ +X ∗ +20)2 }. Xi = (Xi1 , Xi2 , Xi3 , Xi4 ) = {exp(Xi1 i2 1i i1 i3 i1 i4 Kang and Schafer (2007) study the four propensity score weighting estimators based on the 8 propensity score that is a logistic regression with Xi as the linear predictor. This propensity score model is misspecified because the true propensity score is a logistic regression with Xi∗ as the linear predictor. The weighting estimators they examine are the Horvitz-Thompson estimator (HT; Horvitz and Thompson, 1952) (HT), the inverse propensity score weighting estimator (IPW; Hirano et al., 2003), the weighted least squares regression estimator (WLS; Robins et al., 2000; Freedman and Berk, 2008), and the doubly robust regression estimator (DR; Robins et al., 1994), n µ̂HT = 1 X Ti Yi n πβ̂ (Xi ) (20) i=1 µ̂IPW n n X Ti Yi . X Ti = πβ̂ (Xi ) πβ̂ (Xi ) i=1 µ̂WLS = µ̂DR = (21) i=1 !−1 n n n X Ti Xi Yi X 1X > Ti Xi Xi> (22) Xi γ̂WLS where γ̂WLS = n πβ̂ (Xi ) π (Xi ) i=1 β̂ i=1 i=1 ( ) !−1 n n n X X 1X Ti (Yi − Xi> γ̂OLS ) > > Xi γ̂OLS + where γ̂OLS = Ti Xi Xi Ti Xi Yi . n πβ̂ (Xi ) i=1 i=1 i=1 (23) Both the weighted least squares and ordinary least squares regressions are misspecified because the true model is linear in Xi∗ rather than Xi . To estimate the propensity score, Kang and Schafer (2007) use the logistic regression with Xi being the linear predictor, i.e., πβ (Xi ) = logit−1 (Xi> β), which is misspecified because the true propensity score model is the logistic regression with Xi∗ being the linear predictor. Our simulation study uses the same propensity and outcome model specifications but we investigate whether the CBPS improves the empirical performance of the weighting estimators. To estimate the CBPS, we use the same exact logistic regression specification but use the covariate balancing moment ei = Xi under the GMM framework of the proposed methodology outlined conditions by setting X in Section 2. Thus, our simulation study examines how replacing the standard logistic regression propensity score with the CBPS will improve the empirical performance of the four commonly used weighting estimators. As in Kang and Schafer (2007), we conduct simulations under four scenarios: (1) both propensity score and outcome models are correctly specified, (2) only the propensity score model is correctly specified, (3) only the outcome model is correctly specified, and (4) neither the propensity score nor outcome model is correctly specified. For each scenario, two sample sizes, 200 and 1, 000, are used and we conduct 10, 000 Monte Carlo simulations and calculate the bias and root mean squared error (RMSE) for each estimator. 9 Bias Sample size Estimator GLM Balance CBPS (1) Both models correct HT −0.01 2.02 0.73 IPW −0.09 0.05 −0.09 n = 200 WLS 0.03 0.03 0.03 DR 0.03 0.03 0.03 HT −0.03 0.39 0.15 IPW −0.02 0.00 −0.03 n = 1000 WLS −0.00 −0.00 −0.00 DR −0.00 −0.00 −0.00 (2) Propensity score model correct HT −0.32 1.88 0.55 IPW −0.27 −0.12 −0.26 n = 200 WLS −0.07 −0.07 −0.07 DR −0.07 −0.07 −0.07 HT 0.03 0.38 0.15 IPW −0.02 −0.00 −0.03 n = 1000 WLS −0.01 −0.01 −0.01 DR −0.01 −0.01 −0.01 (3) Outcome model correct HT 24.72 0.33 −0.47 IPW 2.69 −0.71 −0.80 n = 200 WLS −1.95 −2.01 −1.99 DR 0.01 0.01 0.01 HT 69.13 −2.14 −1.55 IPW 6.20 −0.87 −0.73 n = 1000 WLS −2.67 −2.68 −2.69 DR 0.05 0.02 0.02 (4) Both models incorrect HT 25.88 0.39 −0.41 IPW 2.58 −0.71 −0.80 n = 200 WLS −1.96 −2.01 −2.00 DR −5.69 −2.20 −2.18 HT 60.60 −2.16 −1.56 IPW 6.18 −0.87 −0.72 n = 1000 WLS −2.68 −2.69 −2.70 DR −20.20 −2.89 −2.94 RMSE Balance CBPS True GLM True 0.68 −0.11 0.03 0.03 0.29 −0.01 −0.00 −0.00 13.07 4.01 2.57 2.57 4.86 1.73 1.14 1.14 4.65 3.23 2.57 2.57 1.77 1.44 1.14 1.14 4.04 3.23 2.57 2.57 1.80 1.45 1.14 1.14 23.72 4.90 2.57 2.57 10.52 2.25 1.14 1.14 −0.17 −0.35 −0.07 −0.07 0.01 −0.04 −0.01 −0.01 12.49 3.94 2.59 2.59 4.93 1.76 1.14 1.14 4.67 3.26 2.59 2.59 1.75 1.45 1.14 1.14 4.06 3.27 2.59 2.59 1.79 1.46 1.14 1.14 23.49 4.90 2.59 2.59 10.62 2.26 1.14 1.14 0.25 141.09 −0.17 10.51 0.49 3.86 0.01 2.62 −0.10 1329.31 −0.04 13.74 0.18 3.08 0.02 4.86 4.55 3.50 3.88 2.56 3.12 1.87 3.13 1.16 3.70 3.51 3.88 2.56 2.63 1.80 3.14 1.16 23.76 4.89 3.31 2.56 10.36 2.23 1.48 1.15 −0.14 186.53 −0.24 10.32 0.47 3.86 0.33 39.54 0.05 1387.53 −0.04 13.40 0.17 3.09 0.07 615.05 4.64 3.49 3.88 4.22 3.11 1.86 3.14 3.47 3.69 3.50 3.88 4.23 2.62 1.80 3.15 3.53 23.65 4.92 3.31 3.69 10.52 2.24 1.47 1.75 Table 1: Relative Performance of the Four Different Propensity Score Weighting Estimators Based on Different Propensity Score Estimation Methods under the Simulation Setting of Kang and Schafer (2007). The bias and root mean squared error (RMSE) are computed for the Horvitz-Thompson (HT), the inverse propensity score weighting (IPW), the inverse propensity score weighted least squares (WLS), and the doubly-robust least squares (DR) estimators. The performance of the CBPS is compared with that of the standard logistic regression (GLM), the CBPS without the score equation (Balance), and the true propensity score (True). We consider four scenarios where the outcome and/or propensity score models are misspecified. The sample sizes are n = 200 and n = 1, 000. The number of simulations is 10, 000. Across the four weighting estimators, the CBPS is most robust to misspecification and dramatically improves the performance of GLM. 10 The results of our simulation study is presented in Table 1. For each of the four scenarios, we examine the bias and RMSE of each weighting estimator based on four different propensity score methods: (a) the standard logistic regression with Xi being the linear predictor as in the original simulation study (GLM), (b) the CBPS estimation with the covariate balancing moment conditions with respect to Xi and without the score condition (Balance), (d) the proposed CBPS estimation, and (4) the true propensity score (True), which is given by πβ (Xi ) = logit−1 (Xi∗ > β). In the first scenario where both models are correct, all four weighting methods have relatively low bias regardless of what propensity score estimation method we use. However, the HT estimator has a large variance and hence a large RMSE when either the standard logistic regression or the true propensity score is used. In contrast, when used with the CBPS, the same estimator has a much lower RMSE. A similar observation can be made for the IPW estimator while the WLS and DR estimators are not sensitive to the choice of propensity score estimation methods. The CBPS without the score equation improves the performance of the HT estimator, but the bias is somewhat larger when compared to the CBPS with the score equation. The second simulation scenario shows the performance of various estimators when the propensity score model is correct but the outcome model is misspecified. As expected, the results are quite similar to those of the first simulation scenario. When the propensity score model is correctly specified, the four weighting estimators have low bias. However, the CBPS significantly reduces the variance of the HT estimator even when compared with the true propensity score. This confirms the theoretical result in the literature that the estimated propensity score leads to a more efficient estimator of the average treatment effect (e.g., Hahn, 1998; Hirano et al., 2003). The third simulation scenario examines an interesting situation where the propensity score model is misspecified while the outcome models for the WLS and DR estimators are correctly specified. Here, we find that as expected, the HT and IPW estimators, which solely rely on the propensity score, have large bias and RMSE when used with the standard logistic regression model. However, the CBPS significantly reduces their bias and RMSE regardless of whether it incorporates the score equation from the logistic likelihood function. In contrast, the bias and RMSE of the WLS and DR estimators remain low and essentially identical across the propensity score estimation methods. Together with the above results under the second scenario, these results confirm the double-robustness property where the DR estimator performs well so long as either the propensity score or outcome model is correctly specified. The final simulation scenario illustrates the most important point made by Kang and Schafer 11 Log−Likelihood 10 1 10−1 ● ● ● ●●● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ●● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●●●● ● ● ●● 0.1 ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 100 GLM 101 102 ● ●● ●● ● ● Log−Likelihood (GLM−CBPS) 103 104 −520 −550 GLM −580 ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● −610 Both Models Specified Correctly Likelihood−Balance Tradeoff Covariate Imbalance ● ● −610 −580 −550 ● −1 −520 0 10 10 10 2 10 3 10 10 4 100 CBPS ● ● ●●● ● ●● ●●●● ● ● ● ● ●● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ● 104 −520 CBPS 1 ● 101 102 103 ● 0.1 1 10 ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ●● ●●●● ●●● ● ● ●●● ●● ● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ● 10−1 ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● Log−Likelihood (GLM−CBPS) GLM 101 102 103 ● 100 −550 GLM −580 −610 Neither Model Specified Correctly ● ● −610 −580 −550 −520 104 Imbalance (GLM−CBPS) ● ● ● ● ●● ● −1 10 0 10 CBPS 10 1 CBPS 2 10 3 10 10 4 ● 100 101 102 103 104 Imbalance (GLM−CBPS) Figure 1: Comparison of Likelihood and Imbalance between the Standard Logistic Regression (GLM) and the Covariate Balancing Propensity Score (CBPS). The two scenarios from the simulation results reported in Table 1 are shown for the sample size of n = 1, 000: both the outcome and propensity score models are correctly specified (upper row), and both models are misspecified (bottom row). Each dot in a plot represents one Monte Carlo simulation draw. When both models are correct, CBPS reduces the multivariate standardized bias, which is a summary measure of imbalance across all covariates (middle upper plot) without sacrificing much likelihood, which is a measure of model fit (upper left plot). In contrast, when both models are incorrect, CBPS significantly improves balance while sacrificing some degree of model fit. The two plots in the right column illustrates this trade-off between likelihood and balance. (2007) that the performance of the DR estimator can deteriorate when both the propensity score and outcome models are incorrectly specified. Under this scenario, all models suffer from some degree of bias when used with the standard logistic regression model. The bias and RMSE are the largest for the HT estimator but the DR estimator also exhibit a significant amount of bias and variance. However, the CBPS with or without the score equation can substantially improve the performance of the DR estimator. Specifically, when the sample size is 1, 000, the bias and RMSE are dramatically reduced. The CBPS also significantly improves the performance of the HT and IPW estimators even when compared to the true propensity score. In sum, even when both the outcome and propensity score models are misspecified, the CBPS can yield robust estimates of treatment effects. 12 How can the CBPS dramatically improve the performance of the GLM? Figure 1 shows that the CBPS sacrifices likelihood in order to improve balance under two of the four simulation scenarios. We use the following multivariate version of the “standardized bias” (Rosenbaum and Rubin, 1985) to measure the overall covariate imbalance, N 1 X Imbalance = wβ̂ (Ti , Xi )Xi N i=1 !> N 1 X Ti Xi Xi> N1 i=1 !−1 !1/2 N X 1 wβ̂ (Ti , Xi )Xi (24) N i=1 The figure shows that when both the outcome and propensity score models are correctly specified (the top row), the CBPS achieves better covariate balance (middle plot) without sacrificing much likelihood (left plot). In contrast, when both models are misspecified (the bottom row), the CBPS significantly improves covariate balance (middle plot) at the cost of some likelihood (left plot). The plots in the right column shows the CBPS’s tradeoff between likelihood and covariate balance where a larger improvement in covariate balance is associated with a larger loss of likelihood. 3.2 Improved Performance of Propensity Score Matching Methods In an influential article, LaLonde (1986) empirically evaluated the ability of various estimators to obtain an unbiased estimate of the average treatment effect in the absence of randomized treatment assignment. From a randomized study of a job training program (National Supported Work Demonstration or NSW) where an unbiased estimate of the average treatment effect is available, LaLonde constructed an “observational study” by replacing the control group of the experimental NSW data with untreated observations from non-experimental data sets such as the Current Population Survey (CPS) and the Panel Study for Income Dynamics (PSID). LaLonde showed that the estimators he evaluated failed to replicate the experimental benchmark, and this finding led to increasing interests in experimental evaluation among social scientists. More than ten years later, Dehejia and Wahba (1999) revisited the LaLonde study and showed that the propensity score matching estimator was able to closely replicate the experimental benchmark. Dehejia and Wahba estimated the propensity score using the logistic regression and matched a treated observation with a control observation based on the estimated propensity score. This finding, however, came under intense criticism by Smith and Todd (2005). Smith and Todd argued that the impressive performance of the propensity score matching estimator reported by Dehejia and Wahba critically hinges on a particular subsample of the original LaLonde data they analyze. This subsample excludes most of high income workers from the non-experimental comparison sets, thereby making the selection bias much easier to overcome. Indeed, Smith and Todd showed that 13 if one focuses on the Dehejia and Wahba sample other conventional estimators do as well as the propensity score matching estimator. Moreover, the propensity score matching estimator has a difficulty of replicating the experimental benchmark when applied to the original LaLonde data and is quite sensitive to the propensity score model specification. In what follows, we investigate whether the CBPS can improve the poor performance of the propensity score matching estimator reported by Smith and Todd (2005). Specifically, we analyze the original LaLonde experimental sample (297 treated and 425 untreated observations) and use the PSID as the comparison data (2, 490 observations). The data sets are obtained from Dehejia’s website. The pre-treatment covariates in the data include age, education, race (white, black or hispanic), marriage status, high school degree, 1974 earnings, and 1975 earnings. The outcome variable of interest is 1978 earnings. In this sample, the experimental benchmark for the average treatment effect is $886 with the standard error of $488. The only discrepancy between the publicly available data we analyze and the data analyzed by Smith and Todd is that the former does not include 1974 earnings in the experimental sample. Therefore, we impute this variable by regressing 1974 earnings on the remaining pre-treatment covariates in the non-experimental sample, and then using the fitted linear regression to predict 1974 earnings for the experimental sample. An affine transformation, truncated at zero, is applied to the predicted values such that the sample mean of 1974 earnings and the sample proportion of those who earn nothing in 1974 match closely with the corresponding results reported in Smith and Todd (2005). We first replicate Smith and Todd’s analysis by estimating the “evaluation bias,” which is defined as the average effect of being in the experimental sample on 1978 earnings. Specifically, we estimate the conditional probability of being in the experimental sample given the pre-treatment covariates. Based on this estimated propensity score, we match the control observations in the experimental sample with similar observations in the non-experimental sample. Since neither group of workers received the job training program, the true average treatment effect is zero. Following the original analysis, we conduct one-to-one and one-to-ten nearest neighbor matching with replacement where matching is done on the log-odds of the estimated propensity score. Furthermore, to examine the sensitivity to the propensity score model specification, we fit three different logistic regression models — a linear specification, a quadratic specification that includes the squares of non-binary covariates, and the specification used by Smith and Todd (2005) which is based on Dehejia and Wahba’s variable selection procedure and adds an interaction term between Hispanic and zero earnings in 1974 to the quadratic specification. Finally, our standard errors are based on Abadie and Imbens (2006) rather 14 Model Specification Linear Quadratic Smith and Todd (2005) 1–to–1 Nearest Neighbor GLM Balance CBPS −1643 −377 −188 (877) (841) (792) −2800 −1180 234 (935) (932) (799) −2882 −879 −346 (950) (850) (830) 1–to–10 Nearest Neighbor GLM Balance CBPS −1329 −564 −392 (727) (708) (711) −1828 −675 −465 (714) (739) (686) −1951 −735 −224 (725) (720) (745) Table 2: Estimated Evaluation Bias of One-to-one and One-to-ten Nearest Neighbor Propensity Score Matching Estimators with Replacement. The results (standard errors are in parentheses) represent the estimated average effect of being in the experimental sample on the 1978 earnings where the experimental control group is compared with the matched subset of the untreated nonexperimental sample. If a matching estimator is successful, then its estimated effect should be close to the true effect, which is zero. The propensity score is estimated in three different ways using the logistic regression – standard logistic regression (GLM), the CBPS without the score equation (Balance), and the CBPS which combines the score equation and the balance conditions (CBPS). In addition, we consider three different logistic propensity score model specification – the linear, and the quadratic function of covariates, and the model specification used by Smith and Todd (2005). Across three different propensity score model specifications and two matching estimators, the CBPS has the smallest bias and significantly improves the standard logistic regression. than the bootstrap used by Smith and Todd, which may not yield a valid standard error (Abadie and Imbens, 2008). Table 2 presents the estimated evaluation bias of one-to-one and one-to-ten nearest neighbor propensity score matching with replacement across three different propensity score model specifications described above. The results are compared across the propensity score estimation methods – the standard logistic regression (GLM), the CBPS with balance condition only (Balance), and the CBPS that combines the likelihood and balance conditions. Despite the small discrepancy in the data, our estimates based on the standard logistic regression are comparable with the estimates presented in Smith and Todd (2005). For one-to-one (one-to-ten) matching, our estimate is −2882 (−1951) with the standard error of 950 (725) while the Smith and Todd’s estimate is −2932 (−2119) with the standard error of 898 (787). Across all specifications and matching methods, the CBPS exhibits the smallest estimated bias and significantly improves the performance of propensity score matching estimator based on the standard logistic regression. For example, if the same propensity score specification as the one used in Smith and Todd (2005), the estimated evaluation bias is reduced by almost 90% (−2882 is reduced to −346 for one-to-one matching) if one uses the CBPS instead of the standard logistic regression. The CBPS with balance conditions alone also has a smaller bias than the standard logistic regression, 15 but the CBPS that combines the likelihood and balance conditions appears to work best across the scenarios considered here. In addition, as seen in Section 3.1, the CBPS also seems to make the matching estimators less sensitive to changes in propensity score model specification. To see where the dramatic improvement of the CBPS comes from, we examine the covariate balance across the estimation methods and model specifications. Table 3 presents the standardized bias for each covariate both in the entire sample (top panel) and the (one-to-one) matched sample (bottom panel). The standardized bias for each covariate in the entire sample is defined as the inverse propensity score weighted difference in means between the treatment and control groups, divided by the standard deviation of the covariate within the treatment group. Similarly, for the matched sample, it equals the difference in means between the two groups, divided by the standard deviation of the treated observations. In addition to the univariate standardized bias, we also present the loglikelihood and the multivariate standardized bias statistic where the latter is defined in equation (24) for the entire sample and for the matched sample we use the standard definition available in the literature (Rosenbaum and Rubin, 1985). As seen in Section 3.1, the CBPS significantly improves the balance across covariates when compared to the standard logistic regression (GLM) while sacrificing some degree of likelihood. This is true across model specifications and in both the entire and matched samples. The CBPS without the score equation reduces covariate imbalance as well. By construction, it matches the inverse propensity score weighted means perfectly between the treatment and control groups. However, for the one-to-one matched sample, the CBPS has generally lower standardized bias and much higher likelihood. This may suggest that the CBPS without the score equation overfits the data and may explain the fact that the CBPS with both the score and balancing equations has a better performance in Table 2. In addition, we note that the over-identifying restriction test described in Section 2.3 rejects the null hypothesis that the propensity score model specification is correct. The J-statistic is equal to 79 (22 degrees of freedom), 89 (30), and 99 (32) for the linear, quadratic and Smith and Todd model specifications, respectively. This may explain the fact that the estimates based on the CBPS are still biased by several hundred dollars though the magnitude of bias is significantly reduced when compared with the standard logistic regression estimates. Finally, we estimate the average treatment effect rather than the evaluation bias. Smith and Todd (2005) focused on the estimation of evaluation bias but others including LaLonde (1986) and Dehejia and Wahba (1999) studied the differences between the experimental benchmark and the estimates based on various statistical methods. As explained by Smith and Todd (2005), the 16 17 GLM 0.097 −0.004 −0.196 0.270 −0.020 0.114 −0.104 −0.069 −0.365 0.051 −1097 0.577 GLM 0.137 −0.215 −0.185 0.102 0.135 −0.282 −0.197 −0.639 −0.551 −0.480 −0.094 −580 5.936 Linear Balance 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 −609 0.000 Linear Balance 0.042 0.090 −0.086 0.146 0.000 0.000 −0.016 −0.046 0.236 −0.415 −1186 0.332 CBPS −0.003 0.107 −0.048 0.104 0.015 0.005 −0.008 −0.014 0.208 −0.296 −1152 0.266 CBPS 0.039 −0.105 −0.066 0.056 0.041 −0.097 −0.072 −0.080 −0.104 −0.032 0.013 −609 0.982 GLM 0.004 −0.017 −0.172 0.166 0.065 0.052 −0.124 −0.057 −0.230 −0.131 −1117 0.718 GLM 0.069 −0.144 −0.128 −0.053 0.146 −0.101 −0.081 −0.376 −0.316 −0.339 −0.039 −548 4.242 Quadratic Balance 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 −591 0.000 Quadratic Balance 0.047 0.142 −0.052 0.166 −0.025 0.095 −0.021 −0.015 0.107 −0.290 −1213 0.412 CBPS 0.010 0.070 −0.043 0.125 0.005 0.119 −0.003 −0.003 0.174 0.136 −1163 0.191 CBPS 0.048 −0.131 −0.056 0.047 0.074 −0.110 −0.081 −0.109 −0.131 −0.047 0.018 −574 0.959 Smith GLM 0.076 −0.154 −0.133 −0.037 0.143 −0.120 −0.086 −0.392 −0.338 −0.333 −0.032 −547 4.234 Smith GLM −0.025 −0.028 −0.115 0.073 0.045 0.091 −0.117 −0.050 −0.258 −0.182 −1118 0.692 and Todd (2005) Balance CBPS 0.000 0.049 0.000 −0.135 0.000 −0.065 0.000 0.072 0.000 0.043 0.000 −0.116 0.000 −0.084 0.000 −0.111 0.000 −0.134 0.000 −0.032 0.000 0.018 −591 -576 0.000 1.144 and Todd (2005) Balance CBPS 0.075 0.028 0.150 0.126 0.062 −0.019 −0.062 0.135 −0.099 0.030 0.062 0.100 −0.033 0.018 −0.041 −0.001 0.107 0.174 −0.375 −0.068 −1220 −1177 1.123 0.180 Table 3: Standardized Covariate Balance Across Different Propensity Score Estimation Methods and Model Specifications. We compare three different ways of estimating the propensity score – the standard logistic regression (GLM), the CBPS without the score equation (Balance), and the CBPS that combines the score equation with the balancing conditions (CBPS). For each estimation method, we consider three logistic regression specifications – the linear (Linear), the quadratic function of covariates (Quadratic), and the model specification used by Smith and Todd (2005). For each covariate, we report the inverse propensity score weighted difference in means between the treatment and control groups of the entire sample, divided by its standard deviation among the treated observations (upper panel) as well as the difference in means between the two groups of the (one-to-one) matched sample, divided by its standard deviation among the treated observations (bottom panel). In addition, the log-likelihood and a multivariate analogue of the above imbalance statistics are presented. The CBPS sacrifices some degree of likelihood in order to improve balance. Weighted Difference-in-Means (Entire Sample) Intercept Age Education Black Hisp Married High School Degree Earnings, 1974 Earnings, 1975 Employed, 1974 Employed, 1975 Log-Likelihood Overall Imbalance Difference-in-Means (1–to–1 Matched Sample) Age Education Black Hispanic Married High School Degree Earnings, 1974 Earnings, 1975 Employed, 1974 Employed, 1975 Log-Likelihood Overall Imbalance Evaluation propensity score Model specification Linear Quadratic Smith and Todd (2005) Treatment propensity score Model specification Linear Quadratic 1–to–1 GLM −928 (1080) −2825 (1229) −2489 (1203) 1–to–1 GLM −298 (1050) −675 (1106) Nearest Neighbor Balance CBPS 66 692 (966) (989) −144 1419 (1023) (979) −422 554 (1039) (977) Nearest Neighbor Balance CBPS 585 350 (986) (962) 861 291 (1039) (986) 1–to–10 GLM −1340 (873) −1533 (879) −1506 (858) 1–to–10 GLM −616 (777) −643 (885) Nearest Neighbor Balance CBPS −93 84 (843) (898) −35 145 (894) (849) −183 309 (843) (863) Nearest Neighbor Balance CBPS −227 90 (834) (760) 50 −38 (886) (755) Table 4: Comparison between the Estimated Average Treatment Effect and the Experimental Benchmark. The experimental estimate is 886 with the standard error of 488. The top panel represents the matching estimates where the conditional probability of being in the experimental sample is used as the propensity score whereas the bottom panel represents the estimates with the conditional probability of being in the experimental treatment group as the propensity score. We compare three different ways of estimating the propensity score – the standard logistic regression (GLM), the CBPS without the score equation (Balance), and the CBPS that combines the score equation with the balancing conditions (CBPS). For each estimation method, we consider two or three logistic regression specifications – the linear (Linear), the quadratic function of covariates (Quadratic), and the model specification used by Smith and Todd (2005) for the evaluation propensity score. The standard errors are in parentheses. The CBPS with and without the score equation yield the estimates that are much closer to the experimental benchmark when compared to the standard logistic regression. LaLonde experimental sample combined with the PSID comparison sample we analyze presents a difficult selection bias problem to be overcome. In the literature, for example, Diamond and Sekhon (2011) analyze this same sample using the Genetic Matching estimator and presents the estimate of −571 (with the 95% confidence interval of [−2786, 1645]), which is not very close to the experimental benchmark of 866 with the standard error of 488. In contrast, when applied to other samples, various methods yielded estimates that are much closer to the experimental benchmark (e.g., Dehejia and Wahba, 1999; Diamond and Sekhon, 2011; Hainmueller, 2012). We conduct two propensity score matching analyses. First, we use the conditional probability of being in the experimental sample as the propensity score. This corresponds to the propensity score model specifications and estimation methods used to generate the estimates of evaluation bias given in Table 2. The advantage of this approach is that we have a larger sample size because the experimental control group is used to estimate the propensity score. Second, we use the conditional probability of belonging to the experimental treatment group as the propensity score by only using 18 the experimental treatment group and the PSID comparison sample. This resembles the scenario faced by most applied researchers where the experimental control group is not available. After generating the propensity score based on various specifications, we conduct the one-to-one and oneto-ten nearest neighbor matching with replacement based on the log-odds of the estimated propensity score as done before. Note that the Smith and Todd specification is tailored to the evaluation propensity score and hence is not used for the treatment propensity score. Table 4 presents the results based on the estimated evaluation propensity score in the top panel and those based on the estimated treatment propensity score in the bottom panel. One clear pattern emerges from these results. The CBPS with or without the score equation yields the matching estimates that are much closer to the experimental benchmark when compared with the standard logistic regression (GLM). For the evaluation propensity score, the CBPS with the score equation does better than the CBPS without it. However, for the treatment propensity, neither the CBPS nor the CBPS without the score equation dominates the other. In addition, even for the CBPS, the one-to-ten nearest neighbor matching estimator tends to yield estimates that are further away from the experimental benchmark when compared with the one-to-one nearest neighbor matching estimator. This suggests that for some treated observations there may not be sufficiently many comparable control observations. 4 Extensions So far, we have shown that the CBPS can dramatically improve the performance of propensity score weighting and matching estimators when estimating the average treatment effects in observational studies. Another important advantage of the CBPS is that it can be extended to a number of other important causal inference settings by directly incorporating various balancing conditions. In this section, we briefly describe several potential extensions of the CBPS while leaving their empirical applications to future research. 4.1 The Generalized Propensity Score for Multi-Valued Treatments First, we extend the CBPS to causal inference with multi-valued treatments. Suppose that we have a multi-valued treatment where the treatment variable Ti takes one of the K integer values, i.e., Ti ∈ T = {0, . . . , K − 1} where K ≥ 2. The binary treatment case considered in Section 2 is a special case with K = 2. Following Imbens (2000), we can define the generalized propensity score as the following multinomial probabilities, πβk (Xi ) = Pr(Ti = k | Xi ) 19 (25) where all conditional probabilities sum up to unity, i.e., PK−1 k=0 πβk (Xi ) = 1. For example, one may use the multinomial logistic regression to model this generalized propensity score. As in the binary case, we have the moment condition that is based on the score function under the likelihood framework, ) ( N K−1 k 1 X X 1{Ti = k} ∂πβ (Xi ) = 0 (26) · N ∂β > πβk (Xi ) i=1 k=0 Regarding the balancing property of the generalized propensity score, weighting covariates by its inverse will balance them across all treatment levels. This fact yields the following (K − 1) sets of moment conditions, N 1 X N ( i=1 ei ei 1{Ti = k − 1}X 1{Ti = k}X − πβk (Xi ) πβk−1 (Xi ) ) = 0 (27) for each of k = 1, . . . , K − 1. As before, these moment conditions can be combined with that of equation (26) under the GMM or EL framework. 4.2 Adjusting for Time-Dependent Confounding in Longitudinal Data Next, we show that the CBPS is also applicable to causal inference with time-dependent confounding and time-varying treatments in the longitudinal setting. Specifically, we show how to extend the CBPS to the marginal structural models, which employs inverse propensity score weighting under the assumption of no unmeasured confounding (Robins, 1999; Robins et al., 2000). Suppose that we have a simple random sample of N units from a population for each of whom we have repeated measurements of the outcome and treatment variables throughout a total of J time periods. Let Yij and Tij denote the observed outcome and time-varying treatment for unit i at time j, respectively. We also observe time-dependent confounders Xij which may or may not be affected by the treatment status of the same unit in the previous time periods. Under this setting, for a given unit i in a given time period j, the propensity score is defined as the conditional probability of receiving the treatment given the treatment history up to time j − 1 and the covariate history up to time j, πβ (T i,j−1 , X ij ) = Pr(Tij = 1 | T i,j−1 , X ij ) (28) where T ij = {Ti0 , Ti1 , . . . , Tij } and X ij = {Xi0 , Xi1 , . . . , Xij } represent the treatment and covariate history up to time period j, respectively. Thus, the maximum likelihood estimate of β is given by, β̂MLE = argmax β∈Θ N X J X Tij log πβ (T i,j−1 , X ij ) + (1 − Tij ) log{1 − πβ (T i,j−1 , X ij )} (29) i=1 j=1 Then, the moment condition based on the score function of this likelihood formulation equals, ( ) N J 0 0 1 X X Tij πβ (T i,j−1 , X ij ) (1 − Tij )πβ (T i,j−1 , X ij ) − = 0. (30) JN πβ (T i,j−1 , X ij ) 1 − πβ (T i,j−1 , X ij ) i=1 j=1 20 We combine this moment condition with the balancing property of the propensity score. In this context, weighting the inverse of the propensity score should balance the covariate distribution between the treated and control units at each time period. Thus, we have a total of J additional sets of moment conditions, which are of the following form, ) ( N eij eij (1 − Tij )Z Tij Z 1 X − = 0 N πβ (T i,j−1 X ij ) 1 − πβ (T i,j−1 , X ij ) i=1 (31) where Zeij = f (T i,j−1 , X ij ) is a vector valued function of the treatment history up to time j − 1 and the covariate history up to time j. Together with equation (30), these moment conditions can yield the GMM or EL estimate of the propensity score that simultaneously optimizes the resulting balance and the prediction of treatment assignment. 4.3 Generalizing Experimental Results Another possible extension of the CBPS concerns the generalization of experimental results to a target population. While the randomization of treatment assignment eliminates selection bias, experimental studies may suffer from the lack of external validity because experimental samples are not representative of a target population. Cole and Stuart (2010) and Stuart et al. (2011) use the propensity score to generalize experimental results (see also Imai and Ratkovic, 2011). Suppose that we have an experimental sample of Ne units where the binary treatment variable Ti is completely randomized. Let Si represent the sampling indicator where Si = 1 if unit i is in the experimental sample and Si = 0 otherwise. In this context, the “propensity score” is defined as the conditional probability of being in the experimental sample given the pre-treatment characteristics, πβ (Xi ) = Pr(Si = 1 | Xi ) (32) where Xi is an K dimensional vector of covariates and β is an L-dimensional vector of unknown parameters. In addition to the experimental sample, we assume that a random sample representative of the target population P is available and the sample size is Nne . Without loss of generality, we assume that the first Ne units belong to the experimental sample Si = 1 for i = 1, . . . , Ne , and the last Nne units belong to the non-experimental sample Si = 0 for i = Ne + 1, . . . , N where N = Ne + Nne represents the total sample size. The assumption that makes the generalization of experimental results possible is {Yi (1), Yi (0)}⊥⊥Si | Xi , which implies that the sample selection bias can be eliminated by conditioning on the pre-treatment characteristics Xi . Under this assumption, the propensity score πβ (Xi ) is estimated by fitting, for example, a logistic regression with Si as the 21 response variable. Similar to equation (7), the moment condition from this model can be represented by the following score function, N 1 X N ( Si πβ0 (Xi ) i=1 πβ (Xi ) − (1 − Si )πβ0 (Xi ) 1 − πβ (Xi ) ) = 0 (33) In addition, there are two moment conditions based on the balancing property. First, if the propensity score is correct, then appropriately weighting the covariates in the experimental sample will make their distribution similar to that of the weighted non-experimental sample, ) ( N ei ei 1 X Si X (1 − Si )X = 0 − N πβ (Xi ) 1 − πβ (Xi ) (34) i=1 ei = f (Xi ) is an M dimensional vector-valued function of covariates. Second, in many cases, where X it can be assumed that the treatment is not available for the non-experimental sample and the control condition in the experiment is equivalent to the condition to which the units in the non-experimental sample are exposed, i.e., Ti = 0 for i = Ne + 1, . . . , N . In this case, we can impose an additional moment condition based on the fact that the outcome distribution for the experimental control group should match with that of the weighted non-experimental sample. Formally, we have, N Ne (1 − Si )Yi 1 X Si (1 − Ti )Yi − = 0. · N Ne − N1 πβ (Xi ) 1 − πβ (Xi ) (35) i=1 Finally, the moment conditions given in equations (33)–(34) and (35) can be combined under the GMM or EL framework as done in Section 2 to estimate the propensity score. 4.4 Generalizing the Instrumental Variables Estimates The final extension we consider is the generalization of instrumental variables estimates. As shown by Angrist et al. (1996), the method of instrumental variables identifies the average treatment effect for the compliers or the Local Average Treatment Effect (LATE). However, this LATE parameter has been under criticism because of the external validity concern (see e.g., Deaton, 2009; Heckman and Urzua, 2010; Imbens, 2010). Specifically, in the presence of treatment effect heterogeneity, the LATE does not generally equal the average treatment effect for the overall population (ATE). Recently, in order to extrapolate the LATE to the ATE, Angrist and Fernandez-Val (2010) propose to weight the conditional LATE by the “propensity score,” which in this context is defined as the conditional probability of being a complier given the covariates (see also Aronow and Sovey, 2011). Formally, consider a simple random sample of N units from a population. Suppose that we have a binary instrumental variable Zi where Zi = 1 (Zi = 0) means that unit i is encouraged (not) to receive 22 the treatment. Let Ti represent a binary treatment variable as before and Ti (z) be the potential treatment value under the encouragement status z. Following Angrist et al. (1996), we define a complier Ci = c as a unit who receives the treatment only when encouraged ((Ti (1), Ti (0)) = (1, 0)), a never-taker Ci = n as a unit who never takes the treatment ((Ti (1), Ti (0)) = (0, 0)), and an always-taker Ci = a as a unit who always receives the treatment ((Ti (1), Ti (0)) = (1, 1)). The monotonicity assumption states that there is no defier who only takes the treatment when not encouraged ((Ti (1), Ti (0)) = (0, 1)). Under this setting, the propensity score is defined as, πβ (Xi ) = Pr(Ci = c | Xi ) (36) where β ∈ Θ is a vector of unknown parameters. First, we consider the case of one-sided noncompliance where the units with Zi = 0 are not eligible for the treatment and hence no always-taker exists. Under this setting, the propensity score can be estimated using the units who are encouraged (Zi = 1). This group of units consists of either compliers (Ti = 1) or never-takers (Ti = 0), and hence the ML estimator can be written as, β̂MLE = argmax β∈Θ N X Zi [Ti log πβ (Xi ) + (1 − Ti ) log{1 − πβ (Xi )}] . (37) i=1 Thus, the moment condition based on the score function is given by, ( ) N Ti πβ0 (Xi ) (1 − Ti )πβ0 (Xi ) 1 X Zi − = 0. N πβ (Xi ) 1 − πβ (Xi ) (38) i=1 Now, the moment conditions based on the balancing property of the propensity score is given by, ( ) N ei ei 1 X N Zi Ti X (1 − Ti )X − = 0 (39) N N1 πβ (Xi ) 1 − πβ (Xi ) i=1 where N1 = PN i=1 Zi is the number of units who are encouraged. In addition, the instrumental variable Zi is randomized as is often the case for randomized experiments with noncompliance, then the distribution of weighted covariates in the encouragement should be matched with that of the non-encouragement group. This implies the following additional moment condition, ( ) N ei ei N (1 − Zi )X 1 X N Zi Ti X − = 0 N N1 πβ (Xi ) N − N1 (40) i=1 These moment conditions then in turn can be leveraged to estimate the CBPS within the GMM or EL framework. 23 Next, we consider the case of two-sided noncompliance. Here, we allow for the existence of always-takers so that some units who are not encouraged may receive the treatment (Ti , Zi ) = (1, 0). Thus, the propensity score is now defined as a set of multinomial probabilities, πβc (Xi ) = Pr(Ci = c | Xi ) (41) πγa (Xi ) = Pr(Ci = a | Xi ) (42) where γ is a vector of unknown parameters and the conditional probability of being a never-taker is given by πγn (Xi ) = Pr(Ci = n | Xi ) = 1 − πβc (Xi ) − πγa (Xi ). As shown by Imbens and Rubin (1997), the likelihood function has the following mixture structure, N h Y {πβc (Xi ) + πγa (Xi )}Ti {1 − πβc (Xi ) − πγa (Xi )}(1−Ti ) iZi h πγa (Xi )Ti {1 − πγa (Xi )}(1−Ti ) i1−Zi (43) i=1 Thus, the moment condition based on the score equation can be derived based on this likelihood function. In addition, we can exploit the covariate balancing properties of the propensity score. In particular, units in each of the observed four cells defined by (Ti , Zi ) can be weighted so that their covariate distribution matches with the population distribution. The moment conditions ) ( N ei ei 1 X N Zi (1 − Ti )X Ti X − = N N1 πβc (Xi ) + πγa (Xi ) 1 − πβc (Xi ) − πγa (Xi ) i=1 ( ) N ei ei 1 X N (1 − Zi ) Ti X (1 − Ti )X − = N N − N1 πγa (Xi ) 1 − πγa (Xi ) i=1 ) ( N ei ei N (1 − Zi )X 1 X N Zi X − Ti = N N1 πβc (Xi ) (N − N1 )πγa (Xi ) are then given by, 0 (44) 0 (45) 0 (46) i=1 Again, these moment conditions can be combined to estimate the CBPS under the GMM or EL framework. 5 Concluding Remarks The propensity score matching and weighting methods have become popular tools for applied researchers in various disciplines who conduct causal inference in observational studies. The propensity score methodology also has been extended to various other settings including longitudinal data, nonbinary treatment regimes, and the generalization of experimental results. Despite this development, relatively little attention has been paid to the question of how the propensity score should be estimated (see McCaffrey et al., 2004, for an exception). This is unfortunate because researchers have 24 found that slight misspecification of the propensity score model can result in substantial bias of estimated treatment effects. The proposed methodology, the covariate balancing propensity score (CBPS), enables the robust and efficient parametric estimation of the propensity score by directly incorporating the two key properties of propensity score — a good estimate of the propensity score predicts the treatment assignment and balances covariates at the same time. We directly exploit these properties and estimate the propensity score within the familiar framework of Generalized Method of Moments (GMM) or Empirical Likelihood (EL). This means that the existing theoretical properties and methodologies can be directly applied to the CBPS as well. While we provide some empirical evidence that the CBPS can dramatically improve the performance of propensity score weighting and matching methods, there exist a number of important issues that merit future research. First, the potential extensions outlined in this paper are of interest to applied researchers and the empirical performance of the CBPS in these situations must be rigorously investigated. Second, an important issue of model selection has not been addressed. While the CBPS is relatively robust to model misspecification, its successful application requires a principled method of choosing an appropriate model specification as well as covariate balancing criteria. The existing methods within the GMM framework such as Andrews (1999) and Caner (2009) are directly applicable and yet their empirical performance in various causal inference contexts remain to be investigated. 25 References Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 1, 235–267. Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators. Econometrica 76, 6, 1537–1557. Abadie, A. and Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business and Economic Statistics 29, 1, 1–11. Andrews, D. (1999). Consistent moment selection procedures for generalized method of moments estimation. Econometrica 67, 3, 543–563. Angrist, J. and Fernandez-Val, I. (2010). ExtrapoLATE-ing: External validity and overidentification in the LATE framework. Working Paper No. 16566, National Bureau of Economic Research. Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with discussion). Journal of the American Statistical Association 91, 434, 444–455. Aronow, P. M. and Sovey, A. J. (2011). Beyond LATE: Estimation of the average treatment effect with an instrumental variable. Working paper, Department of Political Science, Yale University. Caner, M. (2009). LASSO-type GMM estimator. Econometric Theory 25, 1, 270–290. Cole, S. R. and Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology 172, 1, 107–115. Deaton, A. (2009). Instruments of development: Randomization in the tropics, and the search for the elusive keys to economic development. Proceedings of the British Academy 162, 123–160. Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association 94, 1053–1062. Diamond, A. and Sekhon, J. (2011). Genetic matching for estimating causal effects: A new method of achieving balance in observational studies. Working Paper, Department of Political Science, University of California, Berkeley. 26 Freedman, D. A. and Berk, R. A. (2008). Weighting regressions by propensity scores. Evaluation Review 32, 4, 392–409. Graham, B. S., Campos de Xavier Pinto, C., and Egel, D. (Forthcoming). Inverse probability tilting for moment condition models with missing data. Review of Economic Studies . Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–331. Hainmueller, J. (2008). Synthetic matching for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Tech. rep., Department of Government, Harvard University. Hainmueller, J. (2012). Entropy balancing for causal effects: Multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1, 25–46. Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association 99, 467, 609–618. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 4, 1029–1054. Hansen, L. P., Heaton, J., and Yaron, A. (1996). Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14, 3, 262–280. Heckman, J. J., Ichimura, H., and Todd, P. (1998). Matching as an econometric evaluation estimator. Review of Economic Studies 65, 2, 261–294. Heckman, J. J. and Urzua, S. (2010). Comparing IV with structural models: What simple IV can and cannot identify. Journal of Econometrics 156, 1, 27–37. Hirano, K., Imbens, G., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 4, 1307–1338. Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 3, 199–236. Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 260, 663–685. 27 Iacus, S., King, G., and Porro, G. (2011). Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association 106, 493, 345–361. Imai, K., King, G., and Stuart, E. A. (2008). Misunderstandings among experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society, Series A (Statistics in Society) 171, 2, 481–502. Imai, K. and Ratkovic, M. (2011). Identification of treatment effect heterogeneity as a variable selection problem. Working paper available at http://imai.princeton.edu/research/svm.html. Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association 99, 467, 854–866. Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika 87, 3, 706–710. Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics 86, 1, 4–29. Imbens, G. W. (2010). Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic Literature 48, 2, 399–423. Imbens, G. W. and Rubin, D. B. (1997). Bayesian inference for causal effects in randomized experiments with noncompliance. Annals of Statistics 25, 1, 305–327. Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussions). Statistical Science 22, 4, 523–539. LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76, 4, 604–620. Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23, 19, 2937–2960. McCaffrey, D. F., Ridgeway, G., and Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9, 4, 403– 425. 28 Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, New York. Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics 22, 1, 300–325. Ratkovic, M. (2012). Achieving optimal covariate balance under general treatment regimes. Working Paper, Department of Politics, Princeton University. Robins, J. (1999). Statistical Models in Epidemiology, the Environment and Clinical Trials (eds. M. E. Halloran and D. A. Berry), chap. Marginal Structural Models Versus Structural Nested Models as Tools for Causal Inference, 95–134. Springer, New York. Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (2007). Comment: Performance of doublerobust estimators when ‘inverse probability’ weights are highly variable. Statistical Science 22, 4, 544–559. Robins, J. M., Hernán, M. A., and Brumback, B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11, 5, 550–560. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 427, 846–866. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90, 106–121. Rosenbaum, P. R. (1987). Model-based direct adjustment. Journal of the American Statistical Association 82, 398, 387–394. Rosenbaum, P. R. (1989). Optimal matching for observational studies. Journal of the American Statistical Association 84, 1024–1032. Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society, Series B, Methodological 53, 597–610. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1, 41–55. 29 Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 387, 516–524. Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39, 33–38. Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Statistics in Medicine 26, 20–36. Smith, J. A. and Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics 125, 1–2, 305–353. Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science 25, 1, 1–21. Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society, Series A (Statistics in Society) 174, 2, 369–386. Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 97, 3, 661–682. 30