Getting the Mean Right is a Good Thing: Generalized Additive Models 1
by user
Comments
Transcript
Getting the Mean Right is a Good Thing: Generalized Additive Models 1
Getting the Mean Right is a Good Thing: Generalized Additive Models1 Nathaniel Beck2 and Simon Jackman3 January 31, 1997 1 We thank Michael Dimock, Gary Jacobson, Gary King and John Oneal for their data, Ari Zentner for research assistance, and Richard Carson, Peter Hall, Trevor Hastie, Gary King, Robert Kohn, Jonathan Nagler and Adrian Pagan for helpful discussions. The analysis of the democratic peace draws on material coauthored with Richard Tucker. We are grateful to Jonathan Nagler, the director of the Social Science Computing Facility at the University of California, Riverside, and Kolin Hand, its systems programmer, for allowing us to use their computing facility. Errors and omissions remain our own responsibility. 2 Department of Political Science, University of California, San Diego, La Jolla, CA 92093. e-mail: [email protected] 3 Department of Political Science and Reshaping Australian Institutions Project, Division of Economics and Politics, Research School of the Social Sciences, Australian National University, Canberra, ACT 0200, Australia; on leave from Department of Political Science, Stanford University. e-mail: [email protected] Contents Abstract i 1 Introduction 1 2 Generalized Additive Models 2.1 Interpreting GAMs . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Contrasting other non-parametric approaches . . . . . . . . 2 3 4 3 Scatterplot Smoothing 6 3.1 Smoothing by local regression (loess) . . . . . . . . . . . . . . 8 3.2 Cubic smoothing splines . . . . . . . . . . . . . . . . . . . . . 10 3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 3.5 3.6 4 5 Linear Smoothers . . . . . . . . . . . . . . . . . . . . Equivalent degrees of freedom . . . . . . . . . . . . Standard Errors . . . . . . . . . . . . . . . . . . . . . Pointwise versus Global Confidence Intervals . . . . Specification Tests . . . . . . . . . . . . . . . . . . . . Asymptotic Properties . . . . . . . . . . . . . . . . . Estimating GAMs . . . . . . . . . . . . . . . . . . . . Example — Pennsylvania State Senate data . . . . . Bandwidth selection and the bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GAMs in practice 4.1 The effect of Perot on Congressional races: GAMs as an alternative to multiple regression . . . . . . . . . . . . . . . . 4.2 Overdrafts and the Congressional Vote: Assessing Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Do governments face rising hazards? — an analysis in discrete, but smooth, time . . . . . . . . . . . . . . . . . . . . . 4.4 The democratic peace: monadic or dyadic . . . . . . . . . . Conclusion . . . . . . . . . 13 14 15 15 16 17 18 19 22 25 . 25 . 28 . 31 . 34 42 Appendices 45 A Will GAMs find spurious non-linearities? . . . . . . . . . . . 45 B Software for GAMs . . . . . . . . . . . . . . . . . . . . . . . . 47 References 49 Abstract Theory: Social scientists almost always use statistical models positing the dependent variable as a linear function of X, despite suspicions that the social and political world is not so parsimonious. Generalized additive models (GAMs) permit each independent variable to be modelled non-parametrically while requiring that the independent variables combine additively, striking a sensible balance between the flexibility of nonparametric techniques and the ease of interpretation and familiarity of linear regression. GAMs thus offer social scientists a practical methodology for improving on the extant practice of “linearity by default”. Method: We present the statistical concepts and tools underlying GAMs (e.g., scatterplot smoothing, non-parametrics more generally, and accompanying graphical methods), and summarize issues pertaining to estimation, inference, and the statistical properties of GAMs. Monte Carlo experiments assess the validity of tests of linearity accompanying GAMs. Re-analysis of published work in American politics, comparative politics, and international relations demonstrates the usefulness of GAMs in social science settings. Results: Our re-analyses of published work show that GAMs can extract substantive mileage beyond that yielded by linear regression, offering novel insights, particularly in terms of modelling interactions. The Monte Carlo experiments show there is little danger of GAMs spuriously finding non-linear structures.1 1 All data analysis, Monte Carlo experiments, and statistical graphs were generated using S-PLUS, Version 3.3 (Statsci 1995). The routines and data are available at ftp://weber.uscd.edu/pub/nbeck/gam. 1 Introduction Social scientists make many assumptions in the course of employing statistical models. Linearity must rank as one of the more ubiquitous assumptions, yet one that typically gets fairly short shrift. For instance, while econometrics texts invariably make the linear functional form their first assumption (e.g., Greene 1993, 143), the consequences of violating latter assumptions garner far more space. This strikes us odd, given that the optimality properties of least squares and maximum likelihood estimators depend upon the researcher having specified the correct functional form for the mean of the dependent variable; for example, understanding the consequences of heteroskedasticity, seems to us a second-order problem if the model parameters tap the wrong relationship to start with. To be sure, introductory texts go on to show how a variety of specific non-linear functional forms can be estimated in the linear framework via various transformations (logarithmic, polynomial, reciprocals). But little attention is paid to how analysts might come to know the correct functional form relating the independent variables to the mean of the dependent variable. Simple non-linear forms, such as polynomials or logarithmic transformations, often appear as clumsy attempts to capture subtleties; as we argue below, these tricks of the trade impose “global solutions” on what are typically “local problems”. And in practice, these simple nonlinear functional forms are often difficult to estimate with much precision, falling prey to problems of multicollinearity. The sources of this disparity in emphasis are fairly clear. First, social and political theories rarely suggest a specific functional form. We might occasionally encounter arguments for higher order terms or nonlinear transformations; examples include returns to scale, diminishing returns, spatial contagion, cyclical patterns in time series data, non-linear constraints imposed by working with rates or proportions data, and nonmonotone survival data. But if a priori non-linearities are rare then rarer still is guidance about the specific form of these non-linearities. The more striking point here is not the absence of theories implying non-linearity or non-monotonicity. Rather, few social scientific theories offer any guidance as to functional form whatsoever. Statements like “y increases with x” (monotonicity) are as specific as most social theories get. We are not arguing that non-linearities, non-monotonicities, and other seemingly complicated phenomena necessarily abound; we are saying that 1 we don’t know one way or the other. Our theories are typically silent on the issue, and empirical social scientists are neither trained nor encouraged to consider non-linearity a compelling issue. Nonetheless, we don’t see the work-a-day regression-runner as a dupe. Most of us understand that the linear regression model prevails in spite of suspicions that the world is not linear and additive. As we tell our students, linear regression prevails because it is simple — “parsimonious”, easily estimated, and easily interpreted — and, with perhaps a hint of embarrassment, it seems plausible in the absence of any theoretical story suggesting a specific functional form. The truth of the matter is that there are seldom any prior expectations as to the functional form of E y g X . Moreover, it hardly seems persuasive to use the absence of theoretical arguments about non-linearity to buttress use of the linear regression model. “Linearity by default” seems an unduly narrow methodological practice in the face of weak or vague theoretical expectations, and given that alternative models can be readily implemented. ( )= ( ) 2 Generalized Additive Models How then to take non-linearity seriously? Here we survey a regressionlike model that directly confronts the possibility of non-linearity: generalized additive models (GAMs). These models and the ideas that underlie them have received considerable attention in the statistics literature, but are yet to percolate into the social sciences. Additive models re-cast the linear regression model = + ∑ X; + ; k yi j ij i (1) j=1 by modelling y as a additive combination of arbitrary univariate functions of the independent variables and a zero mean, independent and identically distributed stochastic disturbance: = + ∑ g (X ; ) + ; k yi j ij i (2) j=1 ( ) = where E i = 0, and var(i ) = 2 ; i 1; : : : ; n. Note that no distributional assumptions about the i are necessary until the analyst moves from estimation to inference (hypothesis testing, constructing confidence intervals, 2 () etc). Here we focus on two types of gj : locally weighted regression smoothers and cubic smoothing splines. Several other classes of gj have been considered in the literature, but our focus on these two smoothers does not unduly restrict the generality of our discussion. Generalized additive models extend the framework in equation (2) in precisely the same way that generalized linear models (GLMs) (McCullagh and Nelder 1989) extend the linear regression model so as to accommodate qualitative dependent variables. A detailed description of the GLM framework is beyond our scope here; we note that the GLM framework provides a useful syntax and conceptual framework for modelling qualitative dependent variables. Further details as to how GAMs draw on the GLM framework appear in our applications, where we present GAMs for both continuous and qualitative dependent variables. In the Appendix we show how to estimate GAMs, with an emphasis on the implementation in S and Splus, which in turn makes use of the GLM approach. 2.1 Interpreting GAMs The absence of the regression parameters j in equation (2) reflects an important characteristic of GAMs. One does not obtain a set of regression parameters from a GAM, but rather, estimates of gj Xi; j for every value of Xi; j , written as ĝ j Xi; j . These estimates are obtained using a fitting algorithm known as backfitting, which we discuss in section 3.4. The coefficient on each g j X j is set to one by construction, and so it is the ĝ j X j that tells us about the relationship between Xj and the dependent variable.1 In many applications we will assume that some of the predictors have linear effects, as in the standard regression setup. Thus we will typically estimate a semi-parametric model ( ) ( ) ( ) ( ) = + ∑ Z ; + ∑ g (X ; ) + : m yi k l il j ij (3) i j=1 l =1 The estimated non-parametric component of a GAM can be thought of as estimates of the functions transforming each Xj ( j 1; : : : ; k) so as to maximize the fit of their additive combination to a dependent variable, subject to constraints about the smoothness of the ĝ j X j . The actual values = 1 ( ) ( ) The fitted g j X j each have zero mean by construction. Without this constraint the intercept in a GAM is unidentified; i.e., any of the fitted gj X j could be shifted by some by arbitrary constant, accompanied by an offsetting shift in the intercept, and the resulting set of estimates would fit the data equally well. ( ) 3 ( ) of ĝ j X j are not substantively meaningful per se; what matters is the shape of the fitted functions. For this reason, graphical methods are used to interpret the nonparametric component of a GAM. A plot of Xj versus ĝ j X j reveals the nature of any estimated non-linearities in the relationship between Xj and the dependent variable, holding constant the other components in the model. Standard errors and confidence regions can be calculated and plotted about ĝ j X j , providing a guide as to whether the fitted function is distinguishable from a linear fit, or increasing or decreasing in Xj (see section 3.3). While it may seem easier to examine tables of regression coefficients rather than plots of the ĝ j , this ease is only obtained at the cost of (possibly) unwarranted, restrictive, and unnecessary assumptions of linearity. On the other hand, additivity ensures that the effects of each of a GAM’s predictor can be interpreted net of the effects of other predictors, just as in linear regression. ( ) ( ) 2.2 Contrasting other non-parametric approaches Non-parametric statistics is an extremely large and active area of statistical research, providing many alternatives to the linear, additive regression model. Given this panoply of methods, our focus on GAMs warrants some justification. GAMs seem to strike a sensible compromise between ease of interpretation and flexibility. Other alternatives to linear regression like neural nets (White 1992; Ripley 1993), projection pursuit (Friedman and Stuetzle 1981), or alternating conditional expectations (Breiman and Friedman 1985; DeVeaux 1990) strike us as sacrificing ease of interpretation in order to better fit the data. For instance, projection pursuit attempts to find orthogonal projections through the space of the data that maximize model fit, but substantively interpreting the projections seems difficult. Alternating conditional expectations (ACE) finds optimal transformations of the X and y variables that maximize fit to the model, but again, the resulting fit strikes us as hard to interpret. It is often difficult to decide when these techniques are overfitting the data, a problem that is especially pressing for projection pursuit and neural networks, which can approximate any continuous function.2 2 These methods are very powerful if we simply want good predictive power, as in pattern recognition, a problem seldom encountered in political science. Few political scientists are trying to decipher hand written zip codes for the post office (e.g., Ripley 4 These other non-parametric alternatives are also difficult to interpret in the natural units of any substantive research problem. For instance, the analogue of equation (2) for ACE is ( yi ) = ∑ j ( Xi; j ) + i ; k (4) j=1 while that for projection pursuit is q yi = ∑ (a X ) + : 0 l l i (5) i l =1 Thus even if we had a good way to choose the j , they would be interpretable as either the effect of an independent variable on a transformation of the dependent variable (for ACE), or the effect of some linear combination of the independent variables on the dependent variable (for projection pursuit).3 In addition, neither of these techniques allows for interactions among the independent variables. While the basic GAM framework is additive, we shall see in Section 4.4 that GAMs provide a very flexible method for estimating models with pairwise interactive terms, far and away the most common type of interaction studied in political science. GAMs should also be distinguished from semi-parametric methods which keep the assumption of linearity but allow the error term to be modelled non-parametrically (e.g., Western 1995). While this approach is valuable, we believe that “getting the mean function right” is more important than correct modelling of the error process. It is possible to combine these two approaches. In particular, it is easy to estimate GAMs using Huber’s (1981) “M-estimates” which allow for platykurtic error distributions. Our focus here is on the mean function, and so we do not pursue M-estimators for GAMs further. We also distinguish two classes of flexible parametric approaches; BoxCox transformations (Greene 1993, 329-34) and parametric splines (Greene 1993, 235-8). Neither of these have proven themselves popular in political science, and neither are as flexible as the GAM. 1993, 16). 3 The ACE transformation of the dependent variable, yi , causes problems because the only restriction on it is that it be continuous. Thus there is no way to untransform, as there would be, say, for a simple logarithmic transformation; ,1 is generally not even defined! As a consequence it is very difficult to interpret estimates in natural units. Note that complicated non-linear transformations of the independent variable cause no interpretative problems; there is no need to invert these transformations. ( ) 5 As equation (2) shows, GAMs retain the additive property of the familiar regression model, and so can be interpreted “variable-by-variable”, unlike projection pursuit or most neural nets. Transformations of variables are restricted to GAMs’ right-hand side variables, again making substantive interpretation relatively easy. The results of a GAM, the ĝ j Xi; j , can ( ) be easily interpreted in the natural units of the research problem; plots of the independent variables against their smoothed fits to the dependent variable are the non-parametric counter-part of standard regression lines. GAMs speak to substantive research question just as directly as everyday regression, but without imposing the assumption of linearity. Furthermore, the specific transformations of right-hand variables employed by GAMs have well-understood properties. These transformations are variable-specific, and the amount of smoothing is typically set in advance by the analyst, via calibration parameters that have straightforward interpretations (e.g., the “equivalent degrees of freedom” consumed by each variable-specific smoother, see below). And because GAMs are additive, the familiar linear regression model remains a convenient benchmark against which to assess how various GAMs fare on the tradeoff between predictive power (“getting the mean right”) and substantive meaning. The methodology for assessing this tradeoff in the GAM context is for all practical purposes identical to that used to compare different linear regression models (see section 3.3). GAMs seem to us to be the most convenient way of buying into what non-parametric approaches have to offer without making interpretation too difficult. With linear regression at one extreme and neural networks at the other, GAMs lie much closer to the former than the latter. We concede that there are research settings where neural nets or ACE or projection pursuit are to be preferred; but our belief at this point is that such situations are rare in political science. In short, GAMs provide much of the flexibility of nonparametric methods while allowing the analyst to keep a close eye on what the statistical procedures are actually doing to the data. 3 Scatterplot Smoothing Like most non-parametric techniques, the statistical theory and accompanying mathematics for GAMs is complex, and we refer readers to the discussion in Hastie and Tibshirani (1990, ch 5). However, most of the key intuitions about GAMs follow from ideas having to do with bivariate 6 scatterplot smoothing. Smoothing is an important tool for non-parametric regression, addressing one of the simplest yet most fundamental questions in data analysis: “what is our best guess of y, given x?” In the most general setting, a smoother yields a series of data reductions or summaries of y, each specific to (possibly overlapping) regions of the space of the predictors. The moving average or moving median is a frequently encountered smoother, summarizing a single variable with means or medians specific to overlapping time windows. The resulting summary series exhibits less variability than the raw series, again highlighting that smoothers are tools for data reduction. GAMs make use of smoothing, but subject to the constraint that the smoothed predictors combine additively in modelling y (see the discussion in section 2.2). To define scatterplot smoothing, let x x1 ; : : : ; xn 0 be the observations on an independent variable and let y y1 ; : : : ; yn 0 be the observations on a dependent variable. Assume that the data are sorted by x. A scatterplot smoother takes x and y and returns a function ĝ x ŷ. The amount of smoothing in the moving-average time-series example is controlled by the number of time periods that are averaged, or, more generally, by the size of the neighborhood or bandwidth over which the averaging is done. The larger this neighborhood, the more smoothing, and the more data reduction. For instance, consider taking the entire sample as the neighborhood. A moving average in this case is simply the sample mean of the observations, reducing the entire series to a constant. Ordinary least squares (OLS) regression is a special case of a smoother, providing an infinite amount of smoothing.4 This is because in OLS regression the relationship between y and x is taken to be global, not local, and hence the smoothing neighborhood is the entire dataset; the same relationship between x and y holds over all plausible values of x. As we show below, most scatterplot smoothers yield estimated g functions that approach a linear regression line as the amount of smoothing increases. At the other extreme, as bandwidth gets smaller, the amount of smoothing decreases, and the estimated g function degrades to a function that simply interpolates the data, resulting in no data reduction at all. Our uses of GAMs rely extensively on two popular smoothers: loess =( =( ) ) ( )= 4 Smoothness here is defined as the inverse of the second derivative of the fitted function. Since linear regression fits straight lines, and straight lines have second derivatives that are zero, linear regression is considered to be infinitely smooth. 7 (Cleveland and Devlin 1988), a local regression smoother, and cubic smoothing splines. A large body of statistical research establishes that these smoothers have many desirable properties. We summarize some of this research below, but refer interested readers to the reviews in Hastie and Tibshirani (1990), Goodall (1990), and Hastie and Loader (1993). 3.1 Smoothing by local regression (loess) j = ĝ(x0) as follows (Cleveland Given a target point x0 , loess yields ŷ x0 1993, 91–101): 1. Identify the k nearest neighbours of x0 , i.e., the k elements of x closest to x0 . This set is denoted N x0 . The analyst controls k via a “span” argument, which defines the size of the neighborhood in terms of a proportion of the sample size: i.e., k span n. ( ) jx0 , xi j, ( ) = 2. Calculate x0 maxN(x0 ) neighbor most distant from x0 . the distance of the near- ( ) 3. Calculate weights wi for each point in N x0 , using the following tricube weight function: jx , x j W (x ) 0 i (6) 0 where ( )= W z ( (1 , z ) 3 3 0 ( )= for 0 z < 1; otherwise. (7) 1, and that the weights decline smoothly to zero Note that W x0 over the set of nearest-neighbors, such that W xi 0 for all non near-neighbors. The use of the tri-cube weight function here is somewhat arbitrary; any weight function that has smooth contact with ( )= ( ) 0 on the boundary of N x0 will produce a smooth fitted function (Cleveland, Devlin and Grosse 1988, 112). 4. Regress y on x and a constant (for local linear fitting), using weighted least squares (WLS) with weights wi as defined in the previous step. Quadratic or cubic polynomial regressions can also be used in these local regressions (Fan and Gijbels 1996), or even mixtures of loworder polynomials (Cleveland and Loader 1996). 8 ( ) 5. The smoothed value ĝ x0 is the predicted value from the WLS fit at x0 . Local regression can also be applied beyond the two-dimensional setting encountered in scatterplot smoothing. The smoothing algorithm presented in equations (6) and (7) generalizes readily to the case where y is being modelled to a set of two or more explanatory variables. Consider applying loess to the model yi = g(x ; z ) + : i i i (8) A target point in the two-dimensional space of the predictors can be denoted as x0 ; z0 , about which we can define a set of nearest-neighbors, again defined as a pre-specified proportion of the data. The Euclidean distances of near-neighbors from the target point are passed to the tri-cube kernel in equation (7), producing a set of weights for the (local linear) regression of y on x, z, and a constant. This fitted regression is then used to generate ŷ x0 ; z0 ĝ x0 ; z0 . This is a very useful feature of loess, which we exploit in one of the applications below. In particular, the additive predictor g x; z in a GAM provides a way of capturing interaction effects non-parametrically, rather than simply having x and z conditioning the linear effects of each variable as in least squares regression. Local regression is an excellent paradigm for smoothing. The use of nearest-neighborhoods lets the smoother adapt to local fluctuations of the density of x, while not necessarily sacrificing smoothness. In this sense loess can be understood as a tri-cube kernel scatterplot smoother, adapting to local fluctuations in the density of x. The combination of three features — nearest neighbors, a smooth weight function (the tri-cube kernel), and forming ŷ x0 via locally weighted regressions — helps local regression outperform many other scatterplot smoothers (e.g, moving averages, overlapping regressions, and so on).5 In particular, the adaptive features of local regression stand in marked contrast with smoothing by fixed kernels, which until recently were considered a good basis for smoothing (Hastie and Loader 1993). Fixed kernel ( ) j( )= ( ) ( ) j smoothers summarizes or interpolate data within a symmetric neighborhood of a target point x0 , with the data about the target point weighted 5 Useful summaries of these properties and other strengths of local regression appear in Hastie and Loader (1993) and Cleveland and Loader (1996). According to the latter paper, local regression was first used as early as 1829 by Danish actuaries, but the first published reference appears to be Woolhouse (1870). 9 according to a probability density function (e.g., the normal probability density function), with the target point as the mean, and variance proportional to the bandwidth), ignoring the observed density of the data in that region.6 Accordingly, fixed kernels potentially give biased estimates of E y x0 , especially around the endpoints of the observed data Hastie and Tibshirani (1990, 31). In setion 3.3 we show how local regression attains greater asymptotic efficiency than fixed kernel smoothers. (j ) 3.2 Cubic smoothing splines Cubic smoothing splines are another popular choice for scatterplot smoothing and fitting GAMs. This smoother arises as the solution to the following optimization problem: among all functions g x with continuous first and second derivatives, find one that minimizes the penalized residual sum of squares () Z = ∑[y , g(x )] + [g00(t)] dt; RSS + roughness penalty; N PRSS i i 2 b 2 (9) a i=1 where is a fixed constant, and a x1 : : : xN b (Hastie and Tibshirani 1990, 27). The first part of this expression is the familiar residual sum of squares (RSS), measuring fidelity to the data. The second term is weighted by the constant , and penalizes “roughness” or “flamboy- ancy” (Good and Gaskins 1971) in the target function. In this way is analogous to the span parameter in loess; i.e., higher values of result in smoother fits. As , the roughness penalty dominates the criterion 00 in (9), making g x 0, yielding the least squares regression line as the solution function (since the second derivative of a straight line is zero). Conversely, as 0, the roughness penalty disappears, allowing the the RSS component of equation (9) to dominate; the resulting g x tends to a function that simply interpolates the data. The g x that minimizes the criterion in (9) is a piecewise cubic polynomial, known as a cubic spline function, with knots (join points) at each of !1 ( )= ! () () 6 Of course, if the realizations of x are equally spaced, then the adaptive features of local regression are irrelevant. In this special case fixed-kernel smoothing and nearest neighbor approaches coincide (at least in the interior of the data), but such a fortitious distribution of x values rarely occurs in practice. See Fan and Marron (1993, 129-130). 10 the interior data points (Schoenberg 1964; Reinsch 1967):7 i.e., , ( ) = + x + x + x + ∑ (x , x ) s x0 0 1 0 2 2 0 3 3 0 n 1 i 0 i 3 + (10) i=2 + ( , where the sub-script on the last term denotes the positive part of x0 xi . It is well known that a polynomial of order n 1 will exactly fit n data points (e.g., Lancaster and S̆alkauskas 1986, 30) but the resulting function is almost guaranteed not to be smooth, nor to resemble any social or political process. Smoother functions result from fitting successively lowerorder polynomials, with the infinitely smooth least squares regression line resulting when the fitted polynomial is of order one. But experimenting with the order of a polynomial fit to the entire data is not an especially elegant or efficient approach to smoothing. If the data exhibit localized fluctuations in the mean function, then a reasonably high-order polynomial may be required to do a fair job of fitting the data. This approach can consume degrees of freedom rapidly, and while the fitted polynomial may capture some local fluctuations in the mean well (and even relatively smoothly), it can fit quite poorly elsewhere. In short, using polynomials to smooth data is a global solution to a local problem. Smoothing splines like that in equation (10) differ from globally-fitted polynomials in that the piece-wise terms let the fitted function respond to local changes in the mean of the data. The combination of global cubic polynomial with the piecewise cubic terms also ensures that the fitted spline function has a continuous second derivative at the knots. In part, the previous sentence merely restates that functions of the form in equation (10) solve the optimization problem in equation (9). But when we graphically inspect the resulting spline function, the continuity constraint on the second derivatives means that the fitted function also looks smoother. This seems to be because “our eyes are skilled at picking up second and lower order discontinuities, but not higher” (Hastie and ) , 7 Splines are a widely used tool for smoothing and interpolating data and have nonstatistical applications in settings such as architecture and engineering. de Boor (1978) is a useful general introduction to a large literature on splines. Wegman and Wright (1983) provide a survey of statistical applications, and Eubank (1988) and Green and Silverman (1994) are book length treatments of non-parametric regression based on splines. Cubic smoothing splines should not be confused with parametric natural splines, where the analyst chooses the number and placement of knots. In the absence of guidance as to where to place the knots, most analysts locate a small number of knots uniformly over the data. Unlike smoothing splines, there seems to be no simple way to trade off parsimony and smoothness with natural splines. 11 Tibshirani 1990, 22-23).8 Not only do cubic smoothing splines provide a way of formally trading off smoothness (“parsimony”) for fidelity to the mean function, but the operationalization of smoothness here jells with how we tend to graphically perceive smoothness. It first appears that equation (10) can not be estimated since there are n 2 parameters on the piece-wise cubics (the i ), plus four parame- , ters for the global cubic polynomial, yielding more parameters than data points. But the constraints imposed by the smoothing parameter in the roughness penalty term drastically reduces the “effective number of parameters”. A more detailed discussion appears in Hastie and Tibshirani (1990, 27–29), but the underlying idea is that the pre-specified value of governs how sensitive the fitted function is to the piece-wise components versus the parameters fitted globally. As the smoothness constraint becomes more binding the estimates of the parameters become highly inter-dependent and driven towards zero. Furthermore, as gets larger still, all but two parameters in equation (10) are driven to zero, the exceptions being 0 (the constant) and 1 (the globally-fitted parameter on the linear term); i.e., least squares results when an infinitely smooth cubic spline is fit to the data. While smoothing splines attack the scatterplot smoothing problem differently from local regression, they nonetheless work the same as any other smoother, producing a more or less smooth fit to a scatterplot by controlling the sensitivity of the fitted function to local fluctuations in the mean of the data. And in fact, it is hard to know when local regression or cubic splines should be preferred to one another: “The smoothing spline and local regression methods appeal to rather different intuitive interpretations, and it is unlikely that either will have universally dominant performance” (Hastie and Loader 1993, 123). Because loess give no weight to observations outside the set of nearest-neighbors in forming a local estimate of E y , it is more robust against outlying values on X than smoothing splines (Cleveland 1979). Our experience is that smoothing splines produce fewer small nonmonotonicities compared to local regression. Thus where we are reasonably confident that local monotonicity in the mean function is correct, we use smoothing splines. For example, in our analysis of the effect of the () 8 Hastie and Tibshirani (1990, Figure 2.6) display a series of spline functions demonstrating this point. Spline functions with discontinuities in the fitted function and its first two derivatives simply “don’t look smooth”. 12 Congressional overdraft scandal (Section 4.2), we would not expect small increases in the number of overdrafts to help an incumbent, and so use smoothing splines rather than local regression; the latter produces some small bumps that made no theoretical sense. But, we also note that the general shape of the fits produced by local regression and smoothing splines is quite similar; we would worry about the use of either smoother if they revealed qualitatively different results. In sum, the differences in the ways local regression and cubic smoothing splines operate are generally overwhelmed by choices as to how much smoothing to do with a given brand of smoother: or put differently, “within smoother” variation seems to dominate “across smoother” variation.9 3.3 Inference Linear Smoothers Local regression and cubic smoothing splines are both linear smoothers, with relatively well-understood statistical properties. Scatterplot smoothers proceed by averaging values of y about some target point x0 ; in particular, a linear smoother estimates g x0 by ( ) ( ) = ∑ S ( x ; x ) y N ĝ x0 0 j (11) j j=1 where x j and y j are elements of the data vectors x and y, respectively, x0 is the target point, and S is a weight function parameterized by the smoothing parameter . Letting the smoother produce ĝ xi for each i yields S, an n by n smoother matrix. Each row of S contains the weights used in generating the smoothed fit at each data point; or, in other words, the jth row vector of S is a n by 1 vector of the weights each observation picks up in contributing to the fitted smooth at xj , sometimes referred to as the equivalent kernels. Using this notation a linear smoother can be written more compactly as ( ) ĝ = S y; 9 (12) One caveat is that local regression generalizes to higher-dimensional settings more readily than spline functions. For examples of fitting splines in two or more dimensions, see Green and Silverman (1994, ch 7) or Wahba (1990, section 2.4). 13 ( j )= ( ) where ĝ is the n by 1 vector of estimates of E y x g x , conditional on the pre-specified smoothing parameter (Hastie and Tibshirani 1990, 44). Linear smoothers are thus linear in precisely the same way that the predicted values from a least squares regression are a linear function of the dependent variable, conditional on the data: i.e, for least squares linear regression, ŷ Xˆ Hy, where H X X0 X ,1 X0 is the well-known “hat = = = ( ) matrix”, transforming the vector of observed responses y into the fitted responses ŷ (Weisberg 1985, 109–111). The smoother matrix S has many of the same properties as the hat matrix. In fact, H is a special case of a smoother matrix, recalling that least squares regression can be thought of as an infinitely smooth scatterplot smoother. The representation of linear smoothers in equation (12) is extremely convenient. As the bandwidth of a smoother gets wider, off-diagonal elements of S get larger, as data points further away from the target point x0 have greater influence in forming ĝ x0 . On the other hand, in the limiting case of a smoother that simply interpolates the data, S is a diagonal matrix (specifically, an identity matrix). ( ) Equivalent degrees of freedom This inverse relationship between the amount of smoothing and the extent to which S is diagonal gives rise to a very useful approximation for linear smoothers. It can be shown that the degrees of freedom consumed by a linear smoother S is well approximated by the trace (sum of the diagonal elements) of S (Hastie and Tibshirani 1990, 52–55 and Appendix 2). Recall that for least squares linear regression, the trace of the hat matrix is equal to the degrees of freedom consumed by the regressor’s predictors (including an intercept, if present). This quantity, tr S , is referred to as the equivalent degrees of freedom in the literature on smoothing and GAMs. More precisely, tr S 1 is an estimate of the degrees of freedom consumed by the non-parametric component of the smooth fit, since all scatterplot smoothers fit an intercept term. Figure 1 displays how the equivalent degrees of freedom consumed by a scatterplot smoother increases as the amount of smoothing decreases. The exact nature of this tradeoff is data-dependent, but the relationship depicted in Figure 1 is quite typical. Jagged fits consume more degrees of freedom; conversely, least squares is an infinitely smooth fit to the data. () ( ), 14 10 12 14 16 2 4 6 8 Equivalent degrees of freedom 2 4 6 8 10 12 14 16 0.2 0.4 0.6 span 0.8 1.0 -6 -5 -4 -3 -2 -1 0 Figure 1: Equivalent degrees of freedom, local linear regression (loess) and cubic smoothing splines. The amount of smoothing performed by loess is controlled by the span parameter (left panel), while weights the roughness penalty for cubic smoothing splines (right panel). Smoothness increases as both smoothers tend towards the least-squares regression fit (d f =2). Standard Errors The representation of the smoothing problem in (12) also allows a simple characterization of the variance-covariance matrix of the fitted smoothed values. Given the error variance 2 , ( ) = SS0 : 2 cov ĝ (13) With an estimate of 2 appropriate to a particular application, pointwise standard errors are equal to the square-root of the diagonal elements of SS0 ˆ 2 . With a distributional assumption (e.g., normal errors), confidence intervals can be constructed about the fitted smooth using the pointwise standard errors. Examples are in Figure 4 and and in the applications below. Pointwise versus Global Confidence Intervals It should be noted that pointwise standard errors summarize uncertainty about the location of a single fitted point, say, ŷi xi . Uncertainty about the entire fitted function is harder to characterize, simultaneously involving the variances of each the ŷi , and their covariances (the off-diagonal j 15 elements of SS0 2 , which are non-zero for any non-degenerate smoother). The upshot is that pointwise confidence intervals do not necessarily capture all of the uncertainty as to where the smoothed fit actually lies. About the only feasible method for characterizing the distribution of the entire fitted function ĝ x is a computationally intensive re-sampling () scheme; i.e., under the assumption of normal errors, repeatedly sample the vector g from a n-dimensional normal distribution with mean vector ŷ ĝ x and variance-covariance matrix SS0 ˆ 2 , and plot each sampled function. Since the resulting scatterplot quickly gets cluttered, one approach is to display only “extreme” or “interesting” sampled functions. To construct a “confidence set” for the fitted function, one can reject sampled functions that exceed a critical value on some performance criterion (e.g., Hastie and Tibshirani 1990, 62–4). But imagine plotting the “shadow” or “projection” of the fitted function’s n-dimensional 95% confidence set onto a two-dimensional scatterplot; it can be shown that the bounds of this shadow need not be wider than the bounds on the corresponding pointwise confidence interval (Hastie and Tibshirani 1990, 62). Our experience echoes this result. We find that the pointwise confidence intervals provide a very good approximation as to the shape of the confidence set of the fitted function, and at far less cost. The shaded confidence intervals we display around fitted smooths are pointwise confidence intervals. = () Specification Tests Specification tests for scatterplot smoothers/GAMs can be performed under the same conditions for linear regression or GLMs (i.e., large samples or normal errors). In particular, for a GAM with a continuous dependent variable, the sum of squared residuals can be used in a standard F test. The statistic for this test is computed identically to the usual linear regression F statistic. While the distribution of this statistic is only approximately F, simulation results indicate that the approximation is reasonable (Hastie and Tibshirani 1990, 67). In the Appendix we report new simulation results which indicate that the F approximation is generally good, as well as factors which appear to make the approximation better or worse. For models with a binary dependent variable we can compute the analogue to the log of the likelihood. This is, as usual, the sum of the logs of the estimated probabilities of success (for the observed successes) and failure (for the observed failures). For both tests, statistics are computed using 16 the approximate degrees of freedom consumed by the smoother. A common use of either test is to examine whether a specification with a smooth term is superior to one with a corresponding linear term; since the latter model is nested within the former, the appropriate F or 2 test can be used here. Asymptotic Properties Asymptotic results for smoothers find the nearest-neighbor approach underlying local regression to be superior to the fixed-kernel approaches we described in section 3.1 (Fan 1993; Fan 1992). In particular, local regression has greater asymptotic efficiency than fixed-kernel approaches. Precise characterizations of the asymptotic properties of non-parametric regression estimators are difficult to summarize concisely; here we present a useful intuition. A kernel-based local linear regression estimator evaluated with n observations has asymptotic mean-square error [ ( )] = [asy bias] + asy variance c h g00(x ) + nh1 f (cx ) ; 2 AMSE ĝn x0 1 4 n 0 2 2 n (14) 0 ( ) where hn is the bandwidth of the kernel, f x0 is the density of x at the target point x0 , and c1 and c2 are constants, fixed for a given kernel (Fan 1993, 199). Part of why nearest-neighbor methods outperform fixed-kernel estimators is due to their adaptive bandwidth feature: when f x0 decreases (i.e., the data are sparse in the vicinity of the target point), the bandwidth increases, effectively keeping asymptotic variance low.10 Equation (14) also makes clear that smoothers perform better if the underlying target function is relatively smooth to start with, where “smoothness” here is again defined with respect to the second derivative of the target function.11 In the limiting case of a linear target function, local linear regression results in no bias whatsoever. Local linear fitting will yeild ( ) 10 In this sense, the notation in equation (14) is slightly misleading, implying that hn (x ) is fixed x0 . A more precise notation might be hn 0 , indicating that bandwidth may be conditional on the particular target point. 11 As is standard in the statistical literature, the bias component of the mean-squareerror is approximated with a Taylor-series expansion out to the second derivative of the target function (hence the approximation symbol in equation (14); i.e., the expression for the bias of the non-parametric estimator is itself an estimate. A higher-order expansion could be used, but the second-order expansion is sufficient to capture the key intuition and appears standard in the literature. 8 17 in a biased estimate of the target function if the target function has nonzero second derivatives, and in general, local regression methods will be biased if the local regression is not of the same order as the target function. Note that smoothing by local constant fitting (e.g., kernel-based averaging, including the moving average as a special case) is akin to the researcher making the rather strong assumption that the target function is constant. Local polynomials can be used when the target function is suspected to be non-linear; mixing low-order polynomials is a current area of research (Cleveland and Loader 1996). Another important asymptotic result for our purposes is that the rate of convergence for a p dimensional GAM is no slower than that for scatterplot smoothing (Stone 1982); accordingly, increases in the number of predictors in a GAM does not result in slower rates of convergence, as it does for other non-additive non-parametric regression models (Hastie and Tibshirani 1990, 131–132). This said, there is a reasonably strong sense of skepticism about the the asymptotic results. Despite good asymptotic properties for local regression, biases at the boundaries of the X variables can be large and dominate the performance of the estimator in finite samples. For this reason, some statisticians downplay the usefulness of asymptotic properties of smoothers, preferring to focus on the good finite sample properties of local regression (e.g., Cleveland and Loader 1996) and cubic smoothing splines. 3.4 Estimating GAMs GAMs are usually estimated using a backfitting algorithm, a computationally efficient approach to estimating the large linear systems often implied by non-parametric multivariate regression models. Backfitting is used in other non-parametric settings (e.g., projection pursuit and ACE) and excellent descriptions of the algorithm appear in the statistics literature (e.g., Breiman and Friedman 1985). Here we simply sketch how the algorithm works in the context of GAMs; a fuller description appears in Hastie and Tibshirani (1990, ch5). Given the model (j E y X1 ; X2 ) = g ( X ) + g ( X ); 1 1 2 2 (15) backfitting alternates between estimating g1 and g2 . With an initial ĝ1 , an estimate of g2 can be obtained from smoothing the partial residual 18 , ( ) , ( ) y ĝ1 X1 on X2 , and an update of ĝ1 can be obtained by smoothing y ĝ2 X2 on X1 . This process continues until all of the estimated smooth functions cease changing. More formally, estimating a GAM with p predictors involves solving the system 0 I B B S B B @ ... 2 Sp ( S1 S1 I S2 .. .. . . Sp Sp ::: ::: S1 S2 . . . .. . ::: 10 g B C B C g B C C @ ... AB I 1 2 gp 1 0 1 Sy B C C B C Sy C C = B .. C C @ . C A B A 1 2 (16) Sp y ) for the vector g1 ; : : : ; g p 0. This system can be written more compactly as Pg = Qy; (17) = which is the GAM analog of the least squares normal equations X0 X X0 y. Continuing the analogy with least squares suggests inverting P and pre-multiplying both sides of equation (17) by P,1 to yield ĝ. This is seldom an efficient computing strategy, since P is an np by np matrix, and 3 inverting this matrix consumes computing resources on order of np (Hastie and Tibshirani 1990, 109). Backfitting exploits the special structure of P and its component matrices, reducing the computational burden tremendously. ( ) 3.5 Example — Pennsylvania State Senate data We motivate these ideas using a small data set, presented graphically in Figure 2. On the horizontal axis is the Democratic plurality recorded using voting machines at some recent elections for the Pennsylvania State Senate. On the vertical axis is the Democratic plurality recorded among absentee ballots. These data were collected to help decide a court case after a large Democratic plurality among absentee ballots decided the outcome of a special election, and in turn, partisan control of the Pennsylvania State Senate. We stress that our interest in these data is purely expository, and so the contested special election is not shown on Figure 2; Bartels (1996) reanalyses these data and conclusions about the substantive issue of electoral fraud. Figure 3 show what is going on “behind the scenes” for three fitting techniques: least squares linear regression, the local regression smoother, and cubic smoothing splines (right-hand panels). Each panel plots the so19 1000 500 0 -500 Democratic Margin - Absentee Ballots 0 20000 40000 60000 Democratic Margin - Voting Machine Figure 2: Pennsylvania State Senate data. The straight line is an OLS fit, and the shaded area is its 95% confidence interval, assuming independent and identically distributed (iid) normal errors. called equivalent kernels for the three different fitting methods at various target points in the Pennsylvania State Senate data; the equivalent kernels are simply the weights assigned to the data points in forming ŷ x0 ĝ x0 , where x0 is the target point. As Figure 3 makes clear, the equivalent kernels depend on not just the type of smoother being used, but also on the amount of smoothing, and the location of the target point. This “built-in” adaptivity gives these smoothers good properties (low bias) in sparse regions of x, and in particular, at the boundaries of x. The equivalent kernels for local regression are plotted in the two middle panels of Figure 3. Two different spans are plotted (1/4 and 3/4), meaning that 25% and 75% of the data count as near-neighbors, respectively. Observations outside this neighborhood receive zero weight. Within the neighborhood, the tri-cube kernel, defined in equation (7), largely determines how the weights are formed. Smaller spans (bandwidths) result in tighter neighborhoods, down-weighting the influence of more distant observations in forming the local data summary, relative to the wider bandwidth j = ( ) 20 .5 (a): obs. 9, OLS linear regression (b): obs. 20, OLS linear regression .4 .3 .2 o o o .1 o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o 0 o o o o o o o - .1 .5 (c): obs. 9, local linear regression (loess) (d): obs. 20, local linear regression (loess) o .4 span = 1/4 span = 3/4 o o .3 o o o o o .2 o oo o .1 o o o oo o o 0 o o o o o o o o oo o oo oo oo o oo o o o o o o o o o o oo o o o oo oo oo oo o o o o oo o o .5 (e): obs. 9, cubic smoothing splines (f): obs. 20, cubic smoothing splines o .4 df = 10 df = 4 o o .3 o o o .2 o oo .1 o o oo o o o o o o o 0 o o o oo o oo o oo o o o o oo o o o o o o o o o o o o o o o oo o o o o oo oo oo oo o o o oo o o o Figure 3: “Equivalent kernels” for least squares linear regression, local linear regression (loess) and cubic smoothing splines, using two target points from the Pennsylvania State Senate data. Three different scatterplot smoothers are fit to the data presented in Figure 2 and the equivalent kernels for observations 9 and 20 are graphed here; each equivalent kernel is the vector of weights the observations picks up in contributing to the fit at the specified target point. Each equivalent kernel sums to one, though is negative in some cases. Note that for least squares linear regression, every observation makes a contribution to ŷ x0 ; contrast scatterplot smoothers, where the resulting data summaries are specific to neighborhoods of x0 . j 21 smoother. The bottom two panels of Figure 3 shows the equivalent kernels for two different cubic smoothing splines fit to the Pennsylvania State Senate data. The smoother consuming ten degrees of freedom forms tighter neighborhoods around each target point than does the smoother consuming four degrees of freedom. Remember that cubic smoothing splines emerge as the solution to a constrained optimization problem, and are not set up with a probability density function or weight function as the smoother’s underlying kernel. Nonetheless, like all smoothers, smoothing splines wind up differentially weighting observations around the target point, in the manner shown in Figure 3. For least squares regression, the equivalent kernel for the jth data point is simply the jth row of the hat matrix defined in section 3.3. The equivalent kernels for linear regression (top panels) highlight that least squares is a “global” fitting technique; all observations contribute to the least square estimate of a specific y0 x0 , no matter how distant from the target point. The result of exploiting all the sample information is familiar: least squares has minimum variance among the class of linear, unbiased estimators. But it is precisely this question of bias that is at issue; if least squares linear regression isn’t “getting the mean right” then we might question the value of its optimality properties. Under an easily-imagined set of circumstances, we might prefer “getting the mean (more) right” to the low variance of least squares. Scatterplot smoothers and their multivariate generalization, GAMs, offer a way for analysts to manage the tradeoff between bias and variance, an issue we pursue below. j 3.6 Bandwidth selection and the bias-variance tradeoff Consider smoothing by moving averages. More temporal aggregation yields a more precise data summary (since it is drawn from a greater number of observations), reflecting the wider bandwidth of the smoother. Less temporal aggregation — a narrower bandwidth — yields a less smooth data summary, but one that is more faithful to local fluctuations in the mean of the data. But since smoothers with the narrower bandwidths yield local data summaries based on fewer observations, they are inherently less precise than smoothers with wider bandwidths. In general, a wider bandwidth risks bias while shrinking variability; a narrower bandwidth purchases less bias at the cost of precision. 22 All smoothers wrestle with this tradeoff, via a smoothing parameter that in turn governs bandwidth. Bandwidths are sometimes estimated or derived from the data, but usually are chosen in advance and then finetuned by the user on a case-by-case basis. For local regression, the span parameter governs the size of the neighborhood in which data points receive non-zero weights in the local regression. For smoothing splines, the parameter in (9) effectively governs the amount of smoothing. Figure 4 depicts the bias-variance tradeoff, for both local regression and cubic smoothing splines, using the Pennsylvania State Senate data first presented in Figure 2. The top left panel of Figure 4 uses a relatively small span (1/4), making the fitted function more sensitive to local changes in the mean function, but with lessened precision (wider confidence intervals). In the lower left panel, a wider bandwidth (span = 3/4) lessens sensitivity to local changes in the mean function (risking bias), though produces a more precise fit (smaller confidence intervals). The right hand panels repeats this comparison for two cubic smoothing splines fit to the same data, with equivalent degrees of freedom set to 4 and 10. It is also illustrative to compare the confidence intervals on these smoothed fits with those obtained with linear regression fit to the same data (Figure 2). The linear regression confidence intervals are more narrow than those obtained with the more “flamboyant” non-parametric fits (though only slightly more narrow than the confidence intervals in the bottom panels of Figure 4). Moreover, the linear regression confidence intervals lie far away from several data points. All this is to say that linear regression is not immune from the bias-variance tradeoff. The graphical comparisons in Figure 4 reveal that users of linear regression (implicitly) obtain low variance at the expense of possibly high bias. How to choose these smoothing parameters? Based on our experience in applying smoothers in social science contexts, this seems a matter of judgement best decided on a case-by-case basis, informed by prior beliefs about the processes being modelled and diagnostic plots for evidence of over- and under-fitting. We prefer this type of approach for scatterplot smoothing and fitting GAMs; graphical inspection lets us keep track of what the smoothing procedures are up to more so than letting the software “black-box” the model fitting, say via cross-validation (e.g., Hastie and Tibshirani 1990, 42–52) or other automated procedures for choosing smoothing parameters. In addition, we typically have prior beliefs that the social and political processes under study are reasonably smooth; while the 23 1500 60000 cubic smoothing spline, df = 10 0 20000 40000 60000 1000 500 0 -500 loess, span = .75 0 20000 40000 60000 -1500 -500 0 500 1000 1500 40000 -1500 -500 0 500 1000 1500 1000 500 0 -500 -1500 20000 1500 0 -1500 24 Democratic Margin - Absentee Ballots loess, span = .25 cubic smoothing spline, df = 4 0 20000 40000 60000 Democratic Margin - Voting Machine Figure 4: Bias-variance tradeoff, local linear regression (loess) and cubic smoothing splines, fit to the Pennsylvania State Senate Data. The solid lines are the fitted smooth functions, and the shaded regions indicate 95% pointwise confidence regions, assuming iid normal errors. For comparison, the dotted lines mark the 95% confidence interval from the linear regression fit (Figure 2). The top panels show how smaller bandwidths (span = 1/4, d f = 10, respectively) increase sensitivity to local changes in the mean function, at the expense of precision. Note the wider confidence intervals in the top panels, compared to smoother fitted functions in the bottom panels (span = 3/4, d f = 4). world may be non-linear, we doubt it to be abruptly or flamboyantly nonlinear.12 In these cases we have good reasons for constraining the smoothing parameters to values that reflect these beliefs, rather than relying on the software to automatically choose parameters that minimize prediction error. All this is to say that we sometimes prefer “keeping the mean function parsimonious” (i.e., smoother) over blind adherence to “getting the mean right”. If qualitative findings were sensitive to small and arbitrary decisions about smoothing, the whole smoothing and GAM technology would not be very useful. Fortunately, our findings are not sensitive to choices about the amount of smoothing to perform, within what appear to be reasonable ranges of smoothness. Obviously researchers should clearly indicate whether results depend on arbitrary decisions about smoothness. But even here GAMs are simply making explicit the effect of arbitrary choices that usually are left implicit. Users of OLS almost never question whether an assumption of perfect smoothness is reasonable. All of the findings in our applications are robust to exact choices about how much to smooth; on the other hand, since our results differ substantially from those obtained with simple linear models, it follows that the assumption of extreme smoothness underlying OLS is substantively consequential. 4 GAMs in practice GAMs seem an appealing way to analyze political and social data. It is now time to see how helpful they are in actual political science research. The examples that follow come from a variety of sub-fields of political science, and were chosen to show some of the advantages of the GAM approach. In the conclusion we assess the usefulness of GAMs in political research more generally. 4.1 The effect of Perot on Congressional races: GAMs as an alternative to multiple regression Jacobson and Dimock (1994) analyze the effect of the House Bank scandal on the 1992 House elections. In one analysis they regress the challenger’s vote in the 1992 election on the number of overdrafts written by 12 Einstein’s observation that “God is subtle but he is not malicious” is a favorite epigraph in texts on non-parametric statistics. 25 the incumbent and a series of other “control” variables. The control variable of interest for our purpose here is the percentage of the presidential vote received by Perot. Jacobson and Dimock (1994, 617) used this variable because they “suspected that Perot voters were more disposed to vote against incumbents of either party.” In Table 1 we first present a re-estimation of Jacobson and Dimock’s regression (their Table 9).13 The data comprise all 309 incumbents who ran for re-election in 1992. We regressed the proportion of the vote received by the challenger against the Perot vote in the district, the log of the number of overdrafts (plus one) for each incumbent and a series of control variables.14 Table 1: Estimates of Challenger’s Vote: 1992 Congressional Election OLS GAM: Perot Vote GAM: Checks b se b se b se Constant 4:34 3:49 5:34 3:44 4:64 3:48 Prior Office :98 :87 1:07 :86 :81 :87 ln(Chal. Exp.) 2:66 :26 2:63 :26 2:64 :27 ln(Inc. Exp.) :06 :60 :11 :60 :004 :59 Pres. Vote :36 :036 :34 :04 :37 :04 ln(Overdrafts) :80 :19 :85 :19 see figure 6 Marginal :85 :84 :85 :83 :89 :84 Part. Redist. 3:48 1:24 3:51 1:22 3:48 1:23 Perot Vote :18 :06 see figure 5 :18 :06 Sum Sq. Resid. 9887.3 9588.1 9754.3 Degrees of freedom 300 298 297 , , The results of the linear regression are in the first columns of Table 1. According to the linear regression model, the Perot vote did have an effect on the 1992 Congressional vote, with a 10 point increase in the Perot vote associated with just under a two point increase in the challenger’s vote. The t-ratio for this coefficient is about 3, so the effect is clearly statistically significant. But while we suspect that Perot voters are more likely to vote 13 The data as supplied by Jacobson and Dimock were updated after publication. Analyses reported here use the updated data. Our results differ by at most a few percent from the published results. 14 In the next subsection we assess the logarithmic transformation on the overdrafts variable. Here we focus solely on the effect of the Perot vote. Other control variables used in all analyses were whether the challenger had held prior office, the logs of both the challenger’s and incumbent’s expenditures, the presidential vote for the challenger’s party in the district, a dummy variable for whether the incumbent was in a close race in 1990 (Marginal ) and whether the incumbent was subject to a partisan redistricting between 1990 and 1992 (Part. Redist.). All data used are as supplied by Jacobson and Dimock; measurement details are in their 1994 article. Unreported analyses confirm that these predictors have linear effects and so will not be discussed further. 26 2 4 Spline 3df Spline 4df Spline 8df Loess Span=.75 Loess Span=.5 Loess Span=.25 5 10 (b): Smoothing Spline, df=3 Spline df=3 Linear 0 6 8 10 Effect on Challenger’s Vote (a): Comparison of Smooths 0 Effect on Challenger’s Vote against incumbents, we have no theory which says that this functional form is linear. To assess this, we re-estimated a series of GAMs. These were identical to the linear regression, except they allow for a smooth, but non-linear, relationship between the vote for Perot and the challenger. Since we want to ensure that our findings are not the result of an arbitrary choice about whether to use loess or cubic smoothing splines, as well as the degree of smoothing, we present a series of analyses which vary both the degree of smoothing and the choice of smoothing technique. Plots of the relationship between the challenger’s vote and the Perot vote for a variety of smoothing choices are in panel (a) of Figure 5.15 In interpreting these plots, it should be remembered that the median district Perot vote was about 20%, ranging from 3% to 33%; the distribution of this vote by district was reasonably symmetric, with a slightly heavier left tail. 5 10 15 20 25 30 5 Perot Vote 10 15 20 25 30 Perot Vote Figure 5: Non-parametric estimates of the effect of Perot vote in a Congressional District on challenger’s vote. (a) compares a variety of smooths and (b) compares the best fitting smooth with a linear fit The linear model for the effect of Perot vote is clearly misleading. Using any of the graphed smooths, we see that the an increase in the vote 15 We stress that all these figures are based on GAMs which include all the other Jacobson and Dimock variables either linearly or logarithmically. The effect of these other variables is thus summarized by a regression coefficient. These coefficients (for the GAM using a smoothing spline with three degrees of freedom) are shown in the middle columns of Table 1. Specifying a non-linear effect for the Perot vote has little impact on the other coefficients. This need not be the case; we would have seen bigger effects if the Perot vote had been more highly correlated with the other control variables. 27 for Perot has a linear effect on the challenger’s vote until the Perot vote exceeds about 15% (that is, there is a linear impact for the lower quartile of districts). But once that threshold is exceeded, there is no further effect of the Perot vote on the Congressional election. Note that this finding is robust to choices about how to smooth. The coefficient on the linear Perot term is .18; but the non-linear effect is between three and four times as great until the 15% threshold is reached. In panel (b) of Figure 5 we compare the best fitting of the smooths, a cubic smoothing spline using 3 degrees of freedom, with the linear fit. The linear model under predicts the challenger’s percentage of the vote by over three points for districts at the 15% threshold. F tests overwhelmingly reject the null hypothesis of linearity in favor of any of the local regression fits, with p < :0001. This difference is clearly non-trivial. It also yields a different conclusion than Jacobson and Dimock’s (1994, 618) finding that “the more support for Perot, the more support for the House challenger.” Instead we have two different effects, separated by the 15% threshold. It is possible that each district has a relatively small number of anti-Washington voters, who would both be more likely to vote for Perot and the non-incumbent House candidate. But districts where Perot did well (above the 15% threshold) are probably districts where the Perot campaign was organized. If so, such organization did not appear to have consequences for the House race. We would need individual level data to test this hypothesis, but without the GAM we would not even have seen the non-linear relationship between the Perot and House vote.16 4.2 Overdrafts and the Congressional Vote: Assessing Transformation Jacobson and Dimock were interested in the effect of the House Bank scandal, not the Perot vote, on Congressional vote. They chose to model the effect of overdrafts on the challengers vote via a logarithmic transformation. Since many incumbents had no overdrafts, they used the natural log of the number of overdrafts plus one. While we might expect that the impact of overdrafts on the vote is monotonically increasing, it is also reasonable to expect decreasing returns to scale. But Jacobson and Dimock 16 We do not claim it would have been impossible to find such a relationship via analysis of least squares residuals and such; but we do claim it is at least as easy to find and test non-linear relationships with GAMs as it is to find linear relationships via linear regression. 28 neither justify nor examine their choice of specific transformation. GAMs provide an ideal way of doing this. The overdrafts variable presents difficulties in that it is highly skewed. While its range is from zero to 851, the median number of overdrafts is two, with 40% of incumbents who sought re-election having had no overdrafts and 90% having had fewer than 100 overdrafts. The few incumbents with a large number of overdrafts bring the mean number of overdrafts to over 30. While highly skewed variables are always a problem, the large number of incumbents with no overdrafts causes severe problems for the logarithmic transformation. While it is conventional to add one to all observations to avoid the log of zero, we could equally well add .01 or 10. Substantively, this decision has serious consequences for estimating the impact of the independent variable at low levels of that variable; for Overdrafts, almost all observations are at low levels.17 The GAM completely eliminates the need for the common arbitrary decision about how to avoid the log of zero. The logarithmic transformation is also problematic in that it is fit globally, not locally. Decreasing returns to scale for large values of Overdrafts clearly requires the use of a downward bending transformation. But a log transformation struggles to fit both those incumbents with many overdrafts, as well as the large number of incumbents with few, if any, overdrafts. To assess the relationship between overdrafts and challenger vote, we estimated a GAM relating challenger vote to a non-parametric estimate of Overdrafts, using a smoothing spline with four degrees of freedom for this estimate.18 Neither the logarithmic nor the spline model is nested inside the other, and so we cannot directly test one versus the other. We can embed both models in a larger model which contains both, and then test whether the data reject either specialized model (which is the common procedure for testing non-nested regression specifications). This “test” shows that the data prefer neither model, finding either one consistent with the data. Thus using the logarithmic transformation of Overdrafts, as Jacobson 17 These consequences stem from the logarithmic function having large first and second derivatives near zero. Thus we will get very different estimated impacts if we add .01, 1 or 10 before transforming. Theory is completely silent as to which of these values is appropriate, and conventional data analysis allows for a large range of values which might be appropriately added to Overdrafts prior to the log transform. 18 All other control variables were entered linearly. Results were almost identical if we used the non-parametric Perot Vote. The non-parametric estimate was not sensitive to smoothing choices. 29 and Dimock did, does no great injustice to the data. 0 200 400 600 800 OVERDRAFTS 5 4 3 2 1 0 -1 2 4 6 8 Effect on Challenger’s Vote (b): Fewer than 100 overdrafts 0 Effect on Challenger’s Vote (a): All Incumbents 0 20 40 60 80 100 OVERDRAFTS Figure 6: Comparison of logarithmic (dashed line) and cubic smoothing spline (df=4), effect of overdrafts on challenger’s vote. An approximate 95% confidence interval for the cubic smoothing spline estimate is shaded. (a) is for all incumbents while (b) focusses on the region containing 90% of the data, allowing the eye to more clearly see the difference between the two models. Both panels are identical except for scale. We can, however, use the estimated smoothing spline to improve our understanding of the effect of the House Bank scandal on the election. The estimated impact of the scandal using the logarithmic transformation is heavily influenced by the arbitrary decision to add one to Overdrafts before taking logs. The GAM approach avoids the need for this arbitrary decision. The estimated smoothing spline and logarithmic fits are shown in Figure 6. Comparing the spline to the logarithmic fit, we see that the logarithmic fit overstates the impact of overdrafts when the number of overdrafts is not great; it lies above the confidence interval of the spline fit until the number of overdrafts exceeds 50. Over 85% of incumbents seeking re-election were in this region. The spline fit indicates that a very small number of overdrafts did not hurt incumbents very much, and may not have hurt them at all. The pointwise confidence interval for the impact of overdrafts excludes zero only for six or more overdrafts. At that point, the GAM analysis indicates that six overdrafts cost the incumbent about a quarter of a point; the logarithmic 30 analysis indicates that six overdrafts cost the incumbent over a point and a half. The logarithmic analysis indicates that even one overdraft cost the incumbent about half a point; the corresponding GAM impact is four one hundredths of a point. Thus the logarithmic analysis shows that even a very small number of overdrafts might have affected close Congressional races; the GAM analysis shows that overdrafts only had a major effect on races for the minority of incumbents who had a relatively large number of overdrafts. Thus while the logarithmic transformation does not do any great injustice to the data, it does have substantive consequences that appear not to be correct. 4.3 Do governments face rising hazards? — an analysis in discrete, but smooth, time An enduring controversy in comparative politics is whether the probability of a cabinet failing increases with the age of the cabinet. Warwick (1992) argued that the hazard rate of cabinet failure increased over the life of the cabinet, while Alt and King (1994) argued for a constant hazard rate (both controlling for various institutional rules).19 This controversy has both statistical and substantive implications. The incorrect assumption of a constant hazard rate leads to wrong standard errors, and wrong, or at best, inefficient, parameter estimates. The substantive issue is whether, controlling for institutional rules, there is a tendency for cabinets to become more fragile as they age. If so, we need to investigate what happens over the life of a cabinet to make it more fragile; if not, then fragility is purely determined by institutional rules. Parametric estimation of a continuous time Weibull model shows that the hazard rate is rising, at least for the model used by Alt and King (Beck 1995; Beck N.d.). But this finding is based on a strong assumption, that the hazard rate increases (or decreases) monotonically, following the specific Weibull shape. One of us has used discrete time hazard models to investigate this issue (Beck 1995). This analysis can be dramatically improved by the use of GAMs. The discrete time approach models the probability of a government failing at the end of some month, given that it survived up to that month (the discrete hazard) as a logit on a series of covariates and a time marker. 19 In survival analysis, the hazard rate is, loosely speaking, the probability of a cabinet failing in some small time interval t t given that it had survived up to time t. ( + ) 31 As is standard in this literature, the covariates we use are a measure of party polarization, whether the government is a majority government, how many prior formation attempts had been made, whether the coalition was formed right after an election, whether the government is a caretaker government and whether there is an investiture requirement. Letting FAILURE be a dummy variable which marks whether a government fails in a given month, given that it survived up to that month, and removing governments which fail from further analysis, the discrete hazard model is a logit with P(FAILURE ) E log = + x1 + + xk + h(t): (18) 1 , P(FAILURE ) In Beck (1995), h(t) was modelled as a series of dummy variables, one for each time point (so h(t) = h ). With monthly data, and cabinets survivit 0 1 i k i it t ing up to 48 months, this led to imprecise estimates of the coefficients of the individual ht ’s; these are, not surprisingly, highly collinear. While it appeared that the coefficients on the ht ’s increased with t, the large standard on these estimates made this finding uncertain. One solution to this problem would be to use a parametric model in time. But the shape of the time dependence is unknown and is exactly the point at issue. A low order polynomial approximation is overly restrictive, and is, in any event, inconsistent with typical shape assumptions made in survival analysis. GAMs provide a better solution. Instead of a series of dummy variables, or an arbitrary polynomial, we simply model h t as a smooth function of time. The smoothness assumption seem reasonable. While hazard rates may be time dependent, we would not expect them to make large jumps from month to month. Here we model h t as a smoothing spline consuming 4 degrees of freedom; our substantive findings were identical using a smooth with any reasonable number of degrees of freedom. The smoothing spline for time is shown in panel (a) of Figure 7 with coefficient estimates in Table 2.20 The picture is clear. The hazard of a government falling increases over time, at least for the first two and a half () () 20 The metric on the y-axis in this panel is in terms of logits of the probability of government failure, the GAM’s dependent variable; see equation 18. We present the results in terms of logits instead of the substantive probabilities of government failure because the latter are a function of all the covariates; i.e., it is more straightforward to inspect the partial effects of change in a given independent variable in terms of the logits. We can transform the predicted effects to changes in terms of probabilities via the inverse logit transformation, to produce graphs like that in panel (b). 32 years that it is in power. Examination of the 95% confidence interval shows that this decline is not simply random error. A 2 test clearly shows that Equation 18 with a smooth function of time is superior to one without a time variable; we can reject the null hypothesis that the smoothing spline of time does not belong in the specification at p :001. Contrary to Alt = and King, hazard rates are rising, not constant. 0.06 Weibull Discrete 0.04 0.0 0.5 HAZARD 1.0 0.08 1.5 (b): Hazard Rates -0.5 Effect on Cabinet Failure (a): Logit Scale 0 10 20 30 40 0 10 20 30 40 MONTH MONTH Figure 7: The effect of cabinet duration on cabinet failure: discrete hazards. (a) Effect of duration on logit of government failure(shaded area indicates an approximate 95% confidence interval, the y-intercept is arbitrary) and (b) comparison of estimated discrete and Weibull hazard. The plot of the smoothing spline is not directly interpretable in substantive terms, but it is easy to use the inverse logit transformation to compute the substantively meaningful discrete hazards (which are just the probabilities of cabinet failure in a given month). A plot of the discrete hazard rate against time, computed at the mean of all the other covariates, is shown in panel (b) of Figure 7. The hazard rate increases, more or less linearly, over the first 30 months of a government, tripling from .03 to .09 over those months. Typical cabinets are three times as likely to fall after having been in power over two years then at the time of their initial formation. The hazard rate remains at the .09 level after that, so the probability of a government falling in any month after month 30, given that it survived to that month, is almost one in ten. We should remember that there are relatively few governments which fall after three years, so the uncertainty associated 33 Table 2: Estimation of the Discrete Hazard of Cabinet Failure. Variable Constant INVESTITURE POLARIZATION MAJORITY FORMATION POST-ELECTION CARETAKER MONTH -2LogLike Degrees of Freedom With Smoothed Time b se ,3:30 :20 :41 :14 :025 :006 ,:63 :14 :14 :05 ,:90 :15 1:92 :32 see figure 7 1950.9 5385 Without Time b se ,2:88 :17 :36 :14 :020 :006 ,:54 :14 :12 :05 ,:76 :14 1:62 :31 1971.5 5389 with the flat portion of the estimated hazard is large. Fully parametric duration models typically make an assumption about the shape of the hazard function. The most standard model, the Weibull, assumes monotonic hazards. Thus it is hard to know if the finding of rising hazards in a Weibull model is real or artifactual. The GAM discrete hazard model estimates the hazard function non-parametrically. Thus our finding of rising hazards gives credence to the Weibull findings. A comparison of the Weibull hazard with the non-parametrically estimated hazard is shown in panel (b) of Figure 7. Both hazard rates are similar for the first year and a half a government is in power; at that point the non-parametric discrete hazard rises faster than does the Weibull hazard. While this difference between the two hazard rates is not statistically significant, it does show that, if anything, the Weibull underestimates the degree of rising hazards. Thus the Weibull model, with rising hazards, does little injustice to the data. The GAM analysis clearly shows that the Alt and King model is subject to rising hazards. Even holding institutional rules constant, cabinets do appear to become more fragile as they age. 4.4 The democratic peace: monadic or dyadic The democratic peace hypothesis asserts that democracies, while no less likely to be involved in wars in general, are less likely to fight other democracies. This hypothesis has spawned much quantitative research (see e.g., Ray 1995). The typical research design used to assess this hy- 34 pothesis uses data on some subset of all pairs of nations, with annual observations on dyadic variables, including the binary dependent variable of whether the dyad is involved in a militarized conflict. This data is analyzed by a straightforward logit analysis. Since theory is not specific on exactly what it takes for a nation to be a democracy, researchers have either used arbitrary dichotomizations (or trichotomizations) of relatively continuous democracy measures or have created ad hoc continuous measures of dyadic democracy. The 2 dimensional loess smooth enables us to do better. Here we reanalyze conflict in the post-World War II period among contiguous dyads, originally reported in Oneal et al. (1996).21 Following Oneal et al. (1996) and Oneal and Russett (1996), we estimate a logit of DISPUTE (whether or not there was a militarized dispute in the Correlates of War dataset) on: a measure of dyadic democracy (see below); the average annual growth in real GDP of the dyadic partners over the previous three years (GROWTH ); economic interdependence measured by the relative importance of trade between the dyadic partners (INTERDEP ); a dummy variable indicating whether the partners were allied (ALLIES ); and the relative balance of forces between the partners, with each partner’s capability measured by economic, military and demographic factors (CAPRATIO ).22 We use geographically contiguous dyads in the period 1950–85, yielding 6678 dyad-years. In the earlier work Oneal et al. (1996) used a simple dummy variable to measure dyadic democracy. Subsequently Oneal and Russett (1996) argued that the appropriate measure of dyadic democracy is the democracy score of the less democratic nation in the pair. This newer measure, MIN- DEM, is constructed by creating democracy scores for each member of the dyad (DEMA and DEMB ) and taking the dyadic score as the lesser of the two. The democracy score for a nation is the difference between its measure on a democracy scale (running from zero to ten) and an autocracy scale (also running from zero to ten), so DEMA, DEMB and MINDEM run from negative ten to ten. While MINDEM is not an unreasonable measure of dyadic democracy, the democratic peace hypothesis is not stated so clearly as to make it obviously the correct measure. 21 This section is drawn from a longer analysis of time-series–cross-sectional data issues in international relations (Beck and Tucker 1996). 22 These measures, other than for democracy, were taken exactly from Oneal, et al. Democracy measures were coded from the Polity 3 dataset (Gurr and Jaggers 1996). Measurement issues are discussed in (Oneal et al. 1996) and (Oneal and Russett 1996). 35 The GAM/interaction approach allows us to investigate the interrelationship of DEMA,DEMB, and the likelihood of a dyadic dispute in any given year. We simply enter a smooth term in the interaction of DEMA and DEMB (a 2 dimensional loess smooth) in a GAM of disputes.23 The smooth term used has a span of half the data; as usual, findings were not sensitive to smoothing choices. There is, however, an additional complication which must be dealt with before undertaking the GAM estimation. The simple logit setup of Oneal, et al. assumes the dyadic observations are temporally independent. This assumption, as Oneal, et al., recognize, is implausible. As shown in Beck and Tucker (1996), we can view the dispute time-series–cross-section data as duration data on how long dyads remain at peace.24 We can allow for non-independence by adding a smoothed function of the length of time since the last dispute, PEACEYRS, to the logit. We use a cubic smoothing spline with four degrees of freedom to model this smooth. As usual, results were not sensitive to specific smoothing decisions. It is important to estimate the effect of PEACEYRS non-parametrically, since we do not know how the probability of dispute varies over time. Results for the simple logit and GAM analyses are reported in Table 3. The smoothed fit of PEACEYRS (Figure 8) clearly shows that the logit on disputes suffers from non-independence.25 For the first seven or so years that a dyad is at peace, the probability of a a militarized dispute declines rapidly. We can gauge the substantive size of this impact by noting that it is roughly eight times as large as the effect of the dyadic partners being allied. The coefficient on the binary variable ALLIES in the GAM analysis of table 3 is .30; going from having been at war in the previous year to seven years of prior peace has a corresponding impact of almost 2.4 in 23 Since the model is symmetric in the dyadic partners, we randomly assign one nation as “A” for each observation. 24 The basic argument is that time-series–cross-section data with a binary dependent variable is just another way of describing duration data; the information in the binary dependent variable is the length of time between disputes (the number of zeros between ones). The assumption of independence in the logit analysis is identical to the assumption of a constant hazard rate in the corresponding duration analysis. Thus including the smooth of PEACEYRS is equivalent to our inclusion of the smooth of cabinet age in the analysis of cabinet failure. 25 “Likelihood ratio” tests clearly indicate that the smooth of PEACEYRS belongs in the specification. These tests indicate that a model with a linear PEACEYRS term is superior to a model without that term, and a model with the smooth of PEACEYRS is superior to the model with linear PEACEYRS. p-values for all these tests are effectively zero. As in the cabinet duration analysis, the y-axis for this plot (and of the z-axis in figure 9) is in on the logit scale (see note 20). 36 Table 3: Estimates of Logit Model of Post-World War II Disputesa Logit b GAM se :011 b se MINDEM ,:027 DEMA, DEMB see figure 9 GROWTH ,:14 :02 ,:12 :03 ALLIES ,:52 :11 ,:30 :12 CAPRATIO ,:444 :25 ,:68 :26 INTERDEP ,13:8 3:17 ,6:42 2:85 PEACEYRS see figure 8 Constant ,2:26 :11 ,1:05 :10 -2LogLike 2793.0 2337.4 Degrees of Freedom 6776 6767 a Data are in dyad-year form. Dependent variable is whether dyad engaged in a militarized dispute in a given year the same logit units. We can use the inverse logit transformation to estimate these effects in terms of the probability of a dispute. For a dyad with other independent variables set to their median values, changing from being non-allied to allied decreases the probability of conflict from .055 to .042; moving from having been at war in the previous year to seven years of prior peace decreases this probability from .274 to .033.26 The probability of a militarized dispute declines more slowly from years eight to about twelve; after about twelve years, additional years of peace have essentially no effect on whether the dyad enters a militarized dispute. The conventional logit analysis shows that MINDEM affects the probability of dyadic dispute; as the less democratic partner becomes more democratic, the probability of a militarized dispute decreases. The 2 dimensional loess smooth gives more information about the role of each partner’s level of democracy in inhibiting war. This smooth is shown in Figure 9. This figure is a 3 dimensional perspective plot, with the two democracy scores plotted on the x and y axes and the smoothed impact on disputes plotted on the z axis. Panel (a) of the figure is drawn from the perspective of the least democratic dyad, while (b) redraws the same figure from the perspective of the most democratic dyad. Figure 10 shows the corresponding contour plot, another way of conveying the information 26 The effects in terms of logits for PEACEYRS could be read off Figure 8; all reported effects were actually computed using the more exact methods discussed in Appendix B. 37 2 1 0 -1 -2 -3 EFFECT ON DISPUTE: Logit Scale 0 10 20 30 YEARS SINCE LAST DISPUTE Figure 8: Effect of PEACEYRS on disputes. Smoothing spline with approximate 95% confidence interval shaded. in Figure 9.27 Moving from being the least to the most democratic dyad has an effect on disputes (in terms of logits) of about .9. Substantively, this is about three times the impact of the members of the dyad being allied. Setting the values of all other independent variables to their medians, moving from being the least to being the most democratic dyad decreases the probability 27 As any hiker knows, conveying information about three dimensions on a two dimensional page is not easy. The perspective plots are an attempt to render the effect of the two democracy scores on disputes (on the logit scale) assuming the reader were looking at the three dimensional surface with the eye in (a) being at at the point 10; 10; 0 and the eye in (b) being at 10; 10; 0 , The contour plot is like the hiker’s “topo” map or the temperature map on the national weather page. The lines or “isocurves” on the contour plot joins up pairs of democracy scores which have equivalent effects on the probability of a militarized dispute. The numbers on the isocurves are in the same logit units as is the z-axis of the perspective plots. The isocurves on a hiker’s topo map indicate positions of equal altitude; the density of these curves around a given point indicates the steepness of the terrain near that point. The height of the perspective plot corresponding to a minimum democracy score of -10 for both partners is about zero; in Figure 10 the contour line closest to this point has a value of zero. At the other extreme, the height of the perspective plot corresponding to a maximum democracy score of 10 for both partners is almost -.9; the contour line closest to this point has a value of -.8. (Since the effect of democracy never is lower than -.88, there is no contour line corresponding to -.9.) ( (, , ) 38 ) EFFECT -1 -0.5 0 0.5 39 10 De (b) Perspective of Democracies EFFECT -1 -0.5 0 0.5 (a) Perspective of Non-Democracies mo 5 cra 0 cy :P a 0 ar y: P -5 rtn er -1 0 Tw o -10 10 5 One tner -5 crac emo -10 De mo 5 cra c -10 -5 One tner 0 y: Pa 0 5 Par ray: c o em 5 rtn er 10 Tw o 10 D D Figure 9: Local linear regression (loess) estimate of joint effects of each partners democracy score on militarized disputes. The estimated effects are defined over the two-dimensional space of the predictors, forming a surface. The plots are drawn from the perspective of (a) the least democratic dyads and (b) the most democratic dyads. of conflict from .06 to .025; we have already seen that being allied only decreases this probability by .013. A likelihood ratio tests clearly indicates that the smooth of the democracy interaction belongs in the specification, and is superior to a specification with only linear and multiplicative terms in DEMA and DEMB. (The p-value for this test is essentially zero.) Of 0.4 0.3 0.2 -0.8 -0.7 0 5 -0.6 -0.5 -0.4 -0.3 -0.3 -0.2 -0.1 0 -5 0.1 0.1 -10 Democracy: Partner Two 10 perhaps more interest, estimating a combined model with both MINDEM and the smooth interaction shows that even with MINDEM already in the specification, the smooth interaction term also belongs (p < :001). 0 -10 -0.1 -5 0 5 10 Democracy: Partner One Figure 10: Contour plot of estimated joint effects of each partners democracy score on militarized disputes. The contour plot is an alternative way of presenting the information in Figure 9, showing the “topography” of the fitted surface. The value of the GAM is not in obtaining improved fits, but rather in allowing us to delineate the impact of democracy on war, and in particular to assess whether each nation’s democracy matters, or whether only dyadic democracy matters. Looking at the the perspective plot in panel (a), we see that either partner’s democracy has little or no effect on the conflict when either partner is not democratic. The contour plot shows that the democracy has little effect on conflict until both partner’s democracy scores exceed zero (on the -10 to 10 scale). This is contrary to the 40 recent Oneal and Russett argument that the probability of conflict declines over the entire range of democracy scores. This would seem to indicate the the original dichotomous measure of democracy is better then the new MINDEM. But this is not the case either; once both both partners’ score exceeds zero, an increase in either partners score has a strong effect on conflict. In panel (b), we see that the surface drops rapidly when both partners are relatively democratic. The magnitude of the effect can be seen in either the steep drop off in the surface in Figure 9 or the close spacing of the contours in the northeast quadrant of Figure 10. Holding other variables constant, the difference in the probability of a dispute for the least and most democratic dyads is about the same as the difference in probability of conflict immediately after a conflict and after three years of peace. 90% of this effect occurs after both partners have positive democracy scores. Analyses relying on a dichotomous measure of dyadic democracy would miss this.28 The GAM analysis sheds light on whether the effect of democracy on war is dyadic or monadic. The dyadic hypothesis asserts that democracies themselves are not more pacific, only that pairs of democracies are more pacific; the monadic hypothesis asserts that democracies are more pacific, and democratic dyads are only as pacific as would be predicted from their two separate democracy levels (Rousseau et al. 1996, e.g.,). The GAM analysis shows that both positions are partially correct. At low levels of democracy it is dyadic democracy that matters; dyads containing a partner with a negative democracy score are equally likely to engage in conflict, regardless of whether the other partner is highly democratic or highly non-democratic. But for dyads where both partners have positive democracy scores, any increase in either partner’s democracy score deceases the probability of conflict. This can be seen in either the shape of the surface in panel (b) of Figure 9, or, perhaps, more clearly in the contours of Figure 10. In the northeast (democratic quadrant), these contours are negatively sloped 45 lines. This indicates that at high levels of democracy, either partner’s increase in democracy is a perfect substitute for the other. This shows that both the MINDEM and the dichotomous measure of 28 The plots show clearly that both the dichotomous and the minimum democracy measures are flawed. If the dichotomous measure were correct the perspective plot would simply show two vertical levels. If MINDEM were the correct measure the contours would all be vertical above the 45 line and horizontal below it, with decreasing conflict levels moving from left to right and from top to bottom throughout the plot. Figures 9 and 10 do not look like this. 41 dyadic democracy do not capture all the features of the democratic peace, and that neither the monadic or dyadic theories completely capture the nature of post-World War II conflict. 5 Conclusion Political methodology has made great strides in the last decade. Political scientists now routinely use models driven by a wide variety of complex statistical distributions with ease, and have started to use nonparametric methods to allow for more realistic error processes. Despite these more realistic methods almost all political science models assume a linear relationship between covariates and mean response. We find this especially puzzling since there is almost always no theory that responses should follow this simple linear form. The generalized additive model seems an ideal tool for confronting this shortcoming. While very flexible in modelling the effects of specific variables, the GAM is not so pliable that it finds patterns where none exist. Thus it may be the ideal compromise between simple linear models and much more complex neural net and projection pursuit type models. The additivity constraint also makes the GAM easy to interpret, a property not shared by its more more complicated cousins. Of course, analysts will choose among the numerous non-parametric regression techniques, perhaps choosing to ignore them altogether, drawing on their experiences with statistical modelling, and their intuitions about the particular substantive problem at hand. But the combination of flexibility and simplicity captured by the GAM means it can lay a strong claim to be the “workhorse” non-parametric regression method. Workhorses must be relatively easy to use; as we show in Appendix B, this is case for GAMs. While we think it most useful simply to present the non-parametric results as graphs of smooth curves, GAMs can also be used by those who are committed to parametric methods. They can be used to either test the adequacy of a chosen parametric transformation (including, perhaps most importantly, the linear transformation!) or to suggest a superior transformation. While we feel that global transformations are inferior to local estimates, a carefully chosen global transformation is superior to one chosen by convention or prayer. We presented a variety of applications of the GAM in Section 4. In the analysis of the House Bank scandal the GAM was most helpful in uncov42 ering the nature of the relationship between Perot votes and votes for the House challenger. With the GAM it was simple to see that there really were two relationships here: one for districts with few Perot voters and one for districts with many such voters. The GAM also helped us to better understand the substantive effect of the number of overdrafts drawn on 1992 House re-election contests; this effect is substantively smaller than had been previously claimed. And at the very least, our reanalysis also showed that those committed to parametric forms could better inform their choice of parametric form by using the GAM. These reanalyses show it is straightforward for analysts familiar with regression and logit analysis to recast those models as GAMs. Our reanalyses of cabinet durations shows that GAMs are an ideal tool for the analysis of discrete time duration data, and can settle important substantive and statistical questions that are almost impossible to settle any other way. Standard practice at present is to impose specific forms of age effects, such as the Weibull. The GAM can either be used to justify such forms or to go beyond them. The reanalysis of the democratic peace shows that the use of smooth functions of time goes well beyond standard duration models. The reanalysis of the democratic peace data also shows how useful the GAM is for modelling bivariate interactions. While many theories suggest interactions in social and political data, they are hard to estimate precisely in a linear framework. The GAM makes such estimation easy. We know of no other tool that could have shown that the dyadic democratic peace hypothesis holds, but also showing that the hypothesis must be refined: for the more democratic nations, an increase in either nation’s level of democracy decreases the probability of conflict. We do not claim that the GAM is useful in all situations. In particular, it cannot be used where independent variables take on only a very few discrete values, although such variables can still be used as “linear controls” or conditioning variables in a GAM. Consider a hypothetical analysis of political behavior, relying on survey data. GAMs will probably be: (a) of little help in understanding the direct effects of categorical variables like gender, race, and religion; (b) potentially helpful in modelling the effects of scale measures of political attitudes and ideology; and (c) of great help with independent variables like age, income, and years of education. GAMs work equally well on interval and ordinal data, and do not even assume that the effect of a variable is monotonic; they essentially avoid 43 the entire level-of-measurement issue. Also, analysts of aggregate political and economic data or public policy frequently have more-or-less continuous independent variables, though little theory about the functional form relating these variables to a dependent variable of interest. Here the generalized additive model is ideal. There surely will be situations where other non-parametric methods, such as neural nets or ACE, are the method of choice. But only GAMs allow the analyst to work within the relatively familiar framework of the additive, regression model. Accordingly, the interpretation of GAMs is very similar to the interpretation of regression coefficients, and the methodology for model choice in the GAM context is the same as that methodology in the linear or maximum likelihood context. So while GAMs are non-parametric they remain intuitively comprehensible. Thus, for example, public policy audiences that now routinely use evidence based on linear regression should have little trouble using evidence derived from the GAM. The same cannot be said for the GAMs more exotic relatives. Analysing data involves making compromises. The linear regression model is one such compromise (though typically not presented as such), providing a simple statistical answer to the substantive question “does y increase/decrease with x?” When its underlying assumptions are valid, the linear regression model is rightly hailed as a powerful tool: at the cost of just one degree of freedom an analyst learns as much as can be learned from the sample data about . But as our applications demonstrate, assumptions of constant, linear effects do not always hold true, and E y X may well be a relationship of little substantive relevance. In ( )= such circumstances linear regression stops being a useful compromise, and may even blind researchers to what is actually going on in the data. Instead, GAMs puts a range of compromises between parsimony and fidelity to the data within reach of empirical social scientists, with least squares linear regression at one extreme. Our title claims that “getting the mean right is a good thing”. If we include statistical power and substantive interpretability as desiderata, then surely GAMs are good things too. 44 Appendix A: Will GAMs find spurious nonlinearities? — A Monte Carlo Study In this appendix we address the question of whether GAMs will find smooth (non-linear) patterns in data that are known to be generated as the sum of a linear relationship and a purely random factor. As described in Section 3.3, an F test can be used to assess whether the null hypothesis of linearity should be rejected. But this test is only approximate and has unknown finite sample properties. To examine the performance of this test we undertook a series of Monte Carlo experiments. We initially generated N values of an independent variable, x, uniformly spaced over the interval [1, 10]. These values were fixed for each experiment. The error process, , was generated by taking N draws from a zero mean normal distribution. The dependent variable was then generated as y x . A GAM was then fit to the data, and an F test for nonlinearity was conducted. This process was repeated one thousand times for each configuration of experimental conditions. The object of the experiments was to see how often the GAM rejected the null hypothesis of a linear model, that is, falsely reported that a nonlinear model was correct. Given that the data are known to be linear, we should reject this null hypothesis at the p :05 confidence level 5% of the time. In other words, we are testing whether the nominal size of these tests (in this case, 5%) corresponds with the actual size of the tests. We varied the sample size, the error variance, the type of smooth (loess or spline) and the smoothness of the fitted GAM (as measured by the equivalent degrees of freedom consumed by the smooth). The experiments showed that the actual size of the test is between 5% and 10%, that is, somewhat larger than the nominal size of the test. The average actual size of the test over all our experiments was just over 7%. Given the complicated fitting procedure used by GAMs we were impressed with how well the test for non-linearity performed. But the test is insufficiently conservative. Thus analysts who believe in a strict testing methodology should use critical values of the F distribution in excess of the nominal 5% values in assessing the null hypothesis that the true functional form is linear. We find most frequently that we can either reject the null of linearity at any level, or we can clearly not reject it any level, so this small departure of the nominal and actual size of the test is of little practical importance. Our experiments showed that the actual size of the F test only varied = + = 45 with the lack of smoothness of the GAM, as measured by the equivalent degrees of freedom used in the smooth. Table 4 summarizes these results, showing that equivalent degrees-of-freedom consumed by the GAM (i.e., the amount of smoothing) was the only variable to have any significant impact on the size of the test.29 When we estimated more flexible GAMS, that is, allowed the GAMS to more faithfully track local fluctuations in the mean E y of the data, we found that the test was more likely to incorrectly reject the null of linearity. Thus researchers should take special care before concluding that a relationship is non-linear when working with extremely flexible smooths. Even in the worst case, the nominal size of the test is never over 10%. We also note that in our own work that smooths which consume many degrees of freedom are seldom required. The analyses in Section 4 did not require smooths using more than five degrees of freedom. () 29 Full results are in the data archive for this paper. Table 4: Analysis of Variance, Monte Carlo Experiments Test Variable df Sum of Sq % Sum of Sq F p-value edf 1 127:45 49:27 197:38 .00 n 1 :52 :20 :81 .37 type 1 :70 :27 :08 .30 t 1 :22 :08 :33 .57 edf n 1 :06 :02 :10 .76 edf type 1 :03 :01 :05 .82 n type 1 :56 :22 :86 .35 edf t 1 :37 :14 :56 .45 nt 1 :01 :00 :01 .92 type t 1 :57 :22 :86 .35 edf n type 1 :20 :08 :30 .58 edf n t 1 :21 :08 :32 .57 edf type t 1 :22 :09 :34 .56 n type t 1 :30 :11 :45 .50 edf n type t 1 :00 :00 :00 .95 Residuals 192 127:66 49:19 Dependent variable is size of the F test for non-linearity (proportion of rejections at p < :05 in 1000 replications). n = sample size (25, 100, 400, 1000); type = 0 for cubic smoothing splines, 1 for loess; edf = equivalent degrees of freedom (almost uniformly distributed between 2.3 and 7); t = t-ratio of the true (linear) data generating process, inversely proportional to error variance (0, .5, 1, 1.7, 2.5, 5). 46 Appendix B: Software for GAMs A statistical computing environment with strong graphics capabilities is essential for using GAMs and scatterplot smoothers. S is such an environment, though most people use S via its enhanced commercial version, Splus. Hastie (1992) provide a lengthy exposition of estimating GAMs in S. Venables and Ripley (1994, ch10) is also an excellent guide. Estimating GAMs is extremely simple in S, and uses a simple syntax. Consider the Pennsylvania State Senate data introduced in Figure 2. The following S commands fit GAMs to these data: penngams <- gam(absentee ˜ s(machine), data=ashen) penngaml <- gam(absentee ˜ lo(machine), data=ashen) In the first example, the S formula syntax absentee˜s(machine) tells the program to model the relationship between Democratic margin among absentee ballots and the Democratic margin recorded with voting machines with a cubic smoothing spline. The default is for the cubic smoothing spline to consume 4 degrees of freedom. A more jagged fit is produced by the formula syntax absentee˜s(machine,df=10). The analyst can control the amount of smoothing in terms of the parameter (see equation (9) in the text) using a spar argument in s(). Since the scale of and hence spar is data-dependent, we prefer controlling smoothness via the df argument. The second example directs the program to use loess as the smoother. The default span for lo() is .5; a smoother fit could be produced with lo(machine,span=.75), and a more jagged fit with lo(machine,span=.25). Local linear fitting is the default; local quadratic fitting can be specified by lo(machine,degree=2). Each example tells the software to create the “GAM objects”, penngams and penngaml, respectively (the <- is the S assignment operator). These objects can be passed to other S functions like predict, plot, summary and anova to obtain useful information about the estimated fit. In particular, the command plot(penngaml,ask=T) will give the user a menu of options for graphically inspecting the fitted GAM. The summary function provides a printed summary of the GAM, including tests of the null hypothesis of linearity. Two different GAM objects can be compared using the anova() command; if one of the models 47 nests within the other, this function also provides a test that the difference between the models is statistically significant. Multiple predictors can be introduced to a GAM straightforwardly. For example biggam <- gam(y ˜ s(x1) + x2 + lo(x3,span=.25) + s(x4,df=12) + lo(x5, x6, span=.5, degree=2) + x7) models y semi-parametrically, specifying linear regression-like fits on X2 and X7 , cubic smoothing splines for the fits on X1 and X4 , a local linear regression fit on X3 , and a local quadratic fit over the combination of X5 and X6 . GAMs for qualitative dependent variables draw directly on the GLM framework. GAMs borrow two important concepts from GLMs: a link function, h , that describes how the mean of y, E y depends on the additive predictors; and a variance function that captures how the variance of y depends upon the mean, i.e., var(y)=V , with constant (Hastie and Pregibon 1992, 196–7). For instance, logistic regression models for binary outcomes can be dealt with by making the link function the logit transformation h log = 1 , and specifying a binomial variance function, V 1 =n. S makes extensive use of this GLM-type syntax in its support for GAMs. For example, if y is a binary outcome, a logistic additive model can be fitted using the command () ( )= () ( ) = ( ( , )) ( )= ( , ) logitgam <- gam(y ˜ s(x1) + lo(x2) + x3, family=binomial) The family= option is part of the S GLM syntax, and the defaults accompanying this option are to fit the logit model (the logit link is the default for GAMs/GLMs in the binomial family). A probit GAM model can be fit by specifying probitgam <- gam(y˜ s(x1) + lo(x2) + x3, family=binomial(link=probit)) The robust M-estimator is obtained by rprobitgam <- gam(y˜ s(x1) + lo(x2) + x3, robust(family=binomial(link=probit))) 48 Other software packages like SAS, SPSS and STATA provide some support for scatterplot smoothing, but support for “full-blown” GAMs appears confined to S and Splus. One of us (Jackman) has implemented scatterplot smoothing by loess for GAUSS. This routine, along with all the S code for the models and graphs, is in the data archive for this paper. References Alt, Jame E. and Gary M. King. 1994. “Transfers of Government Power: The Meaning of Time Dependence.” Comparative Political Studies. 27:190–210. Bartels, Larry M. 1996. “Pooling Disparate Observations.” American Journal of Political Science. 40:905–42. Beck, Nathaniel. 1995. “Continuous and Discrete Time Methods for Estimating Duration Dependence.” Paper presented at the Annual Meeting of the Midwest Political Science Association, Chicago. Beck, Nathaniel. N.d. Modelling Space and Time: The Event History Approach. In Research Strategies in Social Science, ed. Elinor Scarbrough and Eric Tanenbaum. Oxford University Press. Beck, Nathaniel and Richard Tucker. 1996. “Conflict in Space and Time: Time-Series–Cross-Section Analysis with a Binary Dependent Variable.” Paper presented at the Annual Meeting of the American Political Science Association, San Francisco. Breiman, L. and J.H. Friedman. 1985. “Estmating optimal transformations for multiple regression and correlation (with discussion).” Journal of the American Statistical Association. 80:580–619. Cleveland, William S. 1979. “Robust Locally-Weighted Regression and Scatterplot Smoothing.” Journal of the American Statistical Association. 74:829–36. Cleveland, William S. 1993. Visualizing Data. Summit, New Jersey: Hobart Press. Cleveland, William S. and Clive Loader. 1996. Smoothing by Local Regression: Principles and Methods. In Statistical Theory and Computational Aspects of Smoothing, ed. W. Härdle and M. G. Schimek. Heidelberg: Physica Verlag. Cleveland, William S. and S.J. Devlin. 1988. “Locally weighted regression: an approach to regression analysis by local fitting.” Journal of the American Statistical Association. 83:596–610. Cleveland, William S., Susan J. Devlin and Eric Grosse. 1988. “Regression by Local Fitting: Methods, Properties and Computational Algorithms.” Journal of Econometrics. 37:87–114. de Boor, C. 1978. A Practical Guide to Splines. New York: Springer-Verlag. 49 DeVeaux, R. 1990. Finding Transformations for Regression Using the ACE Algorithm. In Modern Methods of Data Analysis, ed. John Fox and J. Scott Long. Newbury Park, Ca.: Sage pp. 177–208. Eubank, Randall L. 1988. Spline smoothing and nonparametric regression. New York: Marcel Dekker. Fan, J. 1992. “Design-adaptive non-parametric regression.” Journal of the American Statistical Association. 87:998–1004. Fan, Jianqing. 1993. “Local linear regression smoothers and their minimax efficiencies.” Annals of Statistics. 21:196–216. Fan, Jianqing and Irene Gijbels. 1996. Local polynomial modelling and its applications. London: Chapman and Hall. Fan, Jianqing and J.S. Marron. 1993. “Comment on Hastie and Loader.” Statistical Science. 8:120–143. Friedman, Jerome H. and W. Stuetzle. 1981. “Projection pursuit regression.” Journal of the American Statistical Association. 76:817–23. Good, I.J. and R.A. Gaskins. 1971. “Nonparametric roughness penalties for probability densities.” Biometrika. 58:255–277. Goodall, C. 1990. A Survey of Smoothing Techniques. In Modern Methods of Data Analysis, ed. John Fox and J. Scott Long. Newbury Park, Ca.: Sage pp. 126–77. Green, Peter J. and Bernard W. Silverman. 1994. Nonparametric regression and generalized linear models: a roughness penalty approach. London: Chapman and Hall. Greene, William. 1993. Macmillan. Econometric Analysis. Second ed. New York: Gurr, Ted Robert and Keith Jaggers. 1996. “Polity III: Regime Change and Political Authority, 1800-1994.” Computer file, Inter-university Consortium for Polticial and Social Research, Ann Arbor, MI. Hastie, T.J. and R.J. Tibshirani. 1990. Generalized Additive Models. London: Chapman and Hall. Hastie, Trevor J. 1992. Generalized Additive Models. In Statistical Models in S, ed. John M. Chambers and Trevor J. Hastie. Pacific Grove, California: Wadsworth and Brooks/Cole Chapter 7. Hastie, Trevor J. and Clive Loader. 1993. “Local regression: automatic kernel carpentry (with discussion).” Statistical Science. 8:120–143. Hastie, Trevor J. and Daryl Pregibon. 1992. Generalized Linear Models. In Statistical Models in S, ed. John M. Chambers and Trevor J. Hastie. Pacific Grove, California: Wadsworth and Brooks/Cole Chapter 6. Huber, Peter J. 1981. Robust Statistics. New York: Wiley. 50 Jacobson, Gary and Michael Dimock. 1994. “Checking Out: The Effects of Bank Overdrafts on the 1992 House Elections.” American Journal of Political Science. 38(3):601–24. Lancaster, Peter and Kȩstutis S̆alkauskas. 1986. Curve and Surface Fitting: An Introduction. London: Academic Press. McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman and Hall. Oneal, John R. and Bruce Russett. 1996. “Exploring the Liberal Peace: Interdependence, Democracy and Conflict, 1950–85.” Unpublished manuscript, Department of Political Science, Yale University. Oneal, John R., Francis H. Oneal, Zeev Maoz and Bruce Russett. 1996. “The Liberal Peace: Interdependence, Democracy and International Conflict.” Journal of Peace Research. 33(1):11–29. Ray, James Lee. 1995. Democracy and International Conflict: An Evaluation of the Democratic Peace Proposition. Columbia, S.C.: University of South Carolina Press. Reinsch, C. 1967. “Smoothing by spline functions, 1.” Numerische Mathematik. 10:177–183. Ripley, Brian D. 1993. Statistical aspects of neural networks. In Networks and Chaos - Statistical and Probabilistic Aspects, ed. O.E. BarndorffNielsen, J.L. Jensen and W.S. Kendall. London: Chapman and Hall pp. 40–123. Rousseau, David L., Chrisopher Gelpi, Dan Reiter and Paul K. Huth. 1996. “Assessing the Dyadic Nature of the Democratic Peace: 1918–88.” American Political Science Review. 90(3):412–33. Schoenberg, I.J. 1964. “Spline functions and the problem of graduation.” Proceedings of the National Academcy of Sciences (USA). 52:947–50. Statsci. 1995. S-PLUS: Version 3.3 for Windows. Seattle, WA: Mathsoft, Inc. Stone, C.J. 1982. “Optimal global rates of convergence for nonparametric regression.” Annals of Statistics. 10:1040–1053. Venables, William N. and Brian D. Ripley. 1994. Modern Applied Statistics with S-Plus. New York: Springer-Verlag. Wahba, Grace. 1990. Spline Models for Observational Data. Philadelphia: SIAM. Warwick, Paul V. 1992. “Rising Hazards: An Underlying Dynamic of Parliamentary Government.” American Journal of Political Science. 36:857– 76. Wegman, Edward J. and Ian W. Wright. 1983. “Splines in Statistics.” Journal of the American Statistical Association. 78:351–365. Weisberg, Sanford. 1985. Applied Linear Regression. New York: Wiley. 51 Western, Bruce. 1995. “Concepts and Suggestions for Robust Regression Analysis.” American Journal of Political Science. 39:786–817. White, Halbert. 1992. Artificial Neural Networks: Approximation and Learning Theory. Oxford: Basil Blackwell. Woolhouse, W.S.B. 1870. “Explanation of a new method of adjusting mortality tables, with some observations upon Mr. Makeham’s modification of Gompertz’s theory.” Journal of the Institute of Actuaries. 15:389– 410. 52