Getting the Mean Right is a Good Thing: Generalized Additive Models 1

by user

on 15-09-2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Getting the Mean Right is a Good Thing: Generalized Additive Models 1

Transcript

Getting the Mean Right is a Good Thing: Generalized Additive Models 1

Getting the Mean Right is a Good Thing:
Generalized Additive Models1
Nathaniel Beck2
and
Simon Jackman3
January 31, 1997
1
We thank Michael Dimock, Gary Jacobson, Gary King and John Oneal for
their data, Ari Zentner for research assistance, and Richard Carson, Peter Hall,
Trevor Hastie, Gary King, Robert Kohn, Jonathan Nagler and Adrian Pagan for
helpful discussions. The analysis of the democratic peace draws on material coauthored with Richard Tucker. We are grateful to Jonathan Nagler, the director of
the Social Science Computing Facility at the University of California, Riverside,
and Kolin Hand, its systems programmer, for allowing us to use their computing
facility. Errors and omissions remain our own responsibility.
2 Department of Political Science, University of California, San Diego, La Jolla,
CA 92093. e-mail: [email protected]
3
Department of Political Science and Reshaping Australian Institutions
Project, Division of Economics and Politics, Research School of the Social
Sciences, Australian National University, Canberra, ACT 0200, Australia; on
leave from Department of Political Science, Stanford University.
e-mail:
[email protected]
Contents
Abstract
i
1
Introduction
1
2
Generalized Additive Models
2.1 Interpreting GAMs . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Contrasting other non-parametric approaches . . . . . . . .
2
3
4
3
Scatterplot Smoothing
6
3.1 Smoothing by local regression (loess) . . . . . . . . . . . . . .
8
3.2 Cubic smoothing splines . . . . . . . . . . . . . . . . . . . . . 10
3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4
3.5
3.6
4
5
Linear Smoothers . . . . . . . . . . . . . . . . . . . .
Equivalent degrees of freedom . . . . . . . . . . . .
Standard Errors . . . . . . . . . . . . . . . . . . . . .
Pointwise versus Global Confidence Intervals . . . .
Specification Tests . . . . . . . . . . . . . . . . . . . .
Asymptotic Properties . . . . . . . . . . . . . . . . .
Estimating GAMs . . . . . . . . . . . . . . . . . . . .
Example — Pennsylvania State Senate data . . . . .
Bandwidth selection and the bias-variance tradeoff
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
GAMs in practice
4.1 The effect of Perot on Congressional races: GAMs as an alternative to multiple regression . . . . . . . . . . . . . . . .
4.2 Overdrafts and the Congressional Vote: Assessing Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Do governments face rising hazards? — an analysis in discrete, but smooth, time . . . . . . . . . . . . . . . . . . . . .
4.4 The democratic peace: monadic or dyadic . . . . . . . . . .
Conclusion
.
.
.
.
.
.
.
.
.
13
14
15
15
16
17
18
19
22
25
.
25
.
28
. 31
. 34
42
Appendices
45
A
Will GAMs find spurious non-linearities? . . . . . . . . . . . 45
B
Software for GAMs . . . . . . . . . . . . . . . . . . . . . . . . 47
References
49
Abstract
Theory: Social scientists almost always use statistical models positing
the dependent variable as a linear function of X, despite suspicions that
the social and political world is not so parsimonious. Generalized additive models (GAMs) permit each independent variable to be modelled
non-parametrically while requiring that the independent variables combine additively, striking a sensible balance between the flexibility of nonparametric techniques and the ease of interpretation and familiarity of linear regression. GAMs thus offer social scientists a practical methodology
for improving on the extant practice of “linearity by default”.
Method: We present the statistical concepts and tools underlying
GAMs (e.g., scatterplot smoothing, non-parametrics more generally, and
accompanying graphical methods), and summarize issues pertaining to
estimation, inference, and the statistical properties of GAMs. Monte Carlo
experiments assess the validity of tests of linearity accompanying GAMs.
Re-analysis of published work in American politics, comparative politics,
and international relations demonstrates the usefulness of GAMs in social
science settings.
Results: Our re-analyses of published work show that GAMs can extract substantive mileage beyond that yielded by linear regression, offering
novel insights, particularly in terms of modelling interactions. The Monte
Carlo experiments show there is little danger of GAMs spuriously finding
non-linear structures.1
1
All data analysis, Monte Carlo experiments, and statistical graphs were generated using S-PLUS, Version 3.3 (Statsci 1995). The routines and data are available at
ftp://weber.uscd.edu/pub/nbeck/gam.
1 Introduction
Social scientists make many assumptions in the course of employing
statistical models. Linearity must rank as one of the more ubiquitous assumptions, yet one that typically gets fairly short shrift. For instance, while
econometrics texts invariably make the linear functional form their first
assumption (e.g., Greene 1993, 143), the consequences of violating latter
assumptions garner far more space. This strikes us odd, given that the
optimality properties of least squares and maximum likelihood estimators
depend upon the researcher having specified the correct functional form
for the mean of the dependent variable; for example, understanding the
consequences of heteroskedasticity, seems to us a second-order problem if
the model parameters tap the wrong relationship to start with.
To be sure, introductory texts go on to show how a variety of specific
non-linear functional forms can be estimated in the linear framework via
various transformations (logarithmic, polynomial, reciprocals). But little
attention is paid to how analysts might come to know the correct functional form relating the independent variables to the mean of the dependent variable. Simple non-linear forms, such as polynomials or logarithmic transformations, often appear as clumsy attempts to capture subtleties;
as we argue below, these tricks of the trade impose “global solutions” on
what are typically “local problems”. And in practice, these simple nonlinear functional forms are often difficult to estimate with much precision,
falling prey to problems of multicollinearity.
The sources of this disparity in emphasis are fairly clear. First, social and political theories rarely suggest a specific functional form. We
might occasionally encounter arguments for higher order terms or nonlinear transformations; examples include returns to scale, diminishing returns, spatial contagion, cyclical patterns in time series data, non-linear
constraints imposed by working with rates or proportions data, and nonmonotone survival data. But if a priori non-linearities are rare then rarer
still is guidance about the specific form of these non-linearities. The more
striking point here is not the absence of theories implying non-linearity or
non-monotonicity. Rather, few social scientific theories offer any guidance as
to functional form whatsoever. Statements like “y increases with x” (monotonicity) are as specific as most social theories get.
We are not arguing that non-linearities, non-monotonicities, and other
seemingly complicated phenomena necessarily abound; we are saying that
1
we don’t know one way or the other. Our theories are typically silent on
the issue, and empirical social scientists are neither trained nor encouraged
to consider non-linearity a compelling issue. Nonetheless, we don’t see the
work-a-day regression-runner as a dupe. Most of us understand that the
linear regression model prevails in spite of suspicions that the world is not
linear and additive. As we tell our students, linear regression prevails
because it is simple — “parsimonious”, easily estimated, and easily interpreted — and, with perhaps a hint of embarrassment, it seems plausible in
the absence of any theoretical story suggesting a specific functional form.
The truth of the matter is that there are seldom any prior expectations as
to the functional form of E y
g X . Moreover, it hardly seems persuasive to use the absence of theoretical arguments about non-linearity to
buttress use of the linear regression model. “Linearity by default” seems
an unduly narrow methodological practice in the face of weak or vague
theoretical expectations, and given that alternative models can be readily
implemented.
( )= ( )
2 Generalized Additive Models
How then to take non-linearity seriously? Here we survey a regressionlike model that directly confronts the possibility of non-linearity: generalized additive models (GAMs). These models and the ideas that underlie
them have received considerable attention in the statistics literature, but
are yet to percolate into the social sciences.
Additive models re-cast the linear regression model
= + ∑ X; + ;
k
yi
j
ij
i
(1)
j=1
by modelling y as a additive combination of arbitrary univariate functions
of the independent variables and a zero mean, independent and identically
distributed stochastic disturbance:
= + ∑ g (X ; ) + ;
k
yi
j
ij
i
(2)
j=1
( )
=
where E i = 0, and var(i ) = 2 ; i 1; : : : ; n. Note that no distributional
assumptions about the i are necessary until the analyst moves from estimation to inference (hypothesis testing, constructing confidence intervals,
2
()
etc). Here we focus on two types of gj : locally weighted regression
smoothers and cubic smoothing splines. Several other classes of gj have
been considered in the literature, but our focus on these two smoothers
does not unduly restrict the generality of our discussion.
Generalized additive models extend the framework in equation (2) in
precisely the same way that generalized linear models (GLMs) (McCullagh
and Nelder 1989) extend the linear regression model so as to accommodate qualitative dependent variables. A detailed description of the GLM
framework is beyond our scope here; we note that the GLM framework
provides a useful syntax and conceptual framework for modelling qualitative dependent variables. Further details as to how GAMs draw on the
GLM framework appear in our applications, where we present GAMs for
both continuous and qualitative dependent variables. In the Appendix we
show how to estimate GAMs, with an emphasis on the implementation in
S and Splus, which in turn makes use of the GLM approach.
2.1 Interpreting GAMs
The absence of the regression parameters j in equation (2) reflects an
important characteristic of GAMs. One does not obtain a set of regression
parameters from a GAM, but rather, estimates of gj Xi; j for every value of
Xi; j , written as ĝ j Xi; j . These estimates are obtained using a fitting algorithm known as backfitting, which we discuss in section 3.4. The coefficient
on each g j X j is set to one by construction, and so it is the ĝ j X j that
tells us about the relationship between Xj and the dependent variable.1 In
many applications we will assume that some of the predictors have linear
effects, as in the standard regression setup. Thus we will typically estimate
a semi-parametric model
( )
( )
( )
( )
= + ∑ Z ; + ∑ g (X ; ) + :
m
yi
k
l
il
j
ij
(3)
i
j=1
l =1
The estimated non-parametric component of a GAM can be thought
of as estimates of the functions transforming each Xj ( j 1; : : : ; k) so as
to maximize the fit of their additive combination to a dependent variable,
subject to constraints about the smoothness of the ĝ j X j . The actual values
=
1
( )
( )
The fitted g j X j each have zero mean by construction. Without this constraint the intercept in a GAM is unidentified; i.e., any of the fitted gj X j could be shifted by some by
arbitrary constant, accompanied by an offsetting shift in the intercept, and the resulting
set of estimates would fit the data equally well.
( )
3
( )
of ĝ j X j are not substantively meaningful per se; what matters is the shape
of the fitted functions.
For this reason, graphical methods are used to interpret the nonparametric component of a GAM. A plot of Xj versus ĝ j X j reveals
the nature of any estimated non-linearities in the relationship between Xj
and the dependent variable, holding constant the other components in the
model. Standard errors and confidence regions can be calculated and plotted about ĝ j X j , providing a guide as to whether the fitted function is distinguishable from a linear fit, or increasing or decreasing in Xj (see section
3.3). While it may seem easier to examine tables of regression coefficients
rather than plots of the ĝ j , this ease is only obtained at the cost of (possibly) unwarranted, restrictive, and unnecessary assumptions of linearity.
On the other hand, additivity ensures that the effects of each of a GAM’s
predictor can be interpreted net of the effects of other predictors, just as in
linear regression.
( )
( )
2.2 Contrasting other non-parametric approaches
Non-parametric statistics is an extremely large and active area of statistical research, providing many alternatives to the linear, additive regression model. Given this panoply of methods, our focus on GAMs warrants
some justification.
GAMs seem to strike a sensible compromise between ease of interpretation and flexibility. Other alternatives to linear regression like neural
nets (White 1992; Ripley 1993), projection pursuit (Friedman and Stuetzle
1981), or alternating conditional expectations (Breiman and Friedman
1985; DeVeaux 1990) strike us as sacrificing ease of interpretation in order to better fit the data. For instance, projection pursuit attempts to find
orthogonal projections through the space of the data that maximize model
fit, but substantively interpreting the projections seems difficult. Alternating conditional expectations (ACE) finds optimal transformations of the
X and y variables that maximize fit to the model, but again, the resulting
fit strikes us as hard to interpret. It is often difficult to decide when these
techniques are overfitting the data, a problem that is especially pressing
for projection pursuit and neural networks, which can approximate any
continuous function.2
2
These methods are very powerful if we simply want good predictive power, as in
pattern recognition, a problem seldom encountered in political science. Few political
scientists are trying to decipher hand written zip codes for the post office (e.g., Ripley
4
These other non-parametric alternatives are also difficult to interpret
in the natural units of any substantive research problem. For instance, the
analogue of equation (2) for ACE is
( yi ) = ∑ j ( Xi; j ) + i ;
k
(4)
j=1
while that for projection pursuit is
q
yi
= ∑ (a X ) + :
0
l
l
i
(5)
i
l =1
Thus even if we had a good way to choose the j , they would be interpretable as either the effect of an independent variable on a transformation
of the dependent variable (for ACE), or the effect of some linear combination of the independent variables on the dependent variable (for projection pursuit).3 In addition, neither of these techniques allows for interactions among the independent variables. While the basic GAM framework
is additive, we shall see in Section 4.4 that GAMs provide a very flexible method for estimating models with pairwise interactive terms, far and
away the most common type of interaction studied in political science.
GAMs should also be distinguished from semi-parametric methods
which keep the assumption of linearity but allow the error term to be modelled non-parametrically (e.g., Western 1995). While this approach is valuable, we believe that “getting the mean function right” is more important
than correct modelling of the error process. It is possible to combine these
two approaches. In particular, it is easy to estimate GAMs using Huber’s
(1981) “M-estimates” which allow for platykurtic error distributions. Our
focus here is on the mean function, and so we do not pursue M-estimators
for GAMs further.
We also distinguish two classes of flexible parametric approaches; BoxCox transformations (Greene 1993, 329-34) and parametric splines (Greene
1993, 235-8). Neither of these have proven themselves popular in political
science, and neither are as flexible as the GAM.
1993, 16).
3
The ACE transformation of the dependent variable, yi , causes problems because
the only restriction on it is that it be continuous. Thus there is no way to untransform,
as there would be, say, for a simple logarithmic transformation; ,1 is generally not even
defined! As a consequence it is very difficult to interpret estimates in natural units. Note
that complicated non-linear transformations of the independent variable cause no interpretative problems; there is no need to invert these transformations.
( )
5
As equation (2) shows, GAMs retain the additive property of the familiar regression model, and so can be interpreted “variable-by-variable”,
unlike projection pursuit or most neural nets. Transformations of variables
are restricted to GAMs’ right-hand side variables, again making substantive interpretation relatively easy. The results of a GAM, the ĝ j Xi; j , can
( )
be easily interpreted in the natural units of the research problem; plots
of the independent variables against their smoothed fits to the dependent
variable are the non-parametric counter-part of standard regression lines.
GAMs speak to substantive research question just as directly as everyday
regression, but without imposing the assumption of linearity.
Furthermore, the specific transformations of right-hand variables employed by GAMs have well-understood properties. These transformations
are variable-specific, and the amount of smoothing is typically set in advance by the analyst, via calibration parameters that have straightforward
interpretations (e.g., the “equivalent degrees of freedom” consumed by
each variable-specific smoother, see below). And because GAMs are additive, the familiar linear regression model remains a convenient benchmark
against which to assess how various GAMs fare on the tradeoff between
predictive power (“getting the mean right”) and substantive meaning. The
methodology for assessing this tradeoff in the GAM context is for all practical purposes identical to that used to compare different linear regression models (see section 3.3). GAMs seem to us to be the most convenient
way of buying into what non-parametric approaches have to offer without
making interpretation too difficult.
With linear regression at one extreme and neural networks at the other,
GAMs lie much closer to the former than the latter. We concede that there
are research settings where neural nets or ACE or projection pursuit are to
be preferred; but our belief at this point is that such situations are rare in
political science. In short, GAMs provide much of the flexibility of nonparametric methods while allowing the analyst to keep a close eye on what
the statistical procedures are actually doing to the data.
3 Scatterplot Smoothing
Like most non-parametric techniques, the statistical theory and accompanying mathematics for GAMs is complex, and we refer readers to the
discussion in Hastie and Tibshirani (1990, ch 5). However, most of the
key intuitions about GAMs follow from ideas having to do with bivariate
6
scatterplot smoothing.
Smoothing is an important tool for non-parametric regression, addressing one of the simplest yet most fundamental questions in data analysis:
“what is our best guess of y, given x?” In the most general setting, a
smoother yields a series of data reductions or summaries of y, each specific to (possibly overlapping) regions of the space of the predictors. The
moving average or moving median is a frequently encountered smoother,
summarizing a single variable with means or medians specific to overlapping time windows. The resulting summary series exhibits less variability
than the raw series, again highlighting that smoothers are tools for data
reduction. GAMs make use of smoothing, but subject to the constraint
that the smoothed predictors combine additively in modelling y (see the
discussion in section 2.2).
To define scatterplot smoothing, let x
x1 ; : : : ; xn 0 be the observations
on an independent variable and let y
y1 ; : : : ; yn 0 be the observations on
a dependent variable. Assume that the data are sorted by x. A scatterplot
smoother takes x and y and returns a function ĝ x
ŷ.
The amount of smoothing in the moving-average time-series example
is controlled by the number of time periods that are averaged, or, more
generally, by the size of the neighborhood or bandwidth over which the averaging is done. The larger this neighborhood, the more smoothing, and
the more data reduction. For instance, consider taking the entire sample
as the neighborhood. A moving average in this case is simply the sample
mean of the observations, reducing the entire series to a constant. Ordinary least squares (OLS) regression is a special case of a smoother, providing an infinite amount of smoothing.4 This is because in OLS regression
the relationship between y and x is taken to be global, not local, and hence
the smoothing neighborhood is the entire dataset; the same relationship
between x and y holds over all plausible values of x. As we show below,
most scatterplot smoothers yield estimated g functions that approach a
linear regression line as the amount of smoothing increases. At the other
extreme, as bandwidth gets smaller, the amount of smoothing decreases,
and the estimated g function degrades to a function that simply interpolates the data, resulting in no data reduction at all.
Our uses of GAMs rely extensively on two popular smoothers: loess
=(
=(
)
)
( )=
4
Smoothness here is defined as the inverse of the second derivative of the fitted function. Since linear regression fits straight lines, and straight lines have second derivatives
that are zero, linear regression is considered to be infinitely smooth.
7
(Cleveland and Devlin 1988), a local regression smoother, and cubic
smoothing splines. A large body of statistical research establishes that
these smoothers have many desirable properties. We summarize some
of this research below, but refer interested readers to the reviews in Hastie
and Tibshirani (1990), Goodall (1990), and Hastie and Loader (1993).
3.1 Smoothing by local regression (loess)
j = ĝ(x0) as follows (Cleveland
Given a target point x0 , loess yields ŷ x0
1993, 91–101):
1. Identify the k nearest neighbours of x0 , i.e., the k elements of x closest
to x0 . This set is denoted N x0 . The analyst controls k via a “span”
argument, which defines the size of the neighborhood in terms of a
proportion of the sample size: i.e., k
span n.
( )
jx0 , xi j,
( ) =
2. Calculate
x0
maxN(x0 )
neighbor most distant from x0 .
the distance of the near-
( )
3. Calculate weights wi for each point in N x0 , using the following tricube weight function:
jx , x j W
(x )
0
i
(6)
0
where
( )=
W z
(
(1 , z )
3 3
0
( )=
for 0 z < 1;
otherwise.
(7)
1, and that the weights decline smoothly to zero
Note that W x0
over the set of nearest-neighbors, such that W xi
0 for all non
near-neighbors. The use of the tri-cube weight function here is somewhat arbitrary; any weight function that has smooth contact with
( )=
( )
0 on the boundary of N x0 will produce a smooth fitted function
(Cleveland, Devlin and Grosse 1988, 112).
4. Regress y on x and a constant (for local linear fitting), using weighted
least squares (WLS) with weights wi as defined in the previous step.
Quadratic or cubic polynomial regressions can also be used in these
local regressions (Fan and Gijbels 1996), or even mixtures of loworder polynomials (Cleveland and Loader 1996).
8
( )
5. The smoothed value ĝ x0 is the predicted value from the WLS fit at
x0 .
Local regression can also be applied beyond the two-dimensional setting encountered in scatterplot smoothing. The smoothing algorithm presented in equations (6) and (7) generalizes readily to the case where y is
being modelled to a set of two or more explanatory variables. Consider
applying loess to the model
yi
= g(x ; z ) + :
i
i
i
(8)
A target point in the two-dimensional space of the predictors can be denoted as x0 ; z0 , about which we can define a set of nearest-neighbors,
again defined as a pre-specified proportion of the data. The Euclidean distances of near-neighbors from the target point are passed to the tri-cube
kernel in equation (7), producing a set of weights for the (local linear) regression of y on x, z, and a constant. This fitted regression is then used to
generate ŷ x0 ; z0
ĝ x0 ; z0 . This is a very useful feature of loess, which
we exploit in one of the applications below. In particular, the additive
predictor g x; z in a GAM provides a way of capturing interaction effects
non-parametrically, rather than simply having x and z conditioning the
linear effects of each variable as in least squares regression.
Local regression is an excellent paradigm for smoothing. The use of
nearest-neighborhoods lets the smoother adapt to local fluctuations of the
density of x, while not necessarily sacrificing smoothness. In this sense
loess can be understood as a tri-cube kernel scatterplot smoother, adapting to local fluctuations in the density of x. The combination of three features — nearest neighbors, a smooth weight function (the tri-cube kernel),
and forming ŷ x0 via locally weighted regressions — helps local regression outperform many other scatterplot smoothers (e.g, moving averages,
overlapping regressions, and so on).5
In particular, the adaptive features of local regression stand in marked
contrast with smoothing by fixed kernels, which until recently were considered a good basis for smoothing (Hastie and Loader 1993). Fixed kernel
(
)
j(
)= (
)
( )
j
smoothers summarizes or interpolate data within a symmetric neighborhood of a target point x0 , with the data about the target point weighted
5
Useful summaries of these properties and other strengths of local regression appear
in Hastie and Loader (1993) and Cleveland and Loader (1996). According to the latter
paper, local regression was first used as early as 1829 by Danish actuaries, but the first
published reference appears to be Woolhouse (1870).
9
according to a probability density function (e.g., the normal probability
density function), with the target point as the mean, and variance proportional to the bandwidth), ignoring the observed density of the data in that
region.6 Accordingly, fixed kernels potentially give biased estimates of
E y x0 , especially around the endpoints of the observed data Hastie and
Tibshirani (1990, 31). In setion 3.3 we show how local regression attains
greater asymptotic efficiency than fixed kernel smoothers.
(j )
3.2 Cubic smoothing splines
Cubic smoothing splines are another popular choice for scatterplot
smoothing and fitting GAMs. This smoother arises as the solution to the
following optimization problem: among all functions g x with continuous first and second derivatives, find one that minimizes the penalized
residual sum of squares
()
Z
= ∑[y , g(x )] + [g00(t)] dt;
RSS + roughness penalty;
N
PRSS
i
i
2
b
2
(9)
a
i=1
where is a fixed constant, and a
x1
: : : xN b (Hastie and
Tibshirani 1990, 27). The first part of this expression is the familiar residual sum of squares (RSS), measuring fidelity to the data. The second term
is weighted by the constant , and penalizes “roughness” or “flamboy-
ancy” (Good and Gaskins 1971) in the target function. In this way is
analogous to the span parameter in loess; i.e., higher values of result in
smoother fits. As , the roughness penalty dominates the criterion
00
in (9), making g x
0, yielding the least squares regression line as the
solution function (since the second derivative of a straight line is zero).
Conversely, as 0, the roughness penalty disappears, allowing the the
RSS component of equation (9) to dominate; the resulting g x tends to a
function that simply interpolates the data.
The g x that minimizes the criterion in (9) is a piecewise cubic polynomial, known as a cubic spline function, with knots (join points) at each of
!1
( )=
!
()
()
6
Of course, if the realizations of x are equally spaced, then the adaptive features of
local regression are irrelevant. In this special case fixed-kernel smoothing and nearest
neighbor approaches coincide (at least in the interior of the data), but such a fortitious
distribution of x values rarely occurs in practice. See Fan and Marron (1993, 129-130).
10
the interior data points (Schoenberg 1964; Reinsch 1967):7 i.e.,
,
( ) = + x + x + x + ∑ (x , x )
s x0
0
1 0
2
2 0
3
3 0
n 1
i
0
i
3
+
(10)
i=2
+
( ,
where the sub-script on the last term denotes the positive part of x0
xi .
It is well known that a polynomial of order n 1 will exactly fit n data
points (e.g., Lancaster and S̆alkauskas 1986, 30) but the resulting function
is almost guaranteed not to be smooth, nor to resemble any social or political process. Smoother functions result from fitting successively lowerorder polynomials, with the infinitely smooth least squares regression line
resulting when the fitted polynomial is of order one. But experimenting
with the order of a polynomial fit to the entire data is not an especially
elegant or efficient approach to smoothing. If the data exhibit localized
fluctuations in the mean function, then a reasonably high-order polynomial may be required to do a fair job of fitting the data. This approach
can consume degrees of freedom rapidly, and while the fitted polynomial
may capture some local fluctuations in the mean well (and even relatively
smoothly), it can fit quite poorly elsewhere. In short, using polynomials to
smooth data is a global solution to a local problem.
Smoothing splines like that in equation (10) differ from globally-fitted
polynomials in that the piece-wise terms let the fitted function respond
to local changes in the mean of the data. The combination of global cubic polynomial with the piecewise cubic terms also ensures that the fitted spline function has a continuous second derivative at the knots. In
part, the previous sentence merely restates that functions of the form in
equation (10) solve the optimization problem in equation (9). But when
we graphically inspect the resulting spline function, the continuity constraint on the second derivatives means that the fitted function also looks
smoother. This seems to be because “our eyes are skilled at picking
up second and lower order discontinuities, but not higher” (Hastie and
)
,
7
Splines are a widely used tool for smoothing and interpolating data and have nonstatistical applications in settings such as architecture and engineering. de Boor (1978) is
a useful general introduction to a large literature on splines. Wegman and Wright (1983)
provide a survey of statistical applications, and Eubank (1988) and Green and Silverman
(1994) are book length treatments of non-parametric regression based on splines.
Cubic smoothing splines should not be confused with parametric natural splines,
where the analyst chooses the number and placement of knots. In the absence of guidance
as to where to place the knots, most analysts locate a small number of knots uniformly
over the data. Unlike smoothing splines, there seems to be no simple way to trade off
parsimony and smoothness with natural splines.
11
Tibshirani 1990, 22-23).8 Not only do cubic smoothing splines provide a
way of formally trading off smoothness (“parsimony”) for fidelity to the
mean function, but the operationalization of smoothness here jells with
how we tend to graphically perceive smoothness.
It first appears that equation (10) can not be estimated since there are
n 2 parameters on the piece-wise cubics (the i ), plus four parame-
,
ters for the global cubic polynomial, yielding more parameters than data
points. But the constraints imposed by the smoothing parameter in the
roughness penalty term drastically reduces the “effective number of parameters”. A more detailed discussion appears in Hastie and Tibshirani
(1990, 27–29), but the underlying idea is that the pre-specified value of
governs how sensitive the fitted function is to the piece-wise components versus the parameters fitted globally. As the smoothness constraint
becomes more binding the estimates of the parameters become highly
inter-dependent and driven towards zero. Furthermore, as gets larger
still, all but two parameters in equation (10) are driven to zero, the exceptions being 0 (the constant) and 1 (the globally-fitted parameter on
the linear term); i.e., least squares results when an infinitely smooth cubic
spline is fit to the data.
While smoothing splines attack the scatterplot smoothing problem differently from local regression, they nonetheless work the same as any other
smoother, producing a more or less smooth fit to a scatterplot by controlling the sensitivity of the fitted function to local fluctuations in the mean
of the data. And in fact, it is hard to know when local regression or cubic
splines should be preferred to one another: “The smoothing spline and local regression methods appeal to rather different intuitive interpretations,
and it is unlikely that either will have universally dominant performance”
(Hastie and Loader 1993, 123). Because loess give no weight to observations outside the set of nearest-neighbors in forming a local estimate of
E y , it is more robust against outlying values on X than smoothing splines
(Cleveland 1979).
Our experience is that smoothing splines produce fewer small nonmonotonicities compared to local regression. Thus where we are reasonably confident that local monotonicity in the mean function is correct, we
use smoothing splines. For example, in our analysis of the effect of the
()
8
Hastie and Tibshirani (1990, Figure 2.6) display a series of spline functions demonstrating this point. Spline functions with discontinuities in the fitted function and its first
two derivatives simply “don’t look smooth”.
12
Congressional overdraft scandal (Section 4.2), we would not expect small
increases in the number of overdrafts to help an incumbent, and so use
smoothing splines rather than local regression; the latter produces some
small bumps that made no theoretical sense. But, we also note that the general shape of the fits produced by local regression and smoothing splines
is quite similar; we would worry about the use of either smoother if they
revealed qualitatively different results. In sum, the differences in the ways
local regression and cubic smoothing splines operate are generally overwhelmed by choices as to how much smoothing to do with a given brand
of smoother: or put differently, “within smoother” variation seems to dominate “across smoother” variation.9
3.3 Inference
Linear Smoothers
Local regression and cubic smoothing splines are both linear smoothers,
with relatively well-understood statistical properties.
Scatterplot
smoothers proceed by averaging values of y about some target point
x0 ; in particular, a linear smoother estimates g x0 by
( )
( ) = ∑ S ( x ; x ) y
N
ĝ x0
0
j
(11)
j
j=1
where x j and y j are elements of the data vectors x and y, respectively,
x0 is the target point, and S is a weight function parameterized by the
smoothing parameter . Letting the smoother produce ĝ xi for each i
yields S, an n by n smoother matrix. Each row of S contains the weights
used in generating the smoothed fit at each data point; or, in other words,
the jth row vector of S is a n by 1 vector of the weights each observation
picks up in contributing to the fitted smooth at xj , sometimes referred to as
the equivalent kernels. Using this notation a linear smoother can be written
more compactly as
( )
ĝ
= S y;
9
(12)
One caveat is that local regression generalizes to higher-dimensional settings more
readily than spline functions. For examples of fitting splines in two or more dimensions,
see Green and Silverman (1994, ch 7) or Wahba (1990, section 2.4).
13
( j )= ( )
where ĝ is the n by 1 vector of estimates of E y x
g x , conditional on
the pre-specified smoothing parameter (Hastie and Tibshirani 1990, 44).
Linear smoothers are thus linear in precisely the same way that the predicted values from a least squares regression are a linear function of the
dependent variable, conditional on the data: i.e, for least squares linear
regression, ŷ Xˆ
Hy, where H X X0 X ,1 X0 is the well-known “hat
=
=
= ( )
matrix”, transforming the vector of observed responses y into the fitted
responses ŷ (Weisberg 1985, 109–111). The smoother matrix S has many
of the same properties as the hat matrix. In fact, H is a special case of a
smoother matrix, recalling that least squares regression can be thought of
as an infinitely smooth scatterplot smoother. The representation of linear
smoothers in equation (12) is extremely convenient. As the bandwidth of a
smoother gets wider, off-diagonal elements of S get larger, as data points
further away from the target point x0 have greater influence in forming
ĝ x0 . On the other hand, in the limiting case of a smoother that simply
interpolates the data, S is a diagonal matrix (specifically, an identity matrix).
( )
Equivalent degrees of freedom
This inverse relationship between the amount of smoothing and the
extent to which S is diagonal gives rise to a very useful approximation for
linear smoothers. It can be shown that the degrees of freedom consumed
by a linear smoother S is well approximated by the trace (sum of the
diagonal elements) of S (Hastie and Tibshirani 1990, 52–55 and Appendix
2). Recall that for least squares linear regression, the trace of the hat matrix
is equal to the degrees of freedom consumed by the regressor’s predictors
(including an intercept, if present).
This quantity, tr S , is referred to as the equivalent degrees of freedom
in the literature on smoothing and GAMs. More precisely, tr S
1 is
an estimate of the degrees of freedom consumed by the non-parametric
component of the smooth fit, since all scatterplot smoothers fit an intercept
term. Figure 1 displays how the equivalent degrees of freedom consumed
by a scatterplot smoother increases as the amount of smoothing decreases.
The exact nature of this tradeoff is data-dependent, but the relationship
depicted in Figure 1 is quite typical. Jagged fits consume more degrees of
freedom; conversely, least squares is an infinitely smooth fit to the data.
()
( ),
14
10 12 14 16
2
4
6
8
Equivalent degrees of freedom
2 4 6 8 10 12 14 16
0.2
0.4
0.6
span
0.8
1.0
-6
-5
-4
-3
-2
-1
0
Figure 1: Equivalent degrees of freedom, local linear regression (loess)
and cubic smoothing splines. The amount of smoothing performed by
loess is controlled by the span parameter (left panel), while weights the
roughness penalty for cubic smoothing splines (right panel). Smoothness
increases as both smoothers tend towards the least-squares regression fit
(d f =2).
Standard Errors
The representation of the smoothing problem in (12) also allows a
simple characterization of the variance-covariance matrix of the fitted
smoothed values. Given the error variance 2 ,
( ) = SS0 :
2
cov ĝ
(13)
With an estimate of 2 appropriate to a particular application, pointwise
standard errors are equal to the square-root of the diagonal elements of
SS0 ˆ 2 . With a distributional assumption (e.g., normal errors), confidence
intervals can be constructed about the fitted smooth using the pointwise
standard errors. Examples are in Figure 4 and and in the applications
below.
Pointwise versus Global Confidence Intervals
It should be noted that pointwise standard errors summarize uncertainty about the location of a single fitted point, say, ŷi xi . Uncertainty
about the entire fitted function is harder to characterize, simultaneously involving the variances of each the ŷi , and their covariances (the off-diagonal
j
15
elements of SS0 2 , which are non-zero for any non-degenerate smoother).
The upshot is that pointwise confidence intervals do not necessarily capture all of the uncertainty as to where the smoothed fit actually lies.
About the only feasible method for characterizing the distribution of
the entire fitted function ĝ x is a computationally intensive re-sampling
()
scheme; i.e., under the assumption of normal errors, repeatedly sample
the vector g from a n-dimensional normal distribution with mean vector ŷ ĝ x and variance-covariance matrix SS0 ˆ 2 , and plot each sampled
function. Since the resulting scatterplot quickly gets cluttered, one approach is to display only “extreme” or “interesting” sampled functions.
To construct a “confidence set” for the fitted function, one can reject sampled functions that exceed a critical value on some performance criterion
(e.g., Hastie and Tibshirani 1990, 62–4).
But imagine plotting the “shadow” or “projection” of the fitted function’s n-dimensional 95% confidence set onto a two-dimensional scatterplot; it can be shown that the bounds of this shadow need not be
wider than the bounds on the corresponding pointwise confidence interval (Hastie and Tibshirani 1990, 62). Our experience echoes this result. We
find that the pointwise confidence intervals provide a very good approximation as to the shape of the confidence set of the fitted function, and
at far less cost. The shaded confidence intervals we display around fitted
smooths are pointwise confidence intervals.
= ()
Specification Tests
Specification tests for scatterplot smoothers/GAMs can be performed
under the same conditions for linear regression or GLMs (i.e., large samples or normal errors). In particular, for a GAM with a continuous dependent variable, the sum of squared residuals can be used in a standard F
test. The statistic for this test is computed identically to the usual linear
regression F statistic. While the distribution of this statistic is only approximately F, simulation results indicate that the approximation is reasonable
(Hastie and Tibshirani 1990, 67). In the Appendix we report new simulation results which indicate that the F approximation is generally good, as
well as factors which appear to make the approximation better or worse.
For models with a binary dependent variable we can compute the analogue to the log of the likelihood. This is, as usual, the sum of the logs of
the estimated probabilities of success (for the observed successes) and failure (for the observed failures). For both tests, statistics are computed using
16
the approximate degrees of freedom consumed by the smoother. A common use of either test is to examine whether a specification with a smooth
term is superior to one with a corresponding linear term; since the latter
model is nested within the former, the appropriate F or 2 test can be used
here.
Asymptotic Properties
Asymptotic results for smoothers find the nearest-neighbor approach
underlying local regression to be superior to the fixed-kernel approaches
we described in section 3.1 (Fan 1993; Fan 1992). In particular, local regression has greater asymptotic efficiency than fixed-kernel approaches.
Precise characterizations of the asymptotic properties of non-parametric
regression estimators are difficult to summarize concisely; here we present
a useful intuition.
A kernel-based local linear regression estimator evaluated with n observations has asymptotic mean-square error
[ ( )] = [asy bias] + asy variance
c h g00(x ) + nh1 f (cx ) ;
2
AMSE ĝn x0
1
4
n
0
2
2
n
(14)
0
( )
where hn is the bandwidth of the kernel, f x0 is the density of x at the target point x0 , and c1 and c2 are constants, fixed for a given kernel (Fan 1993,
199). Part of why nearest-neighbor methods outperform fixed-kernel estimators is due to their adaptive bandwidth feature: when f x0 decreases
(i.e., the data are sparse in the vicinity of the target point), the bandwidth
increases, effectively keeping asymptotic variance low.10
Equation (14) also makes clear that smoothers perform better if the underlying target function is relatively smooth to start with, where “smoothness” here is again defined with respect to the second derivative of the
target function.11 In the limiting case of a linear target function, local linear regression results in no bias whatsoever. Local linear fitting will yeild
( )
10
In this sense, the notation in equation (14) is slightly misleading, implying that hn
(x )
is fixed x0 . A more precise notation might be hn 0 , indicating that bandwidth may be
conditional on the particular target point.
11
As is standard in the statistical literature, the bias component of the mean-squareerror is approximated with a Taylor-series expansion out to the second derivative of the
target function (hence the approximation symbol in equation (14); i.e., the expression for
the bias of the non-parametric estimator is itself an estimate. A higher-order expansion
could be used, but the second-order expansion is sufficient to capture the key intuition
and appears standard in the literature.
8
17
in a biased estimate of the target function if the target function has nonzero second derivatives, and in general, local regression methods will be
biased if the local regression is not of the same order as the target function.
Note that smoothing by local constant fitting (e.g., kernel-based averaging,
including the moving average as a special case) is akin to the researcher
making the rather strong assumption that the target function is constant.
Local polynomials can be used when the target function is suspected to
be non-linear; mixing low-order polynomials is a current area of research
(Cleveland and Loader 1996).
Another important asymptotic result for our purposes is that the rate
of convergence for a p dimensional GAM is no slower than that for scatterplot smoothing (Stone 1982); accordingly, increases in the number of
predictors in a GAM does not result in slower rates of convergence, as it
does for other non-additive non-parametric regression models (Hastie and
Tibshirani 1990, 131–132).
This said, there is a reasonably strong sense of skepticism about the the
asymptotic results. Despite good asymptotic properties for local regression, biases at the boundaries of the X variables can be large and dominate the performance of the estimator in finite samples. For this reason,
some statisticians downplay the usefulness of asymptotic properties of
smoothers, preferring to focus on the good finite sample properties of local
regression (e.g., Cleveland and Loader 1996) and cubic smoothing splines.
3.4 Estimating GAMs
GAMs are usually estimated using a backfitting algorithm, a computationally efficient approach to estimating the large linear systems often
implied by non-parametric multivariate regression models. Backfitting is
used in other non-parametric settings (e.g., projection pursuit and ACE)
and excellent descriptions of the algorithm appear in the statistics literature (e.g., Breiman and Friedman 1985). Here we simply sketch how the
algorithm works in the context of GAMs; a fuller description appears in
Hastie and Tibshirani (1990, ch5).
Given the model
(j
E y X1 ; X2
) = g ( X ) + g ( X );
1
1
2
2
(15)
backfitting alternates between estimating g1 and g2 . With an initial ĝ1 ,
an estimate of g2 can be obtained from smoothing the partial residual
18
, ( )
, ( )
y ĝ1 X1 on X2 , and an update of ĝ1 can be obtained by smoothing
y ĝ2 X2 on X1 . This process continues until all of the estimated smooth
functions cease changing. More formally, estimating a GAM with p predictors involves solving the system
0
I
B
B
S
B
B
@ ...
2
Sp
(
S1 S1
I S2
..
..
.
.
Sp Sp
:::
:::
S1
S2
. . . ..
.
:::
10
g
B
C
B
C
g
B
C
C
@ ...
AB
I
1
2
gp
1 0 1
Sy
B
C
C
B
C
Sy C
C
= B .. C
C
@ . C
A B
A
1
2
(16)
Sp y
)
for the vector g1 ; : : : ; g p 0. This system can be written more compactly as
Pg
= Qy;
(17)
=
which is the GAM analog of the least squares normal equations X0 X
X0 y. Continuing the analogy with least squares suggests inverting P and
pre-multiplying both sides of equation (17) by P,1 to yield ĝ. This is seldom an efficient computing strategy, since P is an np by np matrix, and
3
inverting this matrix consumes computing resources on order of np
(Hastie and Tibshirani 1990, 109). Backfitting exploits the special structure of P and its component matrices, reducing the computational burden
tremendously.
( )
3.5 Example — Pennsylvania State Senate data
We motivate these ideas using a small data set, presented graphically
in Figure 2. On the horizontal axis is the Democratic plurality recorded
using voting machines at some recent elections for the Pennsylvania State
Senate. On the vertical axis is the Democratic plurality recorded among
absentee ballots. These data were collected to help decide a court case after
a large Democratic plurality among absentee ballots decided the outcome
of a special election, and in turn, partisan control of the Pennsylvania State
Senate. We stress that our interest in these data is purely expository, and
so the contested special election is not shown on Figure 2; Bartels (1996) reanalyses these data and conclusions about the substantive issue of electoral
fraud.
Figure 3 show what is going on “behind the scenes” for three fitting
techniques: least squares linear regression, the local regression smoother,
and cubic smoothing splines (right-hand panels). Each panel plots the so19
1000
500
0
-500
Democratic Margin - Absentee Ballots
0
20000
40000
60000
Democratic Margin - Voting Machine
Figure 2: Pennsylvania State Senate data. The straight line is an OLS fit,
and the shaded area is its 95% confidence interval, assuming independent
and identically distributed (iid) normal errors.
called equivalent kernels for the three different fitting methods at various
target points in the Pennsylvania State Senate data; the equivalent kernels
are simply the weights assigned to the data points in forming ŷ x0 ĝ x0 ,
where x0 is the target point.
As Figure 3 makes clear, the equivalent kernels depend on not just
the type of smoother being used, but also on the amount of smoothing,
and the location of the target point. This “built-in” adaptivity gives these
smoothers good properties (low bias) in sparse regions of x, and in particular, at the boundaries of x.
The equivalent kernels for local regression are plotted in the two middle
panels of Figure 3. Two different spans are plotted (1/4 and 3/4), meaning
that 25% and 75% of the data count as near-neighbors, respectively. Observations outside this neighborhood receive zero weight. Within the neighborhood, the tri-cube kernel, defined in equation (7), largely determines
how the weights are formed. Smaller spans (bandwidths) result in tighter
neighborhoods, down-weighting the influence of more distant observations in forming the local data summary, relative to the wider bandwidth
j = ( )
20
.5 (a): obs. 9, OLS linear regression
(b): obs. 20, OLS linear regression
.4
.3
.2
o
o
o
.1
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
0
o
o
o
o
o
o
o
- .1
.5 (c): obs. 9, local linear regression (loess)
(d): obs. 20, local linear regression (loess)
o
.4
span = 1/4
span = 3/4
o
o
.3
o
o
o
o
o
.2
o
oo
o
.1
o o
o
oo
o
o
0
o
o
o
o
o o
o
o
oo
o
oo
oo
oo
o oo
o
o
o o
o
o
o
o o
o
oo
o o
o
oo
oo
oo
oo
o
o
o
o oo
o
o
.5 (e): obs. 9, cubic smoothing splines
(f): obs. 20, cubic smoothing splines
o
.4
df = 10
df = 4
o
o
.3
o
o
o
.2
o
oo
.1
o
o
oo
o
o
o
o
o
o
o
0
o
o
o
oo
o
oo
o oo
o
o
o o
oo
o
o
o
o
o
o o
o o
o
o
o
o o
o
oo
o
o o
o
oo
oo
oo
oo
o
o
o oo
o
o
o
Figure 3: “Equivalent kernels” for least squares linear regression, local
linear regression (loess) and cubic smoothing splines, using two target
points from the Pennsylvania State Senate data. Three different scatterplot smoothers are fit to the data presented in Figure 2 and the equivalent
kernels for observations 9 and 20 are graphed here; each equivalent kernel
is the vector of weights the observations picks up in contributing to the fit
at the specified target point. Each equivalent kernel sums to one, though is
negative in some cases. Note that for least squares linear regression, every
observation makes a contribution to ŷ x0 ; contrast scatterplot smoothers,
where the resulting data summaries are specific to neighborhoods of x0 .
j
21
smoother.
The bottom two panels of Figure 3 shows the equivalent kernels for two
different cubic smoothing splines fit to the Pennsylvania State Senate data.
The smoother consuming ten degrees of freedom forms tighter neighborhoods around each target point than does the smoother consuming four
degrees of freedom. Remember that cubic smoothing splines emerge as the
solution to a constrained optimization problem, and are not set up with a
probability density function or weight function as the smoother’s underlying kernel. Nonetheless, like all smoothers, smoothing splines wind up
differentially weighting observations around the target point, in the manner shown in Figure 3.
For least squares regression, the equivalent kernel for the jth data point
is simply the jth row of the hat matrix defined in section 3.3. The equivalent kernels for linear regression (top panels) highlight that least squares is
a “global” fitting technique; all observations contribute to the least square
estimate of a specific y0 x0 , no matter how distant from the target point.
The result of exploiting all the sample information is familiar: least squares
has minimum variance among the class of linear, unbiased estimators. But
it is precisely this question of bias that is at issue; if least squares linear regression isn’t “getting the mean right” then we might question the value of
its optimality properties. Under an easily-imagined set of circumstances,
we might prefer “getting the mean (more) right” to the low variance of
least squares. Scatterplot smoothers and their multivariate generalization,
GAMs, offer a way for analysts to manage the tradeoff between bias and
variance, an issue we pursue below.
j
3.6 Bandwidth selection and the bias-variance tradeoff
Consider smoothing by moving averages. More temporal aggregation
yields a more precise data summary (since it is drawn from a greater number of observations), reflecting the wider bandwidth of the smoother. Less
temporal aggregation — a narrower bandwidth — yields a less smooth
data summary, but one that is more faithful to local fluctuations in the
mean of the data. But since smoothers with the narrower bandwidths
yield local data summaries based on fewer observations, they are inherently less precise than smoothers with wider bandwidths. In general, a
wider bandwidth risks bias while shrinking variability; a narrower bandwidth purchases less bias at the cost of precision.
22
All smoothers wrestle with this tradeoff, via a smoothing parameter
that in turn governs bandwidth. Bandwidths are sometimes estimated or
derived from the data, but usually are chosen in advance and then finetuned by the user on a case-by-case basis. For local regression, the span
parameter governs the size of the neighborhood in which data points receive non-zero weights in the local regression. For smoothing splines, the
parameter in (9) effectively governs the amount of smoothing.
Figure 4 depicts the bias-variance tradeoff, for both local regression and
cubic smoothing splines, using the Pennsylvania State Senate data first
presented in Figure 2. The top left panel of Figure 4 uses a relatively small
span (1/4), making the fitted function more sensitive to local changes in the
mean function, but with lessened precision (wider confidence intervals). In
the lower left panel, a wider bandwidth (span = 3/4) lessens sensitivity to
local changes in the mean function (risking bias), though produces a more
precise fit (smaller confidence intervals). The right hand panels repeats
this comparison for two cubic smoothing splines fit to the same data, with
equivalent degrees of freedom set to 4 and 10.
It is also illustrative to compare the confidence intervals on these
smoothed fits with those obtained with linear regression fit to the same
data (Figure 2). The linear regression confidence intervals are more narrow than those obtained with the more “flamboyant” non-parametric fits
(though only slightly more narrow than the confidence intervals in the
bottom panels of Figure 4). Moreover, the linear regression confidence intervals lie far away from several data points. All this is to say that linear
regression is not immune from the bias-variance tradeoff. The graphical
comparisons in Figure 4 reveal that users of linear regression (implicitly)
obtain low variance at the expense of possibly high bias.
How to choose these smoothing parameters? Based on our experience
in applying smoothers in social science contexts, this seems a matter of
judgement best decided on a case-by-case basis, informed by prior beliefs
about the processes being modelled and diagnostic plots for evidence of
over- and under-fitting. We prefer this type of approach for scatterplot
smoothing and fitting GAMs; graphical inspection lets us keep track of
what the smoothing procedures are up to more so than letting the software “black-box” the model fitting, say via cross-validation (e.g., Hastie
and Tibshirani 1990, 42–52) or other automated procedures for choosing
smoothing parameters. In addition, we typically have prior beliefs that the
social and political processes under study are reasonably smooth; while the
23
1500
60000
cubic smoothing spline, df = 10
0
20000
40000
60000
1000
500
0
-500
loess, span = .75
0
20000
40000
60000
-1500
-500
0
500
1000
1500
40000
-1500
-500
0
500
1000
1500
1000
500
0
-500
-1500
20000
1500
0
-1500
24
Democratic Margin - Absentee Ballots
loess, span = .25
cubic smoothing spline, df = 4
0
20000
40000
60000
Democratic Margin - Voting Machine
Figure 4: Bias-variance tradeoff, local linear regression (loess) and cubic smoothing splines, fit to the Pennsylvania State Senate
Data. The solid lines are the fitted smooth functions, and the shaded regions indicate 95% pointwise confidence regions, assuming
iid normal errors. For comparison, the dotted lines mark the 95% confidence interval from the linear regression fit (Figure 2). The
top panels show how smaller bandwidths (span = 1/4, d f = 10, respectively) increase sensitivity to local changes in the mean
function, at the expense of precision. Note the wider confidence intervals in the top panels, compared to smoother fitted functions
in the bottom panels (span = 3/4, d f = 4).
world may be non-linear, we doubt it to be abruptly or flamboyantly nonlinear.12 In these cases we have good reasons for constraining the smoothing parameters to values that reflect these beliefs, rather than relying on
the software to automatically choose parameters that minimize prediction
error. All this is to say that we sometimes prefer “keeping the mean function parsimonious” (i.e., smoother) over blind adherence to “getting the
mean right”.
If qualitative findings were sensitive to small and arbitrary decisions
about smoothing, the whole smoothing and GAM technology would not
be very useful. Fortunately, our findings are not sensitive to choices about
the amount of smoothing to perform, within what appear to be reasonable ranges of smoothness. Obviously researchers should clearly indicate
whether results depend on arbitrary decisions about smoothness. But even
here GAMs are simply making explicit the effect of arbitrary choices that
usually are left implicit. Users of OLS almost never question whether an
assumption of perfect smoothness is reasonable. All of the findings in our
applications are robust to exact choices about how much to smooth; on the
other hand, since our results differ substantially from those obtained with
simple linear models, it follows that the assumption of extreme smoothness underlying OLS is substantively consequential.
4 GAMs in practice
GAMs seem an appealing way to analyze political and social data. It is
now time to see how helpful they are in actual political science research.
The examples that follow come from a variety of sub-fields of political
science, and were chosen to show some of the advantages of the GAM
approach. In the conclusion we assess the usefulness of GAMs in political
research more generally.
4.1 The effect of Perot on Congressional races: GAMs as
an alternative to multiple regression
Jacobson and Dimock (1994) analyze the effect of the House Bank scandal on the 1992 House elections. In one analysis they regress the challenger’s vote in the 1992 election on the number of overdrafts written by
12
Einstein’s observation that “God is subtle but he is not malicious” is a favorite epigraph in texts on non-parametric statistics.
25
the incumbent and a series of other “control” variables. The control variable of interest for our purpose here is the percentage of the presidential vote received by Perot. Jacobson and Dimock (1994, 617) used this
variable because they “suspected that Perot voters were more disposed
to vote against incumbents of either party.” In Table 1 we first present a
re-estimation of Jacobson and Dimock’s regression (their Table 9).13 The
data comprise all 309 incumbents who ran for re-election in 1992. We regressed the proportion of the vote received by the challenger against the
Perot vote in the district, the log of the number of overdrafts (plus one) for
each incumbent and a series of control variables.14
Table 1: Estimates of Challenger’s Vote: 1992 Congressional Election
OLS
GAM: Perot Vote
GAM: Checks
b
se
b
se
b
se
Constant
4:34
3:49
5:34
3:44
4:64
3:48
Prior Office
:98
:87
1:07
:86
:81
:87
ln(Chal. Exp.)
2:66
:26
2:63
:26
2:64
:27
ln(Inc. Exp.)
:06
:60
:11
:60
:004
:59
Pres. Vote
:36
:036
:34
:04
:37
:04
ln(Overdrafts)
:80
:19
:85
:19
see figure 6
Marginal
:85
:84
:85
:83
:89
:84
Part. Redist.
3:48
1:24
3:51
1:22
3:48
1:23
Perot Vote
:18
:06
see figure 5
:18
:06
Sum Sq. Resid.
9887.3
9588.1
9754.3
Degrees of freedom
300
298
297
,
,
The results of the linear regression are in the first columns of Table 1.
According to the linear regression model, the Perot vote did have an effect
on the 1992 Congressional vote, with a 10 point increase in the Perot vote
associated with just under a two point increase in the challenger’s vote.
The t-ratio for this coefficient is about 3, so the effect is clearly statistically
significant. But while we suspect that Perot voters are more likely to vote
13
The data as supplied by Jacobson and Dimock were updated after publication. Analyses reported here use the updated data. Our results differ by at most a few percent from
the published results.
14
In the next subsection we assess the logarithmic transformation on the overdrafts
variable. Here we focus solely on the effect of the Perot vote. Other control variables
used in all analyses were whether the challenger had held prior office, the logs of both
the challenger’s and incumbent’s expenditures, the presidential vote for the challenger’s
party in the district, a dummy variable for whether the incumbent was in a close race
in 1990 (Marginal ) and whether the incumbent was subject to a partisan redistricting
between 1990 and 1992 (Part. Redist.). All data used are as supplied by Jacobson and
Dimock; measurement details are in their 1994 article. Unreported analyses confirm that
these predictors have linear effects and so will not be discussed further.
26
2
4
Spline 3df
Spline 4df
Spline 8df
Loess Span=.75
Loess Span=.5
Loess Span=.25
5
10
(b): Smoothing Spline, df=3
Spline df=3
Linear
0
6
8
10
Effect on Challenger’s Vote
(a): Comparison of Smooths
0
Effect on Challenger’s Vote
against incumbents, we have no theory which says that this functional
form is linear. To assess this, we re-estimated a series of GAMs. These
were identical to the linear regression, except they allow for a smooth, but
non-linear, relationship between the vote for Perot and the challenger.
Since we want to ensure that our findings are not the result of an arbitrary choice about whether to use loess or cubic smoothing splines, as
well as the degree of smoothing, we present a series of analyses which
vary both the degree of smoothing and the choice of smoothing technique.
Plots of the relationship between the challenger’s vote and the Perot vote
for a variety of smoothing choices are in panel (a) of Figure 5.15 In interpreting these plots, it should be remembered that the median district Perot
vote was about 20%, ranging from 3% to 33%; the distribution of this vote
by district was reasonably symmetric, with a slightly heavier left tail.
5
10 15 20 25 30
5
Perot Vote
10 15 20 25 30
Perot Vote
Figure 5: Non-parametric estimates of the effect of Perot vote in a
Congressional District on challenger’s vote. (a) compares a variety of
smooths and (b) compares the best fitting smooth with a linear fit
The linear model for the effect of Perot vote is clearly misleading. Using any of the graphed smooths, we see that the an increase in the vote
15
We stress that all these figures are based on GAMs which include all the other Jacobson and Dimock variables either linearly or logarithmically. The effect of these other variables is thus summarized by a regression coefficient. These coefficients (for the GAM using a smoothing spline with three degrees of freedom) are shown in the middle columns
of Table 1. Specifying a non-linear effect for the Perot vote has little impact on the other
coefficients. This need not be the case; we would have seen bigger effects if the Perot vote
had been more highly correlated with the other control variables.
27
for Perot has a linear effect on the challenger’s vote until the Perot vote
exceeds about 15% (that is, there is a linear impact for the lower quartile
of districts). But once that threshold is exceeded, there is no further effect
of the Perot vote on the Congressional election. Note that this finding is
robust to choices about how to smooth. The coefficient on the linear Perot
term is .18; but the non-linear effect is between three and four times as great
until the 15% threshold is reached. In panel (b) of Figure 5 we compare
the best fitting of the smooths, a cubic smoothing spline using 3 degrees
of freedom, with the linear fit. The linear model under predicts the challenger’s percentage of the vote by over three points for districts at the 15%
threshold. F tests overwhelmingly reject the null hypothesis of linearity in
favor of any of the local regression fits, with p < :0001.
This difference is clearly non-trivial. It also yields a different conclusion
than Jacobson and Dimock’s (1994, 618) finding that “the more support
for Perot, the more support for the House challenger.” Instead we have
two different effects, separated by the 15% threshold. It is possible that
each district has a relatively small number of anti-Washington voters, who
would both be more likely to vote for Perot and the non-incumbent House
candidate. But districts where Perot did well (above the 15% threshold)
are probably districts where the Perot campaign was organized. If so, such
organization did not appear to have consequences for the House race. We
would need individual level data to test this hypothesis, but without the
GAM we would not even have seen the non-linear relationship between
the Perot and House vote.16
4.2 Overdrafts and the Congressional Vote: Assessing
Transformation
Jacobson and Dimock were interested in the effect of the House Bank
scandal, not the Perot vote, on Congressional vote. They chose to model
the effect of overdrafts on the challengers vote via a logarithmic transformation. Since many incumbents had no overdrafts, they used the natural
log of the number of overdrafts plus one. While we might expect that the
impact of overdrafts on the vote is monotonically increasing, it is also reasonable to expect decreasing returns to scale. But Jacobson and Dimock
16
We do not claim it would have been impossible to find such a relationship via analysis
of least squares residuals and such; but we do claim it is at least as easy to find and
test non-linear relationships with GAMs as it is to find linear relationships via linear
regression.
28
neither justify nor examine their choice of specific transformation. GAMs
provide an ideal way of doing this.
The overdrafts variable presents difficulties in that it is highly skewed.
While its range is from zero to 851, the median number of overdrafts is two,
with 40% of incumbents who sought re-election having had no overdrafts
and 90% having had fewer than 100 overdrafts. The few incumbents with
a large number of overdrafts bring the mean number of overdrafts to over
30.
While highly skewed variables are always a problem, the large number
of incumbents with no overdrafts causes severe problems for the logarithmic transformation. While it is conventional to add one to all observations
to avoid the log of zero, we could equally well add .01 or 10. Substantively,
this decision has serious consequences for estimating the impact of the independent variable at low levels of that variable; for Overdrafts, almost all
observations are at low levels.17 The GAM completely eliminates the need
for the common arbitrary decision about how to avoid the log of zero.
The logarithmic transformation is also problematic in that it is fit globally, not locally. Decreasing returns to scale for large values of Overdrafts
clearly requires the use of a downward bending transformation. But a
log transformation struggles to fit both those incumbents with many overdrafts, as well as the large number of incumbents with few, if any, overdrafts.
To assess the relationship between overdrafts and challenger vote, we
estimated a GAM relating challenger vote to a non-parametric estimate
of Overdrafts, using a smoothing spline with four degrees of freedom for
this estimate.18 Neither the logarithmic nor the spline model is nested inside the other, and so we cannot directly test one versus the other. We can
embed both models in a larger model which contains both, and then test
whether the data reject either specialized model (which is the common procedure for testing non-nested regression specifications). This “test” shows
that the data prefer neither model, finding either one consistent with the
data. Thus using the logarithmic transformation of Overdrafts, as Jacobson
17
These consequences stem from the logarithmic function having large first and second
derivatives near zero. Thus we will get very different estimated impacts if we add .01,
1 or 10 before transforming. Theory is completely silent as to which of these values is
appropriate, and conventional data analysis allows for a large range of values which
might be appropriately added to Overdrafts prior to the log transform.
18
All other control variables were entered linearly. Results were almost identical if we
used the non-parametric Perot Vote. The non-parametric estimate was not sensitive to
smoothing choices.
29
and Dimock did, does no great injustice to the data.
0
200
400
600
800
OVERDRAFTS
5
4
3
2
1
0
-1
2
4
6
8
Effect on Challenger’s Vote
(b): Fewer than 100 overdrafts
0
Effect on Challenger’s Vote
(a): All Incumbents
0
20
40
60
80 100
OVERDRAFTS
Figure 6: Comparison of logarithmic (dashed line) and cubic smoothing
spline (df=4), effect of overdrafts on challenger’s vote. An approximate
95% confidence interval for the cubic smoothing spline estimate is shaded.
(a) is for all incumbents while (b) focusses on the region containing 90% of
the data, allowing the eye to more clearly see the difference between the
two models. Both panels are identical except for scale.
We can, however, use the estimated smoothing spline to improve our
understanding of the effect of the House Bank scandal on the election.
The estimated impact of the scandal using the logarithmic transformation
is heavily influenced by the arbitrary decision to add one to Overdrafts
before taking logs. The GAM approach avoids the need for this arbitrary
decision. The estimated smoothing spline and logarithmic fits are shown
in Figure 6. Comparing the spline to the logarithmic fit, we see that the
logarithmic fit overstates the impact of overdrafts when the number of
overdrafts is not great; it lies above the confidence interval of the spline fit
until the number of overdrafts exceeds 50.
Over 85% of incumbents seeking re-election were in this region. The
spline fit indicates that a very small number of overdrafts did not hurt incumbents very much, and may not have hurt them at all. The pointwise
confidence interval for the impact of overdrafts excludes zero only for six
or more overdrafts. At that point, the GAM analysis indicates that six
overdrafts cost the incumbent about a quarter of a point; the logarithmic
30
analysis indicates that six overdrafts cost the incumbent over a point and
a half. The logarithmic analysis indicates that even one overdraft cost the
incumbent about half a point; the corresponding GAM impact is four one
hundredths of a point. Thus the logarithmic analysis shows that even a
very small number of overdrafts might have affected close Congressional
races; the GAM analysis shows that overdrafts only had a major effect on
races for the minority of incumbents who had a relatively large number
of overdrafts. Thus while the logarithmic transformation does not do any
great injustice to the data, it does have substantive consequences that appear not to be correct.
4.3 Do governments face rising hazards? — an analysis in
discrete, but smooth, time
An enduring controversy in comparative politics is whether the probability of a cabinet failing increases with the age of the cabinet. Warwick
(1992) argued that the hazard rate of cabinet failure increased over the life
of the cabinet, while Alt and King (1994) argued for a constant hazard rate
(both controlling for various institutional rules).19 This controversy has
both statistical and substantive implications. The incorrect assumption of
a constant hazard rate leads to wrong standard errors, and wrong, or at
best, inefficient, parameter estimates. The substantive issue is whether,
controlling for institutional rules, there is a tendency for cabinets to become more fragile as they age. If so, we need to investigate what happens
over the life of a cabinet to make it more fragile; if not, then fragility is
purely determined by institutional rules.
Parametric estimation of a continuous time Weibull model shows that
the hazard rate is rising, at least for the model used by Alt and King
(Beck 1995; Beck N.d.). But this finding is based on a strong assumption,
that the hazard rate increases (or decreases) monotonically, following the
specific Weibull shape. One of us has used discrete time hazard models
to investigate this issue (Beck 1995). This analysis can be dramatically improved by the use of GAMs.
The discrete time approach models the probability of a government
failing at the end of some month, given that it survived up to that month
(the discrete hazard) as a logit on a series of covariates and a time marker.
19
In survival analysis, the hazard rate is, loosely speaking, the probability of a cabinet
failing in some small time interval t
t given that it had survived up to time t.
( + )
31
As is standard in this literature, the covariates we use are a measure of
party polarization, whether the government is a majority government,
how many prior formation attempts had been made, whether the coalition was formed right after an election, whether the government is a caretaker government and whether there is an investiture requirement. Letting
FAILURE be a dummy variable which marks whether a government fails
in a given month, given that it survived up to that month, and removing
governments which fail from further analysis, the discrete hazard model
is a logit with
P(FAILURE ) E log
= + x1 + + xk + h(t): (18)
1 , P(FAILURE )
In Beck (1995), h(t) was modelled as a series of dummy variables, one
for each time point (so h(t) = h ). With monthly data, and cabinets survivit
0
1
i
k
i
it
t
ing up to 48 months, this led to imprecise estimates of the coefficients of
the individual ht ’s; these are, not surprisingly, highly collinear. While it appeared that the coefficients on the ht ’s increased with t, the large standard
on these estimates made this finding uncertain.
One solution to this problem would be to use a parametric model in
time. But the shape of the time dependence is unknown and is exactly the
point at issue. A low order polynomial approximation is overly restrictive,
and is, in any event, inconsistent with typical shape assumptions made in
survival analysis.
GAMs provide a better solution. Instead of a series of dummy variables, or an arbitrary polynomial, we simply model h t as a smooth function of time. The smoothness assumption seem reasonable. While hazard
rates may be time dependent, we would not expect them to make large
jumps from month to month. Here we model h t as a smoothing spline
consuming 4 degrees of freedom; our substantive findings were identical
using a smooth with any reasonable number of degrees of freedom.
The smoothing spline for time is shown in panel (a) of Figure 7 with
coefficient estimates in Table 2.20 The picture is clear. The hazard of a
government falling increases over time, at least for the first two and a half
()
()
20
The metric on the y-axis in this panel is in terms of logits of the probability of government failure, the GAM’s dependent variable; see equation 18. We present the results
in terms of logits instead of the substantive probabilities of government failure because
the latter are a function of all the covariates; i.e., it is more straightforward to inspect the
partial effects of change in a given independent variable in terms of the logits. We can
transform the predicted effects to changes in terms of probabilities via the inverse logit
transformation, to produce graphs like that in panel (b).
32
years that it is in power. Examination of the 95% confidence interval shows
that this decline is not simply random error. A 2 test clearly shows that
Equation 18 with a smooth function of time is superior to one without a
time variable; we can reject the null hypothesis that the smoothing spline
of time does not belong in the specification at p :001. Contrary to Alt
=
and King, hazard rates are rising, not constant.
0.06
Weibull
Discrete
0.04
0.0
0.5
HAZARD
1.0
0.08
1.5
(b): Hazard Rates
-0.5
Effect on Cabinet Failure
(a): Logit Scale
0
10
20
30
40
0
10
20
30
40
MONTH
MONTH
Figure 7: The effect of cabinet duration on cabinet failure: discrete hazards. (a) Effect of duration on logit of government failure(shaded area
indicates an approximate 95% confidence interval, the y-intercept is arbitrary) and (b) comparison of estimated discrete and Weibull hazard.
The plot of the smoothing spline is not directly interpretable in substantive terms, but it is easy to use the inverse logit transformation to compute
the substantively meaningful discrete hazards (which are just the probabilities of cabinet failure in a given month). A plot of the discrete hazard rate
against time, computed at the mean of all the other covariates, is shown
in panel (b) of Figure 7. The hazard rate increases, more or less linearly,
over the first 30 months of a government, tripling from .03 to .09 over those
months. Typical cabinets are three times as likely to fall after having been
in power over two years then at the time of their initial formation. The
hazard rate remains at the .09 level after that, so the probability of a government falling in any month after month 30, given that it survived to that
month, is almost one in ten. We should remember that there are relatively
few governments which fall after three years, so the uncertainty associated
33
Table 2: Estimation of the Discrete Hazard of Cabinet Failure.
Variable
Constant
INVESTITURE
POLARIZATION
MAJORITY
FORMATION
POST-ELECTION
CARETAKER
MONTH
-2LogLike
Degrees of Freedom
With Smoothed Time
b
se
,3:30
:20
:41
:14
:025
:006
,:63
:14
:14
:05
,:90
:15
1:92
:32
see figure 7
1950.9
5385
Without Time
b
se
,2:88
:17
:36
:14
:020
:006
,:54
:14
:12
:05
,:76
:14
1:62
:31
1971.5
5389
with the flat portion of the estimated hazard is large.
Fully parametric duration models typically make an assumption about
the shape of the hazard function. The most standard model, the Weibull,
assumes monotonic hazards. Thus it is hard to know if the finding of rising
hazards in a Weibull model is real or artifactual. The GAM discrete hazard
model estimates the hazard function non-parametrically. Thus our finding
of rising hazards gives credence to the Weibull findings. A comparison of
the Weibull hazard with the non-parametrically estimated hazard is shown
in panel (b) of Figure 7. Both hazard rates are similar for the first year and
a half a government is in power; at that point the non-parametric discrete
hazard rises faster than does the Weibull hazard. While this difference between the two hazard rates is not statistically significant, it does show that,
if anything, the Weibull underestimates the degree of rising hazards. Thus
the Weibull model, with rising hazards, does little injustice to the data. The
GAM analysis clearly shows that the Alt and King model is subject to rising hazards. Even holding institutional rules constant, cabinets do appear
to become more fragile as they age.
4.4 The democratic peace: monadic or dyadic
The democratic peace hypothesis asserts that democracies, while no
less likely to be involved in wars in general, are less likely to fight other
democracies. This hypothesis has spawned much quantitative research
(see e.g., Ray 1995). The typical research design used to assess this hy-
34
pothesis uses data on some subset of all pairs of nations, with annual observations on dyadic variables, including the binary dependent variable
of whether the dyad is involved in a militarized conflict. This data is analyzed by a straightforward logit analysis. Since theory is not specific on
exactly what it takes for a nation to be a democracy, researchers have either used arbitrary dichotomizations (or trichotomizations) of relatively
continuous democracy measures or have created ad hoc continuous measures of dyadic democracy. The 2 dimensional loess smooth enables us to
do better.
Here we reanalyze conflict in the post-World War II period among contiguous dyads, originally reported in Oneal et al. (1996).21 Following Oneal
et al. (1996) and Oneal and Russett (1996), we estimate a logit of DISPUTE
(whether or not there was a militarized dispute in the Correlates of War
dataset) on: a measure of dyadic democracy (see below); the average annual growth in real GDP of the dyadic partners over the previous three
years (GROWTH ); economic interdependence measured by the relative
importance of trade between the dyadic partners (INTERDEP ); a dummy
variable indicating whether the partners were allied (ALLIES ); and the relative balance of forces between the partners, with each partner’s capability
measured by economic, military and demographic factors (CAPRATIO ).22
We use geographically contiguous dyads in the period 1950–85, yielding
6678 dyad-years.
In the earlier work Oneal et al. (1996) used a simple dummy variable
to measure dyadic democracy. Subsequently Oneal and Russett (1996) argued that the appropriate measure of dyadic democracy is the democracy
score of the less democratic nation in the pair. This newer measure, MIN-
DEM, is constructed by creating democracy scores for each member of the
dyad (DEMA and DEMB ) and taking the dyadic score as the lesser of the
two. The democracy score for a nation is the difference between its measure on a democracy scale (running from zero to ten) and an autocracy
scale (also running from zero to ten), so DEMA, DEMB and MINDEM run
from negative ten to ten. While MINDEM is not an unreasonable measure of dyadic democracy, the democratic peace hypothesis is not stated
so clearly as to make it obviously the correct measure.
21
This section is drawn from a longer analysis of time-series–cross-sectional data issues
in international relations (Beck and Tucker 1996).
22
These measures, other than for democracy, were taken exactly from Oneal, et al.
Democracy measures were coded from the Polity 3 dataset (Gurr and Jaggers 1996). Measurement issues are discussed in (Oneal et al. 1996) and (Oneal and Russett 1996).
35
The GAM/interaction approach allows us to investigate the interrelationship of DEMA,DEMB, and the likelihood of a dyadic dispute in any
given year. We simply enter a smooth term in the interaction of DEMA
and DEMB (a 2 dimensional loess smooth) in a GAM of disputes.23 The
smooth term used has a span of half the data; as usual, findings were not
sensitive to smoothing choices. There is, however, an additional complication which must be dealt with before undertaking the GAM estimation.
The simple logit setup of Oneal, et al. assumes the dyadic observations are temporally independent. This assumption, as Oneal, et al., recognize, is implausible. As shown in Beck and Tucker (1996), we can view
the dispute time-series–cross-section data as duration data on how long
dyads remain at peace.24 We can allow for non-independence by adding a
smoothed function of the length of time since the last dispute, PEACEYRS,
to the logit. We use a cubic smoothing spline with four degrees of freedom to model this smooth. As usual, results were not sensitive to specific
smoothing decisions. It is important to estimate the effect of PEACEYRS
non-parametrically, since we do not know how the probability of dispute
varies over time. Results for the simple logit and GAM analyses are reported in Table 3.
The smoothed fit of PEACEYRS (Figure 8) clearly shows that the logit
on disputes suffers from non-independence.25 For the first seven or so
years that a dyad is at peace, the probability of a a militarized dispute declines rapidly. We can gauge the substantive size of this impact by noting
that it is roughly eight times as large as the effect of the dyadic partners being allied. The coefficient on the binary variable ALLIES in the GAM analysis of table 3 is .30; going from having been at war in the previous year
to seven years of prior peace has a corresponding impact of almost 2.4 in
23
Since the model is symmetric in the dyadic partners, we randomly assign one nation
as “A” for each observation.
24
The basic argument is that time-series–cross-section data with a binary dependent
variable is just another way of describing duration data; the information in the binary
dependent variable is the length of time between disputes (the number of zeros between
ones). The assumption of independence in the logit analysis is identical to the assumption
of a constant hazard rate in the corresponding duration analysis. Thus including the
smooth of PEACEYRS is equivalent to our inclusion of the smooth of cabinet age in the
analysis of cabinet failure.
25
“Likelihood ratio” tests clearly indicate that the smooth of PEACEYRS belongs in the
specification. These tests indicate that a model with a linear PEACEYRS term is superior
to a model without that term, and a model with the smooth of PEACEYRS is superior to
the model with linear PEACEYRS. p-values for all these tests are effectively zero. As in
the cabinet duration analysis, the y-axis for this plot (and of the z-axis in figure 9) is in on
the logit scale (see note 20).
36
Table 3: Estimates of Logit Model of Post-World War II Disputesa
Logit
b
GAM
se
:011
b
se
MINDEM
,:027
DEMA, DEMB
see figure 9
GROWTH
,:14
:02 ,:12
:03
ALLIES
,:52
:11
,:30 :12
CAPRATIO
,:444
:25 ,:68
:26
INTERDEP
,13:8
3:17 ,6:42
2:85
PEACEYRS
see figure 8
Constant
,2:26
:11 ,1:05
:10
-2LogLike
2793.0
2337.4
Degrees of Freedom
6776
6767
a
Data are in dyad-year form.
Dependent variable is whether dyad engaged in a
militarized dispute in a given year
the same logit units. We can use the inverse logit transformation to estimate these effects in terms of the probability of a dispute. For a dyad with
other independent variables set to their median values, changing from being non-allied to allied decreases the probability of conflict from .055 to
.042; moving from having been at war in the previous year to seven years
of prior peace decreases this probability from .274 to .033.26 The probability of a militarized dispute declines more slowly from years eight to about
twelve; after about twelve years, additional years of peace have essentially
no effect on whether the dyad enters a militarized dispute.
The conventional logit analysis shows that MINDEM affects the probability of dyadic dispute; as the less democratic partner becomes more
democratic, the probability of a militarized dispute decreases. The 2 dimensional loess smooth gives more information about the role of each
partner’s level of democracy in inhibiting war. This smooth is shown in
Figure 9. This figure is a 3 dimensional perspective plot, with the two
democracy scores plotted on the x and y axes and the smoothed impact
on disputes plotted on the z axis. Panel (a) of the figure is drawn from
the perspective of the least democratic dyad, while (b) redraws the same
figure from the perspective of the most democratic dyad. Figure 10 shows
the corresponding contour plot, another way of conveying the information
26
The effects in terms of logits for PEACEYRS could be read off Figure 8; all reported
effects were actually computed using the more exact methods discussed in Appendix B.
37
2
1
0
-1
-2
-3
EFFECT ON DISPUTE: Logit Scale
0
10
20
30
YEARS SINCE LAST DISPUTE
Figure 8: Effect of PEACEYRS on disputes. Smoothing spline with approximate 95% confidence interval shaded.
in Figure 9.27
Moving from being the least to the most democratic dyad has an effect
on disputes (in terms of logits) of about .9. Substantively, this is about
three times the impact of the members of the dyad being allied. Setting the
values of all other independent variables to their medians, moving from
being the least to being the most democratic dyad decreases the probability
27
As any hiker knows, conveying information about three dimensions on a two dimensional page is not easy. The perspective plots are an attempt to render the effect of the
two democracy scores on disputes (on the logit scale) assuming the reader were looking
at the three dimensional surface with the eye in (a) being at at the point 10; 10; 0 and
the eye in (b) being at 10; 10; 0 , The contour plot is like the hiker’s “topo” map or the
temperature map on the national weather page. The lines or “isocurves” on the contour
plot joins up pairs of democracy scores which have equivalent effects on the probability
of a militarized dispute. The numbers on the isocurves are in the same logit units as is
the z-axis of the perspective plots. The isocurves on a hiker’s topo map indicate positions
of equal altitude; the density of these curves around a given point indicates the steepness
of the terrain near that point. The height of the perspective plot corresponding to a minimum democracy score of -10 for both partners is about zero; in Figure 10 the contour line
closest to this point has a value of zero. At the other extreme, the height of the perspective plot corresponding to a maximum democracy score of 10 for both partners is almost
-.9; the contour line closest to this point has a value of -.8. (Since the effect of democracy
never is lower than -.88, there is no contour line corresponding to -.9.)
(
(, ,
)
38
)
EFFECT
-1 -0.5 0 0.5
39
10
De
(b) Perspective of Democracies
EFFECT
-1 -0.5 0 0.5
(a) Perspective of Non-Democracies
mo
5
cra
0
cy
:P
a
0
ar
y: P
-5
rtn
er
-1
0
Tw
o
-10
10
5
One
tner
-5
crac
emo
-10
De
mo 5
cra
c
-10
-5
One
tner
0
y:
Pa
0
5
Par
ray:
c
o
em
5
rtn
er
10
Tw
o
10
D
D
Figure 9: Local linear regression (loess) estimate of joint effects of each partners democracy score on militarized disputes. The
estimated effects are defined over the two-dimensional space of the predictors, forming a surface. The plots are drawn from the
perspective of (a) the least democratic dyads and (b) the most democratic dyads.
of conflict from .06 to .025; we have already seen that being allied only
decreases this probability by .013. A likelihood ratio tests clearly indicates
that the smooth of the democracy interaction belongs in the specification,
and is superior to a specification with only linear and multiplicative terms
in DEMA and DEMB. (The p-value for this test is essentially zero.) Of
0.4 0.3 0.2
-0.8
-0.7
0
5
-0.6
-0.5
-0.4
-0.3
-0.3
-0.2
-0.1
0
-5
0.1
0.1
-10
Democracy: Partner Two
10
perhaps more interest, estimating a combined model with both MINDEM
and the smooth interaction shows that even with MINDEM already in the
specification, the smooth interaction term also belongs (p < :001).
0
-10
-0.1
-5
0
5
10
Democracy: Partner One
Figure 10: Contour plot of estimated joint effects of each partners
democracy score on militarized disputes. The contour plot is an alternative way of presenting the information in Figure 9, showing the “topography” of the fitted surface.
The value of the GAM is not in obtaining improved fits, but rather in
allowing us to delineate the impact of democracy on war, and in particular to assess whether each nation’s democracy matters, or whether only
dyadic democracy matters. Looking at the the perspective plot in panel
(a), we see that either partner’s democracy has little or no effect on the
conflict when either partner is not democratic. The contour plot shows
that the democracy has little effect on conflict until both partner’s democracy scores exceed zero (on the -10 to 10 scale). This is contrary to the
40
recent Oneal and Russett argument that the probability of conflict declines
over the entire range of democracy scores.
This would seem to indicate the the original dichotomous measure of
democracy is better then the new MINDEM. But this is not the case either;
once both both partners’ score exceeds zero, an increase in either partners
score has a strong effect on conflict. In panel (b), we see that the surface
drops rapidly when both partners are relatively democratic. The magnitude of the effect can be seen in either the steep drop off in the surface in
Figure 9 or the close spacing of the contours in the northeast quadrant of
Figure 10. Holding other variables constant, the difference in the probability of a dispute for the least and most democratic dyads is about the same
as the difference in probability of conflict immediately after a conflict and
after three years of peace. 90% of this effect occurs after both partners have
positive democracy scores. Analyses relying on a dichotomous measure of
dyadic democracy would miss this.28
The GAM analysis sheds light on whether the effect of democracy on
war is dyadic or monadic. The dyadic hypothesis asserts that democracies themselves are not more pacific, only that pairs of democracies are
more pacific; the monadic hypothesis asserts that democracies are more
pacific, and democratic dyads are only as pacific as would be predicted
from their two separate democracy levels (Rousseau et al. 1996, e.g.,). The
GAM analysis shows that both positions are partially correct. At low levels of democracy it is dyadic democracy that matters; dyads containing a
partner with a negative democracy score are equally likely to engage in
conflict, regardless of whether the other partner is highly democratic or
highly non-democratic. But for dyads where both partners have positive
democracy scores, any increase in either partner’s democracy score deceases the probability of conflict. This can be seen in either the shape of
the surface in panel (b) of Figure 9, or, perhaps, more clearly in the contours of Figure 10. In the northeast (democratic quadrant), these contours
are negatively sloped 45 lines. This indicates that at high levels of democracy, either partner’s increase in democracy is a perfect substitute for the
other. This shows that both the MINDEM and the dichotomous measure of
28
The plots show clearly that both the dichotomous and the minimum democracy measures are flawed. If the dichotomous measure were correct the perspective plot would
simply show two vertical levels. If MINDEM were the correct measure the contours
would all be vertical above the 45 line and horizontal below it, with decreasing conflict
levels moving from left to right and from top to bottom throughout the plot. Figures 9
and 10 do not look like this.
41
dyadic democracy do not capture all the features of the democratic peace,
and that neither the monadic or dyadic theories completely capture the
nature of post-World War II conflict.
5 Conclusion
Political methodology has made great strides in the last decade. Political scientists now routinely use models driven by a wide variety of
complex statistical distributions with ease, and have started to use nonparametric methods to allow for more realistic error processes. Despite
these more realistic methods almost all political science models assume a
linear relationship between covariates and mean response. We find this
especially puzzling since there is almost always no theory that responses
should follow this simple linear form.
The generalized additive model seems an ideal tool for confronting this
shortcoming. While very flexible in modelling the effects of specific variables, the GAM is not so pliable that it finds patterns where none exist.
Thus it may be the ideal compromise between simple linear models and
much more complex neural net and projection pursuit type models. The
additivity constraint also makes the GAM easy to interpret, a property
not shared by its more more complicated cousins. Of course, analysts will
choose among the numerous non-parametric regression techniques, perhaps choosing to ignore them altogether, drawing on their experiences
with statistical modelling, and their intuitions about the particular substantive problem at hand. But the combination of flexibility and simplicity captured by the GAM means it can lay a strong claim to be the
“workhorse” non-parametric regression method. Workhorses must be relatively easy to use; as we show in Appendix B, this is case for GAMs.
While we think it most useful simply to present the non-parametric results as graphs of smooth curves, GAMs can also be used by those who
are committed to parametric methods. They can be used to either test the
adequacy of a chosen parametric transformation (including, perhaps most
importantly, the linear transformation!) or to suggest a superior transformation. While we feel that global transformations are inferior to local estimates, a carefully chosen global transformation is superior to one chosen
by convention or prayer.
We presented a variety of applications of the GAM in Section 4. In the
analysis of the House Bank scandal the GAM was most helpful in uncov42
ering the nature of the relationship between Perot votes and votes for the
House challenger. With the GAM it was simple to see that there really
were two relationships here: one for districts with few Perot voters and
one for districts with many such voters. The GAM also helped us to better
understand the substantive effect of the number of overdrafts drawn on
1992 House re-election contests; this effect is substantively smaller than
had been previously claimed. And at the very least, our reanalysis also
showed that those committed to parametric forms could better inform
their choice of parametric form by using the GAM. These reanalyses show
it is straightforward for analysts familiar with regression and logit analysis
to recast those models as GAMs.
Our reanalyses of cabinet durations shows that GAMs are an ideal tool
for the analysis of discrete time duration data, and can settle important
substantive and statistical questions that are almost impossible to settle
any other way. Standard practice at present is to impose specific forms of
age effects, such as the Weibull. The GAM can either be used to justify
such forms or to go beyond them. The reanalysis of the democratic peace
shows that the use of smooth functions of time goes well beyond standard
duration models.
The reanalysis of the democratic peace data also shows how useful the
GAM is for modelling bivariate interactions. While many theories suggest
interactions in social and political data, they are hard to estimate precisely
in a linear framework. The GAM makes such estimation easy. We know of
no other tool that could have shown that the dyadic democratic peace hypothesis holds, but also showing that the hypothesis must be refined: for
the more democratic nations, an increase in either nation’s level of democracy decreases the probability of conflict.
We do not claim that the GAM is useful in all situations. In particular, it cannot be used where independent variables take on only a very few
discrete values, although such variables can still be used as “linear controls” or conditioning variables in a GAM. Consider a hypothetical analysis of political behavior, relying on survey data. GAMs will probably be:
(a) of little help in understanding the direct effects of categorical variables
like gender, race, and religion; (b) potentially helpful in modelling the effects of scale measures of political attitudes and ideology; and (c) of great
help with independent variables like age, income, and years of education.
GAMs work equally well on interval and ordinal data, and do not even
assume that the effect of a variable is monotonic; they essentially avoid
43
the entire level-of-measurement issue. Also, analysts of aggregate political
and economic data or public policy frequently have more-or-less continuous independent variables, though little theory about the functional form
relating these variables to a dependent variable of interest. Here the generalized additive model is ideal.
There surely will be situations where other non-parametric methods,
such as neural nets or ACE, are the method of choice. But only GAMs allow the analyst to work within the relatively familiar framework of the additive, regression model. Accordingly, the interpretation of GAMs is very
similar to the interpretation of regression coefficients, and the methodology for model choice in the GAM context is the same as that methodology in the linear or maximum likelihood context. So while GAMs are
non-parametric they remain intuitively comprehensible. Thus, for example, public policy audiences that now routinely use evidence based on linear regression should have little trouble using evidence derived from the
GAM. The same cannot be said for the GAMs more exotic relatives.
Analysing data involves making compromises. The linear regression
model is one such compromise (though typically not presented as such),
providing a simple statistical answer to the substantive question “does y
increase/decrease with x?” When its underlying assumptions are valid,
the linear regression model is rightly hailed as a powerful tool: at the
cost of just one degree of freedom an analyst learns as much as can be
learned from the sample data about . But as our applications demonstrate, assumptions of constant, linear effects do not always hold true, and
E y
X may well be a relationship of little substantive relevance. In
( )=
such circumstances linear regression stops being a useful compromise, and
may even blind researchers to what is actually going on in the data.
Instead, GAMs puts a range of compromises between parsimony and
fidelity to the data within reach of empirical social scientists, with least
squares linear regression at one extreme. Our title claims that “getting the
mean right is a good thing”. If we include statistical power and substantive
interpretability as desiderata, then surely GAMs are good things too.
44
Appendix A: Will GAMs find spurious nonlinearities? — A Monte Carlo Study
In this appendix we address the question of whether GAMs will find
smooth (non-linear) patterns in data that are known to be generated as the
sum of a linear relationship and a purely random factor. As described in
Section 3.3, an F test can be used to assess whether the null hypothesis
of linearity should be rejected. But this test is only approximate and has
unknown finite sample properties. To examine the performance of this test
we undertook a series of Monte Carlo experiments.
We initially generated N values of an independent variable, x, uniformly spaced over the interval [1, 10]. These values were fixed for each
experiment. The error process, , was generated by taking N draws from
a zero mean normal distribution. The dependent variable was then generated as y x . A GAM was then fit to the data, and an F test for nonlinearity was conducted. This process was repeated one thousand times
for each configuration of experimental conditions.
The object of the experiments was to see how often the GAM rejected
the null hypothesis of a linear model, that is, falsely reported that a nonlinear model was correct. Given that the data are known to be linear, we
should reject this null hypothesis at the p :05 confidence level 5% of the
time. In other words, we are testing whether the nominal size of these tests
(in this case, 5%) corresponds with the actual size of the tests.
We varied the sample size, the error variance, the type of smooth (loess
or spline) and the smoothness of the fitted GAM (as measured by the
equivalent degrees of freedom consumed by the smooth). The experiments
showed that the actual size of the test is between 5% and 10%, that is,
somewhat larger than the nominal size of the test. The average actual size
of the test over all our experiments was just over 7%. Given the complicated fitting procedure used by GAMs we were impressed with how well
the test for non-linearity performed. But the test is insufficiently conservative. Thus analysts who believe in a strict testing methodology should
use critical values of the F distribution in excess of the nominal 5% values
in assessing the null hypothesis that the true functional form is linear. We
find most frequently that we can either reject the null of linearity at any
level, or we can clearly not reject it any level, so this small departure of the
nominal and actual size of the test is of little practical importance.
Our experiments showed that the actual size of the F test only varied
= +
=
45
with the lack of smoothness of the GAM, as measured by the equivalent
degrees of freedom used in the smooth. Table 4 summarizes these results,
showing that equivalent degrees-of-freedom consumed by the GAM (i.e.,
the amount of smoothing) was the only variable to have any significant
impact on the size of the test.29 When we estimated more flexible GAMS,
that is, allowed the GAMS to more faithfully track local fluctuations in the
mean E y of the data, we found that the test was more likely to incorrectly
reject the null of linearity. Thus researchers should take special care before
concluding that a relationship is non-linear when working with extremely
flexible smooths. Even in the worst case, the nominal size of the test is
never over 10%. We also note that in our own work that smooths which
consume many degrees of freedom are seldom required. The analyses in
Section 4 did not require smooths using more than five degrees of freedom.
()
29
Full results are in the data archive for this paper.
Table 4: Analysis of Variance, Monte Carlo Experiments
Test Variable
df
Sum of Sq % Sum of Sq
F
p-value
edf
1
127:45
49:27
197:38
.00
n
1
:52
:20
:81
.37
type
1
:70
:27
:08
.30
t
1
:22
:08
:33
.57
edf n
1
:06
:02
:10
.76
edf type
1
:03
:01
:05
.82
n type
1
:56
:22
:86
.35
edf t
1
:37
:14
:56
.45
nt
1
:01
:00
:01
.92
type t
1
:57
:22
:86
.35
edf n type
1
:20
:08
:30
.58
edf n t
1
:21
:08
:32
.57
edf type t
1
:22
:09
:34
.56
n type t
1
:30
:11
:45
.50
edf n type t
1
:00
:00
:00
.95
Residuals
192
127:66
49:19
Dependent variable is size of the F test for non-linearity (proportion of
rejections at p < :05 in 1000 replications). n = sample size (25, 100, 400,
1000); type = 0 for cubic smoothing splines, 1 for loess; edf = equivalent
degrees of freedom (almost uniformly distributed between 2.3 and 7); t
= t-ratio of the true (linear) data generating process, inversely proportional to error variance (0, .5, 1, 1.7, 2.5, 5).
46
Appendix B: Software for GAMs
A statistical computing environment with strong graphics capabilities
is essential for using GAMs and scatterplot smoothers. S is such an environment, though most people use S via its enhanced commercial version,
Splus. Hastie (1992) provide a lengthy exposition of estimating GAMs in
S. Venables and Ripley (1994, ch10) is also an excellent guide.
Estimating GAMs is extremely simple in S, and uses a simple syntax.
Consider the Pennsylvania State Senate data introduced in Figure 2. The
following S commands fit GAMs to these data:
penngams <- gam(absentee ˜ s(machine), data=ashen)
penngaml <- gam(absentee ˜ lo(machine), data=ashen)
In the first example, the S formula syntax absentee˜s(machine)
tells the program to model the relationship between Democratic margin
among absentee ballots and the Democratic margin recorded with voting machines with a cubic smoothing spline. The default is for the cubic smoothing spline to consume 4 degrees of freedom. A more jagged fit
is produced by the formula syntax absentee˜s(machine,df=10). The
analyst can control the amount of smoothing in terms of the parameter
(see equation (9) in the text) using a spar argument in s(). Since the scale
of and hence spar is data-dependent, we prefer controlling smoothness
via the df argument.
The second example directs the program to use loess as the
smoother. The default span for lo() is .5; a smoother fit could be
produced with lo(machine,span=.75), and a more jagged fit with
lo(machine,span=.25). Local linear fitting is the default; local
quadratic fitting can be specified by lo(machine,degree=2).
Each example tells the software to create the “GAM objects”,
penngams and penngaml, respectively (the <- is the S assignment
operator). These objects can be passed to other S functions like predict,
plot, summary and anova to obtain useful information about the estimated fit. In particular, the command
plot(penngaml,ask=T)
will give the user a menu of options for graphically inspecting the fitted
GAM. The summary function provides a printed summary of the GAM,
including tests of the null hypothesis of linearity. Two different GAM objects can be compared using the anova() command; if one of the models
47
nests within the other, this function also provides a test that the difference
between the models is statistically significant.
Multiple predictors can be introduced to a GAM straightforwardly. For
example
biggam <- gam(y ˜ s(x1) + x2 + lo(x3,span=.25) +
s(x4,df=12) + lo(x5, x6, span=.5, degree=2) + x7)
models y semi-parametrically, specifying linear regression-like fits on X2
and X7 , cubic smoothing splines for the fits on X1 and X4 , a local linear
regression fit on X3 , and a local quadratic fit over the combination of X5
and X6 .
GAMs for qualitative dependent variables draw directly on the GLM
framework. GAMs borrow two important concepts from GLMs: a link
function, h , that describes how the mean of y, E y
depends on the
additive predictors; and a variance function that captures how the variance
of y depends upon the mean, i.e., var(y)=V , with constant (Hastie
and Pregibon 1992, 196–7). For instance, logistic regression models for
binary outcomes can be dealt with by making the link function the logit
transformation h log = 1 , and specifying a binomial variance
function, V 1 =n.
S makes extensive use of this GLM-type syntax in its support for
GAMs. For example, if y is a binary outcome, a logistic additive model
can be fitted using the command
()
( )=
()
( ) = ( ( , ))
( )= ( , )
logitgam <- gam(y ˜ s(x1) + lo(x2) + x3,
family=binomial)
The family= option is part of the S GLM syntax, and the defaults accompanying this option are to fit the logit model (the logit link is the default
for GAMs/GLMs in the binomial family). A probit GAM model can be fit
by specifying
probitgam <- gam(y˜ s(x1) + lo(x2) + x3,
family=binomial(link=probit))
The robust M-estimator is obtained by
rprobitgam <- gam(y˜ s(x1) + lo(x2) + x3,
robust(family=binomial(link=probit)))
48
Other software packages like SAS, SPSS and STATA provide some support for scatterplot smoothing, but support for “full-blown” GAMs appears confined to S and Splus. One of us (Jackman) has implemented
scatterplot smoothing by loess for GAUSS. This routine, along with all the
S code for the models and graphs, is in the data archive for this paper.
References
Alt, Jame E. and Gary M. King. 1994. “Transfers of Government Power:
The Meaning of Time Dependence.” Comparative Political Studies.
27:190–210.
Bartels, Larry M. 1996. “Pooling Disparate Observations.” American Journal of Political Science. 40:905–42.
Beck, Nathaniel. 1995. “Continuous and Discrete Time Methods for Estimating Duration Dependence.” Paper presented at the Annual Meeting of the Midwest Political Science Association, Chicago.
Beck, Nathaniel. N.d. Modelling Space and Time: The Event History Approach. In Research Strategies in Social Science, ed. Elinor Scarbrough
and Eric Tanenbaum. Oxford University Press.
Beck, Nathaniel and Richard Tucker. 1996. “Conflict in Space and Time:
Time-Series–Cross-Section Analysis with a Binary Dependent Variable.” Paper presented at the Annual Meeting of the American Political Science Association, San Francisco.
Breiman, L. and J.H. Friedman. 1985. “Estmating optimal transformations
for multiple regression and correlation (with discussion).” Journal of
the American Statistical Association. 80:580–619.
Cleveland, William S. 1979. “Robust Locally-Weighted Regression and
Scatterplot Smoothing.” Journal of the American Statistical Association.
74:829–36.
Cleveland, William S. 1993. Visualizing Data. Summit, New Jersey: Hobart
Press.
Cleveland, William S. and Clive Loader. 1996. Smoothing by Local Regression: Principles and Methods. In Statistical Theory and Computational
Aspects of Smoothing, ed. W. Härdle and M. G. Schimek. Heidelberg:
Physica Verlag.
Cleveland, William S. and S.J. Devlin. 1988. “Locally weighted regression: an approach to regression analysis by local fitting.” Journal of
the American Statistical Association. 83:596–610.
Cleveland, William S., Susan J. Devlin and Eric Grosse. 1988. “Regression by Local Fitting: Methods, Properties and Computational Algorithms.” Journal of Econometrics. 37:87–114.
de Boor, C. 1978. A Practical Guide to Splines. New York: Springer-Verlag.
49
DeVeaux, R. 1990. Finding Transformations for Regression Using the ACE
Algorithm. In Modern Methods of Data Analysis, ed. John Fox and
J. Scott Long. Newbury Park, Ca.: Sage pp. 177–208.
Eubank, Randall L. 1988. Spline smoothing and nonparametric regression.
New York: Marcel Dekker.
Fan, J. 1992. “Design-adaptive non-parametric regression.” Journal of the
American Statistical Association. 87:998–1004.
Fan, Jianqing. 1993. “Local linear regression smoothers and their minimax
efficiencies.” Annals of Statistics. 21:196–216.
Fan, Jianqing and Irene Gijbels. 1996. Local polynomial modelling and its
applications. London: Chapman and Hall.
Fan, Jianqing and J.S. Marron. 1993. “Comment on Hastie and Loader.”
Statistical Science. 8:120–143.
Friedman, Jerome H. and W. Stuetzle. 1981. “Projection pursuit regression.” Journal of the American Statistical Association. 76:817–23.
Good, I.J. and R.A. Gaskins. 1971. “Nonparametric roughness penalties
for probability densities.” Biometrika. 58:255–277.
Goodall, C. 1990. A Survey of Smoothing Techniques. In Modern Methods
of Data Analysis, ed. John Fox and J. Scott Long. Newbury Park, Ca.:
Sage pp. 126–77.
Green, Peter J. and Bernard W. Silverman. 1994. Nonparametric regression
and generalized linear models: a roughness penalty approach. London:
Chapman and Hall.
Greene, William. 1993.
Macmillan.
Econometric Analysis.
Second ed. New York:
Gurr, Ted Robert and Keith Jaggers. 1996. “Polity III: Regime Change and
Political Authority, 1800-1994.” Computer file, Inter-university Consortium for Polticial and Social Research, Ann Arbor, MI.
Hastie, T.J. and R.J. Tibshirani. 1990. Generalized Additive Models. London:
Chapman and Hall.
Hastie, Trevor J. 1992. Generalized Additive Models. In Statistical Models in S, ed. John M. Chambers and Trevor J. Hastie. Pacific Grove,
California: Wadsworth and Brooks/Cole Chapter 7.
Hastie, Trevor J. and Clive Loader. 1993. “Local regression: automatic
kernel carpentry (with discussion).” Statistical Science. 8:120–143.
Hastie, Trevor J. and Daryl Pregibon. 1992. Generalized Linear Models.
In Statistical Models in S, ed. John M. Chambers and Trevor J. Hastie.
Pacific Grove, California: Wadsworth and Brooks/Cole Chapter 6.
Huber, Peter J. 1981. Robust Statistics. New York: Wiley.
50
Jacobson, Gary and Michael Dimock. 1994. “Checking Out: The Effects
of Bank Overdrafts on the 1992 House Elections.” American Journal of
Political Science. 38(3):601–24.
Lancaster, Peter and Kȩstutis S̆alkauskas. 1986. Curve and Surface Fitting:
An Introduction. London: Academic Press.
McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. 2nd ed.
London: Chapman and Hall.
Oneal, John R. and Bruce Russett. 1996. “Exploring the Liberal Peace:
Interdependence, Democracy and Conflict, 1950–85.” Unpublished
manuscript, Department of Political Science, Yale University.
Oneal, John R., Francis H. Oneal, Zeev Maoz and Bruce Russett. 1996. “The
Liberal Peace: Interdependence, Democracy and International Conflict.” Journal of Peace Research. 33(1):11–29.
Ray, James Lee. 1995. Democracy and International Conflict: An Evaluation of
the Democratic Peace Proposition. Columbia, S.C.: University of South
Carolina Press.
Reinsch, C. 1967. “Smoothing by spline functions, 1.” Numerische Mathematik. 10:177–183.
Ripley, Brian D. 1993. Statistical aspects of neural networks. In Networks
and Chaos - Statistical and Probabilistic Aspects, ed. O.E. BarndorffNielsen, J.L. Jensen and W.S. Kendall. London: Chapman and Hall
pp. 40–123.
Rousseau, David L., Chrisopher Gelpi, Dan Reiter and Paul K. Huth. 1996.
“Assessing the Dyadic Nature of the Democratic Peace: 1918–88.”
American Political Science Review. 90(3):412–33.
Schoenberg, I.J. 1964. “Spline functions and the problem of graduation.”
Proceedings of the National Academcy of Sciences (USA). 52:947–50.
Statsci. 1995. S-PLUS: Version 3.3 for Windows. Seattle, WA: Mathsoft, Inc.
Stone, C.J. 1982. “Optimal global rates of convergence for nonparametric
regression.” Annals of Statistics. 10:1040–1053.
Venables, William N. and Brian D. Ripley. 1994. Modern Applied Statistics
with S-Plus. New York: Springer-Verlag.
Wahba, Grace. 1990. Spline Models for Observational Data. Philadelphia:
SIAM.
Warwick, Paul V. 1992. “Rising Hazards: An Underlying Dynamic of Parliamentary Government.” American Journal of Political Science. 36:857–
76.
Wegman, Edward J. and Ian W. Wright. 1983. “Splines in Statistics.” Journal
of the American Statistical Association. 78:351–365.
Weisberg, Sanford. 1985. Applied Linear Regression. New York: Wiley.
51
Western, Bruce. 1995. “Concepts and Suggestions for Robust Regression
Analysis.” American Journal of Political Science. 39:786–817.
White, Halbert. 1992. Artificial Neural Networks: Approximation and Learning
Theory. Oxford: Basil Blackwell.
Woolhouse, W.S.B. 1870. “Explanation of a new method of adjusting mortality tables, with some observations upon Mr. Makeham’s modification of Gompertz’s theory.” Journal of the Institute of Actuaries. 15:389–
410.
52