...

University of Vermont Department of Mathematics and Statistics Stat 229 Applied Logistic Regression

by user

on
Category: Documents
129

views

Report

Comments

Transcript

University of Vermont Department of Mathematics and Statistics Stat 229 Applied Logistic Regression
University of Vermont
Department of Mathematics and Statistics
Stat 229
Applied Logistic Regression
&
Applied Survival Analysis
Spring 2015
Instructor
Office:
Office Hours:
Room:
Time:
Class Room
David W. Hosmer
23C Hills
2:30 - 3:30 Monday
TBA, but most likely in the Davis Center
4:05- 6:30 Monday
VOTEY BLDG 361
Course Objectives: The first half of the course provides an introduction into the analysis of binary data
via the logistic regression model and the second half focuses on the analysis of time to event data, often
called survival analysis. In each part the objective is to provide students with the necessary statistical
techniques, computing skills and writing experience required to:(1) Formulate an analysis plan, (2)
Select the appropriate computer software package and routines, (3) Perform the analysis, (4) Display the
results of the analysis in a collection of tables and graphs and (5) Summarize, in writing, the analysis
process with special emphasis on the interpretation of the results. It is assumed that students will have
had experience in the linear normal errors regression model. This linear regression model will be used
to provide the basic paradigm for analyses in both parts.
Grading
The grade for the course will be based on the homework and the final exam as follows:
Exam/Final
50 Percent
Homework
50 Percent
Homework/Exams:
Homework will be assigned on a regular basis.
The final exam will be a take home, which will focus on the entire semester. The exam will use a new
data set not used in class.
Texts:
A course booklet containing copies of the PowerPoint slides for each portion of the course will be
available for purchase. Students not wishing to purchase the texts can get by with the booklet.
However, for students pursuing a career that will involve statistical analyses the texts would be a useful
addition to a personal professional library. Since I have no idea how many, if any, will buy a text I
have not ordered any through the UVM Bookstore. You can get a good price at Amazon and there are
often used ones available.
Hosmer, D.W. Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Third
Edition, John Wiley & Sons Inc., New York, NY
Hosmer, D.W., Lemeshow, S and May, S. (2008) Applied Survival Analysis: Regression
Modeling of Time to Event Data: Second Edition, John Wiley & Sons Inc., New York, NY
1
Data Sets:
All data sets for logistic regression used in the text are available at a web site at John Wiley & Sons, Inc.
http://wiley.mpstechnologies.com/wiley/BOBContent/searchLPBobContent.do
In addition, the data may also be found, by permission of John Wiley & Sons Inc., in the archive of
statistical data sets maintained at the University of Massachusetts at
http://www.umass.edu/statdata/statdata/stat-logistic.html
in the logistic regression section. Note: Data sets from both first and second editions are also available
here.
One may find code for STATA, SAS and SPSS for the examples in the Second edition of the textbook at
http://statistics.ats.ucla.edu/stat/sas/examples/ALR3
Other Logistic Regression References:
Agresti, A. (2002) Categorical Data Analysis: Second Edition. John Wiley & Sons. Inc., New
York, NY
Breslow, N.E. and Day, N.E.(1980) ‚Statistical Methods in Cancer Research Volume I. The
Analysis of Case-Control Studies, Oxford University Press, New York.
Breslow, N.E. and Day, N.E.(1987) ‚Statistical Methods in Cancer Research Volume II. The
Analysis of Cohort Studies, Oxford University Press, New York.
Collett, D. (2003) Modelling Binary Data: Second Edition. Chapman & Hall / CRC, Boco Raton,
FL.
Kleinbaum , D.G. and Klein, M. (2005) Logistic Regression: A Self learning Text: Second
Edition, Springer-Verlag, New York, NY
2
Other Time to Event References:
Andersen, P.K., Borgan, O., Gill, R. and Keiding, N. (1993) Statistical Models Based on
Counting Processes, Springer-Verlag, New York, NY
Collett, D. (1994) Modelling Survival Data in Medical Research, Chapman and Hall New York,
NY..
Fleming, T. and Harrington, D.(1991) Counting Processes and Survival Analysis, John Wiley
and Sons Inc., New York, NY.
Kalbfleisch, J. and Prentice, R.(2002) The Statistical Analysis of Failure Time Data: Second
Edition, John Wiley and Sons Inc. New York, NY.
Lawless, J.(2003) Statistical Models and Methods for Lifetime Data: Second Edition. John Wiley
and Sons, New York, NY
Lee, E.T. and Wang, J.W. (2003) Statistical Methods for Survival Data Analysis, Third Edition,
John Wiley and Sons Inc., New York, NY
Nelson, W.(2003) Applied Life Data Analysis: Second Edition, John Wiley and Sons., New
York, NY
3
Course Outline
Logistic Regression
1 Introduction to the Logistic Regression Model (Chapter 1, ALR3*)
1.1 Introduction
1.2 Fitting the Logistic Regression Model
1.3 Testing for the Significance of the Coefficients
1.4 Confidence Interval Estimation
1.5 Other Methods of Estimation
2 Multiple Logistic Regression (Chapter 2, ALR3)
2.1 Introduction
2.2 The Multiple Logistic Regression Model
2.3 Fitting the Multiple Logistic Regression Model
2.4 Testing for the Significance of the Model
2.5 Confidence Interval Estimation
3 Interpretation of the Fitted Logistic Regression Model (Chapter 3, ALR3)
3.1 Introduction
3.2 Dichotomous Independent Variable
3.3 Polychotomous Independent Variable
3.4 Continuous Independent Variable
3.5 The Multivariable Model
3.6 Interaction and Confounding
3.7 Estimation of Odds Ratios in the Presence of Interaction
3.8 Interpretation of the Fitted Values
4 Model-Building Strategies and Methods for Logistic Regression (Chapter 4, ALR3)
4.1 Introduction
4.2 Variable Selection
4.3 Stepwise Logistic Regression (Optional)
4.4 Best Subsets Logistic Regression (Optional)
4.5 Numerical Problems
5 Assessing the Fit of the Model(Chapter 5, ALR3)
5.1 Introduction
5.2 Summary Measures of Goodness-of-Fit
5.2.1 Pearson Chi-Square Statistic and Deviance
5.2.2 The Hosmer-Lemeshow Tests
5.2.3 Classification Tables
5.2.4 Area Under the ROC Curve
5.2.5 R2 Measures
5.3 Logistic Regression Diagnostics
5.6 Interpretation and Presentation of Results from a Fitted Logistic Regression Model
4
Course Outline
Survival Analysis
1. Introduction to Regression Modeling of Survival Data (Chapter 1, ASA2)
1.1. Survival time data
1.2. Censoring mechanisms
2. Descriptive Methods for Survival Data (Chapter 2, ASA2)
2.1. Survivorship function and its estimators
2.2. Use of the estimated survivorship function
2.3. Comparison of survivorship functions.
2.4. Other functions of survival time and their estimators
3. Regression Models for Survival Time (Chapter 3, ASA2)
3.1. Semi-parametric regression models
3.2. Fitting the proportional hazards model
3.3. Estimating the survivorship function
4. Interpretation of a Fitted Proportional Hazards Model (Chapter 4, ASA2)
4.1. Nominal scaled covariates
4.2. Continuous covariates
4.3. Multiple covariate models
4.4. Covariate adjusted survivorship function
5. Model Development (Chapter 5, ASA2)
5.1. Selection of covariates
5.2. Selecting the scale of continuous covariates
5.3. Confounding and interactions
6. Assessment of Model Adequacy (Chapter 6, ASA2)
6.1. Definition of residuals
6.2. Proportional hazards assumption
6.3. Influential subjects
6.4. Overall tests of fit
*:ALR3: Hosmer, D.W. Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression:
Second Edition, John Wiley & Sons Inc., New York, NY
ASA2: Hosmer, D.W., Lemeshow, S and May, S. (2008) Applied Survival Analysis: Regression
Modeling of Time to Event Data: Second Edition, John Wiley & Sons Inc., New York, NY
5
Abstracted from ALR3.
1.6 Logistic Regression Data Sets Used in Examples and Exercises (From ALR3)
A number of different data sets are used in the examples as well as the exercises for the purpose of
demonstrating various aspects of logistic regression modeling. Six of the data sets used throughout the
text are described below. Other data sets are introduced as needed in later chapters. Some of the data
sets were used in the previous editions of this text, for example the ICU and Low Birth Weight data,
while others are new to this edition. All data sets used in this text may be obtained from links to web
sites at John Wiley & Sons Inc. and the University of Massachusetts given in the Preface.
1.6.1 The ICU Study
The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on
survival of patients following admission to an adult intensive care unit (ICU). The major goal of this
study was to develop a logistic regression model to predict the probability of survival to hospital
discharge of these patients. A number of publications have appeared that have focused on various facets
of this problem. The reader wishing to learn more about the clinical aspects of this study should start
with Lemeshow, Teres, Avrunin, and Pastides (1988). For a more up-to-date discussion of modeling the
outcome of ICU patients the reader is referred to Lemeshow and Le Gall (1994) and to Lemeshow,
Teres, Klar, Avrunin, Gehlbach and Rapoport (1993). The actual observed variable values have been
modified to protect subject confidentiality. A code sheet for the variables to be considered in this text is
given in Table 1.5. We refer to this data set as the ICU data.
6
Table 1.5 Code Sheet for the Variables in the ICU Data
Variable
Description
Codes/Values
Name
1
Identification code
ID Number
ID
2
Vital Status at
1 = Lived
STA
hospital discharge
0 = Died
3
Age
Years
AGE
4
Gender
0 = Male
GENDER
1 = Female
5
Race
1 = White
RACE
2 = Black
3 = Other
6
Service at ICU admission
0 = Medical
SER
1 = Surgical
7
Cancer part of present problem
0 = No
CAN
1 = Yes
8
History of chronic renal failure
0 = No
CRN
1 = Yes
9
10
Infection probable at ICU
admission
0 = No
CPR prior to ICU admission
0 = No
INF
1 = Yes
CPR
1 = Yes
11
Systolic blood pressure at ICU
admission
mm Hg
SYS
12
Heart rate at ICU admission
Beats/min
HRA
13
Previous admission to an ICU
within 6 months
0 = No
PRE
Type of admission
0 = Elective
14
1 = Yes
TYPE
1 = Emergency
15
16
Long bone, multiple, neck,
single area, or hip fracture
0 = No
PO2 from initial blood gases
0 = > 60
FRA
1 = Yes
PO2
1 = ≤ 60
17
PH from initial blood gases
0 = ≥ 7.25
PH
7
1 = < 7.25
18
PCO2 from initial blood gases
0 = ≤ 45
PCO
1 = > 45
19
20
21
Bicarbonate from initial blood
gases
0 = ≥ 18
Creatinine from initial blood
gases
0 = ≤ 2.0
Level of consciousness at ICU
admission
0 = No coma or
BIC
1 = < 18
CRE
1 = > 2.0
LOC
deep stupor
1 = Deep stupor
2 = Coma
1.6.2 The Low Birth Weight Study
Low birth weight, defined as birth weight less than 2500 grams, is an outcome that has been of concern
to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are higher
for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and
receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of
delivering a baby of normal birth weight.
Data were collected as part of a larger study at Baystate Medical Center in Springfield,
Massachusetts. This data set contains information on 189 births to women seen in the obstetrics clinic.
Fifty-nine of these births were low birth weight. The variables identified in the code sheet given in
Table 1.6 have been shown to be associated with low birth weight in the obstetrical literature. The goal
of the current study was to determine whether these variables were risk factors in the clinic population
being served by Baystate Medical Center. Actual observed variable values have been modified to
protect subject confidentiality. We refer to this data set as the LOWBWT data.
8
Table 1.6 Code Sheet for the Variables in the Low Birth Weight Data
Variable
Description
Codes/Values
Name
1
Identification code
1 - 189
ID
2
Low birth weight
0 = ≥ 2500 g
LOW
1 = < 2500 g
3
Age of mother
Years
AGE
4
Weight of mother at last
Pounds
LWT
1 = White
RACE
menstrual period
5
Race
2 = Black
3 = Other
6
7
Smoking status during
0 = No
pregnancy
1 = Yes
History of premature labor
0 = None
SMOKE
PTL
1 = One
2 = Two, etc.
8
History of hypertension
0 = No
HT
1 = Yes
9
Presence of Uuterine irritability
0 = No
UI
1 = Yes
10
Number of physician visits
0 = None
during the first trimester
1 = One
FTV
2 = Two, etc.
11
Recorded birth weight
Grams
BWT
1.6.3 The Global Longitudinal Study of Osteoporosis in Women
The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international study of
osteoporosis in women over 55 years of age being coordinated at Center for Outcomes Research (COR)
at the University of Massachusetts / Worcester by its Director, Dr. Frederick Anderson, Jr. The study
has enrolled over 60,000 women aged 55 and older in ten countries. The major goals of the study are to
9
use the data to provide insights into the management of fracture risk, patient experience with prevention
and treatment of fractures and distribution of risk factors all among older women on an international
scale over the follow up period. Complete details on the study as well as a list of GLOW publications
may be found at the Center for Outcomes Research web site, www.outcomes-umassmed.org/glow.
Data used here come from six sites in the United States and include a few selected potential risk
factors for fracture from the baseline questionnaire. The outcome variable is any fracture in the first
year of follow up. The incident first-year fracture rate among the 21,000 subjects enrolled in these six
sites is about 4 percent. In order to have a data set of a manageable size, n = 500 , for this text we have
over sampled the fractures and under sampled the non-fractures. As a result associations and
conclusions from modeling these data do not apply to the study cohort as a whole. Data have been
modified to protect subject confidentiality. We thank Dr. Gordon Fitzgerald of COR for his help in
obtaining these data sets. A code sheet for the variables is shown in Table 1.7. This data set is named
the GLOW500 data.
Table 1.7 Code Sheet for Variables in the GLOW Study
Variable
Description
Codes/Values
Name
1
Identification
Code
1–n
SUB_ID
2
Study Site
1-6
SITE_ID
3
Physician ID code
128 unique codes
PHY_ID
4
History of Prior
Fracture
1= Yes
PRIORFRAC
5
Age at Enrollment
Years
AGE
6
Weight at
enrollment
Kilograms
WEIGHT
7
Height at
enrollment
Centimeters
HEIGHT
8
Body Mass Index
Kg/m2
BMI
0 = No
10
9
10
11
12
13
Menopause before
age 45
1= Yes
Mother had hip
fracture
1= Yes
Arms are needed
to stand from a
chair
1= Yes
Former or current
smoker
1= Yes
Self-reported risk
of fracture
1= Less than others of
the same age
PREMENO
0 = No
MOMFRAC
0 = No
ARMASSIST
0 = No
SMOKE
0 = No
RATERISK
2= Same as others of
the same age
3= Greater than others
of the same age
14
Fracture Risk
Score
Composite Risk Score
*
FRACSCORE
15
Any fracture in
first year
1= Yes
FRACTURE
0 = No
*: FRACSCORE =
0 × AGE ≤ 60 + 1 × 60 < AGE ≤ 65 + 2 × 65 < AGE ≤ 70
(
)
(
) (
)
+3 × ( 70 < AGE ≤ 75) + 4 × ( 75 < AGE ≤ 80 ) + 5 × (80 < AGE ≤ 85)
+6 × ( AGE > 85) + ( PRIORFRAC = 1) + ( MOMFRAC = 1)
+ (WEIGHT < 56.8) + 2 × ( ARMASSIST = 1) + ( SMOKE = 1)
1.6.4 The Adolescent Placement Study
Fontanella, Early and Phillips (2008) present results from a study of determinants of aftercare placement
for psychiatrically hospitalized adolescents and have made the data, suitably modified to protect
confidentiality, available to us. It is not our intent to repeat the detailed analyses reported in their paper,
but rather to use the data to motivate and describe methods for modeling a multinomial or ordinal scaled
11
outcome using logistic regression models. As such, we selected a subset of variables, which are
described in Table 1.8. This data set is referred to as the APS data.
Table 1.8 Code Sheet for Variables in the Adolescent Placement Study
Variable
Description
Codes/Values
1
Identification Code
1 - 508
2
Placement
0 = Outpatient
Name
ID
1 = Day Treatment
2 = Intermediate
PLACE
Residential
3 = Residential
3
Placement Combined
0 = Outpatient or
Day Treatment
1 = Intermediate
PLACE3
Residential
2 = Residential
4
Age at Admission
Years
5
Race
0 = White
1 = Non-white
6
Gender
0 = Female
1 = Male
7
Neuropsychiatric
Disturbance
AGE
RACE
GENDER
0 = None
1 = Mild
2 = Moderate
NEURO
3 = Severe
8
Emotional Disturbance
0 = Not Severe
1 = Severe
9
Danger to Others
EMOT
0 = Unlikely
1 = Possible
2 = Probable
DANGER
3 = Likely
10
Elopement Risk
0 = No Risk
1 = At Risk
ELOPE
12
11
Length of Hospitalization
Days
12
Behavioral Symptoms
Score*
0-9
13
State Custody
0 = No
1 = Yes
14
History of Violence
0 = No
1 = Yes
LOS
BEHAV
CUSTD
VIOL
*: Behavioral symptom score is based on the sum of three symptom subscales
(oppositional behavior, impulsivity, and conduct disorder) from the CSPI.
1.6.5 The Burn Injury Study
The April 2008 release (Version 4.0) of the National Burn Repository research dataset (National Burn
Repository 2007 Report, Dataset Version 4.0 accessed on 12/05/2008 at:
http://www.ameriburn.org/2007NBRAnnualReport.pdf) includes information on a total of 306,304 burn
related hospitalizations that occurred between 1973 and 2007. Available information included patient
demographics, total burn surface area, presence of inhalation injury, blinded trauma center identifiers.
The outcome of interest is survival to hospital discharge. Osler, Glantz and Hosmer (2010) selected a
subset of approximately 40,000 subjects treated between 2000 and 2007 at 40 different burn facilities to
develop a new predictive logistic regression model (see the paper for the details on how this subset was
selected). To obtain a much smaller data set for use in this text we over sampled subjects who died in
hospital and under sampled subjects who lived to obtain a data set with n = 1000 and achieve a sample
with 15 percent in hospital mortality. As such, all analyses and inferences contained in this text do not
apply to the sample of 40,000, the original data from the registry or the population of burn injury
patients as a whole. These data are used here to illustrate methods when prediction is the final goal as
well as to demonstrate various model building techniques. The variables are described in Table 1.9 and
the data are referred to as the BURN1000 data
13
Table 1.9 Code Sheet for Variables in the Burn Study
Variable
Description
Codes/Values
Name
1
Identification Code
1 - 1000
ID
2
Burn facility
1 - 40
FACILITY
3
Hospital Discharge Status
0 = Alive
DEATH
1 = Dead
4
Age at admission
years
AGE
5
Gender
0 = Female
GENDER
1 = Male
6
Race
0 = Non-White
RACE
1 = White
7
Total burn surface area
0 - 100%
TBSA
8
Burn involved inhalation
injury
0 = No
INH_INJ
Flame involved in burn
injury
0 = No
9
1 = Yes
FLAME
1 = Yes
1.6.6 The Myopia Study
Myopia, more commonly referred to as nearsightedness, is an eye condition where an individual has
difficulty seeing things at a distance. This condition is primarily because the eyeball is too long. In an
eye that sees normally, the image of what is being viewed is transmitted to the back portion of the eye,
or retina, and hits the retina to form a clear picture. In the myopic eye, the image focuses in front of the
retina, so the resultant image on the retinal itself is blurry. The blurry image creates problems with a
variety of distance viewing tasks (e.g., reading the blackboard, doing homework, driving, playing sports)
and requires wearing glasses or contact lenses to correct the problem. Myopia onset is typically between
the ages of 8 and 12 years with cessation of the underlying eye growth that causes it by age 15-16 years.
14
The risk factors for the development of myopia have been debated for a long time and include
genetic factors (e.g., family history of myopia) and the amount and type of visual activity that a child
performs (e.g., studying, reading, TV watching, computer or video game playing, and sports/outdoor
activity). There is strong evidence that having myopic parents increases the chance that a child will
become myopic and weaker evidence that certain types of visual activities (called “near work”, e.g.,
reading) increase the chance that a child will become myopic.
These data are a subset of data from the Orinda Longitudinal Study of Myopia (OLSM), a cohort
study of ocular component development and risk factors for the onset of myopia in children, which
evolved into the Collaborative Longitudinal Evaluation of Ethnicity and Refractive Error (CLEERE)
Study and both OLSM and CLEERE were funded by the National Institutes of Health/National Eye
Institute. OLSM was based at the University of California, Berkeley (see Zadnik, Mutti, Freidman and
Adams (1993) and Zadnik, Satariano, Mutti, Sholtz and Adams (1994)). Data collection began in the
1989-1990 school year and continued annually through the 2000-2001 school year. All data about the
parts that make up the eye (the ocular components) were collected during an examination during the
school day. Data on family history and visual activities were collected yearly in a survey completed by
a parent or guardian.
The dataset used in this text is from 618 of the subjects who had at least five years of follow up
and were not myopic when they entered the study. All data are from their initial exam and includes 17
variables. In addition to the ocular data there is information on age at entry, year of entry, family history
of myopia and hours of various visual activities. The ocular data come from a subject’s right eye. A
subject was coded as myopic if they became myopic at any time during the first five years of follow up.
We refer to this data set as the MYOPIA data.
15
Table 1.10 Code Sheet for Variables in the Myopia Study
Variable
Variable Description
1
Subject identifier
2
Year subject entered the study
3
Myopia within the first five years of
follow up1
4
Age at first visit
5
Gender
6
Spherical Equivalent Refraction2
7
Values / Labels
Variable Name
1- 618
ID
year
STUDYYEAR
0 = No
1 = Yes
years
0 = Male
1= Female
MYOPIC
AGE
GENDER
diopter
SPHEQ
Axial Length3
mm
AL
8
Anterior Chamber Depth 4
mm
ACD
9
Lens Thickness5
mm
LT
10
Vitreous Chamber Depth6
mm
VCD
11
How many hours per week outside of
school the child spent engaging in
sports/outdoor activities
Hours per week.
SPORTHR
12
How many hours per week outside of
school the child spent reading for
pleasure
Hours per week
READHR
16
13
How many hours per week outside of
school the child spent playing
video/computer games or working on
the computer
Hours per week
COMPHR
14
How many hours per week outside of
school the child spent reading or
studying for school assignments
Hours per week.
STUDYHR
15
How many hours per week outside of
school the child spent watching
television
Hours per week
TVHR
16
Composite of near-work activities
defined as
Hours per week.
DIOPTERHR
17
Was the subject's mother myopic?7
18
Was the subject's father myopic?
0 = No
1 = Yes
0 = No
1 = Yes
MOMMY
DADMY
1: MYOPIC is defined as SPHEQ <= −0.75 D.
2: A measure of the eye's effective focusing power. Eyes that are “normal” (don’t require glasses or
contact lenses) have spherical equivalents between -0.25 diopters (D) and +1.00 D. The more
negative the spherical equivalent, the more myopic the subject.
3: The length of eye from front to back.
4: The length from front to back of the aqueous-containing space of the eye between the cornea and
the iris.
5: The length from front to back of the crystalline lens.
6: The length from front to back of the aqueous-containing space of the eye in front of the retina.
(
)
7: DIOPTERHR = 3 × READHR + STUDYHR + 2 × COMPHR + TVHR
17
Fly UP