University of Vermont Department of Mathematics and Statistics Stat 229 Applied Logistic Regression
by user
Comments
Transcript
University of Vermont Department of Mathematics and Statistics Stat 229 Applied Logistic Regression
University of Vermont Department of Mathematics and Statistics Stat 229 Applied Logistic Regression & Applied Survival Analysis Spring 2015 Instructor Office: Office Hours: Room: Time: Class Room David W. Hosmer 23C Hills 2:30 - 3:30 Monday TBA, but most likely in the Davis Center 4:05- 6:30 Monday VOTEY BLDG 361 Course Objectives: The first half of the course provides an introduction into the analysis of binary data via the logistic regression model and the second half focuses on the analysis of time to event data, often called survival analysis. In each part the objective is to provide students with the necessary statistical techniques, computing skills and writing experience required to:(1) Formulate an analysis plan, (2) Select the appropriate computer software package and routines, (3) Perform the analysis, (4) Display the results of the analysis in a collection of tables and graphs and (5) Summarize, in writing, the analysis process with special emphasis on the interpretation of the results. It is assumed that students will have had experience in the linear normal errors regression model. This linear regression model will be used to provide the basic paradigm for analyses in both parts. Grading The grade for the course will be based on the homework and the final exam as follows: Exam/Final 50 Percent Homework 50 Percent Homework/Exams: Homework will be assigned on a regular basis. The final exam will be a take home, which will focus on the entire semester. The exam will use a new data set not used in class. Texts: A course booklet containing copies of the PowerPoint slides for each portion of the course will be available for purchase. Students not wishing to purchase the texts can get by with the booklet. However, for students pursuing a career that will involve statistical analyses the texts would be a useful addition to a personal professional library. Since I have no idea how many, if any, will buy a text I have not ordered any through the UVM Bookstore. You can get a good price at Amazon and there are often used ones available. Hosmer, D.W. Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Third Edition, John Wiley & Sons Inc., New York, NY Hosmer, D.W., Lemeshow, S and May, S. (2008) Applied Survival Analysis: Regression Modeling of Time to Event Data: Second Edition, John Wiley & Sons Inc., New York, NY 1 Data Sets: All data sets for logistic regression used in the text are available at a web site at John Wiley & Sons, Inc. http://wiley.mpstechnologies.com/wiley/BOBContent/searchLPBobContent.do In addition, the data may also be found, by permission of John Wiley & Sons Inc., in the archive of statistical data sets maintained at the University of Massachusetts at http://www.umass.edu/statdata/statdata/stat-logistic.html in the logistic regression section. Note: Data sets from both first and second editions are also available here. One may find code for STATA, SAS and SPSS for the examples in the Second edition of the textbook at http://statistics.ats.ucla.edu/stat/sas/examples/ALR3 Other Logistic Regression References: Agresti, A. (2002) Categorical Data Analysis: Second Edition. John Wiley & Sons. Inc., New York, NY Breslow, N.E. and Day, N.E.(1980) ‚Statistical Methods in Cancer Research Volume I. The Analysis of Case-Control Studies, Oxford University Press, New York. Breslow, N.E. and Day, N.E.(1987) ‚Statistical Methods in Cancer Research Volume II. The Analysis of Cohort Studies, Oxford University Press, New York. Collett, D. (2003) Modelling Binary Data: Second Edition. Chapman & Hall / CRC, Boco Raton, FL. Kleinbaum , D.G. and Klein, M. (2005) Logistic Regression: A Self learning Text: Second Edition, Springer-Verlag, New York, NY 2 Other Time to Event References: Andersen, P.K., Borgan, O., Gill, R. and Keiding, N. (1993) Statistical Models Based on Counting Processes, Springer-Verlag, New York, NY Collett, D. (1994) Modelling Survival Data in Medical Research, Chapman and Hall New York, NY.. Fleming, T. and Harrington, D.(1991) Counting Processes and Survival Analysis, John Wiley and Sons Inc., New York, NY. Kalbfleisch, J. and Prentice, R.(2002) The Statistical Analysis of Failure Time Data: Second Edition, John Wiley and Sons Inc. New York, NY. Lawless, J.(2003) Statistical Models and Methods for Lifetime Data: Second Edition. John Wiley and Sons, New York, NY Lee, E.T. and Wang, J.W. (2003) Statistical Methods for Survival Data Analysis, Third Edition, John Wiley and Sons Inc., New York, NY Nelson, W.(2003) Applied Life Data Analysis: Second Edition, John Wiley and Sons., New York, NY 3 Course Outline Logistic Regression 1 Introduction to the Logistic Regression Model (Chapter 1, ALR3*) 1.1 Introduction 1.2 Fitting the Logistic Regression Model 1.3 Testing for the Significance of the Coefficients 1.4 Confidence Interval Estimation 1.5 Other Methods of Estimation 2 Multiple Logistic Regression (Chapter 2, ALR3) 2.1 Introduction 2.2 The Multiple Logistic Regression Model 2.3 Fitting the Multiple Logistic Regression Model 2.4 Testing for the Significance of the Model 2.5 Confidence Interval Estimation 3 Interpretation of the Fitted Logistic Regression Model (Chapter 3, ALR3) 3.1 Introduction 3.2 Dichotomous Independent Variable 3.3 Polychotomous Independent Variable 3.4 Continuous Independent Variable 3.5 The Multivariable Model 3.6 Interaction and Confounding 3.7 Estimation of Odds Ratios in the Presence of Interaction 3.8 Interpretation of the Fitted Values 4 Model-Building Strategies and Methods for Logistic Regression (Chapter 4, ALR3) 4.1 Introduction 4.2 Variable Selection 4.3 Stepwise Logistic Regression (Optional) 4.4 Best Subsets Logistic Regression (Optional) 4.5 Numerical Problems 5 Assessing the Fit of the Model(Chapter 5, ALR3) 5.1 Introduction 5.2 Summary Measures of Goodness-of-Fit 5.2.1 Pearson Chi-Square Statistic and Deviance 5.2.2 The Hosmer-Lemeshow Tests 5.2.3 Classification Tables 5.2.4 Area Under the ROC Curve 5.2.5 R2 Measures 5.3 Logistic Regression Diagnostics 5.6 Interpretation and Presentation of Results from a Fitted Logistic Regression Model 4 Course Outline Survival Analysis 1. Introduction to Regression Modeling of Survival Data (Chapter 1, ASA2) 1.1. Survival time data 1.2. Censoring mechanisms 2. Descriptive Methods for Survival Data (Chapter 2, ASA2) 2.1. Survivorship function and its estimators 2.2. Use of the estimated survivorship function 2.3. Comparison of survivorship functions. 2.4. Other functions of survival time and their estimators 3. Regression Models for Survival Time (Chapter 3, ASA2) 3.1. Semi-parametric regression models 3.2. Fitting the proportional hazards model 3.3. Estimating the survivorship function 4. Interpretation of a Fitted Proportional Hazards Model (Chapter 4, ASA2) 4.1. Nominal scaled covariates 4.2. Continuous covariates 4.3. Multiple covariate models 4.4. Covariate adjusted survivorship function 5. Model Development (Chapter 5, ASA2) 5.1. Selection of covariates 5.2. Selecting the scale of continuous covariates 5.3. Confounding and interactions 6. Assessment of Model Adequacy (Chapter 6, ASA2) 6.1. Definition of residuals 6.2. Proportional hazards assumption 6.3. Influential subjects 6.4. Overall tests of fit *:ALR3: Hosmer, D.W. Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Second Edition, John Wiley & Sons Inc., New York, NY ASA2: Hosmer, D.W., Lemeshow, S and May, S. (2008) Applied Survival Analysis: Regression Modeling of Time to Event Data: Second Edition, John Wiley & Sons Inc., New York, NY 5 Abstracted from ALR3. 1.6 Logistic Regression Data Sets Used in Examples and Exercises (From ALR3) A number of different data sets are used in the examples as well as the exercises for the purpose of demonstrating various aspects of logistic regression modeling. Six of the data sets used throughout the text are described below. Other data sets are introduced as needed in later chapters. Some of the data sets were used in the previous editions of this text, for example the ICU and Low Birth Weight data, while others are new to this edition. All data sets used in this text may be obtained from links to web sites at John Wiley & Sons Inc. and the University of Massachusetts given in the Preface. 1.6.1 The ICU Study The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared that have focused on various facets of this problem. The reader wishing to learn more about the clinical aspects of this study should start with Lemeshow, Teres, Avrunin, and Pastides (1988). For a more up-to-date discussion of modeling the outcome of ICU patients the reader is referred to Lemeshow and Le Gall (1994) and to Lemeshow, Teres, Klar, Avrunin, Gehlbach and Rapoport (1993). The actual observed variable values have been modified to protect subject confidentiality. A code sheet for the variables to be considered in this text is given in Table 1.5. We refer to this data set as the ICU data. 6 Table 1.5 Code Sheet for the Variables in the ICU Data Variable Description Codes/Values Name 1 Identification code ID Number ID 2 Vital Status at 1 = Lived STA hospital discharge 0 = Died 3 Age Years AGE 4 Gender 0 = Male GENDER 1 = Female 5 Race 1 = White RACE 2 = Black 3 = Other 6 Service at ICU admission 0 = Medical SER 1 = Surgical 7 Cancer part of present problem 0 = No CAN 1 = Yes 8 History of chronic renal failure 0 = No CRN 1 = Yes 9 10 Infection probable at ICU admission 0 = No CPR prior to ICU admission 0 = No INF 1 = Yes CPR 1 = Yes 11 Systolic blood pressure at ICU admission mm Hg SYS 12 Heart rate at ICU admission Beats/min HRA 13 Previous admission to an ICU within 6 months 0 = No PRE Type of admission 0 = Elective 14 1 = Yes TYPE 1 = Emergency 15 16 Long bone, multiple, neck, single area, or hip fracture 0 = No PO2 from initial blood gases 0 = > 60 FRA 1 = Yes PO2 1 = ≤ 60 17 PH from initial blood gases 0 = ≥ 7.25 PH 7 1 = < 7.25 18 PCO2 from initial blood gases 0 = ≤ 45 PCO 1 = > 45 19 20 21 Bicarbonate from initial blood gases 0 = ≥ 18 Creatinine from initial blood gases 0 = ≤ 2.0 Level of consciousness at ICU admission 0 = No coma or BIC 1 = < 18 CRE 1 = > 2.0 LOC deep stupor 1 = Deep stupor 2 = Coma 1.6.2 The Low Birth Weight Study Low birth weight, defined as birth weight less than 2500 grams, is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are higher for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight. Data were collected as part of a larger study at Baystate Medical Center in Springfield, Massachusetts. This data set contains information on 189 births to women seen in the obstetrics clinic. Fifty-nine of these births were low birth weight. The variables identified in the code sheet given in Table 1.6 have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to determine whether these variables were risk factors in the clinic population being served by Baystate Medical Center. Actual observed variable values have been modified to protect subject confidentiality. We refer to this data set as the LOWBWT data. 8 Table 1.6 Code Sheet for the Variables in the Low Birth Weight Data Variable Description Codes/Values Name 1 Identification code 1 - 189 ID 2 Low birth weight 0 = ≥ 2500 g LOW 1 = < 2500 g 3 Age of mother Years AGE 4 Weight of mother at last Pounds LWT 1 = White RACE menstrual period 5 Race 2 = Black 3 = Other 6 7 Smoking status during 0 = No pregnancy 1 = Yes History of premature labor 0 = None SMOKE PTL 1 = One 2 = Two, etc. 8 History of hypertension 0 = No HT 1 = Yes 9 Presence of Uuterine irritability 0 = No UI 1 = Yes 10 Number of physician visits 0 = None during the first trimester 1 = One FTV 2 = Two, etc. 11 Recorded birth weight Grams BWT 1.6.3 The Global Longitudinal Study of Osteoporosis in Women The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international study of osteoporosis in women over 55 years of age being coordinated at Center for Outcomes Research (COR) at the University of Massachusetts / Worcester by its Director, Dr. Frederick Anderson, Jr. The study has enrolled over 60,000 women aged 55 and older in ten countries. The major goals of the study are to 9 use the data to provide insights into the management of fracture risk, patient experience with prevention and treatment of fractures and distribution of risk factors all among older women on an international scale over the follow up period. Complete details on the study as well as a list of GLOW publications may be found at the Center for Outcomes Research web site, www.outcomes-umassmed.org/glow. Data used here come from six sites in the United States and include a few selected potential risk factors for fracture from the baseline questionnaire. The outcome variable is any fracture in the first year of follow up. The incident first-year fracture rate among the 21,000 subjects enrolled in these six sites is about 4 percent. In order to have a data set of a manageable size, n = 500 , for this text we have over sampled the fractures and under sampled the non-fractures. As a result associations and conclusions from modeling these data do not apply to the study cohort as a whole. Data have been modified to protect subject confidentiality. We thank Dr. Gordon Fitzgerald of COR for his help in obtaining these data sets. A code sheet for the variables is shown in Table 1.7. This data set is named the GLOW500 data. Table 1.7 Code Sheet for Variables in the GLOW Study Variable Description Codes/Values Name 1 Identification Code 1–n SUB_ID 2 Study Site 1-6 SITE_ID 3 Physician ID code 128 unique codes PHY_ID 4 History of Prior Fracture 1= Yes PRIORFRAC 5 Age at Enrollment Years AGE 6 Weight at enrollment Kilograms WEIGHT 7 Height at enrollment Centimeters HEIGHT 8 Body Mass Index Kg/m2 BMI 0 = No 10 9 10 11 12 13 Menopause before age 45 1= Yes Mother had hip fracture 1= Yes Arms are needed to stand from a chair 1= Yes Former or current smoker 1= Yes Self-reported risk of fracture 1= Less than others of the same age PREMENO 0 = No MOMFRAC 0 = No ARMASSIST 0 = No SMOKE 0 = No RATERISK 2= Same as others of the same age 3= Greater than others of the same age 14 Fracture Risk Score Composite Risk Score * FRACSCORE 15 Any fracture in first year 1= Yes FRACTURE 0 = No *: FRACSCORE = 0 × AGE ≤ 60 + 1 × 60 < AGE ≤ 65 + 2 × 65 < AGE ≤ 70 ( ) ( ) ( ) +3 × ( 70 < AGE ≤ 75) + 4 × ( 75 < AGE ≤ 80 ) + 5 × (80 < AGE ≤ 85) +6 × ( AGE > 85) + ( PRIORFRAC = 1) + ( MOMFRAC = 1) + (WEIGHT < 56.8) + 2 × ( ARMASSIST = 1) + ( SMOKE = 1) 1.6.4 The Adolescent Placement Study Fontanella, Early and Phillips (2008) present results from a study of determinants of aftercare placement for psychiatrically hospitalized adolescents and have made the data, suitably modified to protect confidentiality, available to us. It is not our intent to repeat the detailed analyses reported in their paper, but rather to use the data to motivate and describe methods for modeling a multinomial or ordinal scaled 11 outcome using logistic regression models. As such, we selected a subset of variables, which are described in Table 1.8. This data set is referred to as the APS data. Table 1.8 Code Sheet for Variables in the Adolescent Placement Study Variable Description Codes/Values 1 Identification Code 1 - 508 2 Placement 0 = Outpatient Name ID 1 = Day Treatment 2 = Intermediate PLACE Residential 3 = Residential 3 Placement Combined 0 = Outpatient or Day Treatment 1 = Intermediate PLACE3 Residential 2 = Residential 4 Age at Admission Years 5 Race 0 = White 1 = Non-white 6 Gender 0 = Female 1 = Male 7 Neuropsychiatric Disturbance AGE RACE GENDER 0 = None 1 = Mild 2 = Moderate NEURO 3 = Severe 8 Emotional Disturbance 0 = Not Severe 1 = Severe 9 Danger to Others EMOT 0 = Unlikely 1 = Possible 2 = Probable DANGER 3 = Likely 10 Elopement Risk 0 = No Risk 1 = At Risk ELOPE 12 11 Length of Hospitalization Days 12 Behavioral Symptoms Score* 0-9 13 State Custody 0 = No 1 = Yes 14 History of Violence 0 = No 1 = Yes LOS BEHAV CUSTD VIOL *: Behavioral symptom score is based on the sum of three symptom subscales (oppositional behavior, impulsivity, and conduct disorder) from the CSPI. 1.6.5 The Burn Injury Study The April 2008 release (Version 4.0) of the National Burn Repository research dataset (National Burn Repository 2007 Report, Dataset Version 4.0 accessed on 12/05/2008 at: http://www.ameriburn.org/2007NBRAnnualReport.pdf) includes information on a total of 306,304 burn related hospitalizations that occurred between 1973 and 2007. Available information included patient demographics, total burn surface area, presence of inhalation injury, blinded trauma center identifiers. The outcome of interest is survival to hospital discharge. Osler, Glantz and Hosmer (2010) selected a subset of approximately 40,000 subjects treated between 2000 and 2007 at 40 different burn facilities to develop a new predictive logistic regression model (see the paper for the details on how this subset was selected). To obtain a much smaller data set for use in this text we over sampled subjects who died in hospital and under sampled subjects who lived to obtain a data set with n = 1000 and achieve a sample with 15 percent in hospital mortality. As such, all analyses and inferences contained in this text do not apply to the sample of 40,000, the original data from the registry or the population of burn injury patients as a whole. These data are used here to illustrate methods when prediction is the final goal as well as to demonstrate various model building techniques. The variables are described in Table 1.9 and the data are referred to as the BURN1000 data 13 Table 1.9 Code Sheet for Variables in the Burn Study Variable Description Codes/Values Name 1 Identification Code 1 - 1000 ID 2 Burn facility 1 - 40 FACILITY 3 Hospital Discharge Status 0 = Alive DEATH 1 = Dead 4 Age at admission years AGE 5 Gender 0 = Female GENDER 1 = Male 6 Race 0 = Non-White RACE 1 = White 7 Total burn surface area 0 - 100% TBSA 8 Burn involved inhalation injury 0 = No INH_INJ Flame involved in burn injury 0 = No 9 1 = Yes FLAME 1 = Yes 1.6.6 The Myopia Study Myopia, more commonly referred to as nearsightedness, is an eye condition where an individual has difficulty seeing things at a distance. This condition is primarily because the eyeball is too long. In an eye that sees normally, the image of what is being viewed is transmitted to the back portion of the eye, or retina, and hits the retina to form a clear picture. In the myopic eye, the image focuses in front of the retina, so the resultant image on the retinal itself is blurry. The blurry image creates problems with a variety of distance viewing tasks (e.g., reading the blackboard, doing homework, driving, playing sports) and requires wearing glasses or contact lenses to correct the problem. Myopia onset is typically between the ages of 8 and 12 years with cessation of the underlying eye growth that causes it by age 15-16 years. 14 The risk factors for the development of myopia have been debated for a long time and include genetic factors (e.g., family history of myopia) and the amount and type of visual activity that a child performs (e.g., studying, reading, TV watching, computer or video game playing, and sports/outdoor activity). There is strong evidence that having myopic parents increases the chance that a child will become myopic and weaker evidence that certain types of visual activities (called “near work”, e.g., reading) increase the chance that a child will become myopic. These data are a subset of data from the Orinda Longitudinal Study of Myopia (OLSM), a cohort study of ocular component development and risk factors for the onset of myopia in children, which evolved into the Collaborative Longitudinal Evaluation of Ethnicity and Refractive Error (CLEERE) Study and both OLSM and CLEERE were funded by the National Institutes of Health/National Eye Institute. OLSM was based at the University of California, Berkeley (see Zadnik, Mutti, Freidman and Adams (1993) and Zadnik, Satariano, Mutti, Sholtz and Adams (1994)). Data collection began in the 1989-1990 school year and continued annually through the 2000-2001 school year. All data about the parts that make up the eye (the ocular components) were collected during an examination during the school day. Data on family history and visual activities were collected yearly in a survey completed by a parent or guardian. The dataset used in this text is from 618 of the subjects who had at least five years of follow up and were not myopic when they entered the study. All data are from their initial exam and includes 17 variables. In addition to the ocular data there is information on age at entry, year of entry, family history of myopia and hours of various visual activities. The ocular data come from a subject’s right eye. A subject was coded as myopic if they became myopic at any time during the first five years of follow up. We refer to this data set as the MYOPIA data. 15 Table 1.10 Code Sheet for Variables in the Myopia Study Variable Variable Description 1 Subject identifier 2 Year subject entered the study 3 Myopia within the first five years of follow up1 4 Age at first visit 5 Gender 6 Spherical Equivalent Refraction2 7 Values / Labels Variable Name 1- 618 ID year STUDYYEAR 0 = No 1 = Yes years 0 = Male 1= Female MYOPIC AGE GENDER diopter SPHEQ Axial Length3 mm AL 8 Anterior Chamber Depth 4 mm ACD 9 Lens Thickness5 mm LT 10 Vitreous Chamber Depth6 mm VCD 11 How many hours per week outside of school the child spent engaging in sports/outdoor activities Hours per week. SPORTHR 12 How many hours per week outside of school the child spent reading for pleasure Hours per week READHR 16 13 How many hours per week outside of school the child spent playing video/computer games or working on the computer Hours per week COMPHR 14 How many hours per week outside of school the child spent reading or studying for school assignments Hours per week. STUDYHR 15 How many hours per week outside of school the child spent watching television Hours per week TVHR 16 Composite of near-work activities defined as Hours per week. DIOPTERHR 17 Was the subject's mother myopic?7 18 Was the subject's father myopic? 0 = No 1 = Yes 0 = No 1 = Yes MOMMY DADMY 1: MYOPIC is defined as SPHEQ <= −0.75 D. 2: A measure of the eye's effective focusing power. Eyes that are “normal” (don’t require glasses or contact lenses) have spherical equivalents between -0.25 diopters (D) and +1.00 D. The more negative the spherical equivalent, the more myopic the subject. 3: The length of eye from front to back. 4: The length from front to back of the aqueous-containing space of the eye between the cornea and the iris. 5: The length from front to back of the crystalline lens. 6: The length from front to back of the aqueous-containing space of the eye in front of the retina. ( ) 7: DIOPTERHR = 3 × READHR + STUDYHR + 2 × COMPHR + TVHR 17