MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez
by user
Comments
Transcript
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN: 978-84-693-5427-8 Dipòsit Legal: T-1418-2010 ADVERTIMENT. La consulta d’aquesta tesi queda condicionada a l’acceptació de les següents condicions d'ús: La difusió d’aquesta tesi per mitjà del servei TDX (www.tesisenxarxa.net) ha estat autoritzada pels titulars dels drets de propietat intel·lectual únicament per a usos privats emmarcats en activitats d’investigació i docència. No s’autoritza la seva reproducció amb finalitats de lucre ni la seva difusió i posada a disposició des d’un lloc aliè al servei TDX. No s’autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant al resum de presentació de la tesi com als seus continguts. En la utilització o cita de parts de la tesi és obligat indicar el nom de la persona autora. ADVERTENCIA. La consulta de esta tesis queda condicionada a la aceptación de las siguientes condiciones de uso: La difusión de esta tesis por medio del servicio TDR (www.tesisenred.net) ha sido autorizada por los titulares de los derechos de propiedad intelectual únicamente para usos privados enmarcados en actividades de investigación y docencia. No se autoriza su reproducción con finalidades de lucro ni su difusión y puesta a disposición desde un sitio ajeno al servicio TDR. No se autoriza la presentación de su contenido en una ventana o marco ajeno a TDR (framing). Esta reserva de derechos afecta tanto al resumen de presentación de la tesis como a sus contenidos. En la utilización o cita de partes de la tesis es obligado indicar el nombre de la persona autora. WARNING. On having consulted this thesis you’re accepting the following use conditions: Spreading this thesis by the TDX (www.tesisenxarxa.net) service has been authorized by the titular of the intellectual property rights only for private uses placed in investigation and teaching activities. Reproduction with lucrative aims is not authorized neither its spreading and availability from a site foreign to the TDX service. Introducing its content in a window or frame foreign to the TDX service is not authorized (framing). This rights affect to the presentation summary of the thesis as well as to its contents. In the using or citation of parts of the thesis it’s obliged to indicate the name of the author. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 M U LTI V ARI ATE CLASSI FI CATI O N O F GEN E EXPRESSI O N M I CRO ARRAY D ATA CristinaBotellaPérez DOCTORALTHESIS UNIVERSITAT ROVIRA I VIRGILI UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 M U LTI V ARI ATE CLASSI FI CATI O N O F GEN E EXPRESSI O N M I CRO ARRAY D ATA CristinaBotellaPérez DOCTORALTHESIS Supervisedby Dr.JoanFerréBaldrichandDr.RicardBoquéMartí DepartmentofAnalyticalChemistryandOrganicChemistry UniversitatRoviraiVirgili Tarragona2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 ROVIRA I VIRGILI UNIVERSITY Department of Analytical Chemistry and Organic Chemistry Dr.JOANFERRÉBALDRICHandDr.RICARDBOQUÉMARTÍ,associateprofessorsof the Department of Analytical Chemistry and Organic Chemistry at Rovira i Virgili University CERTIFY: The Doctoral Thesis entitled: ‘MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA’, presented by CRISTINA BOTELLA PÉREZ to receive the degree of Doctor of the Rovira i Virgili University, has been carried out under our supervision, in the Department of Analytical Chemistry and Organic ChemistryatRoviraiVirgiliUniversity,andalltheresultspresentedinthisthesiswere obtainedinexperimentsconductedbytheabovementionedstudent. Tarragona,March2010 Dr.JoanFerréBaldrich Dr.RicardBoquéMartí UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 In the middle of difficulty lies the opportunity AlbertEinstein UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Arriben els últims dies de gairebé cinc anys de camí … d’un camí que no ha estat fàcil, ple de sensacions, experiències i moments compartits amb molta gent. Gent que ha estat amb mi durant part o la totalitat d’aquesta tesi i de la que no em voldria oblidar ara que sembla que arribem al final. Gràcies al Dr. Joan Ferré i al Dr. Ricard Boqué per confiar en mi. Joan, Ricard, gràcies pels consells i per donar-me l’oportunitat d’aprendre al vostre costat. Gràcies a tots els membres del grup de Quimiometria, Qualimetria i Nanosensors per aquests anys. Gràcies als companys amb els que he compartit els inicis, la totalitat o el final del doctorat. Així i sense voler oblidar-me de ningú, gràcies a tot el grup per acollir-me com ho heu fet. Gràcies a Vero i Giselle pel seu suport i ànims sobretot als inicis. Gràcies a Idoia, Vane, Santi, Jordi, Jaume, Carol i Kris per tots els bons moments i els riures de les millors hores de cafè. Joe,aiix al final mi compi de despacho, cuántas horas compartidas y cuántos buenos momentos, me quedo con ellos, gracias. Igualment, gràcies a Marta S, pels ànims, per preocupar-te i posar-li somriures a aquesta tesi. Montse, encara que sigui des de la distància, gràcies. Gràcies pels teus correus i els teus ànims. També des de la distància, gràcies a Sílvia, Laia, David i Lluis; des de les nostres terres m’heu acompanyat dia a dia. Les vostres paraules han estat sempre importants. Laura, Antonio, Rafa ... aquest camí ha tingut sentit gràcies a vosaltres, GRÀCIES per ser com sou, no canvieu mai. Laura, GRÀCIES. Gracias por preocuparte por mí, por nuestras charlas y por tener siempre una palabra de apoyo y de ánimo preparada, gracias por compartir conmigo estos años. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Antonio, què t’he de dir …tants anys junts, GRÀCIES. Gràcies pels breaks, pels riures que hem compartit i has aconseguit treure’m en els mals dies. Gracias por preocuparte y estar siempre a mi lado. Rafa, com tants cops, ara tampoc tinc paraules, simplement GRÀCIES. Gràcies pel teu suport, els teus ànims en els mals moments i les teves paraules sempre ben escollides. Hem quedo amb les nostres llargues xerrades. Gràcies per escoltar-me i ser-hi sempre. Tomàs, la persona que ha compartit amb mi aquest camí, que m’ha apoiat en els moments més durs i ha fet possible que arribés a la fi, GRÀCIES. Gràcies per no deixar-me defallir i ajudar-me a mirar endavant en tot moment. Sé que no sempre ha estat fàcil. I que puc dir d’aquells qui gairebé han fet la tesi per mi i amb mi … els meus pares, GRÀCIES. Gràcies per estar sempre al meu costat i apoiar-me en qualsevol de les meves decisions, creient amb les meves possibilitats més que ningú. A tots, només una paraula més, GRÀCIES. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Tableofcontents Structure 13 Chapter1.Introduction 17 1.1 Geneticexpression 1.2 Microarrays 1.2.1 Microarrayplatformsandexperimentation 1.2.2 Microarraydata 1.2.3 Microarrayapplications 19 20 21 25 29 Chapter2.Thesisobjectives 43 Chapter3.Discussionoftheimplementationoftherejectoptionin ProbabilisticͲDiscriminantPartialLeastSquares 47 3.1Introduction 3.2Probabilisticdiscriminantpartialleastsquares 3.2.1Thepartialleastsquaresmodel 3.2.2Theprobabilitydensityfunctionofaclass 3.3Classprediction 3.3.1Classificationbasedonprobabilities 3.3.2Classificationbasedonrisk 3.4Discussionofclassprediction 3.5Probabilisticdiscriminantpartialleastsquareswithrejectoption 3.5.1Rejectoptionasaclass 3.5.2Rejectoptionasathreshold 3.6Implicationsofrejectoptioninclassificationperformanceevaluation 3.7Conclusions 49 50 50 51 53 53 57 60 62 63 66 69 73 Chapter4.ClassificationfrommicroarraydatausingpͲDPLSwith rejectoption 77 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Tableofcontents Chapter5.Outlierdetectionandambiguitydetectionformicroarray datainpͲDPLSregression Chapter 6. Gene selection based on selectivity ratio for probabilistic discriminantpartialleastsquares Chapter7.MultiͲclassclassificationofmicroarraygeneexpression data Chapter8.Conclusions Appendix Datasets Abreviations Publications Communications 107 137 159 179 191 193 201 203 205 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Structure Thisthesisisstructuredineightchapters. Chapter 1. Introduction. This chapter gives an overview of DNA microarrays, their origin, types and applications. The steps involved in the generation of the microarray data, from hybridization to image acquisition and data preͲprocessing, are described. Theneedofmultivariatedataanalysisisjustified.Finally,themultivariatemethodsused foranalyzingmicroarraydataarecited,focusingonclassificationmethods. Chapter2.ThesisObjectives.Inthischapteraredescribedtheaimsofthisthesis.These objectivesaredevelopedinthepublicationsincludedinthenextchapters. Chapter 3. Discussion of the implementation of the reject option in pͲDPLS. This chapter discusses the implementation of the reject option in pͲDPLS. Firstly, the calculation of the pͲDPLS model and the class prediction process based on the Bayes rulearedetailed.Then,thelimitationsoftheclassificationbasedontheBayesruleare discussed. Two approximations to introduce a reject option that overcome the cited limitationsdiscussedinprevioussectionarepresented.Finally,theimplicationsofthe rejectoptionintheevaluationoftheclassifiersarecommented. Chapter 4. Classification from microarray data using pͲDPLS with reject option. This paper (C. Botella, J. Ferré, R. Boqué, Talanta, 80 (2009) 321Ͳ328) describes the implementation of a reject option in pͲDPLS models in order to improve the classificationofmicroarraydata.TherejectoptionallowsapͲDPLSmodeltonotclassify outliersandambiguoussamples.Thisensuresthatonlythesampleswhoseclassification is reliable enough are indeed classified. As a consequence, the number of misclassificationsdecreasesandtheaccuracyoftheclassifierimproves. Chapter 5. Outlier detection and ambiguity detection for microarray data in pͲDPLS regression.OutlierdetectionisoftenoverlookedinmicroarraydataanalysiswithfactorͲ 13 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Structure based classification methods. However, outlier diagnostics are required when implementing any classification method in real practice. In this paper (C. Botella, J. Ferré, R. Boqué, Journal of Chemometrics (2010) Accepted) two procedures, typically used in chemometrics, are combined with the reject option (chapter 4) to detect outliers and ambiguous samples in pͲDPLS. The application of these diagnostics increasestheaccuracyofthepͲDPLSmodelsandavoidsclassifyingsamplesfromclasses thatwerenotmodelled. Chapter 6. Gene selection based on selectivity ratio for probabilistic discriminant partialleastsquares.Geneselectionisafundamentalstepinmicroarraydataanalysis. It allows both identifying the genes that characterize a certain disease and also simplifying and improving classification models by discarding irrelevant genes. In this paper(C.Botella,J.Ferré,R.Boqué,(2010)submitted)ageneselectionprocedurethat is specific for PLS is used to find the best subset of genes that discriminate between different subtypes of tumours and also between healthy and tumour samples. The procedureisbasedonselectingthegenesthatmaximizetheselectivityratio(SR)index. The paper also shows that the calculated accuracy of a classifier can be largely influencedbyhowthedatasetissplittedintoatrainingsetandatestset.Certainsplits can lead to a wrong assessment of the validity of the gene selection algorithm. A repetitive procedure consisting of data split, gene selection, training and validation is proposedinordertotestthegoodnessofthegenesselectedwhittheSRindex. Chapter7.MultiͲclassclassificationofmicroarraygeneexpressiondata.Inmostcases, samplestobeclassifiedfrommicroarraydatamaybelongtomorethantwosubtypesof a disease. The pͲDPLS approach used so far only allows discriminating between two subtypes. This chapter (C. Botella, J. Ferré, R. Boqué, (2010) submitted) describes a classificationstrategytobeusedwhentherearemorethantwocandidateclasses.The methodcombinesthepredictionsfromoneͲversusͲonepͲDPLSmodelswiththeLinear DiscriminantAnalysis(LDA)classifier. 14 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Structure Chapter 8. Conclusions. This chapter sums up the improvements achieved by the methodspresentedinthisthesis. The Appendixcontains adescription of the datasets used in this thesis, the list of the abbreviations used, and the list of papers and presentations performed during this period. 15 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 1 Introduction UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.1 1.1Geneticexpression Geneticexpression Deoxyribonucleicacid(DNA)moleculesarethe genetic material of most living organisms [1]. They are chains of nucleotides (Figure 1). A nucleotide consists of a phosphate group, a deoxyribosesugarmoleculeandanitrogenous base (guanine, cytosine, adenine or thymine) [1]. Genes are sequences of hundreds or thousands of these nucleotides that encode the genetic information to make specific proteins[2]. Figure1.DNAchain.Source:[3]. The protein formation involves a transcriptionprocess,inwhichthe genesaremappedintomessenger RNA (mRNA) by the RNA polymeraseenzyme[1,4]followed by a translation process, in which the aminoacids encoded by the mRNA codons are joined in the presence of transfer RNA (tRNA) and ribosomal RNA (rRNA) (Figure 2). Figure2.Transcriptionandtranslationprocessesinthemakingofaprotein.Source:[3]. 19 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction Thegenesregulatetheproteinexpressionsandconsequentlythemetabolicprocesses of the living organisms. Some genes are only expressed in particular cell types or in certain development stages [5], so these genes (or their expressed intermediate, mRNA)canbeseenasmarkerstodefineparticularcellularstates,suchashealthyor tumour[6]. 1.2 Microarrays Microarray technology is a powerful tool for simultaneously evaluating the 3 expressionlevelofthousandsofgenes in a cell [2] and, hence, the information that is encoded in the DNA[6]. 1 2 A microarray is a microscopic slide that contains an ordered series of Figure 3. Parts of a microarray. 1. Slide, 2. Probe DNA,3.TargetDNA.Source:Affymetrix. DNA,RNAproteinsortissues.TheDNA microarraysarethemostcommon[7].ADNAmicroarrayisgenerallyaglassslideora siliconchipinwhichthousandsofgenesequencesareprinted(Figure3).Oneveryspot manycopiesofaspecifiedDNAsequencearechemicallybondedtothesurfaceofthe slide [2]. The genes immobilized onto the slide are called the DNA probe. Over this DNAprobe,thetargetDNAorthetargetRNA(dependingonthemicroarrayplatform) obtained from the cell under study is hybridized (hydrogen bonded). The amount of hybridizationismeasuredandrelatedtothepresenceandexpressionofcertaingenes inthecell. 20 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays Figure 4 shows the workflow process in a microarray experiment. The experimental process varies depending on the microarray platform that is used (see below). After datahavebeenmeasuredandpreͲprocessed,multivariateanalysisisneededtodeal withthelargeamountofdatathateverymicroarrayexperimentgenerates[7Ͳ10]. Experimental desing Biological question Goal Experimental process Probe andtargetDNA preparation Printing Hybridization Dataextraction Dataadquisition DatapreͲprocessing Quantification Dataanalysis Geneselection Cluster analysis Dataclassification Figure4.Microarrayworkflowprocess. 1.2.1Microarrayplatformsandexperimentation ThefirstDNAarraywasdevelopedbyEdSouthernin1975[10].Southernnoticedthat labelledacidnucleicmoleculescouldbeusedtoevaluateothermoleculeslinkedona solid support. He used the array to verify the presence or the absence of a specific sequenceofDNAfromthedifferentsourcesandtoidentifythesizeoftherestriction fragment. In1995,aninͲsituprobesynthesismethodforphotolitographicallymanufacturingDNA arrayswasdevelopedbyFodoretal.[11]andcommercializedbyAffymetrixInc.Atthe sametime,preͲsynthesizedDNAmicroarrayswerepopularizedbyPatrickO.Brown’s laboratoryatStanfordUniversity[12].TheypublishedstepͲbyͲstepplansforbuildinga robotic DNA arrayer [13]. This was, together with the development of the Southern blot, one of the milestones in the microarray development because the Brown’s method made microarrays affordable for research laboratories, while the early methods for manufacturing miniaturized DNA arrays using inͲsitu probe synthesis requiredsophisticatedandexpensiveroboticequipment. 21 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction Nowadays, there are two main microarray platforms, namely cDNA arrays (where c meanscomplementary)andoligonucleotidearrays.Theydifferinthepreparationand content of the probe, and also on the sample preparation [2, 5] (Table 1). Figure 5 shows the experimental procedure in a cDNA microarray experiment and in a inͲsitu oligonucleotidemicroarrayexperiment. In cDNA microarrays, the probes are cDNAs typically 100Ͳ300 bases long. A cDNA strandisaDNAstrandsynthetizedusingareversetranscriptaseenzyme,whichmakes a DNA sequence complementary to the mRNA present in cells [2]. Note that the commonly called DNA microarrays are actually cDNAmicorarrays. The target sample consists of chains of cDNA of the test samples Cy5 labeled and chains of cDNA of refernecesampleCy3labeled[2,5].Afterthesamplehasbeenhybridized,microarrays are washed for several minutes in decreasing salt buffers and finally dried either by centrigugation of the slide or a rinse in isopropanol followed by quick drying with nitrogen gas or filtered air [7]. The raw microarraydataare obtainedby excitingthe fluorescent dyes at each spot and scanning the microarray. One intensity value is generated by the emission from the Cyanine 3 (Cy3) fluorophore and another from Cyanine 5 (Cy5). The total fluorescence emitted by the spot at each wavelength is proportionaltothetotalamountofthedyeinthespot.Hence,itisproportionaltothe total amount of reference or test sample hybridized. When images of both dyes (colourchannels)aremixed,thetypicalmicroarraypictureisobtained[1,7,14].The colours on the microarray image respond to the four respective situations of microarray hybridization (Figure 6): no hybridization (black spot), reference sample hybridization (green spot), target sample hybridization (red spot) and test and reference sample hybridization (yellow spot). Different intensities of the colours indicatedifferentlevelsofhybridization. 22 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays DNA microarray images from different samples are then transformed onto gene expression data matrices. Each row of the matrix corresponds to a sample and each column corresponds to a gene. Each value characterizes the expression level of the particular gene in that particular sample. The gene expression is given by the ratio betweentheintensitiesintheredandthegreenchannels,whicharedirectlyrelatedto thelevelofexpressionofthetranscript[1]. TheinͲsituoligonucleotidearrays,producedbyAffymetrix,eachgeneisrepresented asaprobesetof10Ͳ25oligonucleotidepairs1insteadofonefulllengthorpartialcDNA clone.Theseprobesaresynthesizeddirectlyonthesurfaceofthesupport.Thetarget ARRAY IMAGE sampleisacDNAbiotinlabeled[2,7].IncontrastonthespottedcDNAarrays,inthis case the test and the reference sample are hybridized separately on different chips; then,dataadquisitionisdonebyscanningtheprobearray.Itcreatesa8×8pixels(on average)foranyprobecell.Asingleintensityvalueforeveryprobecell,representative ofthehybridizationlevelofitstarget,isderived.Finally,thegeneexpressionisgiven bythedifferencesofPMandMM[1].Thegeneexpressionsofallgenesanalysedfora samplearegiveninarowofthegeneexpressionmatrix. Table1.Typesofmicroarrays.Source:[1]. Probe Arrayingtechnique Microarrayplatform cDNA Roboticspotting SpottedcDNAmicroarrays Roboticspotting Spottedoligonucleotidemicroarrays InͲsitusynthesis InͲsituoligonucleotidemicroarrays Oligonucleotides 1 The oligonucleotide pair (probe pair) comprises one oligonucleotide that perfectly matches the gene sequence(PerfectMatch,PM)andasecondoligonucleotidehavingonenucleotidemismatchinthemiddle ofit(Mismatch,MM). 23 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction Testsample Reference sample Cells or tissue RNA extraction AAAA Cells or tissue AAAA AAAA AAAA AAAA AAAA Reverse transcription andlabeling Cy5 labeled cDNA Reference sample RNA extraction AAAA TotalRNA Testsample AAAA TotalRNA cDNA synthesis TTTT TTTT TTTT TTTT Cy3 labeled cDNA AAAA TTTTͲT7 AAAA TTTTͲT7 promotor promotor Invitro transcription Mix andhybridize Emission B B B B B B Double stranded cDNA Biotin labeled cRNA Excitation laser Cy5 Cy5 laser Cy3 Cy3 Emission log2 (Cy5/Cy3) log(PMͲMM) Genes Samples GENEEXPRESSIONDATAMATRIX Figure 5. cDNA and inͲsitu oligonucleotide microarray sample preparation, hybridization and data measurement. 24 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays 1.2.2.Microarraydata The experimental steps involved in a microarray workflow, from microarray manufacturetomicroarraydataextraction(Figure4and5),mayintroducenoiseand variability in the data. Common sources of variability in microarray experiments are variations related to microarray manufacturing and variations related to microarray scanning[7].Variabilityrelatedtomicroarraymanufacturingisduetodyeeffects,slide effects or printͲtip effects. The variability of microarray scanning is due to scanner manufacturing and to a non specific background. The most common origins of both [15]aresummarizedinTable2.Tominimizetheeffectofthesourcesofvariationthat mayaffectmicroarraydataaproperpreͲprocessingdataisfundamentalinmicroarray data analysis. This preͲprocessing transforms the data to make them suitable for analysis[1].PreͲprocessingofmicroarraydataisdoneinthestepsdescribednext[16]. Table2.Sourcesofvariationsofmicroarraydata. Dyeeffects Differentincorporationofdyes Dyeinstability Genelabelinteraction Printingvariability Differentpinefficiencyovertime Arraycoating Slideeffects Microarray Slideinhomogeneities manufacturing Efficiencyofthehybridizationreaction Backgroundnoiseontheslide DifferentamountsofRNAofprobesandDNA Spatial, targetsample PrintͲtiporPlate Temperatureandhumidity effects PCRamplification Samplepreparationprotocols Scannermanufactureforexampledue:laserwronglyadjustedorlaser misaligned. Microarray Nonspecificbackgroundandovershining,nonspecificradiationsand scanning signalsfromneighbouring. Imageanalysis,nonlineartransmissioncharacteristics,saturation effectsandvariationsinspotshape. Abbreviations.PCR:Polymerasechainreaction. 25 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction Backgroundsubtraction Signalintensitiesofageneincludecontributionsfromnonspecifichybridizationsand other fluorescences from the glass. This background fluorescence is estimated from thepixelsthatarenearthefeaturebutarenotapartofaspot[17].Localbackground foreachchannelandspotisevaluatedfocusingonsmallregionssurroundingthespot mask(region2inFigure6).Then,themedianorthemeanofpixelvaluesinthisregion iscalculatedforeachchannelandsubtractedfromthespotintensity[14]. A less used alternative calculates a global background for each slide: an average of negative control spot intensities is used as background value, being the empty spots thenegativecontrolspots. 3 2 1 Figure 6. Scanned Microarray image. 1. Feature pixels 2. Background pixels 3. TwoͲpixel exclusion region. Source:GENEPIXPRO [17]. In inͲsitu oligonucleotide arrays a local background is calculated for each probe and thenaweightedcombinationofthesebackgroundsissubstractedfromalltheprobes ofthemicroarray. 26 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays Treatmentofmissingvalues Microarray datasets frequently contain missing values, either because the spot is empty (intensity=0), or because the background intensity is higher than the spot intensity(intensitywithbackgroundcorrected<0).Thesevaluesneedtobedeletedor estimated and replaced, in a process called imputation, for subsequent data mining [18]. Intheimputation,themissingvaluesmaybereplacedbya1(i.e.sincelog(1)=0,what means no gene expression) or replaced by the mean of the intensities of the gene amongallthesamples. Particularly, in affymetrix datasets, when the intensity of the Mismatch probe cell is higherthanthePerfectmatchintensity,thisprobehasnotphysiologicalsense,insuch acaseavaluecalledChangeThresholdisusedinsteadoftheMismatchintensity[7]. Filteringbaddata Filtering excludes from the data the observations that do not fulfil a preͲformulated presumption[4].Forexample,toolowintensityvaluesthatcannotbetrusteddueto instrumental limitations of the scanner. Typically, the lowest intensity value of the reliable microarray data, referred as “floor”, is 10. Values below “floor” are usually removed(filtered)fromthedatabecausetheyarenotreliableenough.Similarly,the array elements at the high end of the fluorescence intensities may saturate the detector.Thethresholdreferredtoas“ceiling”valueissetat16.000andvaluesover “ceiling”areremovedtoo[4,19]. 27 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction Foldchange,log2(twofold) IncDNAmicroarraystheexpressionofageneinasampleistheratiooftheintensities inbothchannelsforthatgene.Althoughtheseratiosprovideanintuitivemeasureof expressionschanges,theyhavethedisadvantageoftreatingupͲanddownͲregulated genes differently. Genes upͲregulated by a factor of 2 have an expression ratio of 2, whereasthosedownͲregulatedbythesamefactorhaveanexpressionratioof0.5.The mostwidelyusedtransformationoftheratioisthelogarithmbase2,whichtreatsupͲ regulatedanddownͲregulatedgenessymmetrically,sothatageneupͲregulatedbya factorof2hasalog2(ratio)=1,agenedownͲregulatedbyafactorof2hasalog2(ratio) =о1,andageneexpressedataconstantlevel(witharatioof1)hasalog2(ratio)=0. So,log2(ratio)willbeusedtorepresentexpressionlevels[19]. In some cases the log transformation may be too “strong” and have the effect of increasingtheimportanceofthelowintensities.Then,aweakertransformationlikea cuberootisused[6]. Normalization Normalization consists of removing arbitrary variations in the measured gene expression levels of hybridized samples so that biological differences (different gene expressions) can be more easily distinguished. Table 3 summarises the main normalizationcriteriausedandthesystematicvariationtheyremove. The most used method is the LOcally WEighted Scatterplot Smoothing (LOWESS) correction [20] for non linear data, and total intensity normalization or median subtraction otherwise. In inͲsitu microarrays analysis a separate probe array experimentisperformed,whichisusedbyscalingtechniquestominimizedifferences 28 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays in overall signal intensities between the two arrays, allowing for a more reliable detectionofbiologicallyrelevantchangesinthesampples[1,7]. Table3.Strategiesformicroarraydatanormalization. Normalizationmethods Dye effects Slide effects spatial effects Scanner effects between arrays LOWESScorrectionforeach printͲtip[7,15,16] Linearcorrection [16,21] Totalintensitynormalization [19] TwodyesCy3andCy5 [7,15] Doubledyeexperimentation, dyingasampleoncewithCy5 andwithCy3inthesecond experiment[19,22] Ratiosvaluesescalationacross theslides[19,22] Housekeepinggenes[15] 1.2.3Microarrayapplications ThefirstmicroarraypaperfeaturedthesmallmustardplantArabidopsisthaliana[23], but the technology quickly spread to yeast [24], mouse [25], and human [26, 27] studies. Present main applications of microarrays [28] include the identification of genetic individuality of tissues or organisms (e.g. detection of single nucleotide polymorphisms,SNPs)[7,29],theinvestigationofcellularstatesandprocesses(such 29 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction asthesporulationprocess)[30],thediagnosisofgeneticandinfectiousdiseases[31Ͳ 33],theidentificationofthesubtypesofacertaindisease[34,35],thedetectionof geneticwarningsigns[36]orthedrugselection[37]. Inthelargenumberofinvestigationareas,oncologyhasbecomethemainfieldofDNA microarrayapplications[38].Generalaspectsofcancerexpressionprofilinghavebeen extensively reviewed [39Ͳ41]. It has been shown that subclassification of tumours based on their molecular profiles may help to explain why these tumours respond differently to treatment. Golub et al. [34] were the first to use microarray gene expressiondatatodistinguishbetweenacutemyeloidleukemiaandacutelymphocytic leukemia. Posterior studies allowed distinguishing samples of adult versus paediatric leukemia[42],differentsubtypesofleukemia[43]andtheirmolecularcharacterization [44]. Recently, Su et al. [45] and Ross et al. [46] used largeͲscale RNA profiling to construct a molecular classification of different carcinomas (prostate, lung, ovary, colorectum,kidney,liver,pancreas,bladder/urethra,andgastroesophagus).Additional researchfordiagnosisbygeneticprofilinghasbeendonefordifferentcancers[47,48]. In breast cancer, microarrays permitted differentiating between tumour types, corresponding to BRCA1, BRCA2 and sporadic mutations [13, 49], the differentiation betweentheestrogenreceptors[50]andthedifferentiationbetweenthestagesinthe cancer progression [31]. In melanoma, most of the efforts have been applied to differentiate between metastasis and no metastasis tissues [51, 52] and in hepatocellularcarcinomatheresearchhasinvolvedthepursuitofcancerprogression [53].Inothertypesoftumours,thediagnosishasbeenthemaintarget.Thisisthecase ofbladdercancer[54],cutaneoussquamouscellcancer[55],andlungcancer[56].In thefieldsofcolon[57],prostate[23],liver[58],glioma[59]andepithelia[60]cancers, the research has focused on the differentiation between tumour and normal tissues and in the case of lymphoma [35], medulloblastoma [61] and adenocrinoma [62] on thedifferentiationbetweendifferentsubtypesofthem. 30 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays In non oncological clinical diagnosis, DNA microarrays are used to search for the expression pattern characteristics of complex genetic disorders [47] such as diabetes [33], obesity [63, 64], and schizophrenia [65]. Microarrays have also been used in transplantation research; for example in renal transplantation to generate gene expression profiles of renal biopsies for diagnoses of acute rejection [66], or in diagnosis of infectious diseases, to detect gene sequences in the genomes of Mycobacterium tuberculosis, HIV [67, 68], and other pathogens with the aim of providing a diagnostic tool that detects expression of antibiotic resistance genes or specifiedviralsubtypes[38]. Another important application of DNA microarrays is the identification of the genes thatareresponsibleofacertaindisease[48].Oneofthefirstpapersthatreportedthe use of microarrays for this purpose identified the genes differentially expressed betweenaratstrainwithinsulinresistanceandanormalinsulinsensitivecontrolstrain [69]. After this study, microarrays have been applied to identify genes involved in manydifferentcancerexpressions[70Ͳ72],tumourprogressions[73]orinmanyother clinicalfieldssuchasneuronaldiseases[74,75].Inthelastfewyearsmanymethods havebeendevelopedtoidentifythemostrelevantgenesforacertaindiagnosis.Three majorgroupsofmethodsexist:filters,wrappersandembeddedtechniques[76].These methodshavebeenbasedonGeneticAlgorithms[77],RandomForests[78],weights ofSVM[79],tͲtestsortheWilcoxontest[80],tociteafew.Mostofthesecriteriaare univariate(i.e.eachfeatureisevaluatedindependently),thussimpletointerpret,but they omit interactions and correlations between genes during gene selection [81]. Anyhow,theseinteractionsmustbetakenintoaccountsinceithasbeenshownthat thereexistpairsofgenesthatarecoexpressed.Inasimplemanner,ifwefindthatthe genetic expression levels for two genes are similar, we can hypothesize that the respective genes are coͲregulated and possibly functionally related [82]. More 31 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction accurately, these coexpressions have been proved based on the correlation of expressionprofilesoronfunctionalandchromosomalstructuralinformation[83,84]. Linked with the identification of the genes that are responsible for a disease, microarrays have also been applied to find mutations that are responsible for the disease phenotype. Although there are numerous methods for identifying the mutations,microarraysmaybestsatisfyaneedforrapid,accurateandcostͲeffective method for genetic polymorphism identification [47]. This identification has been presentedasthefoundationofpharmacogenomics.Innextfuture,pharmacogenomics aims to optimize the dose and drug formulation and to predict good and adverse clinicalresponsestoindividualdrugs,usingmicroarraysforpersonalizedmedicine[38, 47,85]. Thehugeamountofdatageneratedineachmicroarrayexperimentimpliestheuseof multivariatetechniquesfortheiranalysis.Inoneofthefirststudieswithmicroarrays, Golub et al. [34] applied two cluster selfͲorganizing maps to group 38 samples of leukemiaintotwoclasses.Eisenetal.[86]usedhierarchicalclusteringtofindoutthe geneswithsimilarfunctions.Hierarchicalclusteringhasalsobeenusedtodiscovertwo moleculardistincttypesofdiffuselargeBͲcelllymphomainwhichthepatientsinthe twosubgroupsshowedsignificantdifferencesinoverallsurvival[35],andtocategorize breast cancer into its subtypes [87]. PCA has been applied to discriminate between differenttumourtissues,includingcoloncarcinoma,breastcarcinoma,centralnervous system tumour, lung cancer, leukemia, melanoma, ovarian carcinoma, and prostate cancer[88].ThesameanalysishasbeenperformedwithkͲmeansclustering[88]. Multivariatesupervisedclassificationmethodsareprobablythemostimportanttools formicroarraydataanalysis.Suchmethodscanbeusedtoidentifydifferentexpressed genes, to find subgroups of samples, to differentiate between different states of a 32 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays tumourandtoinfertheclassofasamplefromitsgeneexpressionmicroarraydata.In generalterms,theaimofanyclassifieristobuildadecisionrulefromapreclassified datasetanduseittoassignanewunlabeledsampletooneornoneofthepredefined classes. A large number of classification methods have been used in microarray data analysis.ThemainstudiesaresummarizedinTable4. Table4.Classificationreferencesformicroarraygeneexpressiondata. Classificationmethod Objectiveofthestudy Differentiatebetweenovariancancertissues,normalovariantissuesand othernormaltissues[89]. SVM Recognizefivesetsofgenesinfunctionalclassesthatwereexpectedtobe coͲregulated:thosemediatingthetricarboxylicacidcycle,respiration, cytoplasmicribosomebiosynthesis,proteasomebiosynthesisandhistone biosynthesis[90]. Discriminatebetweentumoursfromavarietyoftissuesandorgans,e.g. betweensubtypesofleukemiaandthemutationsofbreastcancer[91]. TPCR Differentiatebetweenroundbluecelltumoursofchildhood (neuroblastoma,rhabdomysarcoma,nonͲHodgkinlymphomaandEwing familyoftumours)[91]. Classifycancersamplesintothesamefourgroupsofchildhoodcancer NN [92]. Investigatethegeneexpressionpatternsassociatedwithestrogens receptorstatusinsporadicbreastcancer[93]. Classifytypesofleukemia[94]. MCRͲALS Differentiatebetweenninetypesoftumoursamples(breastcancer, centralnervoussystemtumour,coloncarcinoma,lungcancer,leukemia, melanoma,ovariancarcinoma,prostatecancerandrenalcarcinoma)[94]. SOM+kͲmeansclustering Classifysubtypesoforalcancer[95]. 33 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction KNN Selectasubsetofgenestoclassifysubtypesofleukemiaandasubsetto discriminatebetweentumourandhealtycolonsamples[96]. Differentiatebetweentumourandhealthysamplesofcolonandovarian PLS(fordimension reduction)+LDorQLD [97]. Differentiatebetweenthesubtypesofacancersuchaslymphomaor leukemia[97]. PLS+PHR PLS+RPLR Predictpatientsurvivalprobabilities[98]. Classifysamplesoftwotypesofleukemia[99]. Differentiatebetweenhealthyandtumourcolonsamples[99]. Differentiatebetweensamplesbeforeandafterchemotherapy[100]. Identifytheestrogensreceptorstatus[100]. Differentiatethestatesofabreastcancertumour[101]. Predictthedrugefficacyusingexpressiondatabiomarkers[102]. Identifythemostrelevantgenescorrelatedwithacertaintumour[103]. PredictthequalityofaDNAmicroarrayspot[104]. Classifytumoursamples(differenttypesoflymphomaandbreast cancer)[105]. DPLS Differentiatebetweenhealthysamplesandsamplesofcarcinoma,colon andprostatetumour[105]. Identifygeneswhoseexpressionappearstobesynchronizedwithcell cycling[106]. Identifygeneswithperiodicfluctuationsinexpressionlevelscoupledto thecellcycleinthebuddingyeast[106]. Selectafewgeneexpressionsthatarethemosteffectivein discriminatingtumoraltypes(melanoma,colon,leukemiaandrenal tumourcells)[103,107]. Identifynewlungcancermolecularmarkerswithdiagnosticvalue[108]. Abbreviations. SVM: Support vector machines, TPCR: Total principal component regression, NN: Neural networks, MCRͲALS: Multivariate curve resolution alternating least squares, SOM: selfͲorganizing maps , KNN: KͲnearest neighbours, PLS: Partial least squares, LD: Logistic discrimination, QLD: Quadratic logistic discrimination, PHR: Proportional hazard regression, RPLR: Ridge penalized logistic regression, DPLS: Discriminantpartialleastsquares. 34 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 1.2Microarrays Recently,theinterestforusingDiscriminantPartialLeastSquares(DPLS)hasincreased [109,110].Thisinterestarisesfromthehighcomputationalefficiency,largeflexibility andversatilityofthemethodfortheaddressedmicroarrayclassificationproblems,and from the existence of a variety of algorithmic variants [110]. Hence, to improve the DPLSmodelinordertoobtainbetterclassificationmodelsandperformancesplaysa keyroleingeneexpressionmicroarraydataclassification. 35 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction [1] [2] [3] [4] ][5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] 36 TingͲLee, M.L., Analysis of microarray gene expression data. 2004, USA: Kluwer Academic Publishers. Higgs, P.G. and T.K. Attwood, Bioinformatics and Molecular Evolution, ed. B.S. Ltd. 2006: BlackwellPublishing. U.S.Departmentofhealthandhumanservices,TheNewGenetics,inNIHPublicationNo.07Ͳ662. 2006. Baldi, P. and G.W. Hatfield, DNA microarrays and Gene expression. From Experiments to Data AnalysisandModeling.2002,Cambridge:CambridgeUniversityPress. Primrose, S.B. and R.M. Twyman, Principles of Gene Manipulation and Genomics (7th edition). 2006:BlackwellPublishing. Göhlmann, H. and W. Talloen, Gene Expression Studies Using Affymetrix Microarrays. MathematicalandComputationalBiologySeries,ed.C.Hall.2009:Taylor&FrancisGroup,LLC. Pasanen,T.,etal.,DNAMicroarrayDataAnalysis.2003,Helsinki:Ed.CSCͲTheFinnishITcenterfor Science. Allison, D.B., et al., Microarray data analysis: form disarray to consolidation and consensus. NatureReviews(Genetics),2006.7:p.55Ͳ65. Liew, A.W.ͲC., H. Yan, and M. Yang, Pattern Recognition techniques for the emerging field of bioinformatics:Areview..PatternRecognition,2005.38:p.2055Ͳ2073. Southern, E., Detection of specific sequences among DNA fragments separated by gel electrophoresis.JournalofMolecularBiology,1975.98:p.503Ͳ507. Fodor,S.P.,etal.,Multiplexedbiochemicalassayswithbiologicalchips.Nature,1993.364:p.555Ͳ 556. Shena,M.,etal.,QuantitativemonitoringofgeneexpressionPatternswithcomplementaryDNA microarray.Science,1995.270:p.467Ͳ470. Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal ofMedicine,2001.344:p.539Ͳ548. Mada, H., Microarray Data Analysis (I), Part A: cDNA spotted Microarray. Material of Data Analysis Course. http://www.sinica.edu.tw/~hmwu/CourseSMDA/index.htm, Academia Sinica: InstituteofStatisticalScience:Taiwan. Schuchhardt, J., et al., Normalization strategies for cDNA microarrays. Nucleic Acids Research, 2000.28:p.e47. Berrar,D.,W.Dubitzky,andM.Granzow.,Apracticalapproachtomicroarraydataanalysis.2004, USA:KluwerAcademicPublishers. http://www.moleculardevices.com/. Wang, D. and e. al., Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics, 2006.22:p.2883Ͳ2889. Quackenbush,J.,ExtractingbiologyfromhighͲdimensionalbiologicaldata.JExpBiol,2007.210: p.1507Ͳ1517. Cleveland,W.S.,RobustLocallyWeightedRegressionandSmoothingScatterplots.Journalofthe AmericanStatisticalAssociation,1979.74:p.829Ͳ836. Kepler,T.B.,L.Crosby,andK.T.Morgan,NormalizationandanalysisofDNAmicroarraydataby selfͲconsistencyandlocalregression.GenomeBiology,2002.3:p.1Ͳ12. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] References Yang,Y.H.,etal.,NormalizationforcDNAmicroarraydata:arobustcompositemethodaddressing singleandmultipleslidesystematicvariation.NucleicAcidsResearch,2002.30:p.e15. Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002. 1:p.203Ͳ209. Shalon, D., S.J. Smith, and P.O. Brown, 1996. Genome Research, A DNA microarray system for analyzingcomplexDNAsamplesusingtwoͲcolorfluorescentprobehybridization.6:p.639Ͳ645. Lockhart,D.J.,etal.,DNAExpressionmonitoringbyhybridizationtohighdensityoligonucleotide arrays.NatureBiotechnology,1996.14:p.1675Ͳ1680. Baldini, A. and D.C. Ward, In situ hybridization banding of human chromosomes with AluͲPCR products:asimultaneouskaryotypeforgenemappingstudies..Genomics,1991.9:p.770Ͳ774. Ried, T., et al., Multicolor fluorescence in situ hybridization for the simultaneous detection of probe sets for chromosomes 13, 18, 21, X and Y in uncultured amniotic fluid cells. Human MolecularGenetics,1992.1:p.307Ͳ313. Lesk,A.M.,IntroductiontoBioinformatics(3rdEdition).2008:OxfordUniversityPress. Butcher,L.M.,etal.,SNPs,microarraysandpooledDNA:identificationoffourlociassociatedwith mildmentalimpairmentinasampleof6000children.HumanMolecularGenetics,2005.14:p. 1315Ͳ1325. Friedlander, G., et al., Modulation of the transcription regulatory program in yeast cells committedtosporulation.GenomeBiology,2006.7:articleR20. Veer,L.J.v.t.,etal.,Geneexpressionprofilingpredictsclinicaloutcomeofbreastcancer.Nature, 2002.415:p.530Ͳ535. Thomas,R.S.,etal.,IdentificationoftoxicologicallypredictivegenesetsusingcDNAmicroarrays. MolecularPharmacology,2001.60:p.1189Ͳ1194. Mootha, V.K., et al., PGCͲ1aͲresponsive genes involved in oxidative phosphorylation are coordinatelydownregulatedinhumandiabetes.Nature,2003.34(3):p.266Ͳ273. Golub,T.R.,etal.,MolecularClassificationofCancer:ClassDiscoveryandClassPredictionbyGene ExpressionMonitoring.Science,1999.285:p.531Ͳ537. Alizadeh,A.A.,etal.,DistincttypesofdiffuselargeBͲcelllymphomaidentifyedbygeneexpression profyling.Nature,2000.403p.503Ͳ511. Sebat, J., et al., Strong Association of De Novo Copy Number Mutations with Autism. Science, 2007.316:p.445Ͳ449. Chavan,P.,K.Joshi,andB.Patwardhan,DNAMicroarraysinHerbalDrugResearch.eCAM,2006. 3(7):p.447Ͳ457. Aitman,T.J.,Science,medicine,andthefuture:DNAmicroarraysinmedicalpractice.TheBrithish MedicalJournal,2001.323:p.611Ͳ615. CuperlovicͲCulf, M., N. Belacel, and J. Ouellette, Determination of tumour marker genes from geneexpressiondata.DrugDiscoveryTodayTargets(Reviews),2005.10:p.429Ͳ437. MacGregor,P.F.andJ.A.Squire,Applicationsofmicroarraystotheanalysisofgeneexpressionin cancer.ClinicalChemistry,2002.48:p.1170Ͳ1177. Wadlow, R. and S. Ramaswamy, DNA microarrays in clinical cancer research. Current Molecular Medicine,2005.5:p.111Ͳ120. Kohlmann, A., et al., Pediatric acute lymphoblastic leukemia (ALL) gene expression signatures classifyanindependentcohortofadultALLpatientsLeukemia,2004.18:p.63Ͳ71. Haferlach,T.,etal.,AMLM3andAMLM3varianteachhaveadistinctgeneexpressionsignature but also share patterns different from other genetically defined AML subtypes. Genes ChromosomesCancer,2005:p.113Ͳ127. 37 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] 38 Kohlmann, A., et al., Molecular characterisation of acute leukemias by use of microarray technology.GenesChromosomesCancer,2003.37:p.396Ͳ405. Su,A.I.,etal.,Molecularclassificationofhumancarcinomasbyuseofgeneexpressionsignatures. CancerResearch,2001.61:p.7388Ͳ7393. Ross, D.T., et al., Systematic variation in gene expression patterns in human cancer cell lines. NatureGenetics,2000.24:p.227Ͳ235. Petrik,J.,Diagnosticapplicationsofmicroarrays.TransfusionMedicine.16:p.233Ͳ247. Frolov, A.E., Differential Gene Expression Analysis by DNA Microarray Technology and Its ApplicationinMolecularOncology.MolecularBiology,2003.37:p.486Ͳ494. Sorlie,T.,etal.,Geneexpressionpatternsofbreastcarcinomasdistinguishtumorsubclasseswith clinical implications. . Proceedings of the National Academy of Sciences, 2001. 98: p. 10869Ͳ 10874. West, M., et al., Predicting the clinical status of human breast cancer by using gene expression profiles.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS), 2001.98:p.11462Ͳ11467. Bittner,M.,etal.,Molecularclassificationofcutaneousmalignantmelanomabygeneexpression profiling.Nature,2000.406:p.536–540. Clark,E.A.,etal.,GenomicanalysisofmetastasisrevealsanessentialroleforRhoC.Nature,2000. 406:p.532Ͳ535. Mao,H.J.,etal.,MonitoringmicroarrayͲbasedgeneexpressionprofilechangesinhepatocellular carcinoma.WorldJournalofGastroenterology,2005.11:p.2811Ͳ2816. Dyrskjot,L.,Classificationofbladdercancerbymicroarrayexpressionprofiling:towardsageneral clinicaluseofmicroarraysincancerdiagnostics.ExpertReviewsinMolecularDiagnostics.2003.3: p.635–647. Dooley, T.P., et al., Biomarkers of human cutaneous squamous cell carcinoma from tissues and cell lines identified by DNA microarrays and qRTͲPCR. . Biochemical and Biophysical Research Communications,2003.306:p.1026Ͳ1036. Gordon, G.J., R.V. Jensen, and L.L. Hsiao, Translation of microarray data into clinically relevant cancerdiagnostictestsusingexpressionratiosinlungcancerandmesothelioma.CancerResearch, 2002.62:p.4963–4967. Alon, U., et al., Broad patterns of gene expression revealed by clustering analysis of tumor and normalcolontissuesprobedbyoligonucleotidearrays.CellBiology,1999.96:p.6745Ͳ6750. Chen,X.,etal.,Geneexpressionpatternsinhumanlivercancers.MolecularBiologyoftheCell, 2002.13:p.1929Ͳ1939. Boom, J.v.d., et al., Characterization of Gene Expression Profiles Associated with glioma progression using OligonucleotideͲbased microarray analysis and RealͲTime Reverse TranscriptionͲPolymerase Chain Reaction. American Journal of Pathology, 2003. 163: p. 1033Ͳ 1043. Kitahara, O., et al., Alterations of gene expression during colorectal carcinogenesis revealed by cDNA microarrays after laserͲcapture microdissection of tumour tissues and normal epithelia. . CancerResearch,2001.61:p.3544Ͳ3549. Pomeroy, S.L.ande.al,Predictionofcentralnervous systemembryonaltumouroutcomebased ongeneexpression.Nature,2002.415:p.436Ͳ442. Bhattacharjee, A., et al., Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 2001.98:p.13790–13795. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] References Nadler, S.T., et al., The expression of adipogenic genes is decreased in obesity and diabetes mellitus.ProceedingsoftheNationalAcademyofSciences,2000.97:p.11371Ͳ11376. Permana,P.A.,A.D.Parigi,andP.A.Tataranni,Microarraygeneexpressionprofilinginobesityand insulinresistance.Nutrition,2004.20:p.134Ͳ138. Hakak, Y., et al., GenomeͲwide expression analysis reveals dysregulation of myelinationͲrelated genes in chronic schizophrenia. Proceedings of the National Academy of Sciences, 2001. 98: p. 4746Ͳ4751. Mayeux, R., Mapping the new frontier: complex genetic disorders. The Journal of Clinical Investigation,2005.115:p.1404Ͳ1407. Kozal, M.J., et al., Extensive polymorphisms observed in the HIV1 cladeB protease gene using highdensityoligonucleotidearrays.NatureMedicine,1996.2:p.753Ͳ759. Gingeras, T.R., et al., Simultaneous genotyping and species identification using hybridization pattern recognition of generic mycobacterium DNA arrays. Genome Research, 1998. 8: p. 435Ͳ 448. Aitman,T.J.,etal.,IdentificationofCd36(Fat)asaninsulinresistancegenecausingdefectivefatty acidandglucosemetabolisminhypertensiverats.NatureGenetics,,1999.21:p.76Ͳ83. Otero,E.,etal.,DNAmicroarraysinoralcancer.MedicinaOral,2004.9:p.288Ͳ292. Graveel, C.R., et al., Expression profiling and identification of novel genes in hepatocellular carcinomas.Oncogene,2001.20:p.2704Ͳ2712. Brem,R.,etal.,GlobalanalysisofdifferentialgeneexpressionaftertransformationwiththevͲHͲ rasoncogeneinamurinetumormodel.Oncogene,2001.20:p.2854Ͳ2858. Okabe,H.,etal.,GenomeͲwideanalysisofgeneexpressioninhumanhepatocellularcarcinomas using cDNA microarray: identification of genes involved in viral carcinogenesis and tumor progression.CancerResearch,2001.61:p.2129–2137. Cavallaro, S., et al., Gene expression profiles during longͲterm memory consolidation. European JournalofNeuroscience,2001.13:p.1809Ͳ1815. Zirlinger, M., G. Kreiman, and D.J. Anderson, AmygdalaͲenriched genes identified by microarray technologyarerestrictedtospecificamygdaloidsubnuclei.ProceedingsoftheNationalAcademy ofSciences,2001.98:p.5270Ͳ5275. Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics,2007.23:p.2507Ͳ2517. Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast squaressupportvectormachine.BMCBioinformatics,2006.7:article95. DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using randomforest.BMCBioinformatics,2006.7:article3. Guyon,I.,etal.,GeneSelectionforCancerClassificationusingSupportVectorMachines.Machine Learning,2002.46:p.389Ͳ422. Troyanskaya,O.G.,etal.,Nonparametricmethodsforidentifyingdifferentiallyexpressedgenesin microarrays.Bioinformatics,2002.18:p.1454Ͳ1461. Li, G.ͲZ., et al., Partial Least Squares based dimension reduction with gene selection for tumour classification. 7th IEEE International Conference onBioinformatics and Bioengineering, 2007: p. 1439Ͳ1444. Brazma, A. and J. Vilo, Gene expression data analysis. Federation of European Biochemical SocietiesLetters,2000.480:p.17Ͳ24. Lee, H.K., et al., Coexpression Analysis of Human Genes Across Many Microarray Data Sets. GenomeResearch,2004.14:p.1085Ͳ1094. 39 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Introduction [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] 40 Kluger,Y., etal.,Relationship betweengenecoͲexpressionandprobelocalizationon microarray slides.BMCGenomics,2003.4:p.49Ͳ54. Gunther, E.C. and e. al., Prediction of drug efficacy by classification of drugͲinduced genomic expressionprofilesinvitro.ProceedingsoftheNationalAcademyofSciences,2003.100:p.9608Ͳ 9613. Eisen,M.B.,etal.,ClusteranalysisanddisplayofgenomeͲwideexpressionpatterns.Proceedings oftheNationalAcademyofSciences,1998.95:p.14863Ͳ14868. Perou,C.M.,etal.,Molecularportraitsofhumanbreasttumours.Nature,2000.406:p.747Ͳ752. Crescenzi,M.andA.Giuliani,Themainbiologicaldeterminantsoftumorlinetaxonomyelucidated byaprincipalcomponentanalysisofmicroarraydata.FEBSLetters2001.507:p.114Ͳ118. Furey,T.S.,etal., Support VectorMachineclassification and validation ofcancertissuesamples usingmicroarrayexpressiondata.Bioinformatics,2000.16:p.906Ͳ914. BrownM.P.S,e.a.,KnowledgeͲbasedanalysisofmicroarraygeneexpressiondatabyusingsupport vectormachines.ProceedingsoftheNationalAcademyofSciences,2000.97:p.262Ͳ267. Tan, Y., et al., MultiͲclass cancer classification by total principal component regression (TPCR) usingmicroarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65. Khan,J.,etal.,Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofiling andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679. Gruvberger, S., et al., Estrogen receptor status in breast cancer is associated with remarkably distinctgeneexpressionpatterns.CancerResearch,2001.61:p.5979Ͳ5984. Jaumot, J., R. Tauler, and R. Gargallo, Exploratory data analysis of DNA microarrays by multivariatecurveresolution.AnalyticalBiochemistry,2006.358:p.76Ͳ89. Warner, G.C., et al., Molecular classification of oral cancer by cDNA Microarrays Identifies overexpressedgenescorrelatedwithnodalmetastasis.InternationalJournalCancer,2004.110:p. 857Ͳ868. Li, L., et al., Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001. 17: p. 1131Ͳ 1142. Nguyen, D.V. and D.M. Rocke, Tumor classification by partial least squares microarray gene expressiondata.Bioinformatics,2002.18:p.39Ͳ50. Nguyen,D.V.andD.M.Rocke,Partialleastsquaresproportionalhazardregressionforapplication toDNAmicroarraysurvivaldata.Bioinformatics,2002.18:p.1625Ͳ1632. Fort,G.andS.LambertͲLacroix,ClassificationusingPartialLeastSquareswithPenalizedLogistic Regression.Bioinformatics,2005.21:p.1104Ͳ1111. PérezͲEnciso,M.andM.Tenenhaus,Predictionofclinicaloutcomewithmicroarraydata:apartial leastsquaresdiscriminantanalysis(PLSͲDA)approach.HumanGenetics,2003.112:p.581Ͳ592. Modlich, O., et al., Predictors of primary breast cancers responsiveness to preoperative Epirubicin/CyclophosphamideͲbased chemotherapy: translation of microarray data into clinically usefulpredictivesignature.JournalofTranslationalMedicine,2005.3:article32. Man, M.Z., et al., Evaluation methods for classifying Expression data. Journal of BiopharmaceuticalStatistics,2004.14:p.1065Ͳ1084. Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch: identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of Chemometrics2004.18:p.125Ͳ132. Bylesjö, M., et al., MASQOT: a method for cDNA microarray spot quality control. BMC Bioinformatics,2005.6:p.250. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 [105] [106] [107] [108] [109] [110] References Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical ApplicationsinGeneticsandMolecularBiology,2004.3:article33. Johansson,D.,P.Lindgren,andA.Berglund,Amultivariateapproachappliedtomicroarraydata foridentificationofgeneswithcellcycleͲcoupledtranscription..Bioinformatics.19:p.467Ͳ473. Musumarra,G.,etal.,ABioinformaticApproachtotheIdentificationofCandidateGenesforthe DevelopmentofNewCancerDiagnostics.Biol.Chem.,2003.384:p.321Ͳ327. Musssumarra,G.,etal.,GenomeͲbasedidentificationofdiagnosticmolecularmarkersforhuman lungcarcinomasbyPLSͲDA.ComputationalBiologyandChemistry,2005.29:p.183Ͳ195. Nguyen,D.V.andD.M.Rocke,MultiͲclasscancerclassificationviapartialleastsquareswithgene expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226. Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44. 41 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 . UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 2 Thesis Objectives UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Thesisobjectives Microarraysallowthesimultaneousanalysisofthousandsofgeneexpressions.Clinical diagnosis based on gene expression data has two main targets: 1) to achieve the correct diagnostic for a patient with the greatest confidence and 2) to identify the genes responsible for a particular disease. In data analysis words, these objectives implydevelopingthebestclassificationmodelinordertoclassifyasampleinitstrue classwithalowriskofmisclassificationandtoidentifytherelevantvariablesthatallow discriminatingamongtheclassesunderstudy. Multivariate methods are required toanalyse thehugeamount of data generated in microarray experiments. Discriminant Partial Least Squares (DPLS) classification is commonlyusedinthisfield.Theperformanceofthisclassificationmethoddependson manysettingssuchasthedatapreͲprocessing,thenumberoffactors,thenumberof variables and the presence of outliers. Taking into account these considerations the aim of this thesis is to optimize the classification based on DPLS in order to classify clinical samples from their gene expression microarray data. More in detail the objectivesofthepresentthesisare: 1.TodiscussthelimitationofpͲDPLSclassificationfollowingtheBayesrule,which forcestheclassifiertoalwaysassignasampletooneofthemodeledclasses,and proposedifferentapproachestoovercomethislimitation. 2. To implement the reject option in the probabilistic Discriminant Partial Least Squares method (pͲDPLS), used to classify the samples from their gene expression data. This gives to the classification rule the ability to reject to classify a sample when the risk of misclassification is too high, and avoids forcingtheclassificationintooneofthemodelledclasses. 3.TodevelopanewmethodfordetectingambiguoussamplesandoutliersforpͲ DPLS,inordertoimprovetheaccuracyoftheclassificationmodel.Thiswillavoid 45 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter2 classifying samples that would be probably misclassified due: 1) they share characteristicsofthetwoclassesmodelled,2)theydonotbelongtoanyofthe modelledclasses3)theyhaveerrorsininstrumentaldataor4)theyhaveerrorsin theirclasscodification. 4. To develop a new method for gene selection in order to reduce the data dimensionality–eliminatingtheredundantdataandthenoise–andtoimprove theclassificationmodelbydecreasingtheriskofmisclassification. 5. To study the implications that the split of the datasets into training and test setshaveongeneselectionandontheperformanceoftheclassificationmodels. 6.ToextendthebinaryclassificationbasedonDPLStomultiͲclassclassification. Thisshouldhelptosolvecommonclinicalclassificationproblemsinwhichmore thantwosubtypesofsamplesareinvolved. 46 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 3 Discussion of the implementation of the reject option in p-‐DPLS UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.1Introduction 3.1Introduction Microarray gene expression data are characterized by a set of P features or measurements obtained through observation, which are represented by the vector x. Theobjectiveofaclassifieristoassignaclass(category)label(y)tothissamplebased on its recorded x. In the probabilistic Discriminant Partial Least Squares classification method(pͲDPLS) [1],thePLSmodeltranslatesxintoapredictedvalueNJ.ThisNJandthe probability density function (PDF) that describes the distribution of the NJ’s of the trainingsamplesofeachclassareusedtocalculatetheaposterioriprobabilitythatthe sample belongs to each modeled class. Classification is then decided using the Bayes ruleforminimumerror[2]. TheBayesruleiscommonlyusedasacriterionforclassification.Itsdrawbackisthatthe unknown sample is always classified, even if the sample is either an outlier or is ambiguous (it has a similar a posteriori probability to belong to both classes). In such situations,itwouldbebettertorejecttoclassifythesample[3]. TheobjectiveofthischapteristodiscusstheimplementationoftherejectoptioninpͲ DPLS.Section3.2introducestheformulationofthepͲDPLSmodel.Then,theapplication oftheBayesruleforclassifyinginpͲDPLSisshowninsection3.3.Section3.4discusses the limitations of using the Bayes rule in pͲDPLS. Limitations that are overcome by implementingarejectoption.Twoapproximationsforimplementingtherejectoptionin pͲDPLSarediscussedinsection3.5.Finally,section3.6discussesthenecessarychanges intheinterpretationofthemeasuresofclassificationperformancewhentheclassifier includestherejectoption. 49 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS 3.2Probabilisticdiscriminantpartialleastsquares 3.2.1Thepartialleastsquaresmodel Onetaskindataanalysisistodescribetherelationshipbetweentheobservationsinthe predictor space (X) and a dependent variable (y) [4]. Partial least squares (PLS) is a regression method that specifically searches a set of components (or factors) that perform a simultaneous decomposition of X and y with the constraint that these componentsexplainasmuchaspossiblethecovariancebetweenXandy.Discriminant PLS (DPLS) applies PLS regression to binary classification problems, in which y codifies theclassofthesamples[5,6].Withmicroarraygeneexpressiondata,XisanNuPmatrix ofNsamplesandPgeneexpressionsandyisaNu1vectorofonesandzeros,wherethe integer 0 indicates that the sample belongs to class Z0 (e.g. “cancer type I”) and the integer1indicatesthatthesamplebelongstoclassZ1(e.g.“cancertypeII”). PLSdecomposesXandyinto: ܆ൌ ۾܂ ۳ (1) ܡൌ ݍܝ (2) whereTisthescoresmatrix,Pistheloadingsmatrix,uisthevectorofscoresforyandq istheloading[7].Eisthe(error)residualmatrixoftheX–matrixandfisthevectorof (error) residual of the y–vector. An inner relationship is constructed that relates the scoresoftheXblocktothescoresoftheyblock. ܝൌ ܟ܂ (3) 50 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.2Probabilisticdiscriminant partialleastsquares Oncethemodeliscalculated,theaboveequationscanbecombinedtoobtainavector ofregressioncoefficientsforagivennumberoffactors: መ ൌ ܅ሺ ۾ ܅ሻିଵ ݍ ܊ (4) whereWisthematrixwhosecolumnsaretheweightsinEq.(3). Thepredictionforasampleiscalculatedas: መ ݕො ൌ ܠ ܊ (5) NotethatifbhasbeencalculatedfrommeanͲcentereddata,thenxinEq.(5)shouldbe meanͲcentered, and the predicted NJ should be processed accordingly. Ideally, the predictionNJforasampleofclassZ1shouldbe1andforasampleofclassZ0shouldbe 0. Since this is never the case, because of random variability and modelling error, a threshold is defined so that a sample whose prediction NJ is above this threshold is classifiedintoclassZ1,andotherwiseitisclassifiedintoclassZ0.Thethresholdcanbe definedwithadifferentdegreeofrigour(e.g.,thethresholdisarbitrarilysetat0.5or assuming that the NJ’s of the training samples follow a Gaussian distribution and estimating the distribution using the mean and standard deviation of the NJ ’s of each class). In the following section, the threshold is defined from PDFs that describe each class.ThishasleadtoanewversionofDPLScalledprobabilisticͲDPLS(pͲDPLS). 3.2.2Probabilitydensityfunctionofaclass InpͲDPLS,onePDFiscalculatedthatrepresentsthePLSpredictionscharacterizingthe samplesofclassZ0andonePDFiscalculatedthatrepresentstherangeofpredictionsof samples of class Z1. The PDFs are calculated as follows. For the PLS model with A factors, the training samples are predicted with Eq. (5). For each training sample i, a 51 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS Gaussian function (also called kernel function) centred at the predicted value NJi is calculatedas: ݂ሺݕො ሻ ൌ ଵ ݁ ෝ మ భ ష ൬ ൰ మ ೄಶು (6) ܵܲܧ ൌ ඥͳ ݄ (7) ௌா ξଶగ where and ෝ ݅ ି௬ ሻమ σಿ సభሺݕ ൌ ට ேିିఋ (8) SEPi is the standard error of prediction for sample i, hi is the leverage of the sample, RMSECistherootmeansquareerrorofcalibration,yiistheknownclassofthetraining samplei(i.e.thevalue0forasampleofclassZ0andthevalue1forasampleofclass Z1)andɷis1ifthedatahasbeencentredand0otherwise.Figure1showstheGaussian functionscalculatedforthreetrainingsamplesofclassZ0andfoursamplesofclassZ1. p( yˆ | Ȧ0 ) p( yˆ | Ȧ1 ) f(ǔi ) f(ǔi ) SEP i ǔi Z0 ǔ Z1 Figure1.Gaussianfunctions(f(NJi))andPDFs(p(NJ|Z0),p(NJ|Z1))calculatedforahypotheticalpͲDPLSmodel. NotethatthewidthoftheGaussiankernelforeachsampleisdifferent,becauseitdependsontheleverageof thesample,and,ultimately,ontherelativepositionofthesampleinthemultivariatespace. 52 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.2Probabilisticdiscriminant partialleastsquares ThePDFsforclassZ0 andZ1arecalculatedbyaveragingtheindividualkernelfunctions ofthetrainingsamplesofeachclass: ሺݕොȁɘ ሻ ൌ ଵ బ బ σୀଵ ݂ሺݕො ሻ (9) ሺݕොȁɘଵ ሻ ൌ ଵ σభ ݂ሺݕො ሻ భ ୀଵ (10) wheren0andn1arethenumberofsamplesofclassZ0andclassZ1respectively. Foratestsample,thepredictedvalueNJiiscalculatedwithEq.(5)foraDPLSmodelwith A factors. Then, the sample is classified according to its probability to belong to each oneoftheclasses,asitisshowninthenextsection. 3.3Classprediction 3.3.1Classificationbasedonprobabilities Classificationbasedonaprioriprobability Let{Z1 …Zc}beafinitesetofCclasses.TheaprioriprobabilityP(Zc)istheprobabilityof observingclasscwhenanewsamplearrives[8].Itreflectsourpriorknowledgeofhow likelywearetogetasampleofoneclass(e.g.“cancertypeI”)andnotanotherkindof sample(e.g.“healthy”or“cancertypeII”)[9].Aprioriprobabilitiesareoftenconsidered equal for all the classes [10, 11] or calculated from the number of samples in the trainingsetassumingthatthissetisrepresentativeofthepopulation[12Ͳ14],withthe constraintthatσୀଵ ܲሺZୡ ሻ ൌ ͳ[8]. 53 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS InpͲDPLS,theaprioriprobabilitiesareP(Z0)=n0/NandP(Z1)=n1/NforclassZ0andclass Z1respectively,wheren0isthenumberoftrainingsamplesofclassZ0,n1isthenumber ofsamplesofclassZ1andN=n0+n1. Basedontheaprioriprobabilityonly,theclassificationruleinpͲDPLSthatminimizesthe probabilityoferroristoassignasampletoclassZcif ܲሺɘ ሻ ܲሺɘᇱ ሻܿ ᇱ ൌ ͳ ǥ ܥǢ ܿ ് ܿԢ (11) Thedrawbackofthisruleisthatitwillalwaysassignanynewsampletothesameclass (the one with the highest a priori probability), although we know that samples from differentclassesmayarrive.Theinformationaboutthesamplecontainedinxisignored. Classificationbasedonprobabilitydensityfunctions A better classification decision can be made by using the measurement vector x that characterizes the incoming sample; in our case, the data x from a microarray experiment.InpͲDPLS,xisfirstconvertedintothepredictionNJwithEq.(5)forthePLS modelwithAfactors.Then,theruleistoassignthesampleiwithpredictionNJitothe classZcif ሺݕො ȁɘ ሻ ሺݕො ȁɘᇱ ሻܿ ᇱ ൌ ͳ ǥ ܥǢ ܿ ് ܿԢ (12) where p(NJi|Zc) is the classͲconditional PDF for class c obtained from the NJ’s of the training samples evaluated at position NJi (section 3.2.2). Note that, if for a certain sample,p(NJi|Z0)=p(NJi|Z1),thevalueofthePDFwillnotdecide.Figure2showsdifferent PDFsfortwoclasses,Z0andZ1fordifferenthypotheticalpͲDPLSmodels(e.g.calculated withdifferentnumberAoffactors). ForagivensamplewiththepredictedvalueNJi(), the classification is done by comparing the values of each PDF at such NJi (arrows in Figure2a).Thesampleisclassifiedintotheclasswiththelargestp(NJi|Zc).Notethatin thezonewherethePDFsoverlap,thevaluesp(NJi|Zc)aresimilarforbothclasses(seethe 54 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.3Classprediction firsttwoPDFsimagesinFigure2a).Hence,asmallvariationinNJiduetorandomerrorin xmaychangetheclassthathasthelargestp(NJi|Zc),andhencechangestheclassification decision. Theclassification of samples in that zone(called ambiguous samples) will be discussedlater. 0.8 0.6 1 1 a p( yˆi | Ȧ0 ) p( yˆi | Ȧ1) c b 0.8 0.8 P(Ȧ0 | yˆi ) 0.6 0.6 R(Ȧ1 | yˆi ) 0.4 0.4 0.2 0Ͳ1.5 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 1.4 0Ͳ1.5 a 0.8 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 b P(Ȧ0 | yˆi ) 0.8 Ͳ1.5 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 0 Ͳ1.5 1 a Ͳ1 Ͳ0.5 0.2 0 ǔ 0.5 1 1.5 2 2.5 0 P(Ȧ0 | yˆi ) 1.2 0.6 0.6 0.8 0.4 0.4 0.4 0.2 0 Ͳ1.5 Ͳ1 Ͳ0.5 p( yˆi | Ȧ1 ) 0 ǔ 0.5 1 1.5 2 2.5 4 1 a p( yˆi | Ȧ1) p( yˆi | Ȧ0 ) 3 0 Ͳ1.5 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 0 Ͳ1.5 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 P(Ȧ0 | yˆi ) b 0.8 0.6 0.6 0.4 0.4 0 Ͳ1.5 0.5 1 1.5 2 2.5 c R(Ȧ1 | yˆi ) 0.2 P(Ȧ1 | yˆi ) Ͳ1 Ͳ0.5 R(Ȧ0 | yˆi ) Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 0 ǔ 0.5 1 1.5 2 2.5 0 c R(Ȧ0 | yˆi ) Ͳ1.5 0.8 0.2 Ͳ1 0 1 2 1 ǔ R(Ȧ1 | yˆi ) 0.2 P(Ȧ1 | yˆi ) Ͳ1 Ͳ1.5 b 0.8 p( yˆi | Ȧ0 ) 0 1 0.8 1.6 Ͳ0.5 0.4 P(Z1 | yˆ ) 0.2 2 Ͳ1 0.6 0.4 0.2 0Ͳ1.5 1 0.6 0.6 R(Ȧ0 | yˆi ) 0.2 1 p( yˆi | Ȧ1) p( yˆi | Ȧ0 ) 1 0.4 P(Ȧ1 | yˆi ) 0.2 Ͳ1.5 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 R(Ȧ1 | yˆi ) 2.5 c R(Ȧ0 | yˆi ) Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 Figure 2. Example for hypothetical pͲDPLS models a. PDFs b. a posteriori probabilities c. risk functions assumingʄcc=0andʄcc’=1. Classificationbasedonaposterioriprobability A more elaborated classification decision combines the a priori probability and the prediction NJi of the incoming sample. The probability that this new sample belongs to classcinaCͲclassproblemisgivenbytheBayes’aposterioriprobabilityexpression: 55 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS ܲሺɘ ȁݕො ሻൌ ሺ௬ො ȁன ሻሺன ሻ ሺ௬ො ሻ (13) Whenappliedtomicroarraydataclassification,P(Zc|NJi)istheprobabilitythatacellor tissue characterized by its gene expression data x(from which NJi is obtained) is either fromthe“healthy”classor,otherwise,fromthe“tumour”class. Thedenominator(knownasevidenceorunconditionalprobabilitydensityfunction)isa scalefactorthatmeasureshowfrequentlywewillmeasureasamplewithsuchNJi: ሺݕො ሻ ൌ σୀଵ ሺݕො ȁɘ ሻܲሺɘ ሻ (14) The rule assigns the sample to theclasswith the largest a posterioriprobability. So, a samplewillbeassignedtoclassZcif: ܲሺɘ ȁݕො ሻ ܲሺɘᇱ ȁݕො ሻܿ ᇱ ൌ ͳ ǥ ܥǢ ܿ ് ܿԢ (15) Or,sincetheevidenceisthesameforalltheclasses,if ሺݕො ȁɘ ሻܲሺɘ ሻ ሺݕො ȁɘᇱ ሻܲሺɘᇱ ሻܿ ᇱ ൌ ͳ ǥ ܥǢ ܿ ് ܿԢ (16) ForatwoͲclassclassificationproblem,asinpͲDPLS,theaposterioriprobabilitiesP(ZcʜNJi) are: ܲሺɘ ȁݕො ሻൌ ሺ௬ො ȁனబ ሻሺனబ ሻ ܲሺɘଵ ȁݕො ሻൌ ሺ௬ො ȁனభሻሺனభ ሻ ሺ௬ො ሻ (17a) ሺ௬ො ሻ (17b) where: ሺݕො ሻ ൌ ሺݕො ȁɘ ሻܲሺɘ ሻ ሺݕො ȁɘଵ ሻܲሺɘଵ ሻ 56 (18) UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.3Classprediction Figure2bshowstheaposterioriprobabilitiescalculatedfromthePDFSofFigures2afor twoclassesalongtheNJdomain.Thearrowsindicatetheaposterioriprobabilityofthe sample()ineachclass. Note that since the a posteriori probability is calculated as a ratio (Eq. 17aͲb), it increasesforoneclassastheNJ isfarawayfromthePDFoftheotherclass.Hence,fora sample with NJi predicted value () the classification is more risked when PDFs overlap (first two rows of images of Figure 2). Instead when the distributions are more separated(imagesonthirdandfourthrowsinFigure2),theclassificationactionistaken withhigherprobabilityofbeingcorrect. 3.3.2Classificationbasedonrisk Classificationcosts Eachclassificationdecisionhasanassociatedcost.Let{D1 …Dc}bethepossibledecisions, where Dc indicates that the sample is classified in class Zc. Let ʄ(Dc|Zc’) be the cost incurredformakingthedecisionDc(classifyinZc)whenthetrueclassisZc’. Forshort ʄ(Dc|Zc’)isrepresentedasOcc’. In practice, to decide the right costs for the classification problem is difficult and requires an expert opinion. Costs result from combining several factors measured in different units – money, time or quality of life [8] –, but a general approach is to considerthatacorrectclassificationhascost0(i.e.,whenasampleofclasschasbeen classifiedinclassc,ʄcc=0)andanincorrectclassificationhascost1(i.e.,whenasample of class c has been classified in class c’, ʄcc’ = 1) [15Ͳ17]. Other approaches have been used. SantosͲPereira [18] proposed seven different combinations of costs to optimize the classification, based on the work published by Tortorella [19]. They introduced negativecostsforcorrectclassificationsandpositivecostsformisclassifications.Deceux 57 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS [15] presented costs of classifying the samples in three different classes, with values from0.5to3topenalizeeachclassification.Anotherstrategyistoassigndifferentcosts toeachtypeoferrorandclassification,i.e.toclassifyasampleas“healthy”whenitis “tumor”ispenalizeddifferent,withahighercost,thantoclassifyasampleas“tumor” whenitis“healthy”[9,20]. Theriskofclassification Theriskofclassification,calledtheconditionalrisk,R(ɲc|NJi)isdefinedastheexpected loss(cost).Conditionalmeansthattheriskdependsonthevaluethatcharacterizesthe sample(hereNJi,thatderivesfromtheobservedxthroughthePLSmodel)inwhichthe classification is based on.Depending on NJi, we may runa higher ora lower risk. Fora particular NJi and the action Dc taken, the loss incurred is ʄ(Dc|Zc’), where Zc’ is the possibletrueclass(i.e.classesinwihchthesamplesmaybeclassified).SinceP(Zc’|NJi)is the probability that the true class for such NJi is Zc’, the expected loss associated with takingactionDcis[9]: ܴሺD ȁݕො ሻ ൌ σᇲ ୀଵ ɉሺD ȁZᇱ ሻ ܲሺZᇱ ȁݕො ሻ (19) Fortwoclasses,theriskofclassificationbecomes: ܴሺȽ ȁݕො ሻ ൌ ߣ ܲሺɘ ȁݕො ሻߣଵ ܲሺɘଵ ȁݕො ሻ (20a) ܴሺȽଵ ȁݕො ሻ ൌ ߣଵଵ ܲሺɘଵ ȁݕො ሻߣଵ ܲሺɘ ȁݕො ሻ (20b) Hereactionɲ0is“classifythesampleintoclassZ0”andactionɲ1is“classifythesample intoclassZ1”.ʄ01isthelossincurredfordecidingZ0whenthetrueclassisZ1,ʄ10isthe lossincurredfordecidingZ1whenthetrueclassis Z0and ʄ00andʄ11arethecostsof correctlyclassifyingthesamplesintoclassZ0andclassZ1,respectively. 58 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.3Classprediction WheneverwehaveapredictionNJiwecanminimizetheexpectedlossbyselectingthe actionthatminimizestheconditionalrisk.Thedecisionrulebasedonriskisknownas Bayes’theoremoftheminimumerror[2].TherulefortheBayesminimumriskclassifies thesampleinclassZcif ܴሺȽ ȁݕො ሻ ൏ ܴሺȽᇱ ȁݕො ሻܿ ᇱ ൌ ͳ ǥ ܥǢ ܿ ് ܿԢ (21) ForbinaryclassifierslikepͲDPLS,Eq.(21)becomestoclassifythesampleintoclass: ɘ ݂ܴ݅ሺȽ ȁݕො ሻ ൏ ܴሺȽଵ ȁݕො ሻ ɘଵ ݂ܴ݅ሺȽଵ ȁݕො ሻ ൏ ܴሺȽ ȁݕො ሻ (22) withR(ɲ0|NJi)andR(ɲ1|NJi)evaluatedwithequations20aͲ20b. Ifweconsidercostzeroforacorrectclassificationandcostoneforanyerror(i.e.,ʄ00= ʄ11=0andʄ01=ʄ10=1),theriskofclassificationbecomes: ܴሺȽ ȁݕො ሻ ൌ ߣଵ ܲሺɘଵ ȁݕො ሻ ൌ ܲሺɘଵ ȁݕො ሻ (23a) ܴሺȽଵ ȁݕො ሻ ൌ ߣଵ ܲሺɘ ȁݕො ሻ ൌ ܲሺɘ ȁݕො ሻ (23b) andtheclassificationdecisionmaybeexpressedintermsofaposterioriprobabilitiesas ɘ ݂݅ߣଵ ܲሺɘ ȁݕො ሻ ߣଵ ܲሺɘଵ ȁݕො ሻ ɘଵ (24) Figure2cshowstheriskovertheNJdomainforabinaryclassifierwithʄcc=0andʄcc’=1. Notethattheriskcurvesareoppositetotheaposterioriprobabilitycurves,i.e.,ahigha posteriori probability involves a low risk, and viceͲversa. Also note that the risk of classificationinoneoftheclassesdecreasesthefurthestawaythepredictionisfromthe PDF of the other class. For a test sample (), in the top two models, the risk taken to classifythesampleintoclassZ0,R(ɲ0|NJi),issimilartotherisktoclassifythesampleinto classZ1,R(ɲ1|NJi).Insuchasituationthechanceofmisclassificationishigh.Bycontrast, when the PDFs are not overlapped(Figure2c, bottom) the risk taken whenclassifying 59 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS thissampleinclassZ0ismuchhigherthantherisktakenwhenclassifyingitinclassZ1 (i.eR(ɲ0|NJi)>>R(ɲ1|NJi)).HencethesamplewillbeclassifiedinclassZ1withalowriskof classification. Theclassificationbasedonriskisageneralrulefromwhichthepreviousrulesderive.To bemeaningful,theclassificationbasedonrisksrequiresthecoststobesetobjectively (e.g. in monetary units). If they are not known and the cost of misclassification is set equaltooneandthecostofcorrectclassificationissetequaltozero,theclassification basedonriskisequivalenttotheclassificationbasedonlyonaposterioriprobabilities. 3.4Discussionofclassprediction The Bayes rule is optimal in the sense that no other rule can yield a lower error probability.However,whentheNJiliesintheambiguityregionandwhenthesamplelies in the limits of the classes’ domains this rule may lead to questionable results. These situationsarecommentedbelow. It is common that in binary classification the PDFs of class Z0 and class Z1 overlap (Figure3a).Theoverlaparisesbecauseeithertheclassificationalgorithmhasalimited discriminativepower,orbecausesomesamplesofbothclasseshavesimilarmeasuredx. AsamplewhosepredictionisinthatregionhassimilarvaluesofthePDFsp(NJiʜZ0)|p(NJi ʜZ1)and,assumingthattheaprioriprobabilitiesareequal,hasalsosimilarvaluesofthe a posteriori probabilities P(Z0ʜ NJi) | P(Z1ʜ NJi). Since there is not a clear difference, the sample could well belong to any of the two classes and the probability of misclassificationishigh.Theoverlapzone(dashedregioninFigure3)iscalledambiguity region. 60 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.4Discussionofclass prediction 1.2 a 1 4 LL 0 HL 1 p(yˆ | Z1)·P(Z1) p(yˆ | Z0 )·P(Z0 ) 3 HL 0 LL 1 HL 1 p( yˆ | Z1)·P(Z1) p( yˆ | Z0 )·P(Z0 ) 0.8 0.6 LL 0 b 2 0.4 1 0.2 0 Ͳ1.5 1 c Ͳ1 LL 0 Ͳ0.5 0 0.5 ǔ 1 1.5 2 2.5 0 Ͳ1.5 1 P(Z0 | yˆ) P(Z1 | yˆ) HL 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Ͳ1.5 Ͳ1 Ͳ0.5 0 0.5 ǔ 1 1.5 2 2.5 Ͳ1 d P(Z0 | yˆ) 0 Ͳ1.5 Ͳ1 Ͳ0.5 0 0.5 ǔ 1 HL 1 LL 0 Ͳ0.5 1.5 0 0.5 ǔ 1 1.5 2.5 2 P(Z1 | yˆ) 2 2.5 Figure3.HipotheticpͲDPLSmodel.aͲb.PDF’scͲb.aposterioriprobabilityfunctions.ClassZ0isrepresentedby thegreenlineandclassZ1bytheyellowline.Thedashedregionistheambiguityregion. Anothercommonsituationariseswhenthesample’spredictionisoutsidetherangeof the predictions of the training samples. This situation may happen at the extremes of thePDFs(Figure3aand3b)andalsointheregionbetweenthePDFsifthePDFsdonot overlap(Figure3b).Intheseregions,theclassͲconditionalprobabilitiesp(NJi|Zc)arevery low for both classes and also the products p(NJi|Z0)ͼP(Z0) and p(NJi|Z1)ͼP(Z1) are low. However, note in the limits of the PDFs, the a posteriori probability for one of the classes is high (Figures 3c and 3d) because it is calculated as a ratio. For example, for p(NJi|Z0)ͼP(Z0) = 10Ͳ7 and p(NJi|Z1)ͼP(Z1)= 10Ͳ10, the a posteriori probability is P(Z0ʜ NJi) = 10Ͳ7/(10Ͳ7+10Ͳ10)|1.Bylettingtheaposterioriprobabilitydecide,thesamplewouldbe classifiedintoclassZ0withahighaposterioriprobability.Thisresultissatisfactoryifthe 61 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation therejectoptioninpͲDPLS samplemustnecessarilybelongtooneofthetwopossibleclassesandtheclassification modelhasbeendesignedtodoso.However,thefactthatthepredictionofthesample is in the tail of the PDF, where only a very low percentage of training samples are, suggests that the sample may be an outlier and even not belong to the class. Hence, allowingtheclassifiertorejecttoclassify,insteadofforcingittomakeaclassification decision, might beadvantageous. This possibility is notconsideredneither inthetwoͲ class Bayes rule of a posteriori probability (Eq. 15) nor in the minimum risk of classificationrules(Eq.21),whichwillalwaysclassifythesample. 3.5Probabilisticdiscriminantpartialleastsquareswith rejectoption Inmanycases,suchasinclinicaldiagnosis,thecostofawrongclassificationmaybeso highthatitmaybebettertosuspendthedecision(torejecttoclassifythesample),and callforafurthertest[21],thantorisktoobtainawrongclassification.Therejectoption is introduced in a classification rule topreserve againstexcessive misclassifications [3] and to obtain the accuracy required by the user of the classification system [22]. The reject option avoids classifying the samples with a high probability to be wrongly classified [22], and only the classifications with a low risk are performed. Hence, the rejectoptionconvertspotentialmisclassificationsintorejections[23],whichreducesthe errorrate.Therejectoption,however,hastwolimitations: 1.Somesamplesthatwouldbecorrectlyclassifiedbytheclassificationmodel maybeconvertedintorejections. 2.Theclassificationmodelbecomesuselessiftoomanysamplesarerejected. Undoubtedly a tradeoff between errors and rejects must be achieved [18]. Several strategieshavebeendevelopedtodefinetheoptimalrejectoption[11,18,21,23,24]. 62 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.5Probabilistic discriminantpartialleast squareswithrejectoption These strategies basically reduce to two approximations, either by defining the reject optionasanewclass(rejectclass)towhichtheobjectsareassignedtoorbydefining the reject option as a threshold so the object is only classified if its a posteriori probabilityishigherthanthethreshold.Thesetwoapproachesarecommentedbelow. 3.5.1Rejectoptionasaclass Therejectoptionmaybeintroducedintheclassificationprocessasanadditionalclass, the reject class (Zr). In such a case, the possible classification actions of the pͲDPLS classifierare:classifythesampleintoclassZ0(D0),classifythesampleintoclassZ1 (D1) andclassifythesampleintotherejectclassZr(Dr). Classificationbasedonaposterioriprobability The a posteriori probabilities when the reject option is implemented as a class are definedas: ܲሺɘ ȁݕො ሻൌ ሺ௬ො ȁனబ ሻሺனబ ሻ ܲሺɘଵ ȁݕො ሻൌ ሺ௬ො ȁனభሻሺனభ ሻ ܲሺɘ ȁݕො ሻൌ ሺ௬ො ȁனೝ ሻሺனೝ ሻ (25a) (25b) (25c) ሺ௬ො ሻ ሺ௬ො ሻ ሺ௬ො ሻ wherethescalefactordefinedinEq.(14)becomes: ሺݕො ሻ ൌ ሺݕො ȁɘ ሻܲሺɘ ሻ ሺݕො ȁɘଵ ሻܲሺɘଵ ሻ ሺݕො ȁɘ ሻܲሺɘ ሻ (26) Theruleistoclassifyinto: ݈ܿܽݏݏɘ ݂݅ܲሺɘ ȁݕො ሻ ሺܲሺɘଵ ȁݕො ሻǡ ܲሺɘ ȁݕො ሻሻ ݈ܿܽݏݏɘଵ ݂݅ܲሺɘଵ ȁݕො ሻ ሺܲሺɘ ȁݕො ሻǡ ܲሺɘ ȁݕො ሻሻ ݈ܿܽݏݏɘ ݂݅ܲሺɘ ȁݕො ሻ ሺܲሺɘ ȁݕො ሻǡ ܲሺɘଵ ȁݕො ሻሻ (27) 63 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS Iftherejectclassisdefinedinthisway,theaprioriprobabilitiesforclassZ0andclassZ1 arecalculatedfromtheproportionofsamplesofeachclassinthetrainingset.Forthe rejectclass,P(Zr)istheaprioriprobabilitythatanewsamplethatshouldberejected arrives and p(NJiʜZr) defines the distribution of the NJi of any sample that should be rejected. Both p(NJiʜZr) and P(Zr) are clearly difficult to calculate. Usually it is assumed thattherejectclasshasauniformdistributionovertheNJdomain[25]andsincethea priori probability has only a 1 multiplicative effect, only the a product p(NJiʜZr)ͼP(Zr) must be 0.8 p( yˆ | Z1)·P(Z1) p( yˆ | Z0 )·P(Z0 ) calculated. One criterion is to 0.6 define p(NJiʜZr)ͼP(Zr) as a threshold so that the 5% of the area in the 0.4 tails of the PDFs is below this 0.2 threshold [11] (dashed regions in p( yˆ | Zr )·P(Zr ) 0 Ͳ1.5 Ͳ1 Ͳ0.5 0 Reject 4 ǔ 0.5 1 1.5 2 Acceptance 2.5 Reject Figure 4). In this way a sample whose NJi is atthe tails of the PDFs b is rejected. Figure 4 shows the p( yˆ | Z0 )·P(Z0 ) 3 PDFs for class Z0 and class Z1 for p( yˆ | Z1)·P(Z1) overlapped and non overlapped classes. The red horizontal line is 2 the uniform distribution calculated 1 for the reject class. Note that this p( yˆ | Zr )·P(Zr ) 0 Ͳ1.5 Ͳ1 Reject Ͳ0.5 0 ǔ 0.5 1 Acceptance Reject Acceptance 1.5 2 2.5 reject class defines two kinds of regions, the acceptance and the Reject rejectones. Figure4.PDFsforoverlappedandnonͲoverlappedclasses(Z0andZ1)andtherejectclass(Zr).Therejectclass isdefinedasauniformdistribution.Thisissetasthe5%areainthetailsofthePDFs. 64 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.5Probabilistic discriminantpartialleast squareswithrejectoption Theaposterioriprobabilities(Eq.25aͲ25c)derivedfromthePDFsinFigure4areshown inFigure5.WhenthePDFsareoverlapped(Figure5a),therejectclassisusefulatthe endsofthePDFsandalsointheambiguouszoneifthedistributiondefiningthereject class is higher than the PDFs of the classes. However, samples in the ambiguous zone will not be rejected if the reject class is below the PDFs of the classes (as usually happens)becausetheaposterioriprobabilityoftherejectclasswillalwaysbesmaller than the probability of classification 1 a (i.e. P(Zr|NJi) < max (P(Z0|NJi), P(Z1|NJi)). For non overlapped distributions(Figure5b)betweenthe P(Z0 | yˆ) 0.8 P(Z1 | yˆ) 0.6 PDFstheprobabilityofthesampleto belong to the reject class is the largest of the three a posteriori 0.4 0.2 P(Ȧr | yˆ) probabilities, so a sample in that zone would be rejected. This is the behaviour to be expected because there are no training samples with 0 Ͳ1.5 1 Ͳ1 Ͳ0.5 0 0.5 ǔ 1 1.5 2 2.5 b P(Ȧr | yˆ) 0.8 suchNJivalues.Thesamehappensat the extreme of these distributions (i.e. equally to the extremes of 0.6 0.4 P(Z0 | yˆ) overlapped distributions). In which P(Z1 | yˆ) 0.2 the samples whith such NJi will be rejectedtoclassify. 0 Ͳ1.5 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 Figure 5.Aposterioriprobabilitiesforclass Z0 (green),class Z1 (yellow),andtherejectclass Zr(red)fora. overlappedclassesandb.nonoverlappedclassespresentedonFigure4. Different adaptations of the reject class have been described. Pereira et al. [18] introducedanindecisionclassinordertorejectthesamples,butthisrejectclassisnot 65 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS introducedintheevaluationofaposterioriprobabilitiesnorintheconditionalrisk.Their approach may be assimilated to introduce a reject threshold. Landgrebe et al. [10] consideredtheproblemasoneinwhichthereisawellͲdefinedtargetclassandapoorly definedoutlierclass,andintroducedtherejectclassonlyinthepredictionstep.Inother words, in the training step there are two classes (target and outlier) and in the predictionorclassificationstepanadditionalclassisused,therejectclass.Thisclassis assumed to be uniformly distributed across the training classes’ domains, and it is included in the evaluation of the probabilities. The criticism arises because in this approach the a priori probabilities used in the training step are different than the a priori probabilities used in the prediction step. Muzzolini et al. [11] introduced an ambiguous class to reduce the probability of an erroneous classification. This class identifies those samples that are classified as belonging to two or more classes with (near) equal probability. In addition, they introduced the reject distance to identify those samples that have little or no similarity with the predefined classes. The reject thresholdstoidentifysuchsamplesaredeterminedbyfixingtheprobabilityinwhichthe samples are classified as belonging to the distance reject class. This is equivalent to rejectthesamplespredictedoutsideaconfidenceintervalfixedaroundeachPDF(reject distance)[11]. 3.5.2Rejectoptionasathreshold Asecondalternativetointroducetherejectoptionistointroducearejectthreshold. Classificationbasedonaposterioriprobability TheaposterioriprobabilitiesforeachclassovertheNJdomainarecalculated(Eqs.17aͲ 17b) using the PDFs (Figure 6aͲ6b). For such a posteriori probabilities a threshold of rejection is set at (1–t) (Figure 6cͲ6d), so that a sample is rejected if the maximum a posterioriprobabilityislowerthanthisthresholdvalue[22]. 66 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.5Probabilistic discriminantpartialleast squareswithrejectoption The classification rule based on a posteriori probabilities with reject option becomes classifyinto: ݈ܿܽݏݏɘ ݂݅ܲሺɘ ȁݕො ሻ ݉ܽ ݔሺܲሺɘଵ ȁݕො ሻǡ ሺͳ െ ݐሻሻ ݈ܿܽݏݏɘଵ ݂݅ܲሺɘଵ ȁݕො ሻ ݉ܽ ݔሺܲሺɘ ȁݕො ሻǡ ሺͳ െ ݐሻሻ (28) andrejectthesampleif: ሺͳ െ ݐሻ ݉ܽ ݔሺܲሺɘ ȁݕො ሻǡ ܲሺɘଵ ȁݕො ሻሻ (29) IfthePDFsofthetwoclassesareoverlapped,therejectthresholddividestheNJdomain intotworegions:acceptanceregionandrejectregion(Figure6cand6d). 1 4 a b 0.8 3 p( yˆ | Z1 )·P(Z1 ) p ( yˆ | Z 0 )· P ( Z 0 ) 0.6 p( yˆ | Z0 )·P(Z0 ) p( yˆ | Z1)·P(Z1 ) 2 0.4 1 0.2 0 Ͳ1.5 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 2.5 0 Ͳ1.5 1 P(Z0 | yˆ) c d P(Z1 | yˆ) 0.8 Ͳ0.5 0 ǔ 0.5 1 1.5 2 0.6 0.4 0.4 0.2 0.2 Ͳ1 Ͳ0.5 Acceptance 0 ǔ 0.5 Reject 1 1.5 2 Acceptance P(Z0 | yˆ) 2.5 P(Z1 | yˆ) 0.8 (1-t) 0.6 0 Ͳ1.5 Ͳ1 1 2.5 0 Ͳ1.5 (1-t) Ͳ1 Ͳ0.5 Acceptance 0 ǔ 0.5 1 1.5 2 2.5 Acceptance Figure 6. (aͲb) PDFs for overlapped and no overlapped classes. (cͲd) a posteriori probabilities with reject thresholdderivedfromaͲbPDFs. 67 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS AccordingtoChowin[23],theoptimalrejectthreshold(t)isgivenby ݐൌ ሺߣ െ ߣ ሻȀሺߣ െ ߣ ሻ (30) where ʄr is the cost of rejecting a sample and ʄc and ʄm are the costs of a correct classificationoramisclassification,respectively.Generallyʄm> ʄr> ʄc,andinmostcases ʄc=0(i.e.thereisnocostiftheclassificationiscorrect)[23,26]. AlimitationofclassificationbasedonEq.(28)andEq.(29)isthatthethresholdhasno effectifthePDFsarenotoverlapped(Figures6band6d),sincethereisnotasignificant ambiguous region. Note also that for the reject option work properly, (1–t) must be higherthan0.5.If(1–t)islowerthan0.5theprobabilitytoclassifythesampleinoneof theclasseswillalwaysbehigherthantherejectthresholdsothattheclassificationrule based on a posteriori probabilities with reject option is simply the classical Bayes rule (seeFigure6a).Inaddition,theuseoftheaposterioriprobabilityoftheclassandthe reject threshold for rejection ignores the possibility of having samples from unknown classes. This situation may be partially overcomed by setting limits on the PDFs (High Limit and Low Limit in Figure 3 as will be discussed on chapter 4). These limits avoid classifyingsamplesthatlieontheextremesoftheclasses. Other approaches have been proposed to implement rejection based on thresholds. Fumeraetal.[22]proposedtosetanindividualthresholdforeachclass,thusavoiding rejecting too many samples of one of the classes if the number of samples of both classesisnotbalanced.Tortorellaetal.[19,21]consideredalsotwothresholds,which wereoptimizedbymaximizingtheclassificationutilityfunction.Thisisanalternativeto the Chow’s approach. Chow takes into account costs and minimizes the risk [18]. In order to optimize the reject threshold, Li et al. [27] proposed to control the error insteadoffindingatradeͲoffbetweenrejectionrateanderrorrate.Theyreformulated theproblemas:givenanerrorrateforeachclass,designaclassifierwiththesmallest 68 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.5Probabilistic discriminantpartialleast squareswithrejectoption rejectionrule.AsimilaralternativewasproposedbyHanczaretal.[28],althoughthey controlled the conditional error rate of the classifier, not the error rate. Kressel et al. [29]optimizedtherejectthresholdtogetaminimalfalsepositiverateandHerbeietal. [30]presentedtherejectioncosttastheupperboundontheconditionalprobabilityof misclassification,optimizedbyminimizingtheerrorrateforalsoaminimalrejectrate. These approaches often ignore the detection of outliers and the rejection of samples whentheclassesarenotoverlapped. Furtherimprovementsontheapplicationoftherejectoptionarediscussedinchapter4. 3.6 Implications of reject option in classification performanceevaluation When a classifier involves the reject option, the performance measures the classifier mustbeproperlyinterpretedinordertotakeintoaccountthatsamplescanberejected. pͲDPLS is a binary classifier. This means that the classification decision is to choose between two classes, Z1 and Z0, that can be generically called Positive (P) class and Negative(N)classrespectively.Hence,theresultfrompͲDPLScanbethatthesampleis correctlyclassifiedinitsclass,eitherinclassZ1(TruePositive,TP,i.e.,apositivesample thatisclassifiedaspositive)orinclassZ0(TrueNegative,TN,i.e.,anegativesamplethat isclassifiedasnegative)orincorrectlyclassified,eitherinclassZ1(FalsePositive,FP,i.e., anegativesamplethatisincorrectlyclassifiedaspositive)orinclassZ0(FalseNegative, FN,i.e.,apositivesamplethatisclassifiedasnegative)(Table1).Whentherejectoption is implemented, the possible outputs of the classifier include that the sample may be rejected.ApositiveobjectthatisrejectediscalledRejectPositive(RP)and,equivalently, anegativeobjectthatisrejectediscalledRejectNegative(RN).Asamplemayhavebeen 69 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS rejected because its classification was not reliable enough (the risk was too high) or becauseitwaspointedoutasoutlier(seechapters4and5). Table1.Confusionmatrix,outcomesofabinaryclassifier,asdescribedbyKohaviandProvostin[31]. Trueclass Predicted Positive(Z0) Negative(Z1) Positive(Z0) TP FP Negative(Z1) FN TN Rejected(Zr) RP RN The objective of pͲDPLS or any other classifier is to classify correctly as many future samples as possible, i.e., minimize the number of false positives, false negatives and rejections.Forsimplicity,thisisgenerallyevaluatedbytheaccuracyortheerrorrateof theclassificationmodel. Accuracyisdefinedasthepercentageofsamplescorrectlyclassified: ݕܿܽݎݑܿܿܣൌ ା ାାା (31) Ifrejectionisnotanoption,allsamplesareclassifiedandthedenominatorofEq.(31)is equal to the number of samples I submitted to the classifier (i.e. I = TN+FN+TP+FP). Hence, classically, accuracy is calculated by dividing the number of samples correctly classifiedbythetotalnumberofsamples,I.Whenrejectionisanoption,Eq.(31)isstill validbutnotethatthedenominatorisnolongerequaltothetotalnumberofsamplesI, sincesomeofthemmayhavebeenrejected(i.e.I=TN+FN+TP+FP+RP+RN).Hence,the accuracy must be interpreted as the percentage of correctly classified samples with respecttothenumberofsamplesforwhichtheclassifierissuedaclasslabel[22].Note thatthis is the most meaningful interpretation,althoughit is rarellyconsidered in the workswithrejectoption,inwhichtheaccuracyiscalculatedbydividingthenumberof 70 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.6Implicationsoftherejectoptionin classificationperformanceevaluation samplesclassifiedcorrectlybythenumberofsamplessubmittedtotheclassifier,either rejectedornot[21]. Thissignificanceresidesinthattheexperimenterwantsthattheclasslabelissuedbythe classifierbecorrect.Hence,theperformancemeasureshouldreflectthepercentageof thesamplesforwhichtheclassifierassignedaclassandifithasbeendonecorrectlyor wrongly.Inthisway,theaccuracyoftheclassifierwithrejectoptioncanbehigherthan the accuracy of the classifier without reject option (note that if the accuracy were defined over the total number of samples, classifiers with reject option would always performe worse than models without reject option, because the number of samples wellclassifiedusingtherejectoptionwouldbeequalorlower). Similarly,theerrorrateisdefinedasthepercentageofsamplesthatareassignedtothe wrongclass[32]: ൌ ା ାାା (32) Theerrorratemustbealsoreinterpretedliketheaccuracyparameterwhenrejectionis anoption.Hence,thedenominatorofEq.(32)isthetotalnumberofsamplesclassified (withouttakingintoaccounttherejectedones). Thesensitivityandthespecificityaredefinedinsimilarterms[33]. ൌ ା (33) ϐ ൌ ା (34) The sensitivity is evaluated as the number of positive samples (class Z1) correctly classified respect to the number of positive samples classified. Note that, while the denominator expression must be maintained, without reject option the number of 71 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS positive samples classified is the total number of positive samples, with reject option, thenumberofpositivesamplesclassified(TP+FN)maydifferfromthetotalnumberof positivesamples(TP+FN+RP).Ananalogoussituationhappensforthenegativesamples with the Specificity. In this measure, the number of negative samples classified may differfromthenumberoftotalnegativesamplessincesomeofthemmayberejected whentherejectoptionisimplemented. Furthermore, when the reject option is introduced, new performance parameters appear[33]: ൌ ୖ ൌ (35) (36) ୖ However, the redefinition of the performance parameters is not enough to accurately evaluatetheclassifierswithrejectoption.Notethatamodelthatrejectstoclassifymost of the samples but classifies correctly the remaining few will have a high accuracy; however,itisnotuseful.Inaddition,thedrawbackofusingparameterslikeaccuracyis that individually they are not enough to evaluate all the aspects that summarize the performance of the classifier (i.e. correct classifications, misclassifications and rejections).Forthatpurpose,thecostisamoreusefulparameter.Itisdefinedas: ݐݏܥൌ ߣ ܰ ߣ ܰ ɉୡ ୡ (37) whereʄmisthecostofawrongclassification,ʄristhecostofrejectingasample,ʄcisthe costofacorrectclassificationandNm,Nr,Ncarethenumberofsamplesmisclassified, rejected or correctly classified, respectively. The Cost allows taking into account the rejectionsand,inaddition,thecostthateachclassificationimplies[3,17].Thesecosts (ʄ)mustbeoptimizedtokeeptheefficiencyoftheclassifier. 72 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 3.7Conclusions 3.7 Conclusions Probabilistic Discriminant Partial Least Squares (pͲDPLS) is a binary classifier that has some advantages over other versions of DPLS: 1) it assumes neither an arbitrary classificationthresholdfortheNJ’snoraGaussiandistributionfortheNJ’sofeachclass, and2)itassignstheclasslabelbasedontheBayesclassificationruleoftheaposteriori probability,or,moregenerally,ofminimumrisk. However,thestrictapplicationoftheBayesruleforcestheclassifiertoalwaysassignthe sampletooneofthepredefinedclasses.Thisisalimitationforthosesamplesthatmay beoutliersorambiguous,andhencewithalargechancetobemisclassified.Thedanger ofmisclassificationcanbereducedbyimplementingtherejectoption.Inthischapter, twoapproximationstoimplementtherejectoptioninpͲDPLShavebeendiscussed.One ofthemintroducesrejectoptionasarejectclass.Thesecondoneintroducesthereject optionasathreshold.Thebestapproachtointroducetherejectoptionistosetareject threshold.With this approach, the aprioriprobabilities or shapes ofan extra class do notneedtobeassumed.However,therejectoptionsetbytherejectthresholdaloneis notabletorejectoutliers;so,additionalconstraintsmustbeconsidered. Itisalsoessentialforanyclassifiertoevaluatecorrectlytheclassificationperformance. Ageneralapproachistousetheaccuracyortheerrorrate.Theseparameters,however, have the weaknesses that they consider all incorrect decisions (or correct decisions) equallyriskyandtheytreatalloutcomesasequallylikely[26].Sincetherejectionsare notevaluated,theseparametersarenotusefultoevaluateclassifierswithrejectoption. Forsuchclassifiers,theCostparameterisabetterapproach. 73 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Discussionoftheimplementation oftherejectoptioninpͲDPLS [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in Discriminant Partial LeastͲSquares Classification. Journal of Chemometrics and Intelligent Laboratory Systems, 2009.95:p.122Ͳ128. Bayes, T., An Essay towards solving a Problem in the Doctrine of Chances. Philosophical TransactionsoftheRoyalSocietyofLondon,1763.53:p.370Ͳ418. Chow,C.K.,Anoptimumcharacterrecognitionsystemusingdecisionfunctions.IRETrans.Electronic Computers,1957.16:p.247Ͳ254. Eriksson, L., et al., MultiͲ and Megavariate Data Analysis. Principles and Applications. 2001: UmetricsAB. Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44. Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985, Wiley:NewYork.p.581Ͳ591. Gemperline, P.J., L.D. Webber, and F.O. Cox, Raw Materials Testing Using Soft Independent ModellingofClassAnalogyAnalysisofNearͲInfraredReflectance.Anal.Chem,1989.61:p.138Ͳ144. Webb,A.,StatisticalPatternRecognition,2nedition,ed.Wiley.2002,Malvern,UK. Duda,R.O.,P.E.Hart,andD.G.Store,PatternClassification(2ndedition),ed.W.Intersicence.2001, NewYork. Landgrebe, T., et al., The interaction between classification and reject performance for distanceͲ basedrejectͲoptionclassifiers.PatternRecognitionLetters,2006.27:p.908Ͳ917. Muzzolini, R., Y.ͲH. Yang, and R. Pierson, Classifier desing with incomplete knowledge. Pattern Recognition,1998.31:p.345Ͳ369. Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328. Hills, M., Allocation Rules and their Error Rates. Journal of the Royal Statistical Society. Series B (Methodological),1966.28:p.1Ͳ31. Bishop,C.M.,PatternRecognitionandMachinelearning,ed.Springer.2006,NewYork. Denceux, T., Analysis of evidenceͲTheoretic Decision rules for pattern classification. Pattern Recognition,1997.30:p.1095Ͳ1107. Lachenbruch,P.A.andM.Goldstein,DiscriminantAnalysis.Biometrics,1979.35:p.69Ͳ85. Anderson,T.W.,IntroductiontoMultivariateStatisticalAnalysis.1958, NewYork:John Wileyand Sons. SantosͲPereira,C.M.andA.M.Pires,Onoptimalrejectrulesand ROCcurves.PatternRecognition Letters,2005.26:p.943Ͳ952. Tortorella, F., An optimal reject rule for binary classifiers. In: Ferri, F.J et al. (Eds.), Advances in PatternRecognition:JointIAPRInternationalWorkshops,SSPR2000andSPR2000,LectureNotes inComputerScience,vol1876.SpringerͲVerlag,Heidelberg.2000:p.611Ͳ620. Bishop,C.M.,PatternRecognitionandMachineLearning.SpringerScience+BussinessMedia.2006, Singapore. Tortorella,F.,AROCͲbasedrejectrulefordichotomizers.PatternRecognitionLetters,2005.26:p. 167Ͳ180. 74 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 References [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] Fumera, G., F. Roli, and G. Giacinto, Multiple Reject Thresholds for Improving Classification Reliability, in Advances in Pattern Recognition, SSPR&SPR, Editor. 2000, Springer: Berlin Ͳ Heidelberg.p.863Ͳ871. Chow, C.K., On optimum recognition error and reject tradeoff. IEEE ͲTransactions on information theory,1970.16:p.41Ͳ46. Fumera, G., I. Pillai, and F. Roli, Classification with Reject Option. Proceedings of the 12th InternationalConferenceonImageAnalysisandProcessing(ICIAP’03),2003. Landgrebe,T.,etal.AcombiningstrategeyforillͲdefinedproblems.inFifteenthAnn.Sympos.ofthe PatternRecognitionAssociationofSouthAfrica.2004. Brown, C.D. and H.T. Davis, Receiver operating characteristics curves and related decision measures:Atutorial.ChemometricsandIntelligentLaboratorySystems,2006.80:p.24Ͳ38. Li, M. and I.K. Sethi, ConfidenceͲbased classifier design. Pattern Recognition, 2006. 39: p. 1230Ͳ 1240. Hanczar, B. and E.R. Dougherty, Classification with reject option in gene expression data. Bioinformatics,2008.24:p.1889Ͳ1895. Kressel,U.,F.Lindner,andC.Wöler,ClassificationSystemwithrejectclass.2004,DaimlerChrysler AG(DE):UnitedStates. Herbei,R.andM.H.Wegkamp,Classificationwithrejectoption.TheCanadianJournalofStatistics, 2006.34:p.709Ͳ721. Kohavi,R.andF.Provost,GlossaryofTermsMachineLearningͲKluwerAcademicPublishers,1998. 30:p.271Ͳ274. Smith,C.A.B.,Someexamplesfodiscrimination.Ann.Eugen.,1974.13:p.272Ͳ282. Bradley, A.P., The use of the area under the ROC curve in the evaluation of machine learning algorithms.PatternRecognition,1997.30:p.1145Ͳ1159. 75 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 4 Classification from microarray data using p-‐DPLS with reject option Talanta, 2009, Vol.80 (1): 321-32 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption Microarraysallowevaluatingsimultaneouslytheexpressionofthousandsofgenesina cell. Oneof the most relevant applications of thesegene expressions is to classify the samples (e.g. cell or tissues) into one of the classes of interest. Discriminant Partial LeastͲSquares(DPLS)isoftenusedforsuchapurpose.However,mostpublishedresults report the straight application of this method, with disregard to the quality of each individualpredictionandthepossibilityofdetectingpredictionoutliers.Theaimofthis chapteristoimproveDPLSforclassifyingmicroarraydata.Firstly,weimplementanew version of DPLS called probabilistic Discriminant Partial Least Squares (pͲDPLS). This method bases the classification of a sample on kernel probability density functions (PDFs) and the Bayes rule of a posteriori probability. Secondly, a reject option is introduced so that the classifier can reject samples in the ambiguity region, based on Chow’s rule, and can reject samples outside the defined limits of the classes. The ambiguityregionisthezonewherethePDFsthatcharacterizeeachoneoftheclasses overlap. In that zone, the model cannot discriminate well enough whether a sample belongstooneclassortotheother,eitherbecauseoflimitationofthePLSmodel,or becausethesamplesactuallysharecharacteristicsofthemodeledclasses.Hence,there ishighriskthatanyattemptofclassifyingthatsamplecouldresultinamisclassification. Thesecondpossibilityofrejectionisimplementedattheendsoftheclasses’domains and also between PDFs for non overlapped classes. Samples in those regions have extremepredictions,outsidethelimitssetfortheclasses,sotheymaybeconsideredas outliers.Forsuchsamples,weprefertorejecttoclassifytheminsteadoftakingtherisk of misclassifying them. These two approaches will be detailed and discussed in the methodssection. The existence of a reject option increases the experimenter’s confidence in the classificationruleandimprovestheaccuracyofthefinalclassificationmodels.Notethat with reject option only those samples whose classification is reliable are actually classified, while the samples either outside the limits or in the ambiguity region that couldleadtomisclassificationsarerejectedtoclassify. 79 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter4 The pͲDPLS with reject option was tested with two public datasets. With the Human Cancers dataset, the accuracy measured by leaveͲoneͲout crossͲvalidation was improved from 97% to 99% when compared to pͲDPLS without reject option. For the BreastCancerdataset,themethodcouldreject100%ofthetestsamplessubmittedto theclassifierthatdidnotbelongtoanyofthemodelledclasses.Thesesampleswould havebeenmisclassifiediftherejectoptionhadnotbeenconsidered. ThisworkispresentedinpaperformpublishedinTalanta2009,Vol.8(1)321Ͳ328. 80 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydatausingprobabilistic discriminantpartialleastsquareswithrejectoption CristinaBotella,JoanFerré*,RicardBoqué Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University. Marcel·lí Domingo s/n, 43007. Tarragona, Spain * Correspondingauthor:[email protected] Talanta2009,Vol.8(1)321Ͳ328(Editedforformat) Abstract Microarrays are used to simultaneously determine the expressions of thousands of genes. An important application of microarrays is in the classification of samples into classesofinterest(e.g.eitherhealthycellsortumourcells).DiscriminantPartialLeastͲ Squares (DPLS) has often been used for this purpose. In this paper, we describe an improvement to DPLS that uses kernelͲbased probability density functions and the Bayesruletoclassifysampleswhilstkeepingtheoptionofnotclassifyingthesampleif this cannot be done with sufficient confidence. With this approach, those samples outside the boundaries of the known classes or from the ambiguity region between classesarerejectedandonlysampleswithahighprobabilityofbeingcorrectlyclassified are indeed classified. The optimal model is found by simultaneously minimizing the misclassificationandrejectioncosts.Themethod(pͲDPLSwithrejectoption)wastested withtwodatasets.FortheHumanCancersdatasettheaccuracy(obtainedbyleaveͲoneͲ outcrossͲvalidation)wasimprovedfrom97%to99%whencomparedtopͲDPLSwithout rejectoption.FortheBreastCancerdataset,pͲDPLSwithrejectoptionwasabletoreject 100% of the test samples that did not belong to any of the modelled classes. These samples would have been misclassified if the reject option had not been considered. 81 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption 4.1Introduction Supervised classification is increasingly being applied to microarray gene expression data in order to predict tumour types [1Ͳ3], to differentiate between healthy and tumoursamples[4Ͳ6]andtodifferentiatebetweenpharmacologicalmechanisms[7], amongotherapplications.Microarraydataarecharacterizedbythousandsofvariables (genes) and few samples, resulting in high redundancy and a high number of nonͲ informative measurements. There has been a lot of interest in using factorͲbased multivariateclassificationmethodssuchasDiscriminantPartialLeastSquares(DPLS)to analyze these data [8, 9]. The DPLS uses a few latent variables rather than a lot of measuredvariablesandthisbringswithitaseriesofadvantages.DPLStakesvariable correlations into account, filters noise and leads to classification rules with good predictiveperformance, especially when DPLS is implementedtogetherwith variable selectionmethods.DPLShasbeenusedtodifferentiatebetweensamplesbeforeand afterchemotherapy[10],todeterminethedifferentstatesofabreastcancertumour [11],topredicttheefficacyofadrugbyusingexpressiondatabiomarkers[12],andto predictthequalityofDNAͲmicroarrayspots[13]. Like other classification rules, DPLS must have two main qualities: it must provide reliable classifications of forthcoming samples and it must minimize the number of misclassifications (i.e. the expected error rate). Both of these are improved if the classifier is allowed to reject doubtful samples instead of always being forced to classifytheminoneofthemodelledclasses.ByclassifyingonlythemostwellͲdefined cases, both the accuracy of the classifier and the reliability of each classification are improved. 82 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.1Introduction In this paper we implement the reject option in the recently developed pͲDPLS classifier and show how it can be used for microarray data classification. pͲDPLS is a variantofDPLS,whichuseskernelfunctionstocalculateaprobabilitydensityfunction (PDF) for each class. This allows a flexible implementation of the Bayes rule for classification, and also provides a measure of the reliability of the classification. Reliability is a primary concern in statistical classification, especially when this classification is used in critical health applications such as cancer diagnosis [14], an issuewhichhasalsoledtoseveralotherstudies[15]. Inclassification,rejectionisadvantageouswhen:(a)thenewsampledoesnotbelong toanyofthetrainedclasses,(b)thenewsamplebelongstooneoftheclassesbutis verydifferentfromthesamplesusedfortrainingtheclassifier,or(c)thesampleisin the boundary region between classes. Situation (a) occurs when the sample is an outlier.Forcingtheclassifiertodecideamongoneofthemodelledclasseswillproduce aclassificationerror(e.g.acelldoesnotbelongtoanyofthemodelledcelltypesbutit is classified as one of them). Situation (b) typically arises when the sampling of the trainingsamplesisincompleteornotrepresentative.Finally,situation(c)mayarise,for example, because of the limited discriminative power of the measured variables or because the classification algorithm has limited discriminative power. Although samplesinsituations(b)and(c)mightfinallybeclassifiedcorrectly,theymightalsobe classified incorrectly because either they are unique samples or they are ambiguous samplesandcanbelongtoeitheroftheclasses,respectively. Therejectoptionaimstoovercomesituations(a)to(c)byrejectingthesampleandnot classifyingitwhentheprobabilityoferroristoohigh.Thisisasafeguardagainsterrors and improves the accuracy of the classifier, which is evaluated as the percentage of samplescorrectlyclassifiedamongthenumberofsamplesclassified[16].Thisinturn leadstogreaterconfidenceinthesamplesthatarefinallyclassified.Therejectoption 83 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption canbefineͲtunedinordertoavoidrejectingtoomanysamplesthatwouldotherwise beclassifiedcorrectly.Sincetoohigharejectionratewoulddecreasetheusefulnessof the classifier, a compromise must be reached between improving the accuracy and reducing the usefulness of the classifier. There has been extensive research into the theoretical aspects of the reject option [14, 17Ͳ28], most of which relates to Chow’s reject option [29], which implemented the reject option for the Bayes rule. Chow’s rejectoptionhasrecentlybeenusedtomicroarrayexpressiondata[30]. TherearestilltwolimitationstojointlyapplyingtheBayesandChowrules.First,they are not adequate for the extreme (outlying) samples (situations (a)Ͳ(b)) which are typically found at the extremes of the probability density functions (PDFs). These samplesmustberejectedaccordingtoadifferentcriterion.Second,bothrulesrequire knowledgeoftheaprioriprobabilitiesandthePDFsoftheclasses[31],whichmakes applying these rules more difficult. In this paper, the first limitation is overcome by including distance based thresholds, which is equivalent to selecting a confidence intervalaroundeachclassandrejectingsamplesoutsidethisinterval[17].Thesecond limitation is overcome by the calculating PDFͲlike functions in pͲDPLS [32], which makesanapproximateBayesianclassificationeasier. 4.2Methods 4.2.1ProbabilisticDPLS TheDPLSmethodappliesPartialLeastͲSquares(PLS)regressiontobinaryclassification problems,inwhichthedependentvariableycodifiestheclassofeachsample[8,33]. ADPLSmodeliscalculatedbyregressingyonXusingtheadequatenumberoffactors. 84 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.2Methods For microarray gene expression data, X is an NuP matrix of N samples and P gene expressionsandyisaNu1vectorofonesandzeros,wheretheinteger0codifiesthe sampleasbelongingtoclassZ0(e.g.“canceroftypeI”)andtheinteger1codifiesthe sample as belonging to class Z1 (e.g. “cancer of type II”). For a sample i, the value predictedbythePLSmodelisNJi=xiTb,wheretheb'saretheregressioncoefficientsfor themodelofAfactorsandtheadequatepreͲprocessingisimplicit(e.g.iftheb’shad beencalculatedfrommeanͲcentereddata,thenxishouldbemeanͲcentered,andthe predictedNJishouldbeunprocessedaccordingly).Withthecodingofy,theprediction forasampleshouldbecloseto0ifthesamplebelongstoclass Z0,anditshouldbe closeto1ifthesamplebelongstoclassZ1.InordertobetterdefinethecutͲoffvalue betweenclasses,Pérezetal.[32]developedpͲDPLS,aprobabilisticversionoftheDPLS inwhichtheuncertaintyofthepredictedvalueNJisaccountedforinthecalculationof the model. This method is described here for completeness. The method starts by calculatingaDPLSmodelofAfactorswithXandy.Then,thismodelisusedtopredict thetrainingsamplesand,foreachtrainingsamplei,aGaussianfunctioncentredatthe predictedvalueNJiiscalculatedas: ܨሺݕො ሻ ൌ ଵ ௌா ξଶగ ෝ మ భ ష ൬ ೄಶು ൰ ݁మ (1) ܵܲܧ ൌ ඥͳ ݄ σ ො ି௬ ሻమ సభሺ௬ ൌ ට ିିఋ (2) (3) where SEPi is the standard error of prediction for sample i, hi is the leverage of the sample,RMSECistherootmeansquareerrorofcalibration,yiistheknownclassofthe trainingsamplei(i.e.value0forasampleofclassZ0andvalue1forasampleofclass Z1) and ɷ is 1 if the data has been centred and 0 if it has not. Figure 1 shows the 85 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption Gaussian functions calculated from the predictions of three training samples of class Z0andfoursamplesofclassZ1.NotethatthewidthoftheGaussiankernelforsample i depends on SEPi, which is particular to that sample, and depends on the relative positionofthesampleinthemultivariatespace.Then,forclassesZ0 andZ1,aPDFis calculatedastheaverageoftheindividualkernelfunctionsofthetrainingsamplesof eachclass: ଵ బ ሺݕොȁɘ ሻ ൌ σୀଵ ݂ ሺݕොሻ బ ሺݕොȁɘଵ ሻ ൌ ଵ σభ ݂ ሺݕොሻ భ ୀଵ (4) (5) wheren0andn1arethenumberofsamplesofclassZ0andclassZ1respectively. Figure1.SimulatedPDFsofclassZ0 andclassZ1obtainedfromEquations(4)and(5).Thekernelfunctions (Eq.(1))arecentredonpredictionNJi ofeachtrainingsample.Accordingtothecodeassignedtotheclasses, thesamplepredictionsofclassZ0andclassZ1shouldbelocatedaroundthevalues0and1respectively. 86 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.2Methods 4.2.2 Bayesruleforclassification In pͲDPLS, the predicted class for sample i is obtained using the Bayes rule. The predictionNJiforthatsampleisusedtoobtaintheaposterioriprobabilitiesP(Z0ʜNJi)and P(Z1ʜNJi). These are the probabilities that the sample belongs either to class Z0 or to class Z1, once it is known that the sample’s prediction is NJi. For the twoͲclass classificationproblem: ܲሺɘ ȁݕො ሻൌ ሺ௬ො ȁனబ ሻሺனబ ሻ ܲሺɘଵ ȁݕො ሻൌ ሺ௬ො ȁனభ ሻሺனభ ሻ ሺ௬ො ሻ (6a) ሺ௬ො ሻ (6b) wherep(NJiʜZ0)andp(NJiʜZ1)aretheconditionalprobabilitiesevaluatedfromthePDFsof classes Z0 and Z1 and P(Z0) and P(Z1) are the a priori probabilities. Both a priori probabilities may be estimated as the proportion of samples of each class in the trainingset,providedthatthesetisrepresentativeoftheoverallpopulation.Thatis, P(Z0)=n0/N and P(Z1) = n1/N where N=n0+n1. The denominator of Equation (6a) and (6b)is: ሺݕො ሻ ൌ ሺݕො ȁɘ ሻܲሺɘ ሻ ሺݕො ȁɘଵ ሻܲሺɘଵ ሻ (7) TheBayesruleassignsthesampletotheclassinwhichithasthehighestaposteriori probability[31].Theruleis: ݈ܿܽݏݏɘ ݂݅ܲሺɘ ȁݕො ሻ ܲሺɘଵ ȁݕො ሻ ݈ܿܽݏݏɘଵ ݂݅ܲሺɘଵ ȁݕො ሻ ܲሺɘ ȁݕො ሻ (8) Although this rule is optimal in the sense that no other rule can yield a lower error probability, it is not always satisfactory. For example, when the NJi is at one of the extremesofthePDF(Figure2),bothp(NJiʜZ0)andp(NJiʜZ1)arelow,andtheproductsp(NJi 87 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption ʜZ0)ͼP(Z0)andp(NJiʜZ1)ͼP(Z1)arealsolowbuttheaposterioriprobabilityforoneofthe classes (the ratio in Equations 6a and 6b) is high. This means that the further the prediction NJi is from one class, the more likely it will be allocated to the other class. Thisisareasonableresultsincetheclassifieronlyexpectstoreceivesamplesfromthe twomodelledclasses.Inmostmultivariateapplications,however,samplesfromnonͲ modelled classes (outliers) may also be inadvertently submitted to theclassifier. The predictionsforthosesampleswillmostprobablybefoundatthetailsofaPDF,and, hence give a misleading high a posteriori probability for one of the classes. Consequently,forcingthetwoͲclassBayesruletoclassifyanyinputsamplemayinvolve a high risk because outliers may be erroneously classified in one of the modelled classes. HL1 LL0 a p(ǔi Z1 )·P(Z1 ) p(ǔi Z0 )·P(Z0 ) Z0 reject b LL0 Z1 ǔ HL0 reject HL1 LL1 p(ǔi Z1 )·P(Z1 ) p(ǔi Z0 )·P(Z0) reject 88 Z0 ǔ reject Z1 reject Figure 2. Possible distributions ofclassZ0andclassZ1withthe distance reject limits. a. Overlapped classes. b. Well separated classes. Ɣ and Ƒ indicatepossiblepredictionsof unknown samples for which the Bayes rule gives questionable results. LL0, HL0, LL1 and HL1 are the limits for rejectionbasedonthedistance rejectoption. UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.2Methods AnothersituationinwhichtheusefulnessoftheBayesruleislimitedoccurswhenthe predictedvalueNJisintheboundarybetweenclasses(theambiguityregion).Thedotin thecentreofFigure2arepresentsasamplewhosecharacteristicsaresimilarforboth classes,withtheresultthatthemodelcannotclearlydistinguishwhetheritbelongsto one class or to the other. Again, the sample will be assigned to the class to which, accordingtotheBayesrule,ithasthehighestprobabilityofbelonging.However,since theprobabilitythatthesamplebelongstoclassZ0 issimilartotheprobabilitythatit belongs to class Z1, there is a high risk of misclassification and the reliability of the classificationislow. Thesesituationsshowthattherejectoptionmightbeanadvantageousadditiontothe decision rule. In this paper, we implement the reject option in pͲDPLS. Both the classificationreliabilityandaccuracyofthepͲDPLSmodelareimprovedbyidentifying unreliable classifications and rejecting the sample instead of running the risk of misclassifyingit. 4.2.3ImplementationoftherejectoptioninpǦDPLS Therejectoptioninthecaseofclassificationambiguity(i.e.foroverlappedPDFs)can bederivedbyadaptingChow’sruletothePDFsobtainedinpͲDPLS.Chow’srulesetsa thresholdtsothatthesampleisrejectedifthehighestaposterioriprobabilityislower than(1–t).Inotherwords,thesampleisclassifiedonlyif: ݉ܽݔ൫ܲሺɘ ȁݕො ሻǡ ܲሺɘଵ ȁݕො ሻ൯ ሺͳ െ ݐሻ (9) Thus,onlythosesampleswhoseclassificationisreliableenoughareindeedclassified. Theothersamplesarerejectedbecausetheycouldbemisclassified.Thethresholdthat 89 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption optimizesthetradeͲoffbetweentheerrorrateandrejectratecanbederivedfromthe costsassociatedwitheachclassificationresult[29]: ݐൌ ሺߣ െ ߣ ሻȀሺߣ െ ߣ ሻ (10) whereʄm,ʄr,andʄc,arethecostsofincorrectclassification,ofrejectionandofcorrect classification,respectively.Thevaluesthatareassignedtothesecostsmakethereject optiontuneable.Thecostofbeingwrongishigherthanthecostofbothrejectingand classifyingcorrectly(ʄm>ʄr>ʄc).Infact,itispreferabletorejectasampleandgather additional information than to classify the sample incorrectly. It is also generally assumed that classifying correctly has no cost (ʄc = 0). Note that Equation 9 is a generalizationofthestandardBayesrule.Inparticular,fortheextremecaseinwhich thecostofrejectionʄrequalsthecostofmisclassificationʄm,therejectthresholdist= 1andChow’sruleisreducedtothestandardBayesrule,inwhichsamplesarenever rejected. A sample is also not rejected if t >1/C, where C is the number of possible classes(C=2forabinaryclassification)[34]. Thesecondreasonforusingtherejectoptionistoavoidclassifyingextremesamples thathavealargeaposterioriprobabilitybutlowvaluesatbothPDFs.Inordertosolve thisproblem,DubuissonandMasson[18]addedadistancerejectcriteriontoChow’s ambiguity reject option. This idea is implemented here for the pͲDPLS model by imposinglimitsontheNJvalues,whichdefinetheextremeregionsinwhichthesamples willberejected.Thelimitsarechosensothatthesumoftheareainthetailsofeach PDF is five percent of the total area of the distribution (i.e. the distance reject probabilityequals0.05foreachclass,seeFigure2)[19].Sincethelimitsdependonthe shapeofthedistributionsofeachclass,theyareparticularforeachpͲDPLSmodelwith a given number of factors. In practice, when the PDFs are overlapped we have two 90 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.2Methods operative limits, a High Limit (HL) and a Low Limit (LL) and when the PDFs are separatedwehavefourlimits(HLandLLforeachclass,seeFigure2). Assuming the constraints for the distance reject and the ambiguity reject, the Bayes rulewithrejectoptionis: ݂݅ݕො ൏ ݂݅ݎሺ ൏ ݕො ൏ ଵ ሻ ݕ݂݅ݎො ଵ ݔܽ݉ ݂݅ݎ൫ܲሺɘ ȁݕො ሻǡ ܲሺɘଵ ȁݕො ሻ൯ ൏ ሺͳ െ ݐሻ ݈ܿܽݐ݊݅ݕ݂݅ݏݏɘ ݂݅ܲሺɘ ȁݕො ሻ ܲሺɘଵ ȁݕො ሻ ݐ݊݅ݎɘଵ ݂݅ܲሺɘଵ ȁݕො ሻ ܲሺɘ ȁݕො ሻ (11) 4.2.4Evaluationoftheclassificationmethodperformance ThepͲDPLSmodelscanbecalculatedforadifferentnumberoffactorsthatareneeded toexplaintherelevantinformation.Thus,everypͲDPLSmodelwillproducedifferentNJ predictionsforthecalibrationsamplesand,therefore,forthedifferentPDFs,which,in turn,willinfluencetheperformanceoftheclassifier.Theperformanceofaclassifieris commonly characterized by its error rate (or the classification rate, which is the percentage of correctly classified samples) when classifying a test set of unseen samplesthatwerenotusedduringthetrainingphase.Theactualclassofeverysample in the test set is compared to the class to which it is assigned by the classifier. In generalterms,however,itisnotthemisclassification(andrejection)ratethatwewant to minimize, but the misclassification (and rejection) cost [35], since the cost more accuratelyreflectstheobjectiveoftheclassificationrule[36].TheCostisheredefined as: 91 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption ݐݏܥൌ ߣ ߣ (12) where Nr is the number of rejected samples and Nm is the number of misclassified samples. The cost of correctly classifying a sample has been set to zero. Here, the minimizationofthecostwillbeusedtodecideontheoptimalnumberoffactorsinthe pͲDPLSmodel. 4.3ResultsandDiscussion 4.3.1Datasets The proposed classification rule (Eq. 11) was applied to two datasets, the Human Cancers dataset [37] and the Breast Cancer dataset [38]. These datasets have been studiedextensivelyintheliterature[39,40]andalsousedtoevaluatetheperformance ofclassificationmodels[41Ͳ44].TheHumanCancersdatasetconsistsof282microRNA (miRNA, non coding RNA species) normalized expression profiles for 218 samples, including46healthysamples(classZ0)and172tumoursamples(classZ1)fromseveral healthyandtumourtissues(ovary,colonandlungtomentionafew).Thedatasetwas dividedintoatrainingsetandatestsetbyapplyingtheKennardͲStonealgorithm[45] tothescoresofthefirst20PrincipalComponents(PCs),whichwereobtainedfromthe Principal Component Analysis (PCA) of the raw gene expression matrix. For this dataset, the training set contained 153 samples (116 samples of class Z1 and 37 samplesofclassZ0),andthetestsethad65samples(56ofclassZ1and9ofclassZ0). TheBreastCancerdatasetconsistsof5361normalizedgeneexpressionratios.These wereusedin[38]toprovethatahereditablemutationinfluencesthegeneexpression profileofbreastcancer.SevensamplesoftheBRCA1mutationwereusedasclassZ0, eightsamplesofBRCA2mutationwereusedasclass Z1,andsixsamplesofSporadic mutationwereusedastestsamples. 92 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.3Resultsanddiscussion 4.3.2HumanCancersdataset Although pͲDPLS is a fullͲvariable method, it can often be improved by carefully selectingthevariablesandremovingirrelevantmiRNAexpressionsthatinterferewith the discriminative power of the relevant miRNA [46]. For this dataset, the 100 variables with the highest VIP values (variable importance for the projection) were considered. VIP values were calculated as described in [8, 47]. These values quantify howeachvariableinfluencestheresponsesummedoverallcomponentsandclasses. Fortheselectedvariables,sixpͲDPLSmodelswerecalculatedwith1to6factorsusing meanͲcentered miRNA expression patterns (we will denote each model as pͲDPLSA, whereAisthenumberoffactors).Theaprioriprobabilitiesforthesesixmodelswere P(Z0)=37/153=0.24andP(Z1)=116/153=0.76. ThePDFofclassesZ0andZ1werecalculatedforeachpͲDPLSmodel(Eqs.1to5).The test sample was classified by obtaining its NJ prediction and then calculating the a posterioriprobabilities(Eqs6a,6b).Finally,thesamplewaseitherrejectedorclassified intheclasswiththehighestaposterioriprobability(Eq.11).Inthisdataset,thehigh and low limits (HL and LL) for NJi were defined so as to retain the five percent of the totalareaofthePDFinthetailsofthedistributions.Thecostswerearbitrarilysetto ʄc=0,ʄr=0.25,andʄm=1becausenoinformationwasavailableaboutthecostsofeach classification decision. Note that these costs are relative, and indicate that it is preferable to reject four samples than to classify one wrongly. These values are illustrative and should be adjusted for each particular classification problem. With thesevalues,thethresholdvalueforrejectionintheambiguityzoneist=0.25(Eq.10). ThemodelswithA=1toA=6factorswerevalidatedbyleaveͲoneͲoutcrossͲvalidation (CV).Inthisprocess,sampleiwasleftoutofthetrainingset,thepͲDPLSAmodelwas calculated,andthepredictionNJifortheleftͲoutsamplewasobtained(notethatthea priori probabilities were recalculated to take into account that one sample had been 93 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption leftout).Thisprocedurewasrepeatedforallthesamplesofthetrainingsetandforall thepͲDPLSmodels. Figure 3 and Table 1 show the crossͲvalidation results obtained for the different pͲ DPLSmodelswhentherejectoption(Eq.11)isconsidered.Notethatpredictionsfor samplesinclassZ0 arearound0andpredictionsforsamplesinclassZ1arearound1, but that the predictions partially overlap in models with less than four factors (underfittedmodels).Asaresultoftheoverlap,manysamplesareeitherrejectedor wronglyclassifiedandthecostofthesemodels(Table1)ishigh.Forexample,forpͲ DPLS2, 51%ofthesamplesinclassZ0 wererejectedbyCVand27%weremisclassified. On the other hand, the predictions from the models with four to six factors are grouped tighter together. Consequently, these models have fewer misclassifications, fewerrejections,andlowerclassificationcosts. 6 Number offactors (A) 5 4 3 2 1 Ͳ0.6 Ͳ0.4 Ͳ0.2 0 0.2 0.4 0.6 NJ 0.8 1 1.2 1.4 Figure 3. Prediction of the training samples by CV for the different pͲDPLS models with reject option. Squares: healthy samples (class Z0), – Green: correctly classified, Blue: misclassified, Red: rejected to classify–. Circles: tumour samples (class Z1), – Yellow: correctly classified, Brown: misclassified, Orange: rejectedtoclassify–. 94 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.3Resultsanddiscussion Table 1. Classification of validation samples by leaveͲoneͲout crossͲvalidation and test samples for the different pͲDPLS models for t =0.25. In brackets, classifications performed without considering the reject option. Wrongly Factors CrossͲValidationSamples Cost classified Rejected Correctly Wrongly classified classified SamplesofclassZ0 Rejected Correctly classified SamplesofclassZ1 1 30.3 0(6) 24 10(31) 2(19) 89 25(97) 2 15.5 10(22) 19 8(15) 0(0) 3 113(116) 3 7.3 4(9) 13 20(28) 0(0) 0 116(116) 4 5 2(4) 8 27(33) 0(0) 4 112(116) 5 3 1(4) 4 32(33) 0(1) 4 112(115) 6 3 1(5) 7 29(32) 0(1) 1 115(115) TestSamples Wrongly Factors classified Rejected Correctly Wrongly classified classified SamplesofclassZ0 Rejected Correctly classified SamplesofclassZ1 1 0(2) 6 3(7) 0(0) 17 39(55) 2 2(4) 5 2(5) 0(0) 0 56(56) 3 0(2) 4 5(7) 0(0) 0 56(56) 4 0(0) 2 7(9) 0(0) 0 56(56) 5 0(0) 0 9(9) 0(0) 0 56(56) 6 0(0) 0 9(9) 0(0) 0 56(56) 95 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption In terms of classification cost, the optimal model is pͲDPLS5 since no further improvement is obtained for the model of six factors. The PDFs for this model are presented in Figure 4a and the a posteriori probabilities across the NJ domain are presentedinFigure4b.ThelimitswerefoundtobeLL0=–0.43andHL1=1.42.Thus, sampleswithapredictedvalueNJi<–0.43orNJi>1.42wouldbeflaggedasoutliersand rejected. These limits were different for each pͲDPLSA model because the training sample predictions changed.According to the rejection criterion, eight samples (four fromclassZ0andfourfromclassZ1)wererejected,allofthemintheambiguityregion (Table 1). As an example, the dot in Figure 4 corresponds to the sample T_BRST_2 (tumoursample,classZ1)duringtheleaveͲoneͲoutprocess.ThepredictionisNJi=0.44 andthecalculatedaposterioriprobabilitiesareP(Z0ʜNJi)=0.59andP(Z1ʜNJi)=0.41(Eqs. 6a, 6b). Since both probabilities are similar, the confidence (reliability) that the classification is correct is low because a slight shift in NJi due to measurement errors couldhavechangedtheassignedclass.TheapplicationoftheclassicBayesrule(Eq.8) wouldassignthesampletotheclasswiththehighestaposterioriprobability,meaning that the sample would be wrongly classified into class Z0. By allowing the reject option, defined here by Chow’s rule (with t=0.25), the sample was rejected and not classifiedbecausethehighestaposterioriprobabilitywasbelow1t(i.e.max(P(Z1ʜNJi), P(Z0ʜNJi)<0.75).Inthiscase,therejectoptionpreventedusfromclassifyingatumour sample as ahealthy sample, and the expert wouldbeprompted to make more tests before the final diagnosis. It is interesting to note, as we indicated before, that the rejectoption’sperformancedependsontherelativecostsassignedtotheclassification results. Thus, by setting different costs, the threshold (and hence the number of samplesrejected)willbetunedtomeettheexperimenter'sneeds. 96 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.3Resultsanddiscussion 1.8 1 a LL 0 b HL 1 LL 0 P(Z0 | yˆ ) P(Ȧ1 | yˆ ) HL 1 0.8 (1-t) 1.4 p( yˆ | Ȧ0 ) p( yˆ | Ȧ1) 0.6 1 0.4 P(Ȧ0 | yˆT _ BRST2 ) 0.6 P(Ȧ1 | yˆT _ BRST2 ) 0.2 0.2 Ͳ1 Ͳ0.5 0 0.5 ǔ 1 1.5 2 0 Ͳ1 ȚT_BRST_2 Ͳ0.5 0 0.5 ǔ 1 1.5 2 Figure 4. a. PDFs for the five factor pͲDPLS model obtained from the training samples during the LOOCV processwhenT_BRST_2isusedasthevalidationsample.b.AposterioriprobabilitiesacrosstheNJdomain (Eq. 6a and 6b) derived from the PDFs in aͲb. The prediction and the a posteriori probability for sample T_BRST_2duringtheLOOCVprocessarealsoshown. Forcomparison,Table1showsinbracketstheclassificationresultswhentheclassical Bayesruleisapplied.Forthemodelthatbestminimizesthecost,thatis,pͲDPLS5,five samples were misclassified if the reject option was not applied, whereas only one samplewasmisclassified(ahealthysample)whentherejectconstraintswereapplied. Thus, this model’s classification accuracy(i.e. the ratio of samples well classified and the number of samples classified) was improved from 97% (148/153) to 99% (144/148). Notice, however, that the reject option also rejected some samples that would otherwise be correctly classified: the number of samples well classified decreasedfrom148to144.Thisreductioninthenumberofwellclassifiedsamplesis thepricetopayforsafeguardingagainsterrors,andfollowsthetrendofthesuggested costsofclassifications,inwhichrejectingfoursampleswaspreferabletomisclassifying one. Different reject thresholds were tested by varying the classification costs (Table 2). Whenʄr=0.10,ʄm=1andʄc=0,thethresholdwast=0.10.Asexpected,thenumberof rejectedsamplesincreasedbecausethecostofdoingsodecreased(i.e.wepreferred 97 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption to reject ten samples rather than classify one wrongly). However, the number of misclassifiedsamplesdidnotchange,whichmeansthattherulerejectedsamplesthat wouldhavebeencorrectlyclassifiedwitht=0.25.Thus,decreasingttobelow0.25did not improve the model’s classification performance for this dataset. On the other hand,whentwassetto0.35(i.e.ʄr=0.35,ʄe=1,ʄc=0),theresults(notshown)werethe sameasthoseobtainedfort=0.25.Hence,t=0.25wasconsideredoptimalforthispͲ DPLS5model. The samples of the test set were also classified according to Eq. 11. For the pͲDPLS5 model using t=0.25, 100% of the samples were well classified and there were no rejects (Figure 5 and Table 1). By setting the threshold to t=0.10, two correctly classified healthy samples turned into rejects (Table 2). This was seen in the classificationofthetrainingsamplesaboveandhighlightstheneedtosetanadequate rejectthresholdinordertoobtainanadequatetradeͲoffbetweentherejectsandthe misclassifications. This will depend on the needs of the experimenter and the cost constraintsineachparticularapplication. 6 Number offactors (A) 5 4 3 2 1 Ͳ0.2 0 0.2 0.4 0.6 NJ 0.8 1 1.2 1.4 Figure5.ClassificationoftestsamplesforthedifferentpͲDPLSmodelswithrejectoption.Squares:healthy samples(classZ0),–Green:correctlyclassified,Blue:misclassified,Red:rejectedtoclassify–.Circles:tumour samples(classZ1),–Yellow:correctlyclassified,Brown:misclassified,Orange:rejectedtoclassify–. 98 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.3Resultsanddiscussion Table2.ClassificationofvalidationsamplesvialeaveͲoneͲoutcrossͲvalidationandtestsamplesforthe differentpͲDPLSmodelsfort=0.10. CrossͲValidationSamples Wrongly Factors Cost classified Rejected Correctly Wrongly classified classified SamplesofclassZ0 Rejected Correctly classified SamplesofclassZ1 1 16.1 0 36 1 1 115 0 2 6.5 2 31 4 0 14 102 3 5.9 3 23 11 0 6 110 4 3.5 1 17 19 0 8 108 5 3.3 1 14 22 0 9 107 6 2.9 1 13 23 0 4 110 TestSamples Wrongly Factors classified Rejected Correctly Wrongly classified classified SamplesofclassZ0 Rejected Correctly classified SamplesofclassZ1 1 0 9 0 0 56 0 2 0 9 0 0 0 56 3 0 7 2 0 0 56 4 0 3 6 0 0 56 5 0 2 7 0 0 56 6 0 2 7 0 0 56 99 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption 4.3.3BreastCancerdataset This dataset demonstrates the rejection of test samples that are outside the class limits.ThesamemethodologyasfortheHumanCancersdatasetwasusedexceptthat the Kennard Stone algorithm was not applied. Instead, the samples of mutations BRCA1andBRCA2wereusedasatrainingsetandtheSporadicmutationsampleswere used as a test set. The aim was to show that the classification rule could reject prediction samples from nonͲmodelled classes. This would prevent the classification errorthatwouldotherwiseoccuriftheclassifierhadtoassignthesamplestooneof the two modelled classes. Detecting this type of outlier is fundamental to the applicationofanyclassificationrule. ProbabilisticDPLSmodelswerecalculatedforonetothreefactorsbyusinglog2meanͲ centred gene expression data from BRCA1 (class Z0) and BRCA2 (class Z1) mutation samples. This data consisted of the 51 most relevant gene expressions according to [38]. These genes were found to be the most discriminative between the three mutations. The costs of classifying correctly, rejecting and misclassifying were set at ʄc=0,ʄr=0.25,andʄm=1respectively.TheonefactorpͲDPLSmodel(pͲDPLS1)wasthe optimal model with the lowest cost (i.e. cost of 0.5). Models with two and three factorswereoverfitted,withcostsof2.5and3.25respectively.Thesehighercostsare due to the fact that most of the samples are rejected and, although there are no misclassifications,theclassifiersbecomeuseless.Forexample,forthepͲDPLS2model, 10 of the 15 training samples were rejected during LOOCV. Similarly, the pͲDPLS3 modelrejected13ofthetrainingsamples. ThepͲDPLS1 calculatedwiththe51geneexpressionsselectedinthebibliographywas abletodistinguishthesamplesofclassZ0 fromthoseofclassZ1,thusprovidingwell separated PDFs (Figure 6). Only the sample s1252_P2, of class Z0, and the sample s1816_P13,ofclassZ1,wererejectedduringLOOCV.Thepredictionsofbothsamples 100 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.3Resultsanddiscussion wereoutsidethelimitsoftheclasses(i.e.NJs1252_P2>HL0andNJs1816_P13>HL1).Noticethat becausethePDFswerenotoverlapped,therewasnoambiguityregionandthelimits oftheclassesweredefinedbyfouroperativelimits.Theclassificationperformanceof thepͲDPLS1didnotchangewhentherejectthresholdwasvariedtot=0.35andt=0.10. The pͲDPLS1 model was used to classify the six test samples of sporadic mutation of breast cancer. This mutation was not modelled in the training step; hence, all these samplesshouldbepointedasoutliersandnotclassified.Classifyingthesesamplesin anyofthetwomodelledclasseswouldresultinaclassificationerror.Figure6shows the PDFs (Eqs. 4 and 5) of class Z0 and class Z1 for pͲDPLS1 together with the predictions for the test samples. Accordingto Eq. 11, all test samples were correctly detectedasoutliersandrejectedsincetheirpredictionsNJiwerebetweenthelimitsHL0 (NJ=0.24)andLL1(NJ=0.54). Iftherejectconstraintshadnotbeenapplied,theclassifier would have assigned the test samples to the class with the highest a posteriori probability. In this case, the samples s1572_P16 and s1324_P17 would have been incorrectlyclassifiedintoclassZ1(i.e.asBRCA2mutationsamples)andtheremaining samples (s1649_P15, s1320_P18, s1542_P19 and s1281_P21) would have been incorrectlyclassifiedintoclassZ0(asBRCA1mutationsamples).Forthesesamples,the aposterioriprobabilityforoneclasswasnear1.Forexample,samples1572_P16had p(NJiʜZ0)= 6ͼ10Ͳ6 and p(NJiʜZ1)= 2ͼ10Ͳ3 which results in P(Z0ʜNJi) у0 and P(Z1ʜNJi) у1. Therefore, if it is believed that the a posteriori probability demonstrates the classification’s reliability, then the high values of probability obtained for the test sampleswouldsuggestthatwecantrusttheclassifications,despitethefactthatallof them were incorrect. This shows that the classic Bayes rule is unreliable when both conditional probabilities p(NJiʜZ0) and p(NJi ʜZ1) are low. Moreover, it should be noted that the predicted values are not as extreme as those expected for outliers. Hence, thesearenotdirectlysuspicioussamplesbecauseoftheirNJivalues.It wasthereject option,whichsetlimitsontheclasses,whichallowedthesesamplestobedetected. 101 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption LL 0 HL 0 HL 1 LL 1 0.6 0.5 p( yˆ | Ȧ0 ) 0.4 0.3 p( yˆ | Ȧ1) 0.2 0.1 0 Ͳ1 Ͳ0.5 0 ǔ 0.5 1 1.5 2 Figure 6. PDFs for the one factor pͲDPLS model. Limits on LL0=Ͳ0.15, HL0=0.24, LL1=0.54 and HL1=1.50. TrianglesrepresentthetestsamplesclassifiedwithpͲDPLS1. 4.4Conclusions Recently,theDPLSmethodhasreceivedmuchattentioninthefieldofgeneexpression data analysis. We have applied a new version of DPLS, namely probabilistic DPLS (pͲ DPLS),toclassifybiologicalsamplesusingtheirmicroRNA(miRNA)expressionpatterns and cDNA microarray data. pͲDPLS takes into account the uncertainty of the PLS predictionsinthedefinitionoftheclassificationmodel.Inthisversion,thepossibility ofrejectionhasbeenintroduced.pͲDPLSwithrejectoptionperformsbetterthanthe originalpͲDPLS,becauseonlythosesamplesthathavethehighestprobabilityofbeing correctly classified are indeed classified, whereas doubtful cases are rejected. The methodology involves evaluating the probability of each classification together with 102 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 4.4Conclusions theoverallcostoftheclassificationsperformedforeachmodel.Inaddition,thereject optionallowsustodealwithsituationsinwhichtheresultsoftheBayesrulemaybe questioned. Moreover, the classification rule with reject option can help the experimentertocheckthatasampledoesnotbelongtoanyoftheclassesmodelledin thetrainingstepandthereforetoensurethatitisrejectedratherthanmisclassified. Thus, the reject option enables the classifier to detect outliers, and this in turn providesanewapproachforimprovingoutlierdetectionmethodsinthenearfuture. Acknowledgements The authors thank the Department of Universities, Research and the Information SocietyoftheCatalanGovernmentforprovidingCristinaBotella’sdoctoralfellowship, andoftheSpanishMinistryofEducationandScience(projectCTQ2007Ͳ66918/BQU). Theauthorswouldlikealsotoacknowledgetheusefulcommentsofthereferees. 103 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption [1] Alizadeh,A.A.,etal.,DistincttypesofdiffuselargeBͲcelllymphomaidentifyedbygeneexpression profyling.Nature,2000.403p.503Ͳ511. [2] Golub,T.R.,etal.,MolecularClassificationofCancer:ClassDiscoveryandClassPredictionbyGene ExpressionMonitoring.Science,1999.285:p.531Ͳ537. [3] Li,L.,etal.,GeneAssessmentandSampleClassificationforGeneExpressionDataUsingaGenetic Algorithm/kͲnearest Neighbor Method. Combinatorial Chemistry & High Throughput Screening, 2001.4:p.727Ͳ734. [4] BrownM.P.S,etal.,KnowledgeͲbasedanalysisofmicroarraygeneexpressiondatabyusingsupport vectormachines.ProceedingsoftheNationalAcademyofSciences,2000.97:p.262Ͳ267. [5] Furey, T.S., et al., Support Vector Machine classification and validation of cancer tissue samples [6] Nguyen,D.V.and D.M.Rocke,MultiͲclasscancerclassificationvia partialleastsquareswithgene usingmicroarrayexpressiondata.Bioinformatics,2000.16:p.906Ͳ914. expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226. [7] GuntherE.C.,etal.,PredictionofdrugefficacybyclassificationofdrugͲinducedgenomicexpression profilesinvitro.ProceedingsoftheNationalAcademyofSciences,2003.100:p.9608Ͳ9613. [8] Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44. [9] Nguyen, D.V. and D.M. Rocke, Tumor classification by partial least squares microarray gene expressiondata.Bioinformatics,2002.18:p.39Ͳ50. [10] PérezͲEnciso,M.andM.Tenenhaus,Predictionofclinicaloutcomewithmicroarraydata:apartial leastsquaresdiscriminantanalysis(PLSͲDA)approach.HumanGenetics,2003.112:p.581Ͳ592. [11] Modlich, O., et al., Predictors of primary breast cancers responsiveness to preoperative Epirubicin/CyclophosphamideͲbased chemotherapy: translation of microarray data into clinically usefulpredictivesignature.JournalofTranslationalMedicine,2005.3:article32. [12] Man,M.Z.,etal.,EvaluationmethodsforclassifyingExpressiondata.JournalofBiopharmaceutical Statistics,2004.14:p.1065Ͳ1084. [13] Bylesjö, M., et al., MASQOT: a method for cDNA microarray spot quality control. BMC Bioinformatics,2005.6:p.250. [14] Tax, D.M.J. and R.P.W. Duin, Growing a multiͲclass classifier with a reject option. Pattern RecognitionLetters,2008.29:p.1565Ͳ1570. [15] Knauthe, B., et al., Visualization of quality parameters for classification of spectra in shooting crimes.JournalofChemometrics,2008.22:p.252Ͳ258. 104 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 References [16] Fumera, G., F. Roli, and G. Giacinto, Multiple Reject Thresholds for Improving Classification Reliability, in Advances in Pattern Recognition, SSPR&SPR, Editor. 2000, Springer: Berlin Ͳ Heidelberg.p.863Ͳ871. [17] Devarakota, P.R.R., B. Mirbach, and B. Ottersten, Reliability estimation of a statistical classifier. PatternRecognitionLetters,2008.29:p.243Ͳ253. [18] Dubuisson,B.andM.Masson,Astatisticaldecisionrulewithincompleteknowledgeaboutclasses. PatternRecognition,1993.26:p.155Ͳ165. [19] Muzzolini, R., Y.ͲH. Yang, and R. Pierson, Classifier desing with incomplete knowledge. Pattern Recognition,1998.31:p.345Ͳ369. [20] Ripley, B.D., Statistical ideas for selecting network architectures, in Neural Networks: Artificial IntelligenceandIndustrialApplications,B.K.a.S.Gielen,Editor.1995,Springer.p.183Ͳ190. [21] Ripley, B.D., Pattern Recognition and Neural Networks. 2000, Cambridge, Unitet Kingdom: CambridgeUniversityPress. [22] Tortorella, F., An optimal reject rule for binary classifiers. In: Ferri, F.J et al. (Eds.), Advances in PatternRecognition:JointIAPRInternationalWorkshops,SSPR2000andSPR2000,LectureNotes inComputerScience,vol1876.SpringerͲVerlag,Heidelberg.,2000:p.611Ͳ620. [23] Fumera, G., I. Pillai, and F. Roli, Classification with Reject Option. Proceedings of the 12th [24] Fumera,G.andF.Roli.ErrorRejectioninLinearlyCombinedMultipleClassifiers.inProceedingsof InternationalConferenceonImageAnalysisandProcessing(ICIAP’03),2003. 2ndInt.WorkshoponMultipleClassifierSystems(MCS2001).2001.RobinsonCollege,Cambridge, UK. [25] Fumera, G., F. Roli, and G. Giacinto, Reject option with multiple thresholds. Pattern Recognition 2000.33:p.165Ͳ167. [26] Cordella,L.P.,etal.,Amethodforimprovingclassificationreliabilityofmultilayerperceptrons.IEEE Transactionsonneuralnetworks,1995.6:p.1140Ͳ1147. [27] Landgrebe, T., et al., The interaction between classification and reject performance for distanceͲ basedrejectͲoptionclassifiers.PatternRecognitionLetters,2006.27:p.908Ͳ917. [28] Landgrebe,T.,etal.AcombiningstrategeyforillͲdefinedproblems.inFifteenthAnn.Sympos.ofthe PatternRecognitionAssociationofSouthAfrica.2004. [29] Chow, C.K., On optimum recognition error and reject tradeoff. IEEE ͲTransactions on information theory,1970.16:p.41Ͳ46. [30] Hanczar, B. and E.R. Dougherty, Classification with reject option in gene expression data. Bioinformatics,2008.24:p.1889Ͳ1895. [31] Duda,R.O.,P.E.Hart,andD.G.Store,PatternClassification(2ndedition),ed.W.Intersicence.2001, NewYork. 105 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Classificationfrommicroarraydata usingpͲDPLSwithrejectoption [32] Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in Discriminant Partial LeastͲSquares Classification. Journal of Chemometrics and Intelligent Laboratory Systems, 2009.95:p.122Ͳ128. [33] Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985, Wiley:NewYork.p.581Ͳ591. [34] Webb,A.,StatisticalPatternRecognition,2nedition,ed.Wiley.2002,Malvern,UK. [35] Bradley, A.P., The use of the area under the ROC curve in the evaluation of machine learning algorithms.PatternRecognition,1997.30:p.1145Ͳ1159. [36] Li, M. and I.K. Sethi, ConfidenceͲbased classifier design. Pattern Recognition, 2006. 39: p. 1230Ͳ 1240. [37] Lu,J.,etal.,MicroRNAexpressionprofilesclassifyhumancancers.NatureLetters,2005.435:p.834Ͳ 838. [38] Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal ofMedicine,2001.344:p.539Ͳ548. [39] Zheng,Y.andC.K.Kwoh,InformativemicroRNAexpressionpatternsforcancerclassification.Data miningforbiomedicalapplications,Proceedings,2006.3916:p.143Ͳ154. [40] Lin, J. and M. Li, Molecular profiling in the age of cancer genomics. Expert Review of molecular diagnostics,2008.8:p.263Ͳ276. [41] Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical ApplicationsinGeneticsandMolecularBiology,2004.3:article33. [42] Raza, M., et al., Comparative Study of Multivariate Classification Methods using Microarray Gene Expression Data for BRCA1/BRCA2 Cancer Tumors. Proceedings of the Third International ConferenceonInformationTechnologyandApplications(ICITA'05),IEEE.,2005.2:p.475Ͳ480. [43] Branden, K.V. and S. Verboven, Robust data imputation. Computational Biology and Chemestry, 2009.33:p.7Ͳ13. [44] Pochet,N.,etal.,Systematicbenchmarkingofmicroarraydataclassification:assessingtheroleof nonͲlinearityanddimensionalityreduction.Bioinformatics,2004.20:p.3185Ͳ3195. [45] Kennard,R.W.andL.A.Stone,ComputerAidedDesignofExperiments.Technometrics,1969.11:p. 137Ͳ148. [46] Lu,Y.andJ.Han,Cancerclassificationusinggeneexpressiondata.InformationSystems,2003.28: p.243Ͳ268. [47] Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch: identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of Chemometrics2004.18:p.125Ͳ132. 106 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 5 Outlier detection and ambiguity detection for microarray data in p-‐DPLS regression Journal of Chemometrics 2010, Accepted UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguity detectionformicroarraydatainpͲDPLS Microarray data are obtained after a complex series of experimental steps that go from hybridization to image analysis. Microarray manufacturing errors like dye instability, different incorporation of the dyes, slide, spatial and printͲtip effects together with scanning errors may introduce unspected data variability, which can make the collected data for one sample very different than the data from other samplesofthesameclass.Additionally,theexperimentermaybeconfrontedwithnew samples that are not like any of the other samples that have been modelled (e.g., samples that do not belong to any of the modelled classes). All these samples are consideredasoutliers,andcanhaveadegradingimpactinthecalculatedclassification model (if they are training samples), can produce wrong evaluations of the classificationperformanceofthemodel(ifthesamplesarevalidationsamples)andcan leadtowrongclassifications(ifthesamplesarenewsamplestobeclassified). Outlier detection is often unnoticed in microarray data classification. However it is essentialthatanyclassificatitionmethodthatisintendedtohavearealpracticaluse beimplementedtogetherwithappropriateoutlierdetectiontools. Basically, all the outliers can be detected either because they have errors in the recordeddata(x),becausetheyhavebeenidentifiederroneously(witherroneousy), because they have abnormal xͲy relation or because they belong to a different populationthanthesampleswearetryingtoclassify.Inthisworkwedevelopoutlier detection for probabilistic discriminant partial least squares (pͲDPLS) method by combining diagnostics based on leverage and xͲresiduals (common in PLS) and the rejectoptionapproachdevelopedinchapter4. The method was tested on two datasets: the prostate cancer dataset and the small roundbluecelltumoursofchildhooddataset.Resultsshowedthatwithoutoutliersthe pͲDPLS classification models have better classification abilities and samples from 109 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter5 classes not modelled during the training step are rejected to classify, thus avoiding theirmisclassification. The removal of outliers in the prostate cancer dataset reduced the Cost of classificationpersamplefrom0.11to0.06,andthemodelincreasedtheproportionof correctclassificationsoftestsamplesfrom95%to100%.Inthesmallroundbluecell tumoursofchildhooddatasetthepͲDPLSwithoutlierdetectionmethodimplemented isabletoflagcorrectlyasoutliersthe95%ofthesamplesinthepredictionstep.These samples did not belong to any of the classes modelled. When the outlier detection method was not implemented in the training step, only the 5% of the test samples werepointedasoutliers,misclassifiyingtheremaining95%. This work is presented in paper form published in Journal of Chemometrics 2010 (Accepted). 110 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetectionfor microarraydatainprobabilisticDiscriminant PartialLeastSquaresRegression C.Botella*,J.Ferré,R.Boqué Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University. Marcel·lí Domingo s/n, 43007. Tarragona, Spain Correspondingauthor:[email protected] * JournalofChemometrics,2010Accepted.(Editedforformat) Abstract Therejectoptionplaysanimportantroleintheclassificationofmicroarraydata.Inthis work,arejectoptionisimplementedinthediscriminantpartialleastsquares(pͲDPLS) methodinordertorejecttoclassifybothoutliersandambiguoussamples.Microarray dataarehighlysusceptibletopresentoutliersbecauseofthemanystepsinvolvedin the experimental process. During the development of the classifier, outliers in the training data may strongly influence the model and degrade its performance. Some future samples to be classified may also be outliers that will most probably be misclassified.Ambiguoussamplesaresamplesthatcannotbeclearlyassignedtoany of the classes with a high confidence. In this work outlier detection and ambiguity detection are implemented taking into account the xͲresiduals, the leverage and the predicted NJ. The method was applied to oligonucleotide microarray data and cDNA microarraydata.Forthefirstdataset(prostatecancerdataset),theoutlierdetection criteriaallowedustoremoveninesamplesfromthetrainingset.Themodelwithout thosesampleshadbetterclassificationability,withadecreaseintheclassificationCost per sample from 0.10 to 0.07. The method was also used in a second dataset (small round blue cell tumours of childhood dataset) to detect prediction outliers so that mostoftheoutlierswererejectedtoclassifyandmisclassificationswerereducedfrom 100%to5%. 111 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS 5.1Introduction Outlier detection plays a fundamental role in the development and application of multivariate classification methods for microarray data. Outliers are either samples, variables,orcertainvariablesincertainsamplesthathaveadifferentbehaviourthan the rest of the data. This paper focuses on sample outliers. Sample outliers may be training samples, validation samples, or future samples to be classified. The experimenter is interested in flagging them for different reasons. Outliers in the training set may have an excessive influence on the classification rule, unless robust methods of classification are used. Hence, it is interesting to know whether the classificationruleisdominatedbyafewspecialsamples,anddiscoverifthisinfluence canbeadverse.Sampleswithlargemeasurementerrorsorsamplesthatbelongtoa differentpopulation than the samples we are trying toclassify will degrade the rule. These"bad"outliersshouldbedetected,removedandtherulerecalculated.Training outliers may also contain "good" samples with unique information. These must be kept, since they will improve the model by expanding its application domain. Their detectionwillwarntheexperimenterthatmoresamplesofthesimilartypeshouldbe obtainedinordertomodelthatvariabilitybetter.Studyofthegoodoutliersmayalso lead to discover special variables (gene expressions) that may have a high discriminativepower[1].Outlierdetectionmustalsobeappliedwhenfuturesamples aretobeclassified,whichistheultimateobjectiveoftheclassificationrule.Unknown samplesthatdonotbelongtoanyoftheclassesforwhichtheclassificationrulewas trainedorsampleswithlargedataerrorswillbemisclassified.Theexperimenterwants tobewarnedaboutthesesamplessothattheycanberejectedtoclassifyuntilmore information is available. In this sense, outlier detection increases the confidence the experimenter has in the classification protocol, since the samples that might be misclassifiedwillhopefullybeflagged.Finally,outlierdetectionmustalsobeusedto detectoutliersinthevalidationset.Samplesnotrepresentativeofthefuturesamples 112 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.1Introduction to be classified will likely produce an erroneous classification result that will worsen theclassificationabilityofthemodel.Hence,thesesamplesshouldbedetected,asitis done for the unknown samples, and not considered to evaluate the performance of themodel. Theparticularitiesofmicroarraygeneexpressiondataandthemanylevelsofvariation introducedatthecomplexexperimentalstages,fromhybridizationtoimageanalysis, make necessary the use of outlier diagnostics [2, 3]. First of all, the recorded microarray data depend on the biological variations of the population under study (intrinsic to all organisms and influenced by genetic or environmental factors). Technical variations introduced during the extraction, labelling or hybridization of samples,scannersettingsandmeasurementerrorsassociatedwiththereadingofthe fluorescentsignals(whichmaybeaffected,forexample,bydustonthearray[4])will also increase the data variability. Moreover, the large number of variables (gene expressions)comparedtotherelativelylownumberofobjects,makethedataanalysis and the classification a nontrivial task. Fortunately, the combined use of data preͲ processingandmultivariatealgorithmscanextractthemainsystematicvariationinthe data and lead to satisfactory classification results. For example, normalization methods,suchasthelowesscorrection[5]orthetotalintensitynormalization[6]can removeinconsistenciesofthemicroarraydata.However,notalltheerrorsinthedata may be mathematically removed and outlier diagnostics are still needed in order to preventmisclassificationsduetonewunexpecteddatavariations.Outlierdetectionis also needed to flag those samples from new unexpected classes (biological outliers) andthosethatpresentextremebiologicalvariability. Severalmethodshavebeenusedfordetectingoutliersinmicroarraydataforparticular classification rules. Paoli [7] improved the performance of Support Vector Machines (SVM) by selecting the optimal number of genes and treating the most relevant as 113 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS outliers.Moffitt[8]constructedtheSVMmodelbyremovingoutliersviareͲvalidation. Olsen [9] analysed the intensity scores of tissue microarrays of sarcoma phenotypes withEuclideanhierarchicalclusteranalysis,presentingasoutliersthosesamplesthat didnotclusterintoanyofthedefinedgroups.TheVizRanktool,whichcombinesthekͲ nearest neighbours (kͲNN) method with a range of visualizations was also used to detect outliers [10]. Model et al. [11] pointed out that outliers in microarray data cannot always be detected visually and proposed a robust version of Principal Component Analysis (rPCA). Their objective was to exclude single outlier chips from the analysis and to detect systematic changes in experimental conditions as early as possibleinordertofacilitateafastrecalibrationoftheproductionprocess.Shieh[12] addressed outlier detection with highly different expression patterns in microarray data using also PCA and a robust estimation of Mahalanobis distance. Tomlins et al. [13] proposed the cancer outlier profile analysis (COPA) method for detecting translocations from microarray data. For gene selection, genetic algorithms were proposedforoutlierdetectionusingagridcounttree[14].Liuetal.studieddifferent statistical methods to detect genes with differential expressions across the different class samples (1). And Loo et al. used with the same objective, filterͲbased methods [15]. In contrast, Tibshirani [16] and Wu [17] proposed alternative cancer outlier differential expression detection methods for detecting genes that, inside a disease group,exhibitunusuallyhighgeneexpressioninsomebutnotallsamples. Inthiswork,wedevelopoutlierdetectionfordiscriminantpartialleastsquares(DPLS). DPLSisoneofthepreferredmethodsforclassificationofmicroarraydata[18].InDPLS, the assigned class is decided from the predicted value NJ when the measured microarray data are submitted to a PLS model. Hence, the outlier detection approaches that exist for PLS (already applied in multivariate calibration in chemical and industrial fields) can be applied. Pell used the studentized residuals versus leverage plot to detect outliers in PLS, which was successful when either masking or 114 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.1Introduction swampingoccurred[19].Pell[20]also,basedontheworkbyMartensandNæs[21], detectedoutliersfromanFͲratiowhichcomparedthevalidationsamplesxͲresidualsto the xͲresiduals of thecalibration samples.Chiang and Pell [22] presented theclosest distancetocenter(CDC),amultipleoutlierdetectionalgorithmappliedtogetherwith ellipsoidal multivariate trimming (MVT), taking into account only the xͲdata. A methodology to detect prediction outliers in PLS was applied by projecting the new objects on the Sammon’s mapping space containing the convex hull which defines a boundaryaroundeachclusterandanotheraroundthewholecalibrationdata[23].Q andHotelling’sT2statistics[24]werealsousedtodetectoutliersinPLS,althoughthe authorsindicatedthatinsomecasestheseindexeswouldnotbeenough. Most of these mentioned approaches take into account only the xͲresponse data to point a sample as a potential outlier since it is the only information available for unknown samples. Note also that the predicted value NJ is rarely used to detect prediction outliers in PLS, since it is often difficult to set limits on the lowest and highestvaluesofNJthatcanbeaccepted.Onlythosepredictionsthatarereallyextreme canwarnthesamplebeinganoutlier.DPLS,however,hastheparticularitythattheNJ values(fromwhichtheclassisdecided)arelocatedaroundthevaluethatcodifiesthe class(around0or1intheDPLSschemeusedinthispaper)andthatprobabilitydensity functionsofthepredictionscanbeestablished.Thisfacthasbeenpreviouslyusedto definearejectoptionforDPLSandmicroarraydata[25].Therejectoptionallowedto reject to classify those samples that had extreme NJ values or those with "normal" NJ valuesbutwhoseclassificationwasambiguous(i.e.,samplesthathaveaverysimilar probability to belong to any of the modelled classes). In this paper, we provide a unified approach for outlier detection in DPLS for microarray data. This approach combinesthenew criterionbased on thepredicted value NJparticularlydevelopedfor DPLS, with the wellͲknown diagnostics based on the leverage and the xͲresiduals commonlyusedinPLS. 115 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS 5.2Theory 5.2.1ProbabilisticDiscriminantPartialLeastSquares DPLS is the application of PLS regression to classification problems. A DPLS model is calculatedbyregressingy,whichcodifiestheclassofthesamples,onXusingAlatent variables(factors)[18,26].Formicroarraygeneexpressiondata,XisanNuPmatrixof NsamplesandPgeneexpressionsandyisaNu1vectorofonesandzeros,wherethe0 codifiesthatthesamplebelongstoclassZ0andthe1codifiesthatthesamplebelongs toclassZ1.ForanunknownsamplewithmeasuredxͲdata,xt,thevaluepredictedby theDPLSmodeltakingintoaccountAfactorsisgivenbyNJt=xtTb,wherebisthevector of regression coefficients and the preͲprocessing is implicit in the formula. With the mentionedcoding,NJtshouldideallybeclosetozeroifthesamplebelongstoclassZ0 andclosetooneifthesamplebelongstoclassZ1.Thecriterionfordecidingtheclass fromNJtwillinfluencetheperformanceoftheclassificationrule.Thecriterionusedin this work is based on the probabilistic version of DPLS, pͲDPLS [27]. The pͲDPLS procedure starts by calculating a PLS model of A factors relating X and y. Then, the trainingsamplesarepredictedwiththismodel.Foreachtrainingsamplei,apotential functionf(NJi,SEPi)iscalculatedwiththeshapeofaGaussiancentredatthepredicted value NJi and with standard deviation the standard error of prediction (SEPi) of that sample. Next, the individual potential functions of all the samples of class Z0 are averaged to obtain the probability density function (PDF) that describes the predictionsofclassZ0(Eq.1): ሺݕො ȁɘ ሻ ൌ 116 బ ሺ௬ σసభ ො ǡ ୗ ሻ బ (1) UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.2Theory where n0 is the number of samples of class Z0. The PDF for class Z1 is calculated likewise, using n1, the number of samples of class Z1. A sample is classified by calculatingitspredictionNJandapplyingtheBayestheoremtothetwoPDFssothatthe sampleisallocatedintheclasswiththehighestaposterioriprobability.Aconsequence ofthestraightapplicationofthisruleisthatasampleisalwaysclassifiedinoneofthe classes. So, samples from new unexpected classes will be misclassified, and those samples with either extremely low or extremely high values of NJ (which may be outliers)willbeassignedtooneoftheclasseswithaverylargeprobability. 5.2.2RejectoptioninpǦDPLS ThepurposeoftherejectoptioninpͲDPLSistoallowtheclassifiertorejectasampleif this will likely be misclassified. In other words, a class label is assignedonly tothose samples with the highest probability of being correctly classified. By not forcing the classifier to always make a decision in one of the two modelled classes, the misclassification rate of the model (measured as the number of correctly classified sampleswithrespecttothenumberofsamplesforwhichtheclassifierassignsaclass) decreases, and gives confidence to the experimenter on the outputs of the classificationrule.TherejectoptioninpͲDPLSisimplementedherefortwomaintypes ofsamples:outliersandambiguoussamples. 5.2.2.1Rejectionofoutliers OutliersaresampleswhosexͲdatahavedifferentfeaturesthanthebulkofthetraining samples.Severalreasonsforthisbehaviourare(a)thesamplebelongstoaclassthat was not modelled, (b) the sample belongs to one of the modelled classes but the xͲ data have gross errors or contain unmodelled interferences, and (c) the sample belongs to one of the modelled classes but has correct extreme values of some variables.Samplesinsituation(a)shouldbedetectedandrejectedotherwisetheywill 117 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS bewronglyclassifiedinoneofthetwomodelledclasses.Samplesinsituations(b)and (c)willnotnecessarilybemisclassified,butthisuncommonbehaviourwilllikelyaffect theclassificationresultandhencewemightprefertorejectthesamplesandaskfor extended analysis instead of running the risk of misclassifying them. Outliers (hence, candidatestoberejectedtoclassify)inpͲDPLSareflaggedbasedonthefollowingfour criteria:limitsontheNJ,leverage,ratioofresidualvariancesandclassificationerror. a.LimitsontheNJ InpͲDPLS,thepredictionsNJofthetrainingsamplesareusedtocalculateadistribution ofpredictionsforeachclass(seeFigure1).Thesedistributionsareideallycenteredon 0 and 1, the reference values used at the training stage. Uncommon xͲdata will produce NJ values at the extremes of the PDF’s of a class. Hence, limits for NJ are set aroundthemajorityoftheNJofthetrainingdata.TheselimitsdefineregionsintheNJ axisinwhichthesampleiseitherclassifiedinoneclass,intheotherclass,orrejected toclassify[25].Thelimitsaredefinedsuchthattheareainthetailsofeachdistribution isfivepercentofthetotalareaofthedistribution(i.e.2.5%ineachtailofthePDFof each class). These limits depend on the PDFs. Hence, they are different for pͲDPLS modelswithadifferentnumberoffactors.Inpractice,whenthePDFsareoverlapped (Figure 1) there are two limits, a High Limit (HL) and a Low Limit (LL) and when the PDFs are separated there are four limits (a HL anda LLfor eachclass) (Figure 4b).A samplewithaNJ predictedoutsidethelimitswillbeflaggedasoutlier.IfthePDFsare not overlapped, a sample with NJ between HL0 and LL1 will be flagged as inlier. This criterion improves the direct application of the Bayes rule in the sense that, at the extremesofthePDFs,theaposterioriprobabilityforoneclassishigh,andhencethe Bayesrulewouldassignthesampletothatclasswithahighprobability.Byimposing thelimits,thesamplewillnowberejectedtoclassify.NotealsothatthelimitsontheNJ valueswillnotaccountforalltheoutliersituationsinpͲDPLS,sincetheywillnotdetect those outliers whose unusual xͲdata makes NJ be inside a classification region, e.g. 118 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.2Theory when the NJ of a sample of class Z0 falls within the classification region of class Z1. Thesesamplesmightbedetectedbythecriteriadescribednext. b.Leverage The leverage of sample t for a DPLS model calculated with meanͲcentered xͲdata is givenby[28]: ଵ ݄௧ ൌ ࢚௧ ሺ ܂ ܂ሻିଵ ࢚௧ (2) where tt denotes the score vector and T is the scores matrix of the mean centered training data. The leverage measures the distance from the sample to the center (mean)ofthetrainingsettakingintoaccountthecorrelationinthedata.Alowvalueof ht indicates that the sample is similar to the average of the training samples. A high leverageindicatesthatthesamplehasanunusualxͲvector(orscorevector)relativeto thetrainingsamples,soitisanxͲoutlier.Inthatcase,theexperimentershouldsuspect about the reliability of the classification and wait for additional studies. Although no strict rules exist, it is common to declare as a highͲleverage sample the one with ത ൌ ͳȀ ht ! 3 h where h is the average leverage value for the training samples (݄ ܣȀܰ)[29,30]. c.Ratioofresidualvariances InDPLS,thereisavectorofxͲresidualsforeachsampleandnumberoffactorsAused inthemodel.TheresidualsarethedifferencebetweenthemeasuredxͲdataandthe datapredictedbythemodelwithAfactors.Whiletheleveragereferstothepositionof thesampleinthesubspaceofthefactorsusedforregression,theresidualreferstothe orthogonalsubspace,i.e.,thefactorsnotusedforregression.Residualsthataremuch larger than most of the residuals of the training samples indicate that the sample is 119 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS poorlydescribedbythemodelforthatnumberoffactorsand,hence,itisanxͲoutlier. It must be pointed out, however, that large xͲresiduals do not necessarily imply a wrong NJ, and, hence, a wrong classification. Actually, one of the advantages of the factorͲbased methods such as PLS is that the factors retained in the model should accountfortherelevantvariabilityinthexͲdata,whiletheremainingfactorsnotused in the model should account for the irrelevant variability (the xͲresiduals). Hence, a large xͲresidual simply indicates that some part of the measured xͲdata is not modelled.However,thereisalargechancethatthesourceoftheseunmodelleddata had also a contribution in the model space and influenced the NJ. These outliers are detected by comparing the unmodelled parts of the test sample to the unmodelled parts of the training samples using the AͲfactor pͲDPLS model [31] with the ratio of residualvariances: ൌ ௌమ ௌమ (3) wherest2istheresidualvarianceforthetestsample: ܵ௧ଶ ൌ σು ೕసభ൫௫ೕ ି௫ොೕ ൯ ሺିሻ మ (4) andsT2isthetotalvarianceforthetrainingsamples[21]: ்ܵଶ ൌ ು σಿ సభ σೕసభ൫௫ೕ ି௫ොೕ ൯ మ ሺேିିሺ୫ୟ୶ሺேǡሻሻ (5) AnobjectwithV>3isconsideredtobeanoutlier.Asimilarcriterionwasusedin[31]to detectoutliersinPLS.NotethattheusualcomparisonofVwithatabulatedFͲvalueis not useful. The very large number of degrees of freedom involved [32] makes the tabulated FͲvalue be low and most of the samples be flagged as outliers, which is meaningless. 120 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.2Theory 2.5 Ambiguity Region LL0 HL1 9 (a) (b) 8 2 7 6 1.5 V p(NJ|Zc) 5 4 1 3 2 0.5 1 0 Ͳ1 Ͳ0.5 0 2.5 0.5 NJ 1 Ambiguity Region LL0 HL1 0 0 2 1.5 0.05 0.1 0.15 leverage 0.2 10 (c) 0.25 (d) 9 2 8 6 V p(NJ|Zc) 7 1.5 1 5 4 3 0.5 2 1 0 Ͳ1 2.5 Ͳ0.5 LL0 0 0.5 NJ 1 Ambiguity Region 1.5 HL1 0 2 0 0.05 0.1 leverage 0.15 9 (e) 0.2 (f) 8 2 7 5 V p(NJ|Zc) 6 1.5 4 1 3 2 0.5 1 0 Ͳ1 0 Ͳ0.5 0 0.5 NJ 1 1.5 2 0 0.05 0.1 leverage 0.15 0.2 Figure1.Probabilitydensityfunctions(PDFs)forthepͲDPLSmodelwithtwofactorsobtainedduringleaveͲ oneͲoutcrossͲvalidationandinfluenceplotsforthetrainingsampleswhenasampleisusedasatest.aͲb. PDFs and influence plot when sample N43_normal is left out cͲd. PDFs and influence plot when sample T11_tumourisleftout,eͲf.PDFsandinfluenceplotwhensampleN41_normalisleftͲout.Ina,cande,the triangle(S)identifiesthepredictionoftheleftͲoutͲsample.Inb,dandf,thetriangle(S)identifiestheleftͲ outͲsampleascomparedtotherestofthetrainingdata.Theverticalandthehorizontaldottedlinesindicate thelimitsforoutlierdetection. 121 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS d.Classificationerror ClassificationerrorisaneasyͲtoͲuseoutlierdiagnosticsduringthetrainingstageofa classification rule. When the xͲy relation of a sample does not agree with the xͲy relationdescribedbythemodel,thesampleismisclassified,andthisisusedtoflagthe sample as an outlier. This is the equivalent to a large prediction error in regression models.Differentfromthecriteria(a)to(c),theclassificationerrorcanonlybeusedto detectoutliersinthetrainingandvalidationsetsbecauseitrequiresthetrueclassto be known. Despite it cannot be applied to new samples, the criterion is still very helpfultorefinetheclassificationmodel. 5.2.2.2Rejectionofambiguoussamples AmbiguoussamplesaresamplesthatsharecharacteristicsofbothclassZ0andclassZ1 because the measured xͲvariables are not discriminative enough for the algorithm used.WhenthesesamplesarepredictedbytheDPLSmodel,theirNJvaluesareinthe boundarybetweenclasses(ambiguityregion,Figure1)sotheBayesianprobabilityof belonging to any of the classes P(Zc|NJt) is similar. Even small variations in the measuredxͲdatacanmaketheclassifierassignthesampletoeitheroneclassorthe other.Thisincreasestheuncertaintyoftheclassificationresult,soitmaybepreferable torejectthatsample.Thisrejectionisdefinedbytherule: ݂݅݉ܽݔ൫ܲሺɘ ȁݕො ሻ൯ ൏ ሺͳ െ ݐሻܿ ൌ Ͳǡͳ (6) sothatthesampleisrejectediftheaposterioriprobabilityofbelongingtoanyofthe classes is lower than a reject threshold (1–t). Note that the threshold can be set to rejectanyslightlydoubtfulsample.Thisimprovestheerrorrateoftheclassifier,since less samples will be misclassified, but, in turn, more samples will be rejected that otherwisecouldbecorrectlyclassified,whichreducestheusefulnessoftheclassifier. Chow[33]derivedanoptimumrejectionschemethatgivesatradeoffbetweenreject rateanderrorrate.ThisrulewasrecentlydescribedforpͲDPLS[25]. 122 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.3Results 5.3Results 5.3.1Data Theprostatecancerdataset[34]consistsof50nonͲtumoursamples(classZ0)and52 tumoursamples(class Z1)with12600geneexpressions(variables).Fromthesegene expressions(variables),the150withthehighestvarianceweight[35]wereselectedto avoidirrelevantgenesfrominterferingwiththediscriminationpoweroftherelevant genes[36].Thedatasetwasdividedintoatrainingset(82samples,42ofclassZ0and 40ofclass Z1)andatestset(20samples,8ofclass Z0 and 12ofclassZ1)usingthe KennardͲStone algorithm [37]. This dataset is used to show the ability of the methodology to detect outliers in the training set and to show that the final classificationmodelandthepredictionofthetestsetimprovewhentheseoutliersare deleted. The small round blue cell tumours of childhood dataset [38] includes 2308 gene expressions of 12 samples of neuroblastoma (NB), 8 samples of nonͲHodgkin lymphoma (BL), 23 samples of Ewing family of tumours (EWS) and 20 samples of rhabdomyosarcoma(RMS).EWSsamples(class Z0)andRMSsamples(class Z1)were usedfortrainingandtheremaining,NBandBLsamples,astestsamples.Thisdataset was used to show how the proposed method can reject new samples that do not belongtoanyofthemodelledclasses.Since,thetestsamplesdonotbelongtoanyof the modelled classes, they would be misclassified unless the reject option is implemented. 123 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS 5.3.2Prostatedataset Briefly,theprocedurewasasfollows.First,thepͲDPLSmodelwascalculatedforagiven numberoffactorsusingmeanͲcenteredgeneexpressionsofthetrainingsamples.Then, the training samples were predicted and the predictions, NJ, were used to calculate kernelGaussians,which,inturn,definedaPDFforeachclass(Eq.1).FromthePDFs,the rejectoptionlimitsforNJwereset.TheleverageandxͲresidualsofthetrainingsamples werealsocalculated.Anunknownsamplewithmeasuredxt,wasfirstpredicted(NJt=xtTb) andthexͲresiduals,theleverageandtheprobabilityofclassificationforeachmodelled class(evaluatedastheBayesaposterioriprobabilitydetailedin[25])werecalculated. Thesamplewastheneitherclassifiedorrejectedtoclassifyifitwasflaggedasoutlier (section2.2.1)orambiguous(section2.2.2,Eq.6).Beforeclassifyingunknownsamples, theoptimalmodelwasselectedbyleaveͲoneͲoutcrossͲvalidation(LOOCV).InLOOCV,a sampleisleftoutandthemodeliscalculatedusingtheremainingsamples.EachleftͲout sample was treated as an unknown sample and was either classified or rejected as describedabove.Oftherejectedsamples,outlierswereremovedfromthetrainingset andthemodelwasrecalculated;ambiguoussamples,however,weremaintainedinthe modelsincetheyintroducedrelevantvariability. Theperformanceofthemodelwasevaluatedwiththeclassificationcostpersample: ݐݏܥൌ ሺߣ ߣ ሻȀ (7) whereNristhenumberofsamplesrejected,Nmisthenumberofsamplesmisclassified, OrandOm,arethecostsofrejectingasampleormisclassifyingitrespectivelyandNis thetotalnumberofsamplesusedtovalidatethemodel.Thecostcriterion,calculated during LOOCV, was used to compare the pͲDPLS models with a different number of factorsandtoselecttheoptimalmodel.NotethatOrandOmmaybefineͲtunedtomeet the requirements of the classification problem. Since for this dataset there is no reference in the literature about the associated costs of rejecting or misclassifying a 124 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.3Results sample, we used Or =0.25 and Om =1, which indicates that we prefer to reject four samplesinsteadofclassifyingonewrong.Inthiscase,Orvalueslowerthan0.25didnot improvetheclassificationperformanceofthemodel. Forthisdataset,preliminarypͲDPLSmodelsusing1to4factorswerecalculated.Taking into account the cost per sample calculated by LOOCV, the optimal model had two factors.SamplesN43_normalandN25_normal(bothofclassZ0)werepointedoutas outliers because their predictions were outside the accepted region for NJ for its corresponding crossͲvalidation segment. The predictionof sample N43_normal (Figure 1aͲb)wasNJ=0.59,lowerthanLL0=0.56,whilesampleN25_normalhadNJ=0.66,lower thanthelimitLL0=0.50establishedforitspͲDPLSmodel(notethatthelimitsHLandLL vary for each crossͲvalidation segment since the pͲDPLS is calculated with different samples). These extreme predictions suggested the possibility of an unusual x vector. Thiswaslaterconfirmedbecausetheleverageofthesesamplesexceededthreetimes theaverageleverageofthetrainingset:sampleN43_normalhadh=0.13while h =0.024 andsampleN25_normalhadh=0.23while h =0.037.Thereasonforthehighleverageis that five genes, those with Accession Numbers 36785_at, 221_s_at, 774_g_at, 31449_at,38411_at,hadhigherintensitiesthantherestofthesamplesofclassZ0.The fivevariables(genes)differentiallyexpressed,inthiscase,werenotconsideredrelevant sincethedifferentintensitieswereonlypresentinafewsamplessotheydidnotseem torespondtoadifferentialcharacteristicofoneclass. In addition to the samples N43_normal and N25_normal, the leverage criterion also flaggedsampleN33_normalasoutlier(h=0.12,while h =0.025),despitethissampledid nothaveanunusualNJ. Six additional samples (N04_normal, T02_tumour, T05_tumour, T11_tumour, T15_tumourandT25_tumour)wererejectedforhavinghighxͲresiduals(V>3).These 125 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS sixsampleshadmostofthegeneexpressionswithhigherintensitiesthanthemeanof the intensities of the training samples, so the samples were not wellͲmodelled by the factorsoftheDPLSmodel.TheT11_tumoursample(classZ1)(Figure1cͲd),forexample, was rejected because V = 9.45. Its prediction NJ=0.47 was closer to the predictions for class Z0 than to the predictions of class Z1 so the sample would have been classified wrongly(i.e.,nonͲtumour)ifithadnotbeenrejected.Noticethatthepredictionforthis sampleisnotanextremevalue,sothesamplehadnotbeenlabelledassuspiciousbased onlyontheprediction. In addition to the previous samples flagged as outliers, five samples were wrongly classified (Table 1). In these samples, the relation of xͲy did not agree with the trend modelled by the pͲDPLS model. The reason for the wrong classification is that the intensities of the samples of class Z0 (nonͲtumour) are lower than those of class Z1 (tumour)forthemajorityofthesamplesofthisdataset.ThesampleN38_normal(class Z0), however, had intensities in some of the variables higher than expected, more similar to the intensities of tumour samples (class Z1) than to the intensities of the samplesofitstrueclass(Figure2a).Forthisreason,thesamplewasmisclassified.The opposite happened with the misclassified samples of class Z1 (T39_tumour, T21_tumour,T49_tumourandT34_tumour).Someintensitieswerelowerthanmostof theintensitiesofclassZ0(Figure2b).Thissituationmayresultfromeitheranincorrect codification of the samples (mislabelling), experimental problems (e.g. bad intensity acquisition)orbecausethesesamplesweretrulydifferentfromtherestofsamplesof theirclass(whichwouldindicatethatmorerepresentativesamplesofthistypeshould becollectedbeforetheyareincludedinthemodel). 126 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.3Results 1200 a 1000 Intensities 800 600 400 200 0 0 50 100 150 100 150 Variablenumber 1600 1400 b 1200 Intensities 1000 800 600 400 200 0 50 Variablenumber Figure 2. a. Intensities of sample N38_normal of class Z0 (grey) and mean of intensities of class Z0. b. IntensitiesofsampleT21_tumourofclassZ1(grey)andmeanofintensitiesofclassZ1(black). During the crossͲvalidation process, four additional samples were rejected to classify because they were ambiguous. These samples did not have extreme values, so they werenotlikelytoinfluencethemodelexcessivelyandtheywerekeptinthetrainingset. However, since in the LOOCV process these samples acted as test samples, they were consideredasrejectsforthecalculationoftheperformanceoftheclassifier.Anexample isshowninFigure1eand1f.Thefigureshowstheacceptanceandrejectregionsforthe crossͲvalidation model when sample N41_normal is left out. Because its NJ was in the 127 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS ambiguity zone, this sample would have been rejected to classify if it had been an unknownsample. Itistonotealsothattheoutlierdetectionprocesscouldsufferfromthemaskingeffect, sothatthepresenceofseveraloutlierscouldhidethepresenceofsomeotheroutlier. Despite this, extreme samples could still be detected and the model was recalculated withoutthosesamples.TheoptimalmodelwasagainthepͲDPLSmodelwith2factors, withadecreaseoftheCostofclassificationpersamplefrom0.11to0.06(Figure3). Cost perSample 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 2 3 4 1 Number offactors Figure 3. Cost per sample for the training samples with Or=0.25 and Om=1. (y) pͲDPLS model with all the samples.()pͲDPLSmodelsafterremovingoutliers. Table 1 shows the classification results for the models calculated with the original datasetandwiththedatasetafterremovingtherejectedtrainingsamples.TheLOOCV and test set classifications are first presented for the initial dataset using the pͲDPLS model for two factors (columns 2 and 3) without reject option (i.e., there are no rejected samples). Columns 4 and 5 show the classifications when the reject option is 128 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.3Results enabled. Note that one false negative and one false positive of the classical model becomerejects.Inturn,fivetruenegativesandsixtruepositivesbecomealsorejects. This is because the high certainty required in the classification results makes the sampleswithuncertainclassificationberejected.Columns6to9showtheresultsafter the outliers in the training set had been removed. Comparing the classical pͲDPLS models with and without outliers (columns 2Ͳ3 and 6Ͳ7), it is seen that the model withoutoutliersmisclassifiesonesampleless.Thisimprovementismorenotablewhen rejection is allowed (columns 45 versus 89). In this case, the LOOCV error rate (calculated as the ratio of samples misclassified divided by the samples classified), for themodelwithoutliersis5/69=0.07,higherthantheerrorrateforthedepuratedmodel (2/56=0.04). The reduction of misclassified samples is also observed in the test set. Columns 8 and 9 show the results of the depurated model with reject option. This depurated model predicts better than the models calculated with all the training sampleswithoutrejectoption.Thisoptimalmodelclassifieswronglyonlytwosamples, andalsohasfewerrejections,sotheclassificationCostpersampleislower(Figure3). The prediction of the test set is also better. The two misclassifications of pͲDPLS calculatedwiththeinitialdatasetarenowrejections(basedontheambiguityrejection rule, Eq. 6). Compared with the pͲDPLS model with reject option calculated with the initial dataset, the number of misclassifications and of rejections decreased, so the classification cost per sample decreases from 0.10 to 0.07. Hence, the removal of the outliersofthetrainingsetimprovedthepͲDPLSmodelinthesenseofclassifyingbetter boththetrainingsamplesviaLOOCVandthetestsamples. 129 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS Table1.Prostatecancerdataset.ClassificationofvalidationandtestsamplesforthepͲDPLSmodelwithtwo factorscalculatedwithinitialtrainingsamplesandafterremovingoutliers. Initialdataset pͲDPLS Datasetafterremoving outliersfromthe trainingset pͲDPLSwithreject option pͲDPLS pͲDPLSwithreject option LOOCV test LOOCV test LOOCV test LOOCV test FN 2 2 1 1 6 2 2 0 FP 5 0 4 0 0 0 0 0 TN 42 5 37 5 38 5 33 5 TP 33 13 27 13 24 13 21 13 RN 0 0 6 0 0 0 5 0 RP 0 0 7 1 0 0 7 2 ** False Negative (FN): samples of class Z0 classified in class Z1, False Positive (FP): samples of class Z1 classifiedasclassZ0,RejectNegative(RN):samplesofclassZ1rejected,RejectPositive(RP):samplesofclass Z0rejected,TrueNegative(TN):samplesofclassZ1correctlyclassified,TruePositive(TP):samplesofclassZ0 correctlyclassified. 5.3.3Smallroundbluecellstumourdataset Thesamestrategyasfortheprostatecancerdatasetwasfollowed.InthiscasethepͲ DPLSmodelswerecalculatedusingthe96mostsignificantgeneexpressionsaccording toreference[38].PreliminarypͲDPLSmodelswerecalculatedwith1to3factorsusing meanͲcenteredgeneexpressiondataandthenvalidatedbyLOOCV.Theoptimalmodel, with the lowest cost of classification per sample was the one factor model. For this model, four training samples were detected as outliers. Three of them had large xͲ residualswithvaluesst2/sT2of4.98(sampleEWS_T13),6.11(sampleRMS_T7),and8.43 (sampleRMS_T11)largerthanthecutͲoffvalueof3.Moreover,thepredictionofsample RMS_T11wasNJ=1.91,higherthantheclasslimitHL1=1.35.Thefourthoutlier,sample EWS_T12, had a prediction NJ=0.18, lower than the limit LL0=0.072. After deleting 130 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.3Results these four samples, the pͲDPLS model was recalculated and used to predict the test samples. Without reject option, all the test samples would have been incorrectly classified by the model. With the reject option, 19 out of the 20 test samples were pointedasoutliersbytheNJlimitsbecausetheywereinliers.Theothersamplehadthe predictionNJintheacceptanceregionandhenceitwasclassified,buterroneously.The classification performance would have been worse if the pͲDPLS model had not been depuratedfromoutliers.Withoutexcludingthetrainingoutliers,thePDFsofthemodel varied,andhencetheNJlimitsforrejection(Figure4).Inthatcase(i.e.,thepͲDPLSmodel calculated with all the training samples) only 13 of the 20 test samples were rejected andtheremaining7wereconsideredvalidbythemodelandclassifiedeitherinclassZ0 orinclassZ1(hence,wronglyclassified).Notethatthetestsampleshaveintermediate values of the xͲvariables between the two modelled classes EWS and RMS. Since the samplesareclosetothecentreofthemultivariatespace,theirpredictionswerearound 0.5, in the middle of the PDFs of the two modelled classes. In this case, none of test samples could have been rejected neither by the leverage criterion (the maximum leveragewash=0.01,whilethe3݄തwas0.15forthismodel)norbytheratioofvariances (allhadV<3).ThisshowsthecomplementaryinformationthattheNJlimits,theleverage andtheratioofvariancesoffer. 4 LL 0 HL 0 LL 1 HL 1 a 3.5 3 p(NJ|Zc) 2.5 2 1.5 1 0.5 0 Ͳ0.5 0 0.5 NJ 1 1.5 2 131 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS 4.5 LL 0 HL 0 LL 1 b HL 1 4 3.5 3 p(NJ|Zc) 2.5 2 1.5 1 0.5 0 Ͳ1 reject Ͳ0.5 Z0 0 0.5 reject 1 NJ Z1 1.5 2 reject Figure4.Smallroundbluecellstumourdataset.PDFsofpͲDPLSmodelwithonefactora.withallthetraining samples, b. without the training outliers. Note how PDFs (and hence, the NJ limits and the rejection and acceptancezones)changewhenoutliersintrainingsetareremoved. 5.4Conclusions Classification rules for microarray data require appropriate rejection diagnostics. The severalstepsinvolvedinthegenerationandmeasurementofmicroarraydata,thatmay introduce important errors in the data, as well as the possibility of submitting to the classifiersamplesfromanonͲmodelledclass,makeitnecessarytheuseofdiagnosticsto prevent misclassifications. Rejection diagnostics act both in the training stage of the rule,byidentifyingthoseoutliersthancandegradetheperformanceoftherule,andin thepredictionofnewincomingsamples,byidentifyingthosesamplesthatwilllikelybe misclassified.Withinthisapproach,theclassificationmodelisnotforcedtoclassifyany 132 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 5.4Conclusions futuresamplethatarrives.ThisworkextendsthepreviousworkonrejectoptionforpͲ DPLSthatwasbasedonlyonthepredictedNJ,whichhasbeenshowntobenotalways sufficienttodetectoutliers.Bothtrainingandpredictionoutlierswerenowdetectedby takingintoaccountthexͲresiduals,theleverageandthepredictedNJ.Thepossibilityof using xͲresiduals is an advantage of classification methods based on latent variables suchaspͲDPLS.Thedeletionofthetrainingoutliersfromthetrainingsetimprovedthe classification model. At the prediction stage, samples were rejected to classify either becausetheywereoutliers,orbecausetheywereambiguous. Acknowledgements TheauthorsthankthesupportoftheDepartamentd’Universitats,RecercaiSocietatde laInformaciódeCatalunyaforprovidingCristinaBotella’sdoctoralfellowship,anofthe Spanish Ministerio de Educación y Ciencia (project CTQ2007Ͳ66918/BQU). 133 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS [1] Liu, F. and B. Wu, MultiͲgroup cancer outlier differential gene expression detection. ComputationalBiologyandChemistry,2007.31:p.65Ͳ71. [2] Li, C. and W.H. Wong, ModelͲbased analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences, 2001. 98:p.31Ͳ36. [3] Gottardo,R.,etal.,QualityControlandRobustEstimationforcDNAMicroarraysWithReplicates. JournaloftheAmericanStatisticalAssociation,2006.101:p.30Ͳ40. [4] Churchill,G.A.,FundamentalsofexperimentaldesignforcDNAmicroarrays.Nature Genetics, 2002.32:p.490Ͳ495. [5] Cleveland,W.S.,RobustLocallyWeightedRegressionandSmoothingScatterplots.Journalofthe AmericanStatisticalAssociation,1979.74:p.829Ͳ836. [6] Yang,Y.H.,etal.,NormalizationforcDNAmicroarraydata:arobustcompositemethodaddressing singleandmultipleslidesystematicvariation.NucleicAcidsResearch,2002.30:p.e15. [7] Paoli, S., et al., Integrating gene expression profiling and clinical data. International Journal ofApproximateReasoning,2008.47:p.58Ͳ69. [8] Moffitt,R.,etal.,EffectofOutlierRemovalonGeneMarkerSelectionUsingSupport Machines. Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference,2005.1:p.917Ͳ920. [9] Olsen, S.H., D.G. Thomas, and D.R. Lucas, Cluster analysis of immunohistochemical profiles in synovial sarcoma, malignant peripheral nerve sheath tumor, and Ewing sarcoma. Modern Pathology,2006.19:p.659Ͳ668. [10] Mramor, M., et al., VisualizationͲbased cancer microarray data classification analysis. Bioinformatics,2007.23:p.2147Ͳ2154. Vector [11] Model,F.,etal.,Statisticalprocesscontrolforlargescalemicroarrayexperiments.Bioinformatics, 2002.18:p.S155ͲS163. [12] Shieh,A.D.andY.S.Hung,DetectingOutlierSamplesinMicroarrayData.StatisticalApplicationsin GeneticsandMolecularBiology,2009.8:article13. [13] Tomlins,S.A.,etal.,RecurrentfusionofTMPRSS2andETStranscriptionfactorgenesinprostate cancer.Science,2005.310:p.644Ͳ648. [14] Bandyopadhyay, S. and S. Santra, Agenetic approach for efficient outlier detection in projected space.PatternRecognition,2008.41:p.1338Ͳ1349. [15] Loo, L.ͲH., et al., New Criteria for Selecting Differentially Expressed Genes. IEEE EngineeringinMedicineandBiologyMagazine,2007.26:p.17Ͳ26. 134 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 References [16] Tibshirani, R. and T. hastie, Outlier sums for differential gene expression analysis. Biostatistics, 2007.8:p.2Ͳ8. [17] Wu, B., Cancer outlier differential gene expression detection. Biostatistics, 2007. 8: p.566Ͳ 575. [18] Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44. [19] Pell,R.J.,Multipleoutlierdetectionformultivariatecalibrationusingrobuststatistical techniques. ChemometricsandIntelligentLaboratorySystems,2000.52:p.87Ͳ104. [20] Pell,R.J.,L.S.Ramos,andR.Manne,Themodelspaceinpartialleastsquaresregression.Journalof Chemometrics,2007.21:p.165Ͳ172. [21] Martens,H.andT.Naes,MultivariateCalibration.1989,NewYork:JohnWilley&Sons. [22] Chiang, L.H., R.J. Pell, and M.B. Seasholtz, Exploring process data with the use of robust outlier detectionalgorithms.JournalofProcessControl,2003.13:p.437Ͳ449. [23] Pierna, J.A.F., et al., A methodology to detect outliers/inliers in prediction with PLS. ChemometricsandIntelligentLaboratorySystems,2003.68:p.17Ͳ28. [24] Lleti, R., et al., Outliers in partial least squares regression Application to calibration of wine gradewithmeaninfrareddata.AnalyticaChimicaActa2005.544:p.60Ͳ70. [25] Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328. [26] Wold, H., Partial least squares, in Encyclopedia of Statistical Sciences K.a.N.L. Johnson, Editor. 1985,Wiley:NewYork.p.581Ͳ591. [27] Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in DiscriminantPartialLeastͲSquaresClassification.JournalofChemometrics and LaboratorySystems,2009.95:p.122Ͳ128. [28] Faber, N.K.M. and R. Bro, Standard error of prediction for multiway PLS: 1. Background and a simulationstudy.ChemometricsandIntelligentLaboratorySystems,2002.61:p.133Ͳ149. Intelligent [29] Faber,N.K.M.,Estimatingtheuncertaintyinestimatesofrootmeansquare error of prediction: application to determining the size of an adequate test set in multivariate calibration. ChemometricsandIntelligentLaboratorySystems,1999.49:p.79Ͳ89. [30] Faber, N.K.M., A closer look at the biasͲvariance tradeͲoff in multivariate calibration. Journal of Chemometrics,1999.13:p.185Ͳ192. [31] FernándezͲPierna, J.A., et al., Methods for outlier detection in prediction. Chemometrics and IntelligentLaboratorySystems,2002.63:p.27Ͳ39. [32] Maesschalck,R.D.,etal.,Decisioncriteriaforsoftindependentmodellingofclass analogy applied tonearinfrareddata.ChemometricsandIntelligentLaboratorySystems,1999.47:p.65Ͳ77. 135 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Outlierdetectionandambiguitydetection formicroarraydatainpͲDPLS [33] Chow,C.K.,Onoptimumrecognitionerrorandrejecttradeoff.IEEEͲTransactionsoninformation theory,1970.16:p.41Ͳ46. [34] Singh, D., et al., Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 2002.1:p.203Ͳ209. [35] Sharaf,M.A.,D.L.Illman,andB.R.Kowalski,Chemometrics.1986:WileyͲIEEE. [36] Lu, Y. and J. Han, Cancer classification using gene expression data. Information Systems, 2003.28:p.243Ͳ268. [37] Kennard, R.W. and L.A. Stone, Computer Aided Design of Experiments. Technometrics, 1969. 11:p.137Ͳ148 [38] Khan,J.,etal.,Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofiling andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679. 136 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 6 Gene selection based on selectivity ratio for probabilistic discriminant partial least squares Submitted April 2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Geneselectioninmicroarray databasedonSRindex Microarraydataareoftenusedtodetermineifacelloratissueishealthyortumour, or if it belongs to a subtype of a certain tumour. The quality of these classifications depends on the discriminating ability of the multivariate classification model. This ability decreases if irrelevant genes are included in the training data. Hence, gene selection plays a key role in the analysis of microarray data. In fact, gene selection acomplishes several purposes: 1) the identification of genes that are biologically relevant for the development of a certain disease 2) the discovery of coexpressed genesinordertobuildmetabolicpathwaysand3)thereductionofthedimensionality ofthedatainordertomakedataanalysiseasier. Many gene selection methods have been developed. Some are based on biological inferences and some have been developed from other type of data. In some cases, geneselectionisbasedoncriteriathatcanbevalidfordifferenttypesofclassification models, such as using genetic algorithms to select the genes that minimize the prediction error of a certain classifier [1]. Different classification strategies can be plugged into this selection scheme, as long as the model takes in certain selected genesandgivesoutapredictionerrorthatcharacterizestheselectedsubsetofgenes. Others,suchasselectingthegenesthataremostcorrelatedwiththeclasslabel[2]or basedonstatisticaltests[3Ͳ4]ignorehowclassificationalgorithmsprocessesthedata, soitmaynotfavourthesamesystematicvariationsinthedatathatthealgorithmwill do. Since the basis of this thesis has been the application of DPLS, we sought for gene selectionthatcouldenhancethecharacteristicsthattheDPLSalgorithmusesfromthe data. Hence, in this work, we implement the selectivity ratio (SR) index in order to choose the most relevant subset of genes for classification with pͲDPLS models. The selectivity ratio evaluates specifically the most relevant variables in PLS models. For each variable, this index is the ratio of the explained variance with respect to the 139 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter6 residualvariance.Thebestgenesarethosewithahighexplainedvarianceandalow residualvariance.Hence,thegeneswiththehighestSRareselectedassignificantand theremaininggenesarediscardedfromtheanalysis. Thispaperalsodiscussesanotherimportantaspectrelatedtogeneselection,namely theinfluencethatthesplitofthedatasetintotrainingandtestsetshasonthesubset ofselectedgenesandontheevaluationoftheclassifierperformance.Itisacommon practicethatthegoodnessofageneselectionalgorithmischeckedbyclassifyingatest set.Forthatpurpose,theinitialdatasetissplitintoatrainingsetandatestseteither randomlyorusinganalgorithmsuchastheKennardͲStonealgorithm.Then,basedon thetrainingset,asubsetorseveralsubsetsofgenesareselected,andtheclassification modeliscalculatedusingonlythesegenes.Next,thetestsetisclassified.Thesubsetof genes with the highest classification ability indicates the best goodness of the selection. These selected genes may be relevant only to discriminate the samples of thisparticulartrainingsetandthetestaccuracymaybeoveroptimisticsincethegenes wereselectedbasedontheaccuracyofclassificationofthisparticulartestset. In this chapter it is shown that the split of the data intro training and test subsets influencestheaccuracyoftheclassification.Certainsplitscanleadtoclassifycorrectly 100%ofthetestsampleswhileothersplitscanonlyclassifycorrectly80%ofthetest set,thusgivingafalseindicationofthetrueabilityofthegeneselectionalgorithmfor selecting the best genes. In this work, many random splits of training and test sets havebeenusedfordefiningthefinalaccuracyoftheclassificationmodels. These aspects are discussed and implemented for two datasets, prostate cancer dataset and nonͲsmall cell lung cancer dataset. For the prostate cancer dataset, the mean of the accuracies (by crossͲvalidation) of classification increased from 85% (all 5966 genes used) to 94% when only 17 selected genes were used. Equivalently, the 140 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Geneselectioninmicroarray databasedonSRindex meanofaccuraciesforthetestsamplesincreasedfrom84%to94%.ForthenonͲsmall celllungcancerdataset,themodelcalculatedwithonly17ofthe54675originalgenes, providedacrossͲvalidationclassificationaccuracyof93%. ThisworkhasbeensubmittedinApril2010. [1]Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast squaressupportvectormachine.BMCBioinformatics,2006.7:article95. [2]MaoK.Z.andW.Tang, CorrelationͲBasedRelevancyandRedundancy.MeasuresforEfficientGene Selection.PatternRecognitioninBioinformatics,2007,4774:p.230Ͳ241. [3] Dai,J.J.,L.Lieu,andD.Rocke,Dimensionreductionforclassificationwithgeneexpressionmicroarray data.StatisticalApplicationsinGeneticsandMolecularBiology,2006.5:article6. [4]Huang,X.,etal.,Borrowinginformationfromrelevantmicroarraystudiesforsampleclassificatiousing weightedpartialleastsquares.ComputationalBiologyandChemistry,2005.29:p.204–211. 141 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Geneselectioninmicroarraydatabasedon selectivityratioindex C.Botella,J.Ferré*,R.Boqué Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University. Marcel·lí Domingo s/n, 43007. Tarragona, Spain * Correspondingauthor:[email protected] SubmittedApril2010.(Editedforformat) ABSTRACT Mostofthegeneexpressionsmeasuredinamicroarrayexperimentareirrelevantfor the final application of the data. Irrelevant genes may confound the classification modelsanddecreasetheirperformance.Inthiswork,ageneselectionmethodbased ontheselectivityratioindexisused.ThisindexisspecificfortheDPLSmethodandhas been used to select the best genes that discriminate between healthy and tumour prostatecancertissuesandthatdiscriminatebetweendifferentsubtypesofnonsmall celllungcancers.Itisalsoshownthatthesplitofthedatasetintotrainingandtestsets influences both the genes selected and the estimated accuracy of the classification model.Awrongassessmentoftheaccuracyofthemodelmayleadtoeitherrejecta good subset of genes or accept a suboptimal subset. To overcome this influence a repetitive strategy including data split, gene selection, validation and prediction is performed. For the prostate dataset, models calculated with only 17 selected genes wereabletoclassifythesampleswithaccuraciesaroundthe94%,betterthanmodels calculated with all the gene expressions (5966) whose accuracies varied between 50 and100%dependingonthedatasplit.ForthenonͲsmallcelllungcancerdatasetthe models calculated with the genes selected following the selectivity ratio index had 142 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.1Introduction betterclassificationabilities,independentlytothesplitofthedata(accuraciesfrom94 to 98% for leave one out crossͲvalidation) than the models calculated with all the genes. 6.1Introduction DNA microarrays simultaneously provide gene expressions for thousands of genes. Usually, only a few of the measurements describe informative genes either overexpressed or underexpressed, while the rest describe unspecific variations or noise. Discovering the coͲexpressed genes is interesting in order to build metabolic pathways,toknowthebiologicalrelevanceofgenesforclinicaldiagnosisandalsoto enhance the performance of classification algorithms [1]. Classification of cells and tissues according to their gene expression profiles is one of the main uses of microarray data. Multivariate classification is adversely affected by irrelevant genes, which interfere with the discriminative power of the relevant genes. Hence, gene selection is needed to enhance the accuracy of the classifiers, and it is especially relevantwhenthebiochemicalimportanceoftheselectedgeneswillbesought. In the last years, many methods have been developed to identify the most relevant genes for certain types of diagnoses. Three major groups of methods have been described:filters,wrappersandembeddedtechniques[2].Somemethodshavebeen based on genetic algorithms [3], random forests [4], weights of support vector machines [5] and statistical tests such as the tͲtest or the Wilcoxon test [6] to cite a few. DPLS is one of the most used classification methods for gene expression data [7]. DPLS's most important feature is that it uses linear combinations of the original 143 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex variables,whichenablesdimensionalityreduction,noisefilteringandoutlierdetection. AlthoughDPLSdoesnotnecessarilyrequirevariableselection,itispreferabletoinput onlytherelevantvariablesandtodiscardthosethatcandistortthecalculatedfactor space.SeveralapproachesforgeneselectioninPLShavebeendescribed.Tanetal.[8] selected genes using the sum of squared correlation coefficients between the gene expressions and the response variables. Czekaj and Walczak [1] used the stability of regressioncoefficients,andLiShen[9],followingtheworkofGuyonetal.[5],selected the genes with a high absolute value of the regression coefficient using a recursive featureeliminationsystem.Petterson[10],basedonTryggetal.approach[11],used thefirstweightvectorofaPLSmodelwithonefactortoestimatetheimportanceofa genefordescribingthedependentvariable.Othercriteriaoftenusedtoselectgenesis theVariableImportanceonProjection(VIP)[12],whichisbasedontheweighsofthe DPLSmodelandt–orF–statistics[13,14]. Since each classification method enhances particular features of the data, gene selectionbasedongeneralcriteria(e.g.,selectingthegenesthataremostcorrelated withtheclasslabel)doesnotalwaysprovideoptimalsolutions.Recently,Rajalahtiet al.[15]usedtheselectivityratio(SR)indextodiscovertherelevantvariablesinamass spectralprofile,detectingpeptidesinthelowmolecularmassrangewithoutproblems offalsebiomarkercandidates.Theadvantageofthisindexisthatitcanbecalculated specifically for DPLS so that the variables pointed as relevant have also the largest discriminativepowerforthistypeofclassificationmodel. Inthisworkweshowtheuseoftheselectivityratioindextochoosethemostrelevant geneswhentheclassificationiscarriedoutusingDPLSandmicroarraygeneexpression data.Itisshownthattheinitialsplitofthedatasetintoatrainingandatestsetmay influence significantly the estimated classification performance of the classifiers, and hencetheconclusionaboutthegoodnessoftheselectioncriterionandoftheselected 144 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.1Introduction subsetofgenes.Anapproachbasedonrepetitivedatasplit,geneselection,trainingof the classifier and validation is used in order to better estimate the ability of the selectedgenesforprovidingagoodclassifier. 6.2Methods 6.2.1Probabilisticdiscriminantpartialleastsquares(pǦDPLS) Probabilistic Discriminant Partial Least Squares (pͲDPLS) is a new version of Discriminant Partial Least Squares (DPLS) regression [16]. Briefly, pͲDPLS starts by calculating a PLS model of A factors relating a NuP geneexpression matrix (X) and a Nu1 vector of ones and zeros that codifies the samples’ class (y). Next, the training samplesarepredictedwiththismodel.Foreachtrainingsample,apotentialfunctionis calculatedasagaussiancentredatthepredictedvalueNJandwithstandarddeviation equal to the standard error of prediction (SEP) of that sample. Next, the potential functions of the samples of the same class are averaged to obtain the probability densityfunction(PDF)ofclassZ0 andofclassZ1.Theclassificationofatestsampleis donebycalculatingtheaposterioriprobabilityoneachclass,basedonthepredictionNJ ofthesample.TheperformanceofDPLSdependsontherelevanceoftheinputgenes. Below,theselectivityratioindexisintroducedasamethodforgeneselection. 6.2.2Selectivityratioindex Theselectivityratio(SR)index[15]isbasedonKvalheimandKarstangtargetrotation approach[17].Itisdefinedastheratiooftheexplainedvariance(vex,p)totheresidual variance(vres,p,)ofavariable: 145 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex ܴܵ ൌ ௩ೣǡ ௩ೝೞǡ (1) Atargetprojectionmodeliscalculatedas X= tTPpTPT + ETP = XTP + ETP (2) wheretTP(N×1)arethetargetͲprojectedscoresandpTP(P×1)andthetargetͲprojected loadings.Theseareobtainedas tTP=X bPLS/||bPLS|| (3) pTPT=tTPTX/(tTPTtTP) (4) where bPLS (P×1) are the regression coefficients of the DPLS model calculated for A factors.FromEq.(2),theexplainedvarianceforvariablep,vex,p,iscalculatedfromthe pthcolumnofXTP andtheresidualvarianceforvariablep,vres,p,iscalculatedfromthe pthcolumnofETP. ThegeneswiththehighestSRaretheonesthatbestdefinetherelevantvariationsin thedata. 6.2.3Effectofdatasplitonperformanceevaluation Commonly,geneselectionstartsbysplittingthedatasetintoatrainingsetandatest set[18Ͳ20],eitherrandomlyorusingasampleselectionalgorithmsuchastheKennard and Stone algorithm [21]. Then, genes are selected so as to optimize a criterion calculatedfromthetrainingset,andthegoodnessoftheselectedgenes,andhenceof theselectioncriterion,ischeckedeitherbycrossvalidation[22,23]orbypredictinga testset[18Ͳ20,24].Otherdebatableapproaches,suchasselectingthegenesthatbest classifyatestsethavealsobeenused[25].ThelimitationofthesingleͲsplitapproachis that a selection algorithm or a set of selected genes may be discarded because an 146 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.2Methods unfortunatesplitofthedatasetleadstolowclassificationaccuraciesforthetestset. Ortheotherwayround,asuboptimalsetofgenescanbeacceptediftheclassification abilityofthatparticulartestsetishigh. In order to overcome this situation, gene selection is done in this work from one thousand different training subsets selected randomly. For each training set, a DPLS modelisevaluatedandtheselectivityratioindexSRforeachgeneisevaluated.After theonethousanditerations,themeanoftheSR'sofeachgeneiscalculatedandthe geneswiththelargestmeanSRareselected.Theusefulnessofthegenesselectedis then checked by calculating the classification accuracy of new five hundred DPLS modelscalculatedusingtheselectedgenesafterrandomlyselectingthetrainingand testsetsagain. 6.3Results 6.3.1Datasets Theprostatedataset[26]consistsof50nonͲtumoursamples(classZ0)and52tumour samples(classZ1)with12.600geneexpressionsanalysedforeachsample.Thisdataset has been previously studied in gene selection studies and used to evaluate the performanceofaclassificationmethod[4,27,28]tociteafew. ThenonͲsmallcelllungcancer(NSCLC)dataset[29]consistsof58samplesofthetwo majorhistologicalsubtypesoflungcancer,40fromadenocarcinoma(classZ0)and18 fromthesquamouscellcarcinoma(classZ1)with54675geneexpressionsanalysedfor eachsample. 147 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex 6.3.2Discussion 6.3.2.1Prostatecancerdataset ThedatasetwaspreͲprocessedlikein[26].Thefloorvaluewassetat10,theceilvalue at16000andthegeneswith(ImaxpͲIminp)<50and(Imaxp/Iminp)<5wereremoved, where Imaxp and Iminp are the maximum and minimum intensities of the gene respectively.Theintensitiesofthefinal5966genesleftwerethenlog2transformed. This dataset was randomly split into a training set and a test set with the only constraintthatthetrainingsetshouldcontain50%ofthesamplesofeachclassfrom theinitialdataset.ThentwoͲfactorDPLSmodelswerecalculatedwithmeanͲcentered dataandtheSRindexwascalculatedforeachgene.Thenumberoffactorswasinitially determined as the one with the lowest root mean square error of cross validation usingallthegenes.Itwaslattercheckedthatadifferentreasonablenumberoffactors oftheDPLSmodeldidnotaffectthegenesthatwereselectedasrelevantaftertheone thousand repetitions. The procedure was repeated one thousand times and the average SR index of each gene was calculated. The 10, 17 and 35 genes with the highestaverageofSRforthesemodelswereselectedaspotentiallyrelevant(Table1). Figure1showsthemeanSRforthefiftygeneswiththehighestindex.Afterthefirst17 selectedgenes,theremaininggeneshavesimilarSR.Hence,thediscriminativepower fortherestofthegenesisnotrelevantenoughtojustifytheirinclusioninthemodel. Anyway,thebest10,17and35geneswereselectedinordertocomparethemwith previous selection results using the random probabilistic model building genetic algorithm(RPMBGA)criterion[25]. 148 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.2Methods mean of SR amongthe 1000 iterations 3 2 1 0 0 5 10 15 20 25 va ria ble 30 35 40 45 50 Figure1.MeanofSRamongthe1000iterationsforthefiftygeneswithhighestSR. Table1.The35mostrelevantgenesaccordinglywiththeSRindexcalculatedfrom1000pͲDPLSmodels. Idofgenesselected 10genes 17genes 35genes 37639_at 1767_s_at 39756_at 33137_at 32598_at 36601_at 769_s_at 32076_at 40282_s_at 37720_at 36491_at 1521_at 41468_at 575_s_at 38410_at 35742_at 38406_f_at 39315_at 38087_s_at 32206_at 41288_at 34840_at 40024_at 1740_g_at 38634_at 31444_s_at 38051_at 34407_at 32243_g_at 33904_at 33198_at 33362_at 1513_at 37366_at 40856_at *Notethattoavoidredundancy,the17genesarethe10inthefirstcolumnplusthe7inthesecondcolumn, andanalogouslythe35genesarethe10inthefirstcolumnplusthe7inthesecondandthe18inthethird andthefourthcolumns. The ability of the selected genes to discriminate between tumour and nonͲtumour sampleswasevaluatedformodelscalculatedusingthe10,17and35relevantgenes only[24].Inordertomaketheresultslessdependentondatasplit,theclassification 149 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex performance was calculated for fiveͲhundred pͲDPLS models. These models were calculatedfromrandomlygeneratedtrainingandtestsetswith50%ofsamplesofeach classineachset.Forabettercomparison,thesamplesinthetrainingsetandtestset ineachrepetitionarethesameforthemodelscalculatedwith10,17and35genes. The histograms in Figure 2 summarize the validation accuracies of the five hundred models calculated with 10, 17 or 35 genes. For each model (a selected subset of trainingsamplesandgenes)theleaveͲoneͲoutcrossͲvalidation(LOOCV)accuracyand the test set accuracy were evaluated. If the subset of genes is adequate, one would expect both accuracies be high and similar, independently on the samples used to calculatethemodel. Figure2ashowsthatthemodelscalculatedwith10geneshadLOOCVaccuraciesfrom 85%to100%dependingonthedatasplit.Testsetaccuraciesalsorangedfrom85%to 100%.MostofthemodelshadaLOOCVaccuracyof96%andtestsetaccuracyof92%. Thesehighvaluesofbothaccuraciesindicatethatthesubsetofgenesaccountedfor themaindifferencesbetweennonͲtumourandtumourprostatecancersamples.The fact that the histogram is sharp indicates that the high accuracy was maintained for most of the models and it was quite independent on the split of the samples into trainingandtestsets.Notealsothatasingleunfortunatesplitcanleadtolowvaluesof both LOOCV accuracy (88%) and test accuracy (90%), which could lead to reject the selected subset of genes in front of previously reported subsets as they did not improvetheperformance.Alsonotethatsomedatasplitscanleadtomodelswitha large difference between the LOOCV classification accuracies and the test sets classificationaccuracies(e.g.LOOCVaccuracyof88%andtestsetaccuracyof100%). These results highlight the relevance thatthe data splitmay have when determining theusefulnessofaselectedsubsetofgenesortheusefulnessofagivenclassification rule. 150 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.2Methods SimilarremarkscanbedrawnfromFigures2band2cformodelscalculatedwiththe optimal 17 and 35genes following the SRcriterion. For17genes, themost frequent LOOCVaccuracyis94%,withatestaccuracyof92%(Figure2b).Forthesubsetof35 genes,mostofthemodelshavehighLOOCVandtestaccuraciesof94%(Figure2c). The models calculated with genes selected with the selectivity ratio index were compared with models calculated from genes selectedin the bibliography. Figure 2d shows the accuracies when the fiveͲhundred models calculated using optimal genes reportedinreference[25].Notethatalthoughthesubsetsofgeneswerechosenwith adifferentcriterion(RMPMGA)andforadifferentclassifier(supportvectormachines), theycanalsogiveDPLSmodelswithhighaccuracies.However,thehistogramsarenot assharpasinFigure2(aͲc),sothequalityofthemodelsdependsmuchmoreonthe data into training and test sets than when the genes are selected with the SR index. Reference [25] reported test set accuracies of 98% calculated for one single dataset split.NotethatforDPLSthosegenescangiveaccuraciesashighas100%forcertain datasetsplits,butmostofthemhavearound92%accuracy.Thissuggestsaninferior performanceforpͲDPLSthanwhenthesubsetselectedwiththeSRindexisused. For the subsets of 17 and the 35 genes, the accuracies varied from 85% to 100% (Figure2eͲ2f).Notethatinthatcasetheaccuraciesobtaineddependedevenmoreon thetrainingandtestsetsinwhichthedatasetwassplitandthehistogramsweremore flat. When using the raw dataset without gene selection (5966 genes), the validation accuracies range from 50% to 100% for different data splits (Figure 3). The mean of LOOCV accuracy was 85% and the mean of test accuracy was 84%. The lower accuraciesascomparedtousingsubsetsofselectedgenescanbeattributedtothefact thatthemodelsaretakingintoaccountfalsecorrelations.Giventhelargenumberof 151 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex genes, some uninteresting genes may become correlated with the class label for a certaindatasplit,sothatthemodelwillassignahighmodellingimportancetothose genes. The test set, which does not show the same correlation pattern, is then classified with a high error. The almost flat histogram suggests that the accuracies changeoftendependingonthesplitintotrainingandtestssetandhencethatusingall thegenesarenotabletoprovidemodelsthatsystematicallyperformwell. Figure2.Prostatedatasettraining(LOOCV)andtestaccuracyfrequences(perunit)forthefivehundredpͲ DPLSmodelscalculatedwith10(a,d),17(b,e)and35(c,f)geneschosenwiththeSRcriterion(aͲc)andby RPMBGA(dͲf). 152 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.2Methods Figure3. Prostatecancerdataset.Trainingandtestaccuracyfrequences(perunit)forthefivehundredpͲ DPLSmodelscalculatedwithallfinalgenesthegenesafterpreprocessing(5966genes). 6.3.2.2NonǦsmallcelllungcancerdataset The non small cell lung cancer dataset consists of 54675 gene expressions from 58 samples of asenocarcinoma (AC) and squamous cell carcinoma (SCC). Following the procedure described for the prostate dataset, one thousand randomly training and testsubsetsweregeneratedandtheSRindexforeachgenewascalculatedforeachof the models to discriminate between AC and SCC samples. The 17 and 30 genes with thehighestaverageSRindexovertheonethousandmodelswereselectedasrelevant (Table 2). This number of genes was decided in order to compare the results with previouslyreportedresults[29]. 153 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex Table2.The30mostrelevantgenesaccordinglywiththeSRindex1000pͲDPLSmodels. Idofgenesselected 17genes 30genes 206032_at 204455_at 1559606_at 1555501_s_at 206033_at 217528_at 205595_at 219507_at 211194_s_at 206164_at 217272_s_at 228806_at 216918_s_at 225822_at 221796_at 206156_at 244107_at 226832_at 235075_at 207382_at 221795_at 214680_at 57703_at 222892_s_at 204136_at 206266_s_at 230464_at 203097_s_at 206165_s_at 201818_at *The30genesarethe17inthefirsttwocolumnsplusthe13inthethirdandfourthcolumns. TheselectedgeneswereusedtocalculatefivehundredpͲDPLSmodelsusingrandom trainingandtestsets.Thesemodelswerealsocomparedwiththemodelscalculated withthegenesselectedinapreviouswork[29]. Figure4summarizesthevalidationaccuraciesofthefivehundredmodelsobtainedby LOOCVandbypredictingthetestsetforsubsetsof17and30genes.Theaccuraciesfor LOOCVandfortestdatarangedfrom85%to100%.Notethatmostofthemodelswith the17genesselectedhavingmaximalSRhaveLOOCVandtestaccuraciesfrom94to 98% (Figure 4a). This fact is even more notable when the 30 genes are used (Figure 4b), for which the number of models with test accuracies out of this range is insignificant.Incontrast,forthe17and30genesselectedin[29]thepͲDPLSmodels have varying accuracies, from 88% to 98%, without dominant training and test accuracy values (Figures 4cͲ4d). Again, this points out the importance that the data splithasontheevaluatedaccuracies.Notealsothatthemeanofthetestaccuracies obtainedbythemodelscalculatedwiththegenesselectedfollowingtheSRcriterion 154 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 6.2Methods areslightlybetterthanthoseobtainedwiththegenesselectedin[29](fromthe92% to93%forthe17genessubsetorfrom92%to93%forthe30genessubset). selectedin[29](fromthe92%to93%forthe17genessubsetorfrom92%to93%for the30genessubset). Figure4.Trainingandtestaccuracyfrequences(perunit)forthefivehundredpͲDPLSmodelscalculatedwith 17(a,c)and30(b,d)genesselectedbytheSRcriterion(aͲb)orinthereferencework(cͲd). 6.4Conclusions The selectivity ratio index has been used to select the best subset of discriminant genes for microarray data classification with pͲDPLS. The methodology reduces the influence of the samples selected as training samples on the final classification accuracies,andthegenesselectedgivemodelswithverysimilarclassificationabilities 155 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex independentofthedatasplit.Wehavealsoshownthattheaccuraciesofthemodels may depend to a large extent on the particular samples in the training set and that usingasingletestsettovalidatethegenesubsetmayresultineithertoooptimisticor pessimisticconclusions. Acknowledgements The authors thank the support of the Departament d’Universitats, Recerca i Societat delaInformaciódeCatalunyaforprovidingCristinaBotella’sdoctoralfellowship,anof the Spanish Ministerio de Educación y Ciencia (project CTQ2007Ͳ66918/BQU). 156 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 References [1] Czekaj, T., W. Wu, and B. Walczak, Classification of genomic data: Some aspects of feature selection.Talanta,2008.76:p.564Ͳ574. [2] Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics,2007.23:p.2507Ͳ2517. [3] Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast [4] DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using squaressupportvectormachine.BMCBioinformatics,2006.7:article95. randomforest.BMCBioinformatics,2006.7:article3. [5] Guyon,I.,etal.,GeneSelectionforCancerClassificationusingSupportVectorMachines.Machine Learning,2002.46:p.389Ͳ422. [6] Troyanskaya,O.G.,etal.,Nonparametricmethodsforidentifyingdifferentiallyexpressedgenesin microarrays.Bioinformatics,2002.18:p.1454Ͳ1461. [7] Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44. [8] Tan, Y., et al., MultiͲclass cancer classification by total principal component regression (TPCR) usingmicroarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65. [9] Shen,L.,PLSandSVDbasedpenalizedlogisticregressionforcancerclassificationusingmicroarray data.Proceedingsofthe3rdAsiaͲPacificBioinformaticsconference,2005:p.219Ͳ228. [10] Pettersson,F.andA.Berglund,InterpretationandvalidationofPLSmodelsformicroarraydata. ChemometricsandChemoinformaticsACSSymposiumseries,2005.894:p.31Ͳ40. [11] Trygg, J., O2ͲPLS for qualitative and quantitative analysis in multivariate calibration. Journal of Chemometrics,2002.16:p.283Ͳ293. [12] Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch: identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of Chemometrics2004.18:p.125Ͳ132. [13] Dai, J.J., L. Lieu, and D. Rocke, Dimension reduction for classification with gene expression microarraydata.StatisticalApplicationsinGeneticsandMolecularBiology,2006.5:article6. [14] Huang,X.,etal.,Borrowinginformationfromrelevantmicroarraystudiesforsampleclassification usingweightedpartialleastsquares.ComputationalBiologyandChemistry,2005.29:p.204–211. [15] Rajalahti,T.,etal.,Biomarkerdiscoveryinmassspectralprofilesbymeansofselectivityratioplot. ChemometricsandIntelligentLaboratorySystems,2009.95:p.35Ͳ48. [16] Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328. 157 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Gene selection in microarray databasedonSRindex [17] Kvalheim, O.M. and T.V. Karstang, Interpretation of latentͲvariable regression models ChemometricsandIntelligentLaboratorySystems,1989.7:p.39Ͳ51. [18] Horng, J.ͲT., et al., An expert system to classify microarray gene expression data using gene selectionbydecisiontreeExpertSystemswithApplications,2009.36:p.9072Ͳ9081 [19] Yoon,Y.,etal.,Directintegrationofmicroarraysforselectinginformativegenesand phenotype classification.InformationScience,2008.178:p.88Ͳ105. [20] Li, L., et al., Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001. 17: p. 1131Ͳ 1142. [21] Kennard,R.W.andL.A.Stone,ComputerAidedDesignofExperiments.Technometrics,1969.11: p.137Ͳ148 [22] Hossain,A.,etal.,Aflexibleapproximatelikelihoodratiotestfordetectingdifferentialexpression inmicroarraydataComputationalStatistics&DataAnalysis,2009.53:p.3685Ͳ3695 [23] Li,G.ͲZ.,etal.,SelectingsubsetsofnewlyextractedfeaturesfromPCAandPLSinmicroarraydata analysis.BMCGenomics,2008.9:p.S24ͲS38. [24] Paul, T.K. and H. Iba, Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data. EEE/ACM Transactions on Computational Biology and Bioinformatics,2009.6:p.353Ͳ367. [25] Paul,T.K.andH.Iba,Geneselectionforclassificationofcancersusingprobabilisticmodelbuilding geneticalgorithm.BioSystems,2005.82:p.208Ͳ225. [26] Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002. 1:p.203Ͳ209. [27] Dettling, M., BagBoosting for tumour classification with gene expression data. Bioinformatics, 2004.20:p.3583Ͳ3593. [28] Jeffery,I.B.,D.G.Higgins,andA.C.Culhane,Comparisonandevaluationofmethodsforgenerating differentiallyexpressedgenelistsfrommicroarraydata.BMCBioinformatics,2006.7:p.359Ͳ375. [29] Kuner,R.,etal.,GlobalgeneexpressionanalysisrevealsspecificpatternsofcelljunctionsinnonͲ smallcelllungcancersubtypes.LungCancer,2009.63:p.32Ͳ38. 158 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 7 Multi-‐class classification of microarray gene expression data Submitted May 2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata Microarray gene expression data were initially used for binary differentiation e.g., to classifyasampleoracellashealthyortumour.Commonly,however,thediseaseswith geneticoriginhavemorethantwosubtypes,sotheproblemofclassifyingasamplefrom geneexpressiondataismoreoftenthannotamultiͲclassclassificationproblem. Althoughsomeclassificationalgorithmscaneasilyhandlemanyclasses(e.g.,kͲnearest neighboursclassification),others(e.g.someversionsofDPLS)aredesignedtodealwith two classes only. In order to be able to use for multiclassͲclassification the powerful binaryclassifiersavailable,newstrategieshavetobedevised.Oneofthesestrategiesis toperformbinaryclassificationsbetweenpairsofclasses,andthencombinetheresults toobtainthefinalclasslabel.ThisoneͲversusͲonestrategyisoftenbetterthantomodel one class against all the others (the oneͲversusͲall strategy). The reason is that in the oneͲversusͲallstrategy,differentsubtypesofsamplesaregroupedintothesameclass, which must be differentiated from the target class. In contrast, the oneͲversusͲone strategyallowsthemodeltofocusonthegenesthatactuallydifferentiateoneparticular classfromanotherparticularclass. AdifficultyintheoneͲversusͲonestrategyisthatanewsamplewillbesubmittedtoall the binary models that make the classification system. For the binary models that modelledtheclass,thepredictionshouldbethatthesamplebelongstothemodelled class.Foralltheothermodels,thesampleisanoutlierandshouldbedetectedassuch. Hence,thecombinationoftheresultsofthebinaryclassifiersinordertoobtainthefinal assignedclassisafundamentalstep. In the present work multiͲclass classification is performed in two steps by combining partial least squares (PLS) regression and the linear discriminant analysis (LDA). In the initialstep,oneͲversusͲonePLSmodelsallowobtainingthepredictionsforeachsample 161 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter7 (a single value) and for each model. Each oneͲversusͲone PLS model can only discriminatebetweentwodifferentclasses.However,thepredictionsofsamplesfrom the classes not modelled by each PLS model may span all the domain, and hence misclassified.So,themultiͲclassificationisdoneinasecondstepwiththeLDAclassifier appliedoverthepredictionsofthesamplesforalltheoneͲversusͲonePLSmodels. The methodology was used to classify samples of leukemia and small round blue cell tumours datasets. The accuracies of classification were 97%, using only 15 genes, and 100%with17genes,respectively. ThispaperwassubmittedinMay2010. 162 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiǦclassclassificationofmicroarray geneexpressiondata C.Botella*,J.Ferré,R.Boqué Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University. Marcel·lí Domingo s/n, 43007. Tarragona, Spain Correspondingauthor:[email protected] * SubmittedMay2010.(Editedforformat) ABSTRACT WhenclassificationfrommicroarraygeneexpressiondataisamultiͲclassproblem,the outputsofbinaryclassifierssuchasdiscriminantpartialleastsquares(DPLS)mustbe combinedtoobtainthefinalclassificationresult.Inthisworkanewmethodologyfor multiͲclass classification that combines partial least squares (PLS) and linear discriminant analysis (LDA) has been developed. The method also includes a gene selection step based on the selectivity ratio index so that the best performing genes for each binary PLS model are selected. When the methodology was applied to the leukemiadataset,thathasthreeclasses,97%ofthesampleswerecorrectlyclassified usingonly15genesinthePLSmodels.Fortheroundbluecelltumourdataset,thathas fourclasses,100%ofthesampleswerecorrectlyclassifiedusingonly17genesinthe PLSmodels. 163 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata 7.1.Introduction An important challenge in the use of largeͲscale gene expression data for biological classificationoccurswhenthedatasetinvolvesmultipleclasses[1].Sofar,mostofthe research on classification of microarray data has focused on two major classes only (e.g. normal versus cancer tissue, response to treatment versus no response). However, practical cancer diagnosis requires differentiating among more than two typesorsubtypesand,hence,multiͲclassclassificationtechniquesareneeded[2]. MultiͲclass classification can be approached in two ways. One way is the use of algorithms that treat multiͲclass problems directly, such as kͲNearest Neighbours (kNN),LinearDiscriminantAnalysis(LDA)orNeuralNetworks(NN).Asecondwayisto decompose the multiͲclass problem into multiple binary classification problems and usebinaryclassificationalgorithms,suchasDiscriminantPartialLeastSquares(DPLS) orTotalPrincipalComponentRegression(TPCR).Thesebinaryclassificationmodelscan becalculatedbymodellingeitheroneclassversustheothers(oneversusall,OVA),one classversuseachotherclass(oneversusone,OVO)orusinghierarchicalpartitioning [3,4].Then,theresultsofthebinaryclassifiersarecombinedtoobtaintheassigned classlabel. Several novel methods have been developed for multiͲclass classification with microarraydata.Tanetal.in[5]usedTPCR,whichtakesintoaccounttheinformation of the dependent variables and also the errors in the dependent and independent variables. Ooi et al. in [1] used genetic algorithms (GA) for gene selection and classification was based on the maximum likelihood. They obtained better classification accuracies than previouslypublished methods and reduced the number of genes needed for classification. Leng et al. in [6] proposed Sparse Optimal Score (SOS), based on Fisher LDA, as a multicategory classifier and classified three public 164 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.1Introduction datasets satisfactorily. Tibshirani et al. [7] proposed the nearest shrunken centroid method,forcancerclassprediction.WiththesamemultiͲclassificationobjective,some studies proposed derivations of SVM for multiͲclassification. Lee et al. designed an optimalmulticategorySVM[8],Pengetal.in[2]andLiuetal.in[9]combinedGAand oneversusoneSVM.Incontrast,deSouzaetal.in[10]appliedGAandoneversusall SVM. DPLShasprovenusefulforbinaryclassificationofmicroarraydatabutithasnotbeen muchstudiedformulticlassclassification.Nguyenetal.[11]usedPLSasadimension reduction technique for a posterior classification with Logistic discrimination or QuadraticDiscriminantAnalysis.DPLS2wasusedbyTanetal.[12]toclassifymulticlass public datasets using the OVA strategy. However, this strategy may lack biological sense for microarray data analysis when, for instance, healthy samples must be groupedtogetherwithtumoursamplesanddiscriminatedfromothertumourtypes. In this work we describe the application of PLS combined with LDA for multiͲclass classification. Several OVO PLS models are calculated and LDA is applied to the predictions of the samples on each of these models. The advantage of using OVO models is that each model maximizes the differences between the two modelled classes. Additionally, gene selection isperformed for each PLS modelto increase the discriminantability.Theselectionisbasedonthehighestselectivityrationindex[13] that is specially suited for PLS. The method has been applied to two datasets, the leukemiadataset[14]andsmallroundbluecelltumourdataset[15]. 165 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata 7.2Methods 7.2.1MultiǦclassclassificationmethod:PartialLeastSquaresǦLinear discriminantanalysis ThemultiͲclassclassificationinCclassesisdonebycombiningPLSregressionandLDA (Figure1)andmaybevalidatedbyleaveoneoutcrossvalidation(LOOCV)orbyatest set. PLSisaregressionmethodbasedonmaximizingthecovariancebetweenXandy[16]. The gene expression microarray data, X is an N×p matrix of N samples and P gene expressionsandyisavectorofzerosandonesthatcodifiestheclassesofthesamples. In this paper, oneͲversusͲone PLS models are calculated, so X only contains samples from two modelled classes, for instance class Z1 (e.g. “tumour type I”) and class Z2 (e.g.“tumourtypeII”).ThezerosinycodifythesamplesofclassZ1andtheonesiny codifythesamplesofclassZ2.Withthesesettings,PLSmodelsforeverycombination oftwoclassesZivs.Zji=1, …C,j>iarecalculated(Figure1(c)). ForasampletobeclassifiedinoneoftheCclasses,itspredictionineachDPLSmodel iscalculatedas: መ ݕො ൌ ܠ ܊ (1) wherebisthevectorofregressioncoefficientsforthemodelofAfactorsandxisthe geneexpressionvectorforsuchsample.NotethatifbhasbeencalculatedfrommeanͲ centereddatathenxshouldbemeanͲcenteredandNJshouldbeprocessedaccordingly. ThesampletobeclassifiedispredictedinalltheOVOPLSmodels(Figure1(c)),thus obtaining a vector, of predictions NJ (Figure 1(d)). For instance, if there are three subtypesofsamples,threePLSmodelsarecalculated:classZ1versusclassZ2,classZ1 versusclassZ3andclassZ2versusclassZ3.Thepredictionofasampleinthesethree 166 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.2Methods modelsgeneratesNJ=[NJ12 NJ13 NJ23](thesubscriptsindicatetheclassesaccountedforin each model) that describes the behaviour of the sample in the multiclassͲclassifier. Ideally, if the sample belongs to class Z1, NJ12 and NJ13 should be close to zero and NJ23 shouldbefarabove1orfarbelow0sothatthesamplecouldbedetectedasanoutlier inthemodelofclassZ2 vs.classZ3.Actually,thisisnotalwaysthecaseandoutliers mayhavepredictionsalongtheentireNJdomainandmixedwiththepredictionsofthe modelledclasses.Similarly,asampleofclass Z2shouldhaveaNJ12closetoone,aNJ23 close to zero and an undetermined value of NJ13. Finally, a sample of class Z3 should haveNJ13andNJ23closetooneandanundeterminedvalueofNJ12.LDAisthenappliedto NJ. LDA finds discriminant functions (directions) such that the distance between the classes’meanvectorsismaximizedwhenthedataareprojectedontosuchfunctions. Let NJ be the vector of predictions obtained for the sample that mustbe classified. A discriminantscore(m)iscalculatedforthatsampleineachdiscriminantfunctionas: ିଵ ሺܡො െ ૄ ሻ െ ʹɎ ݉ሺܡොሻ ൌ ሺܡො െ ૄ ሻ ܁ௗ (2) whereʅcisthemeanvectorofthepredictionsofthetrainingsamplesofclassc,ʋcis theaprioriprobabilityofclassccalculatedasthenumberofsamplesoftheclassover thetotalnumberofsamples. Ɏ ൌ ே (3) and܁ௗ isthecovariancematrixevaluatedas: ଵ ܁ௗ ൌ σେୡୀଵ ݊ୡ ܁ୡ (4) whereScis ܁ୡ ൌ ଵ ౙ ౙ σୀଵ ሺܡො െ ૄ ሻሺܡො െ ૄ ሻ (5) 167 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata Thenthesampleisclassifiedintheclassforwhichithasthelowestclassificationscore (Figure1(e)). INITIALDATASET GENESELECTION P12 P12 IdGenes P y Genesselected following SR to discriminate betweenclass ʘ1 andʘ 2 n1 n2 x1 n2 n3 x1 x2 x3 a n1 n3 n2 n3 b x1 x2 x3 e }t 2Prediction 1 x3 x3 P12 n1 n2 x1 x2 PLSmodel class Z1 vs.Z2 LDACLASSIFIER CALCULATION d n3 P23 IdGenes Genesselected followingSR to discriminate betweenclassʘ2 andʘ 3 1 P13 IdGenes Genesselected followingSR to discriminate betweenclassʘ1 andʘ 3 P23 TRAINING SAMPLES PREDICTION c x2 P13 n1 OVOPLSMODELS MATRICESWITH SELECTEDGENES PLSmodel classZ1 vs.Z3 P13 n1 n2 n3 x1 x2 x3 y n1 LDAclassifier n2 2Prediction n3 1 PLSmodel classZ2 vs.Z3 P23 n1 n2 n3 x1 x2 x3 2Prediction Figure1.SchemeofathreeclassPLSͲLDAtrainingclassificationprocess:a.Initialdataset.bOVOPLSmodel withanAfactors(initialguess)arecalculatedandgenesareselectedwiththeSRindexforeachmodel.c. TheoptimalOVOPLSmodelsarecalculatedwiththeselectedgenes.d.Allthetrainingsamplesarepredicted ineachOVOPLSmodelobtaininga}matrix.D.LDAclassifieriscalculated,using}asindependentvariables andyastheclasscode.NotetheP12,P23andP13representthesamenumberofgenesbutnotnecessarily thesamegenes.TheoptimalnumberoffactorsintheOVOPLSmodelsisthosethatminimizetheRMSECV criterion. 7.2.2Selectivityratioindex TheSelectivityRatio(SR)indexflagsthemostrelevantvariablesforPLS.Itisbasedon a target rotation approach [17] and is detailed in reference [13]. The SR index is definedastheratiooftheexplainedvariance(vex,p)totheresidualvariance(vres,p)ofa variable(p): SRp=vex,p/vres,p 168 (6) UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.2Methods TakingintoaccountthatPLSdecomposesXas: X=tTPpTPT+ETP=XTP+ETP (7) wheretTP(P×1)arethetargetͲprojectedscoresandpTP(P×1)andthetargetͲprojected loadings.Theexplainedvarianceforeachvariablepiscalculatedfromthepcolumnof thereconstructedXTP,andtheresidualvarianceiscalculatedfromthepcolumnofthe residual matrix E. Note that tTP and pTP on equation 7 are calculated following the procedurein[13].ThegeneswithahighestSRpindexareselectedasthemorerelevant todiscriminatebetweenthetwoclasses modelledbythePLSmodel.Notethateach OVOPLSmodelhasitsoptimalsubsetofgenesthatbestdiscriminatebetweenthetwo modelledclasses.ThenumberofgenesinthesubsetmaydifferfromonePLSmodelto another.Toavoidanadditionaloptimizationstep,themethodologyimplementedhere usedthesamenumberofgenesforallthePLSmodels,althoughthegeneswerenot necessarilythesame. 7.3Datasets The leukemia dataset [14] consists of 72 samples of acute lymphoblastic leukemias carryingachromosomaltranslocationthatderivesonthreesubtypesofsamples,acute lymphoblastic leukemia (ALL, 24 samples class Z1), mixed lineage leukemia (MLL, 20 samples,classZ2)andacutemyeloidleukemia(AML,28samples,classZ3).Foreach sample 12582 gene expressions were obtained. This dataset was preͲprocessed as describedin[14]. Thesmallroundbluecelltumour(SRBCT)dataset[15]consistsof63trainingsamples fromfourdifferentcellsubtypes.23samplesarefromEwingfamilyoftumours(EWS, Z1),20arerhabdomyosarcomas(RMS,classZ2),12areneuroblastomas(NB,Z3)and 169 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata the remaining 8 are Burkitt lymphomas (BL, Z4). The independent test set has 20 samples,6ofclassZ1,5ofclassZ2,6ofclassZ3and3ofclassZ4.Foreachtraining and test sample 2308 genes were analysed. The algorithms were run in Matlab£ software. 7.4Results 7.4.1Leukemiadataset ThreeOVOPLSmodelswerecalculatedforAfactors:amodelofALLvs.MLL,amodel of ALL vs. AML, and a model of MLL vs. AML. For each PLS model, genes having the highestSRindexwereselected.Threegroupsof15,50and100genesweretestedso thattheresultscouldbecomparedwithpreviousresults[14,18].TheOVOPLSmodels wererecalculatedusingtheselectedgenesandtheoptimalnumberoffactorswasthe one that minimized the root mean square error of leaveͲoneͲout crossͲvalidation (RMSECV). Note that this number of factors may differ from the ones used in the preliminary model used for selecting the genes.The three optimal PLS models were used to predict all the training samples. A matrix } (72×3) of predictions was then obtained and used for training the LDA classifier. A sample to be classified was first predicted with the three PLS models, thus obtaining a vector, NJ (3×1) of predictions. ThisvectorwassuppliedtotheLDAclassifiertoobtainthefinalclassification. Inthisdataset,atestsetwasnotavailable,soleaveͲoneͲoutcrossͲvalidation(LOOCV) wascarriedout.Hence,allthesampleswereusedonceasatestsample,obtainingfor eachonea(3×1)vectorofpredictions,andyieldingmatrix}t (72×3)ofpredictionsin total.ThismatrixwasusedtopredicttheclasswiththeLDAclassifiercalculatedwith thetrainingsamplesinthepreviousstep. 170 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.4Results Modellingwiththe15mostrelevantgenes Figure 2a shows the LOOCV predictions from the three binary PLS models calculated with only the 15 most discriminant genes, selected according to the SR index. Note that the model of ALL vs. MLL can discriminate correctly samples from the class ALL (whosepredictionsarearound0)fromthesamplesofclassMLL(whosepredictionsare around1).However,itcannotdifferentiatethesamplesfromclassAML.Ideally,these samplesshouldbehavedifferentlyandhaveextremepredictions,sothattheycouldbe detectedasoutliers.Instead,theirpredictionsarebetweenthevalues0and1,sothe predictions of the PLS model only are not enough for correctly classifying all the samples.AsimilarsituationhappenedwiththeMLLsamplesinthemodelALL vs.AML andwiththeAMLsamplesinthemodelALLvs.MLL(Figure2a). Next,LDAwasappliedtothepredictionsNJofeachLOOCVsample.Figure2bshowsthe validation samples already classified by LDA in the space of the PLS. The LOOCV classificationaccuracywas97.2%,higherthanthe95%accuracybyLOOCVpreviously reportedforthisdataset[14]usingkNNandselectingthegenesfollowingasignalto noisecriterion.A97.2%ofaccuracymeansthatthemethodonlymisclassified2ofthe 72samples.Thesetwomisclassifiedsamples,MLL_2andMLL_15,aresamplesofclass MLLthatwereassignedtoclassAML.Figure3showsthediscriminantscoresoftheLDA classifier for the first two discriminant functions. Note that for these samples the discriminant score in the second discriminant function is not high enough to be assigned to their true class MLL. Both samples have raw intensities lower than the intensitiesofthesamplesoftheirtrueclassMLLandmoresimilartotheintensitiesof thesamplesofclassAML.Asaconsequence,thediscriminantscoresandthepredicted NJ’sforthesetwosamplesweremoresimilartotheNJ’sforclassAML.Moreconcretely (Table1)MLL_2haspredictionsNJ12=0.62andNJ13=0.94,whicharealmostequaltothe mean of the predictions of the samples of class AML (þത12= 0.68 and þത13 = 0.97) and differconsiderablyfromthemeanofthepredictionsforthesamplesofitstrueclass 171 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata (þത12= 0.94 and þത13 = 0.61). The predictions for the model of MLL vs. AML did not contribute significantly to the classification of the MLL_2 sample, having a value betweenthepredictionsofbothclasses. PLS model of class MLL vs. class AML a PLS model of class ALL vs. class AML PLS model of class ALL vs. class MLL -0.2 0 0.2 0.4 0.6 0.8 1 1.2 ǔ b ǔ of model MLL vs. AML 1.5 1 0.5 0 -0.5 1.5 1 1.5 1 0.5 ǔ of modelALL vs. AML 0.5 0 0 -0.5 -0.5 ǔ of modelALL vs. AML Figure2a.PredictionsofLOOCVsamplesforOVOPLSmodels2b.SamplesclassifiedaccordingtoLDAbased ontheLOOCVpredictionsoftheOVOPLSmodelscalculatedwiththe15genesselectedwiththehighestSR index.()samplesofclassALLcorrectlyclassified,(ż)samplesofclassMLLcorrectlyclassified,(×)samplesof classAMLcorrectlyclassified,and()misclassifiedsamples. 172 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.4Results 0 Second discriminantfunction -10 -20 -30 -40 -50 -60 -50 -40 -30 -20 -10 0 First discriminantfunction Figure 3. Discriminant scores of the LDA classifier calculated for the first two discriminant functions. () samplesofclassALL(ż)samplesofclassMLL,(×)samplesofclassAML,and()misclassifiedsamples. Modellingwiththe50mostrelevantgenes When the number of genes selected to calculate the PLS models was 50, the classification performance was similar as for 15 genes, except for one additional samplethatwasmisclassified.Thepredictionsofeachclassaremoreclusteredaround their target values, which should improve the discrimination between the classes. However, the two outliers detected when the classification was performed with 15 genes,MLL_2andMLL_15,wereagainoutliers.Inaddition,thesampleAML_11was alsopointedoutasoutlier.Thisresultedina95.8%ofLOOCVclassificationaccuracy.In this case, then, increasing the number of genes worsened the classification. This contrastswithpreviousresultswherethebestaccuracieswereobtainedwith50genes [18]. Figures 4a and 4b show the predictions and the LOOCV results for the models calculated with 50 genes. The two samples of class MLL misclassified (MLL_2 and MLL_15) behave like in the models calculated with 15 genes. AML_11 is an AML 173 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata samplewhoseintensitiesforthese50selectedgenesarehigherthantheexpectedfor asampleofclassAML. Thisdidnothappenwhenonly15geneswereused.Thesehigh intensitiesinfluencedthepredictedy,whichwassimilartothepredictionsoftheMLL samplesandverydifferentfromthepredictionsofthesamplesofitstrueclass. When the number of genes increased to 100 the classification performance was like theperformanceofthemodelswith50genes,andthethreesamplespointedaboveas outlierswereagainmisclassified. a PLS model of cla ss MLL vs. cla ss AML PLS model of cla ss ALL vs. cla ss AML PLS model of cla ss ALL vs. cla ss MLL -0.2 0 0.2 0.4 0.6 0.8 1 1.2 b ǔ of model MLL vs. AML 1.2 0.8 0.4 0 -0.4 2 0 ǔ of model ALL vs. AML -2 0 0.4 0.8 1.2 ǔ of model ALL vs. AML Figure4a.PredictionsofLOOCVsamplesforOVOPLSmodelscalculatedwiththe50geneswithhighestSR index.4b.ClassificationofLDAfromtheOVOPLSpredictions.()samplesofclassALLcorrectlyclassified,(ż) samplesofclassMLLcorrectlyclassified,(×)samplesofclassAMLcorrectlyclassified and()misclassified samples. 174 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 7.4Results 7.4.2Smallroundbluecelltumourdataset Following the procedure described for the leukemia dataset, OVO PLS models were calculated.Bycombiningthefourdifferentclasses,sixPLSmodelswerecalculated.For eachone,thebest17discriminantgenes,obtainedusingtheSRindex,wereselected. The optimal number of factors for each one of the six PLS models was determined basedontheminimumRMSECV.TheoptimalPLSmodelswereusedtopredictallthe trainingsamples,whichwherethensubmittedtotheLDAclassifier.Figure5showsthe predictions for the test samples for three of the six PLS models, along with the classificationperformedbyLDAfromthosepredictions.FromtheOVOPLSpredictions LDAwasabletoclassifycorrectlyalltestsamples..Notethatinreference[15]a100% oftestaccuracywasachievedusing96genes.WithPLSͲLDA,thesameperformanceis achieved using only 17 genes, selected independently for each one of the OVO PLS models. 1.2 ǔ of model EWS vs. BL 1 0.8 0.6 0.4 0.2 0 -0.2 1.4 1.5 1 1.2 0.5 1 0.8 0.6 0.4 0 0.2 0 -0.2 ǔ of model EWS vs. RMS -0.5 ǔ of model EWS vs. NB Figure5.PredictionsfromthreeofthesixPLSmodelsandtheclassificationperformedbyLDA.(×)samples ofclassEWS()samplesofclassRMS() representssamplesofclassNB(S)representssamplesofclass BL).Allofthesamplesarecorrectlyclassified. 175 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata 7.5Conclusions LDA applied on the predictions of oneͲversusͲone PLS models allows multiͲclass classificationofmicroarraygeneexpressiondatawithgoodperformance.Byselecting the most discriminant genes independently for each PLS model, the accuracies are similartothosepreviouslypublishedbutusinglessgenes.Inaddition,theuseofonlya fewgenesallowsabetterposteriorinterpretationofthebiologicalsenseofthegenes andtheirrelationwithaparticularillness. Acknowledgements The authors thank the support of the Departament d’Universitats, Recerca i Societat delaInformaciódeCatalunyaforprovidingCristinaBotella’sdoctoralfellowship,anof theSpanishMinisteriodeEducaciónyCiencia(projectCTQ2007Ͳ66918/BQU). 176 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 References [1] Ooi, C.H. and P. Tan, Genetic algorithms applied to multiͲclass prediction for the analysis of gene expressiondata.Bioinformatics,2003.19:p.37Ͳ44. [2] Peng,S.,etal.,Molecularclassi¢cationofcancertypesfrommicroarraydatausingthecombination ofgeneticalgorithmsandsupportvectormachines.FEBSLetters,2003.555:p.358Ͳ362. [3] Statnikov, A., et al., A comprehensive evaluation of multicategory classification methods for microarraygeneexpressioncancerdiagnosis.Bioinformatics,2005.21:p.631Ͳ643. [4] Yeang, C.H., et al., Molecular classification of multiple tumour types. Bioinformatics, 2001. 17: p. S316ͲS322. [5] Tan,Y.,etal.,MultiͲclasscancerclassificationbytotalprincipalcomponentregression(TPCR)using microarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65. [6] Leng, C., Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarraydata.ComputationalBiologyandChemistry,2008.32:p.417Ͳ425. [7] Tibshirani, R., et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS,2002.99:p.6567Ͳ6572. [8] Lee, Y., Y. Lin, and G. Wahba, Multicategory Support Vector Machines: Theory and Application to theClassificationofMicroarrayDataandSatelliteRadiancedata.JournaloftheAmericanStatistical Association,2004.99:p.67Ͳ81. [9] Liu,J.J.,etal.,MulticlasscancerclassificationandbiomarkerdiscoveryusingGAͲbasedalgorithms. Bioinformatics,2005.21:p.2691Ͳ2697. [10] Souza,B.F.d.andA.P.d.L.F.d.Carvalho,GeneselectionbasedonmultiͲclasssupportvectormachines andgeneticalgorithms.Geneticsandmolecularresearch,2005.4:p.599Ͳ607. [11] Nguyen, D.V. and D.M. Rocke, MultiͲclass cancer classification via partial least squares with gene expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226. [12] Tan,Y.,etal.,MultiͲclasstumorclassificationbydiscriminantpartialleastsquaresusingmicroarray gene expression data and assessment of classification models. Computational Biology and Chemistry2004.28:p.235–244. [13] Botella,C.,J.Ferré,andR.Boqué,Geneselectioninmicroarraydatabasedontheselectivityratio index.Submitted,2010. [14] Armstrong, S.A., et al., MLL translocations specify a distinct gene expression profile that distinguishesauniqueleukemia.NatureGenetics,2002.30:p.41Ͳ47. [15] Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679. [16] Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985, Wiley:NewYork.p.581Ͳ591. 177 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 MultiͲclassclassificationof microarraygeneexpressiondata [17] Kvalheim, O.M. and T.V. Karstang, Interpretation of latentͲvariable regression models ChemometricsandIntelligentLaboratorySystems,1989.7:p.39Ͳ51. [18] Yang, T.Y., Efficient multiͲclass cancer diagnosis algorithm, using a global similarity pattern. ComputationalStatisticsandDataAnalysis,2009.53:p.756Ͳ765. 178 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 CHAPTER 8 Conclusions UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Conclusions 1. Probabilistic Discriminant Partial Least Squares (pǦDPLS) has been appliedtothebinaryclassificationofmicroarraygeneexpressiondata. The probabilistic Discriminant Partial Least Squares (pͲDPLS) method has been successfullyappliedtoclassificationofmicroarraygeneexpressiondata.Inthetraining step,aPLSmodeliscalculatedfromthemicroarraydatamatrixXandthevectoryof0’s and 1’s that codifies two classes. Next, the training data are predicted with the PLS modelforaselectednumberoffactorsandtheirpredictionsNJareusedtoestimatetwo probabilitydensityfunctions(PDFs),oneforeachmodelledclass.ThesePDFsdefinethe rangeofpredictionsthatcharacterizeseachclass.Inthepredictionstep,theprediction NJ of the sample to be classified and the PDFs are used to calculate the a posteriori probabilitythatthesamplebelongstoeachoneofthemodelledclasses.Thesampleis thenassignedtotheclasswiththehighestprobability. There are several reasons that make pͲDPLS suitable for classifying microarray data. Microarraydatainvolvethousandsofvariablesandamuchsmallernumberofsamples. Many of these variables are redundant, falsely correlated or irrelevant to distinguish between classes. The PLS model compresses the large data matrix X into a few latent variablesbyfocussingonthevariablesinXthataremostcorrelatedwiththevectorof classcodesy.Hence,theclassifierusesthesystematicrelevantdatavariability,sothat thepredictionNJofasampleandthefinalclassificationresultareminimallyaffectedby irrelevantgenes.Inaddition,sinceonlyafewlatentvariablesareused,anoisefiltering effectisachieved. AnotheradvantageofpͲDPLSinfrontofotheralgorithmsthatperformdiscriminantPLS liesinthecalculationofthePDFsofeachclassandinhowtheclasslabelisassigned.The classical discriminant PLS approach decides the class label based only on whether NJ is higher or lower than an arbitrary threshold (e.g. 0.5). More elaborated procedures assumethattheNJ'sofeachclassarenormallydistributed,andthemeanandstandard 181 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter8 deviation of the NJ's are used to estimate a Gaussian distribution for each class. The threshold is then the NJ where the PDFs of both classes coincide or (if a priori probabilitiesaretakenintoaccount)wheretheaposterioriprobabilitiesarethesame. Noneoftheseapproacheshasbeenusefulformicroarraydata.First,thereisnotreason forsettinganarbitrarythreshold.Second,thenumberofsamplesavailableforanalysis isusuallylimitedandoftenoneclassmayhavemanymoresamplesthantheother.This makesthepredictionofthePLSmodelbeusuallynotclusteredaroundthetargetvalues 0and1thatcodifytheclasses,butslightlybiasedandnotnormallydistributed(see,for example,thepredictionsinFigure6ofchapter4).InpͲDPLSthetypeofdistributionof theNJ'sdoesnotneedtobeassumedandthePDFsarecalculatedbycombiningkernel functions. Hence, the PDFs better describe the distribution of the predictions of each class. In addition, the kernel functionsuse as smoothing parameter the uncertaintyof the predictions, so that the relative position of the samples in the multivariate space alsocontributestothecalculatedPDFsthroughtheleverageandthefitofthemodel. AnotheradvantageofthepͲDPLSmethodusedinthisthesisisthatlimitsfortherange of possible NJ's of each class can be set, which allows outlier detection (see section 2 below) and the implementation of a reject option that allows rejecting to classify a samplewhentheaposterioriprobabilitiesforbothclassesaretoosimilar(seesection2 below). The latent variable structure of the PLS model also offers enhanced outlier detectioncapabilitiesbasedontheleverageandresidualvariance(seesection4below). A final advantage of pͲDPLS is that diverse variable selection methodologies, already usedinPLSregression,canbeusedtoselectthemostrelevantgenesforclassification. OneofthesemethodologieshasbeenimplementedinthepͲDPLS,asitisexplainedin thesection5below. 182 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Conclusions 2. A reject option was implemented in probabilistic Discriminant PartialLeastSquares(pǦDPLS). The classification in pͲDPLS is based on the Bayes Theorem so that the sample is assignedtotheclasswiththehighestaposterioriprobability.Thestraightapplicationof thisrulemakesproducesthatasamplewillalwaysbeassignedtooneofthemodelled classesevenwhenthesamplemaybesuspected.Oneofthesesituationsoccurswhen the prediction of the new sample is at the extremes of the PDF of one class. Such a sample is so different from the training samples (it is an outlier) that it might be misclassified.ThesecondsituationoccurswhenthePDFsofthetwoclassesarepartially overlapped, and the sample has a prediction NJ in the overlap zone (called ambiguous region).Thatsamplehascharacteristicsofbothclasses,sotheaposterioriprobabilityto belongtoanyoftheclassesissimilaranditsclassificationisnotreliableenough.While the samples in the two mentioned situations should preferably be not classified, the strictapplicationoftheBayesTheoremforcesitsassignementintooneofthemodelled classes.Inthisthesis,thepossibilityofnotclassifyingasamplehasbeenimplementedin pͲDPLS. This is called the reject option. The reject option in pͲDPLS is generally overlooked.However,itallowsavoidingclassificationswithalowreliability,byrejecting to classify both outliers and ambiguous samples. This increases the confidence of the experimenter that the classification model yields correct results when a class label is issuedforanewsample. Inthiswork,therejectoptionforambiguoussampleshasbeenimplementedinpͲDPLS asarejectthreshold(followingChow’srule),andtherejectoptionforoutliershasbeen implementedbysettinglimitstotheallowedNJvaluesforeachclass. Aninconvenientoftherejectoptionisthatsomesamplesrejectedwouldbeclassified correctly if reject option is not implemented. Hence, when the reject threshold and 183 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter8 limits are set, a tradeͲoff between the number of samples incorrectly classified, correcltyclassifiedandrejectedmustbeachieved. In this thesis, pͲDPLS with reject option has been successfully applied to classify oligonucleotideandmiRNAmicroarraydatabyrejectingsamplesthatwouldhavebeen classified incorrectly. With the reject option, for the Small Round Blue Cell Cancer datasetthemisclassificationrateofthemodelwasreducedfrom100%to10%fortest samplesfromclassesnotmodelledduringthetrainingstep,andfortheHumanCancers datasetfrom3%tolessthan1%forthetrainingsamplesclassifiedbycrossͲvalidation. 3. The performance evaluation of classifiers must be reconsidered whenarejectoptionisallowed. A pͲDPLS classifier must be evaluated to assure its quality. Common measures of a classifiers’performancearetheaccuracyortheerrorrate.Theseparametersareusually calculatedasthenumberofcorrect(orerroneous)classificationsoverthetotalnumber ofsamplesclassified. Whenrejectionisnotanoption,thetotalnumberofsamplesclassifiedisequaltothe number of samples that have been submitted to the classifier. In contrast, when rejectionisanoptionthecalculationofperformancevaluessuchastheaccuracyorthe errorratearestillusefulbutmustbereinterpretedtobemeaningful.Theyareequally calculatedasthenumberofcorrect(orerroneous)classificationsoverthetotalnumber ofsamplesclassified.However,thenumberofsamplesforwhichtheclassifierhasgiven aclasslabel(classified)maybedifferentthanthetotalnumberofsamplessubmittedto theclassifier(thedifferenceisthenumberofsamplesthathavebeenrejected). Thereasoningofthisreinterpretationisthattheanalystwants,firstofall,thattheclass labelissuedbytheclassifieriscorrect.Hence,theperformancemeasureshouldreflect 184 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Conclusions the percentage of the samples for which the classifier assigned a class, which are the onesforwhichadecisionistaken(e.g.,'tumourtype1','tumourtype2').Afterthat,the analyst may accept the classifier to reject some “difficult” samples (of course, the classifiershouldclassifyasmanysamplesaspossibleandrejectasfewaspossible).In addition,iftheaccuracyweredefinedoverthetotalnumberofsamples,classifierswith rejectoptionwouldalwaysperformworsethanmodelswithoutrejectoption,because the number of samples correctly classified using the reject otpion would be equal or lower). The performance measures are also used to decide among several classifiers. For example,inpͲDPLS,differentclassifiersareobtainedbyselectingadifferentnumberof factorsinthePLSmodel.Whentherejectoptionisallowed,theerrorratealonemay not be a sufficient criterion to compare classifiers, since the rejected samples are not included in the count. In that sense, a classifier that rejects most of the samples and classifies correctly the remaining will have a high accuracy, although it is clearly not usefulforclassification. A better criterion for evaluating the performance of a classifier is to use the Cost parameter, which takes into account the number of rejected samples. The Cost evaluates the number of correct classifications, the misclassifications and also the rejectionsofthemodel,takingintoaccounttheindividualcostofeachoftheseactions andprovidinga single valuerepresentative of the performanceof theclassifier or the classificationmodel.TheCosthasbeenusedinthisthesistocomparetheperformance ofthepͲDPLSwithrejectoptionandtodeterminetheoptimalnumberoffactorsforthe pͲDPLS model. The Cost has also been used to evaluate if removing outliers improves thepͲDPLSmodels(seesection4). 185 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter8 4.OutlierdetectioninpǦDPLShasbeenimplementedasarejectoption. Microarray data may contain outliers caused by the many steps involved in obtaining thedata.Moreover,samplesthatbelongtoclassesthathavenotbeenmodelledmay alsobesubmittedtothepͲDPLSclassifier.Hence,outlierdetectionisanecessarytool for the practical implementation of pͲDPLS. Outliers in pͲDPLS were detected in this work by combining leverage, variances and predicted values (NJ) of the pͲDPLS model. This method for outlier detection allows to reject not only samples with errors in the instrumentaldata(x),inthecodification(y)orsampleswithanerroneousxͲyrelation butalsotoidentifythatanincomingsampledoesnotbelongtoanyoftheclassesinthe trainingset. In the Small Blue Round Cell tumours dataset, 90 % of the samples of a class not modelled in the training step were detected as outliers using this method. These sampleswouldhavebeenallmisclassifiediftherejectoptionhadnotbeenused.Inthe prostate dataset, outlier elimination improves the classification model, decreasing the Cost per classification from 0.11 to 0.06. The outlier elimination has also a beneficial effectontheaccuracyoftheclassificationofunknown(test)samples,whichincreases from95%to100%,rejectingtoclassifyasamplethathadbeenwronglyclassified. 5.GeneselectionwasimplementedinpǦDPLSwithrejectoption Most of the thousands of gene expressions in microarray datasets are irrelevant to classifysamples.Irrelevantdatamaydegradetheclassifier’sperformanceanddifficult the understanding of the genes that are discriminating the classes. For these reasons, variableselectionisrequired. 186 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Conclusions Inthisworktheselectivityratioindexhasbeenappliedasageneselectionmethodto selecttherelevantvariablesinPLS.Thisallowedpointingoutthemostrelevantgenesto discriminate subtypes of prostate cancer and nonͲsmall cell lung types of cancer with highaccuracyindependentlyonthetrainingandtestsetsused. For the prostate dataset, models with only 17 selected genes had a mean LOOCV accuracyof94%,comparedtothe85%accuracyobtainedforthepͲDPLSmodelwithout gene selection (5966 genes). Equivalently, the mean of the accuracies for the test set improvedto92%fromthe84%obtainedwithoutgeneselection.Whenthenumberof selectedgenesincreasesfrom17to35,theaccuracydidnotimprove.Similarlyforthe nonͲsmall cell lung cancer dataset, the genes used in the classification were reduced from54675to17,achievingameanofLOOCVaccuracyof93%.Inthiscasetheincrease in the number of genes selected from 17 to 30 neither improved the classification accuracy. Themostadequatemethodforprovingthevalidityofaselectedsubsetofgenes(and,in turn,thevalidityofthegeneselectionalgorithm,andofthegeneselectioncriterion)has alsobeenstudied.Mostvariableselectionmethodsstartbyinitiallysplittingthedataset into a training and a test set. Such an split influences the calculated accuracy of the classificationmodelandalsoinfluencestheconclusionaboutthevalidityoftheselected subset of genes. If the selected genes and the conclusions are based on a single split, underoptimistic or overoptimistic results can be found. A single unfortunate split can lead to low accuracies (around 88%) and, by contrast, a fortunate split can lead to overoptimistic accuracies (around 100%). For this reason, a repetitive strategy of trainingsetandtestsetsplits,geneselection,pͲDPLSmodelcalculationandvalidation wascarriedouttomeasuretheperformanceoftheselectedgenes.Thegenesselected followingthisstrategyprovidedmodelsmuchlessinfluencedbythesplitofthedata. 187 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Chapter8 6. Linear Discriminant Analysis has been combined with PLS to solve multiǦclassclassificationproblems. MultiͲclass classifiers are required for microarray data classification since most of the cellsortissuestobeclassifiedmaybelongtomorethantwoclasses. pͲDPLS is suitable to analyse microarray data due to advantages like the use of latent variablesorthenoisereduction(detailedinsection1),whichareimportantinorderto improvethemulticlassclassification.However,pͲDPLSisabinaryclassifier,hence,itcan only discriminate between two classes at a time. One usual option is to reduce the multiclass classification problems to binary classification ones, following a oneͲversusͲ oneoraoneͲversusͲallstrategy;butthesestrategiesarenotalwaysenoughtoachieve an adequate muticlass classification. The inconvenient resides that the DPLS allows discriminating between two modelled classes, but the NJ predicted values of the incomingsamples(thatmaynotbelongtoanyofthesetwoclasses)presentvaluesthat span all the NJ domain (i.e. Figure 2a chapter 7). Hence, these samples are confused among the samples of the modelled classes, assigned to any of them and, so, misclassified. In this thesis a method that combines PLS and linear discriminant analysis (LDA) has beendevelopedformultiͲclassclassification.Themethodinvolvesalsoaselectionofthe mostdiscriminantgenesforeachofthePLSmodels.Thisstrategyallowsreducingthe datadimensionandperformingthemultiͲclassclassificationwithhighaccuracywitha fewgenes.Thismethodhasbeenappliedtotheleukemiaandthesmallroundbluecell tumour dataset. Leukemia data consist on three different types of samples (AML, ALL and MLL) that generally have poor prognosis and the small round blue cell tumour includes four subtypes (NB, RMS, NHL and EWS) the accurate diagnosis of which is essential because the treatment options, responses to therapy and prognoses vary widely depending on it. For both datasets, the accuracies achieved were very high, a 188 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Conclusions 97% and a 100% of classification accuracy, respectively, using 15 genes to classify the leukemiadatasetand17genesforthesmallroundbluecelltumourdataset. 189 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix Datasets Humancancersdataset TheHumanCancersdatasetwaspublishedbyLuetal.in[1].Thenormalizeddatasetis available at [2] together with supplementary information [1]. The dataset consists of 282 microRNA (miRNA, non coding RNA species) of 218 samples (46 healthy and 172 tumour) from twenty tissues (ovary, colon, lung, prostate, bladder, breast, follicular lymphoma, kidney, liver, brain, melanoma, mesothelioma, stomach, uterus, acute myelogenous leukaemia, diffuse largeͲB cell lymphoma, BͲcell ALL, mycosis fungoides, mixedlineageleukaemiaandTͲcellALL). The published dataset had been normalized as detailed in the Supplementary_Notes document: 1. WellͲtoͲwellscaling–thereadingfromeachwellwasscaledsuchthat thetotalofthetwopostͲlabelingcontrols,inthatwell,became4500 (amedianvaluebasedonapilotstudy). 2. Samplescaling–thenormalizedreadingswerescaledsuchthattotal ofthe6preͲlabelingcontrolsineachsamplereached27,000(amedian valuebasedonapilotstudy). 3. Floorthresholdwassetat32. 4. Datawerelog2transformed. ThenormalizeddownloadabledatafileisatabͲdelimitedtextfile(miGCM_218.gct),of 218 samples and 217 gene expression (left after filtering). The first row of the matrix indicatesthetissueID,andthefirstandthesecondcolumndetailthegenenameand genesdescriptionrespectively. 193 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata Inthisoriginalwork,thedatasetwasusedtodemonstratethefeasibilityandutilityof monitoring the expression of miRNAs in human cancer tissue. This dataset has been usedinotherstudies.Lodesetal.[3]usedthemiRNAasmarkersforcancerdetection andithasbeenpointedthatmiRNAsmaybethefutureofpharmacogenomics[4]. InthisthesisithasbeenusedtoevaluatetheperformanceoftheprobabilisticDPLSwith rejectoptionclassifier. Breastcancerdataset TheBreastCancerdatasetwaspublishedbyHedenfalketal.in[5].Thedatasetafter filtering(3226genes)isavailablein[6]. The downloadable data are the normalized gene expression ratios of 21 samples and from three different mutations (BRCA1, BRCA2 and sporadic mutation). The format description document, in the same web page, describes the downloadable data. The downloadable data file is a tabͲdelimited text file, in which the first row indicates the Patient ID for each experiment (1to 21). The second row provides the mutation classificationforeachexperiment,(BRCA1,BRCA2,Sporadic)andthethirdrowprovides theexperimentID,(s1996,s1822,etc).Columns1to3arerelatedtothegenesIDand theirlocalizationintheplate.Columns4to24containgeneexpressionratiosforeach geneineachexperiment. The gene expression ratios are derived from the fluorescent intensity (proportional to the gene expression level) of a tumor sample (BRCA1, BRCA2, or Sporadic) divided by the fluorescent intensity of a common reference sample (MCFͲ10A cell line). The commonreferencesampleisusedinall21microarrayexperiments. 194 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix The genes are filteredbased on: (a) average fluorescent intensity (level of expression) greaterthan2,500(graylevel)acrossall21samples,(b)averagespotareagreaterthan 40pixelsacrossall21samples,and(c)nomorethanonesampleinwhichthespotarea iszeropixels. Theratios,includedinthedownloadabledatafile,foreachexperimentwerenormalized suchthatthemajorityofthegeneexpressionratiosfromapreͲselectedinternalcontrol genesetwerearound1.0.Nologtransformationwasdoneinthedownloadabledata. This dataset was previously used to evaluate the performance ofclassification models [7, 8], for gene selection methods testing [9, 10], for multiclass classification models evaluation[6]andtocheckimputationmethods[11],tociteafew. WehaveusedthisdatasettodemonstratetheusefulnessofpͲDPLSwithrejectoption torejecttoclassifysamplesfromclassesnotmodeledinthetrainingstep. Prostatedataset The prostatecancerdatasetwas published bySinghet al in [12] and itis available on [13].After filtering, it has 50 nonͲtumour samples and 52 tumour samples with 12600 geneexpressions. The preͲprocessing was detailed in the supplementary information document (SuppInfo_CCv3.pdf).Briefly,thedatawasscaledtoreferenceintensity(meanaverage differenceofallgenespresentinthemicroarrays).Thegeneswithaveragedifferences below 10 were filtered. Equivalently, the maximum threshold was set at 16000. After thresholding, the relative variation of expression for each gene was determined by dividingthemaximumexpression(Max)ofthegeneamongallsamplesbytheminimum 195 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata expression (Min). The absolute variation in expression was determined by subtracting the (Min) from the maximum (Max). The genes with (Max/Min) <5 or (MaxͲMin) <50 werealsofiltered. ThedownloadablematrixisatabͲdelimitedtextfilethatcontainsexpressionvaluesin Affymetrix's scaled average difference units. Rows 1 to 3 contain the identification of the samples, the scale factor of each microarray (sample) and the number of genes respectively.AssociatedtoeachaveragedifferenceexpressionnumberthereisaP,M, or A label that indicates whether RNA for the gene is present, marginal, or absent, respectively (as determined by the GeneChip software), based upon the matched and mismatchedprobesforthegenes. Thisdatasetwaspreviouslystudiedingeneselectionstudiesandusedtoevaluatethe performance of classification methods. To cite a few, Dettling et al. [14] used this dataset (and others) to demonstrate that when bagging was used as a module in boosting, the resulting classifier consistently improved the predictive performance; DiazͲUriarteetal.in[15]usedthisdatasettocheckgeneselectionandtheperformance of a classification using random forest; and Jeffery et al. in [16] used this dataset to compare different gene selection methods (and the lists of genes generated by each one)anddifferentclassifiers. In this thesis, this has been used to check the outlier detection and gene skeleton methodsimplementedtopͲDPLSclassifier. Smallroundbluecellstumourdataset ThesmallroundbluecelltumoursofchildhooddatasetwaspublishedbyKhanetal.in [17] and it is available at [18]. The preͲprocessing of the data is detailed in the SupplementalMethodsdocument. 196 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix Initially,theexpressionlevelsfrom6567genesweremeasuredforeachoneofthe88 analyzedsamples(ofwhich63werelabelledascalibrationsamplesand25wereblind tests).Intheanalysistheredintensity(ri)andtherelativeredintensity(rri)wereused. Genes were omitted if for any of the samples ri was less than 20. This main removed spotsforwhichtheimageanalysisfailed.Withthiscutonly2308geneswereleft. The final downloadable dataset is a tabͲdelimited text file that contains the natural algorithmoftherelativeredintensity(rri)forallofallthe88samplesand2308genes. Thisdatasetwaspreviouslyusedtocheckgeneselectionmethods[19,20],tocompare between different linear discriminant methods [21] or to evaluate multiͲclass classificationmethods[22]. We have used this to check the ability of the proposed outlier detection method of detectingsamplesfromclassesnotmodeledinthetrainingstepofthepͲDPLSmodels. Furthermore it has been used to demonstrate the ability of the PLS combined with lineardiscriminantanalysis(LDA)tomultiͲclassclassification. NonǦsmallcelllungcancer ThenonͲsmallcelllungcancer(NSCLC)datasetwaspublishedbyKuneretal.in[23].The datasetconsistsof58samplesofthetwomajorhistologicalsubtypesoflungcancer,40 fromadenocarcinomaand18fromthesquamouscellcarcinoma.Foreachone,54675 gene expressions were analysed. The data were normalized by the gcRMA method published by Wu et al. in [24]. From the initial 60 hybridizations two microarray hybridizations (PatID 42 and 421) failed the quality criteria due to local hybridization artefactsandwereexcludedfromfurtheranalysis. 197 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata The data are available at NCBI GEO database [25] with the dataset identification GSE10245. Raw data are provided as supplementary files, one for each sample. All samplesaregroupedinamatrixintheSeriesMatrixFile.Thisisatabdelimitedfilewith thehybridizationsofthe58samples. Thisdatasetwasrecentlypublished(year2009)and,asfarasweknown,ithasnotbeen used yet to check classifiers or gene selection method. It has been only used as a referenceinbiologicalstudiesoflungcancer. We have used nonͲsmall cell lung cancer dataset to verify the usefulness of the gene selection method proposed and to show the influence over the accuracies of the classificationmodelsthathavetheinitialdivisionsofthedatasets(i.e.thesplitsofthe datasetintoatrainingandatestset). Leukemiadataset TheleukemiadatasetwaspublishedbyArmstrongetalin[26]anditisavailableon[27]. This dataset consists of 72 samples of acute lymphoblastic leukemias carrying a chromosomal translocation that derives on three subtypes of samples, 24 samples of acutelymphoblasticleukemia(ALL),20samplesofmixedlineageleukemia(MLL)and28 samples of acute myeloid leukemia (AML). For each sample 12582 gene expressions wereanalysed. Thedownloadabledataisatabdelimitedfiletext.ThefilecontainsAffymetrix"average difference"expressionvaluesforallsamples.Thedataarealreadyscaledasdetailedin the File info document. Linear scaling is used to reduce technical noise due to global intensitydifferencesbetweenscans.Linearregressionofall"Present"genes(Affymetrix "P"calls)wasusedtodeterminethescalingfactorforeachscan(thefirstALLscanused as a reference). The scaling factor was applied to expression values (regardless of A/P 198 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix call).Scalingfactorsrangedfrom0.93to2.1;allscalingfactorsareshowninthescanid file. Then once the dataset obtained, user must preͲprocess it according to the authors in [26] as follows: a floor threshold and a ceiling threshold were set at 100 units and at 16000 units respectively. After this preͲprocessing, gene expression values were subjectedtothevariationfilter.ThevariationfiltertestsforafoldͲchangeandabsolute variation over samples, by comparing max/min and maxͲmin intensities. The max/min filterwassetat5andthemaxͲminat500forallexperiments. Thisdatasethadbeenpreviouslyusedtocomparedifferentgeneselectionmethods[20] andtocheckdifferentmultiͲclassclassificationmethodsandstrategies[20,28,29]. We have used this dataset to show the ability of the multiͲclass classifier proposed in thisthesisbycombiningPLSandLDA. 199 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata [1] Lu,J.,etal.,MicroRNAexpressionprofilesclassifyhumancancers.NatureLetters,2005.435:p.834Ͳ [2] http://www.broadinstitute.org/cgibin/cancer/publications/pub_paper.cgi?mode=view&paper_id=1 838. 14. [3] Lodes,M.J.,etal.,DetectionofCancerwithSerummiRNAsonanOligonucleotideMicroarray.PLOS One,2009.4:p.e6229. [4] Mishra,P.J.andJ.R.Bertino,MicroRNApolymorphisms:thefutureofpharmacogenomics, molecularepidemiologyandindividualizedmedicine.Pharmacogenomics,2009.10:p.399Ͳ416. [5] Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal ofMedicine,2001.344:p.539Ͳ548. [6]http://research.nhgri.nih.gov/microarray/NEJM_Supplement/ [7] Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical ApplicationsinGeneticsandMolecularBiology,2004.3:p.article33. [8] Raza, M., et al., Comparative Study of Multivariate Classification Methods using Microarray Gene Expression Data for BRCA1/BRCA2 Cancer Tumors. Proceedings of the Third International ConferenceonInformationTechnologyandApplications(ICITA'05),IEEE.,2005.2:p.475Ͳ480. [9] Pettersson, F. and A. Berglund, Interpretation and validation of PLS models for microarray data. ChemometricsandChemoinformaticsACSSymposiumseries,2005.894:p.31Ͳ40. [10] McLachlan, G.J., R.W. Bean, and L.B.ͲT. Jones, A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics, 2006. 22: p. 1608Ͳ1615. [11] Branden, K.V. and S. Verboven, Robust data imputation. Computational Biology and Chemistry, 2009.33:p.7Ͳ13 [12] Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002. 1:p.203Ͳ209. [13] http://www.broadinstitute.org/cgiͲbin/cancer/datasets.cgi. [14] Dettling,M.,BagBoostingfortumourclassificationwithgeneexpressiondata.Bioinformatics,2004. 20:p.3583Ͳ3593. [15] DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using randomforest.BMCBioinformatics,2006.7:article3. [16] Jeffery,I.B.,D.G.Higgins,andA.C.Culhane,Comparisonandevaluationofmethodsforgenerating differentiallyexpressedgenelistsfrommicroarraydata.BMCBioinformatics,2006.7:p.359Ͳ375. 200 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix [17] Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679. [18] http://research.nhgri.nih.gov/microarray/Supplement/. [19] Zhu, S., et al., Feature Selection for Gene Expression Using ModelͲBased Entropy. IEEE/ACM Transactionsoncomputationalbiologyandbioinformatics,2010.7:p.25Ͳ36. [20] Mohamad, M.S., et al., ThreeͲStage Method for Selecting Informative Genes for Cancer Classification.IEEJTransactionsonElectricalandElectronicEngineering,2009.4:p.725Ͳ730. [21] Huang,D.,etal.,Comparisonoflineardiscriminantanalysismethodsfortheclassificationofcancer based on gene expression data. Journal of Experimental & Clinical Cancer Research, 2009. 28: p. 149:156. [22] Chetty, G. and M. Chetty, Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models. Pattern Recognition in Bioinformatics, Proceedings. Lecture notes in bioinformatics,2009.5780:p.46Ͳ55. [23] Kuner, R., et al.,Global gene expression analysis reveals specific patterns of cell junctions in nonͲ smallcelllungcancersubtypes.LungCancer,2009.63:p.32Ͳ38. [24] Wu,Z.,etal.,AmodelͲbasedbackgroundadjustmentforoligonucleotideexpressionarrays.Journal oftheAmericanStatatisticalAssociation,2004.99:p.909Ͳ17. [25] http://www.ncbi.nlm.nih.gov/geo/. [26] Armstrong, S.A., et al., MLL translocations specify a distinct gene expression profile that distinguishesauniqueleukemia.NatureGenetics,2002.30:p.41Ͳ47. [27] http://research.dfci.harvard.edu/korsmeyer/MLL.htm. [28] Anand, A. and P.N. Suganthan, Multiclass cancer classification by support vector machines with classͲwiseoptimizedgenesandprobabilityestimates.JournalofTheoreticalBiology,2009.259:p. 533Ͳ540. [29] Wang,X.andO.Gotoh,Accuratemolecularclassificationofcancerusingsimplerules.BMCMedical Genomics,2009.2:p.64Ͳ87. 201 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata Abreviations AC ALL AML BL BRCA1 BRCA2 CDC cDNA CV Cy3 Cy5 DA DNA DPLS EWS FN FP GA HL KNN LDA LL LOOCV LOWESS MAplot miRNA MLL mRNA MVT NB NN NSCLC OVA OVO PCA Pcs PDF 202 Adenocarcinoma Acutelymphoblasticleukemia Acutemyeloidleukemia Burkittlymphomas Breastcancergene1 Breastcancergene2 Closestdistancetocenter Complementarydeoxyribonucleicacid Crossvalidation Cyanine3 Cyanine5 Discriminantanalysis Deoxyribonucleicacid Discriminantpartialleastsquares Ewingfamilyoftumours Falsenegative Falsepositive Geneticalgorithms Highlimit Knearestneighbours Lineardiscriminantanalysis Lowlimit LeaveoneoutcrossͲvalidation Locallyweightedscatterplotsmoothing RatioͲintensityplot MicroRNA,noncodingRNAspecies Mixedlineageleukemia Messengerribonucleicacid Ellipsoidalmultivariatetrimming Neuroblastoma Neuralnetworks NonͲsmallcelllungcancer Oneversusall Oneversusone Principalcomponentanalysis Principalcomponents Probabilitydensityfunction UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix pͲDPLS RMS RMSEC RMSECV RMSEP RN RNA RP RPMBGA rRNA SCC SEP SOS SR SRBCT SVM TN TNR TP TPCR TPR tRNA VIP Probabilisticdiscriminantpartialleastsquares Rhabdomyosarcoma Rootmeansquareofcalibration Rootmeansquareofcrossvalidation Rootmeansquareofprediction Rejectnegative Ribonucleicacid Rejectpositive Randomprobabilisticmodelbuildinggeneticalgorithm Ribosomalribonucleicacid Squamouscellcarcinoma Standarderrorofprediction Sparseoptimalscore Selectivityratio Smallroundbluecelltumour Supportvectormachines Truenegative Truenegativerate Truepositive Totalprincipalcomponentregression Truepositiverate Transferribonucleicacid Variableimportanceonprojection 203 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata Publications Cristina Botella, Joan Ferré, Ricard Boqué. Classification from microarray data using probabilistic discriminant partial least squares with reject option.Talanta, 2009, 80(1): 321Ͳ329. CristinaBotella,JoanFerré,RicardBoqué.Outlierdetectionandambiguitydetectionfor microarraydatainprobabilisticDiscriminantPartialLeastSquaresRegression.Journalof Chemometrics,2010,Accepted. Cristina Botella, Joan Ferré,Ricard Boqué.Gene selection in microarray data based on selectivityratio.2010,Submitted. Cristina Botella, Joan Ferré,Ricard Boqué.MultiͲclass classification ofmicroarray gene expressiondata.2010,Submitted. 204 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Appendix Communications CristinaBotella,JoanFerréandRicardBoqué AnewcriterionforselectingtheoptimalnumberoffactorsinDiscriminantͲPartialLeast Squares(DPLS).Applicationtomicroarraygeneexpressiondata. VI Colloquium Chemiometricum Mediterraneum, SaintͲMaximin. France. 2007 Postercommunication CristinaBotella,JoanFerréandRicardBoqué A new performance criterion for classification methods for microarraygeneexpression data. CAMDA(CriticalAssessmentofMicroarrayDataAnalysis),Valencia,Spain.2007 Postercommunication CristinaBotella,JoanFerréandRicardBoqué Classification of tumour cells from gene expression data using Probabilistic DPLS with rejectoption. IIIWorkshopdeQuimiometria,Burgos,Spain.2008 Oralcommunication CristinaBotella,JoanFerréandRicardBoqué Reject option implementing outlier detection and ambiguity detection in the classificationofmicroarraygeneexpressiondata. 11thScandinavianSymposiumonChemometrics,Loen,Norway.2009 Postercommunication 205 UNIVERSITAT ROVIRA I VIRGILI MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez ISBN:978-84-693-5427-8/DL:T-1418-2010 Multivariateclassificationofgene expressionmicroarraydata CristinaBotella,JoanFerréandRicardBoqué Geneselectioninmicroarraydatabasedonselectivityratio. VIIColloquiumChemiometricumMediterraneum,Granada.Spain.2010 Postercommunication 206