...

MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez

by user

on
Category: Documents
13

views

Report

Comments

Transcript

MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA Cristina Botella Pérez
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY
DATA
Cristina Botella Pérez
ISBN: 978-84-693-5427-8
Dipòsit Legal: T-1418-2010
ADVERTIMENT. La consulta d’aquesta tesi queda condicionada a l’acceptació de les següents
condicions d'ús: La difusió d’aquesta tesi per mitjà del servei TDX (www.tesisenxarxa.net) ha
estat autoritzada pels titulars dels drets de propietat intel·lectual únicament per a usos privats
emmarcats en activitats d’investigació i docència. No s’autoritza la seva reproducció amb finalitats
de lucre ni la seva difusió i posada a disposició des d’un lloc aliè al servei TDX. No s’autoritza la
presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de
drets afecta tant al resum de presentació de la tesi com als seus continguts. En la utilització o cita
de parts de la tesi és obligat indicar el nom de la persona autora.
ADVERTENCIA. La consulta de esta tesis queda condicionada a la aceptación de las siguientes
condiciones de uso: La difusión de esta tesis por medio del servicio TDR (www.tesisenred.net) ha
sido autorizada por los titulares de los derechos de propiedad intelectual únicamente para usos
privados enmarcados en actividades de investigación y docencia. No se autoriza su reproducción
con finalidades de lucro ni su difusión y puesta a disposición desde un sitio ajeno al servicio TDR.
No se autoriza la presentación de su contenido en una ventana o marco ajeno a TDR (framing).
Esta reserva de derechos afecta tanto al resumen de presentación de la tesis como a sus
contenidos. En la utilización o cita de partes de la tesis es obligado indicar el nombre de la
persona autora.
WARNING. On having consulted this thesis you’re accepting the following use conditions:
Spreading this thesis by the TDX (www.tesisenxarxa.net) service has been authorized by the
titular of the intellectual property rights only for private uses placed in investigation and teaching
activities. Reproduction with lucrative aims is not authorized neither its spreading and availability
from a site foreign to the TDX service. Introducing its content in a window or frame foreign to the
TDX service is not authorized (framing). This rights affect to the presentation summary of the
thesis as well as to its contents. In the using or citation of parts of the thesis it’s obliged to indicate
the name of the author.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
M U LTI V ARI ATE CLASSI FI CATI O N
O F GEN E EXPRESSI O N
M I CRO ARRAY D ATA
CristinaBotellaPérez
DOCTORALTHESIS
UNIVERSITAT ROVIRA I VIRGILI
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
M U LTI V ARI ATE CLASSI FI CATI O N
O F GEN E EXPRESSI O N
M I CRO ARRAY D ATA
CristinaBotellaPérez
DOCTORALTHESIS
Supervisedby
Dr.JoanFerréBaldrichandDr.RicardBoquéMartí
DepartmentofAnalyticalChemistryandOrganicChemistry
UniversitatRoviraiVirgili
Tarragona2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
ROVIRA I VIRGILI UNIVERSITY
Department of Analytical Chemistry
and Organic Chemistry
Dr.JOANFERRÉBALDRICHandDr.RICARDBOQUÉMARTÍ,associateprofessorsof
the Department of Analytical Chemistry and Organic Chemistry at Rovira i Virgili
University
CERTIFY:
The Doctoral Thesis entitled: ‘‘MULTIVARIATE CLASSIFICATION OF GENE
EXPRESSION MICROARRAY DATA’’, presented by CRISTINA BOTELLA PÉREZ to
receive the degree of Doctor of the Rovira i Virgili University, has been carried out
under our supervision, in the Department of Analytical Chemistry and Organic
ChemistryatRoviraiVirgiliUniversity,andalltheresultspresentedinthisthesiswere
obtainedinexperimentsconductedbytheabovementionedstudent.
Tarragona,March2010
Dr.JoanFerréBaldrich
Dr.RicardBoquéMartí UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
In the middle of difficulty
lies the opportunity
AlbertEinstein
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Arriben els últims dies de gairebé cinc anys de camí…… d’’un camí que no ha estat fàcil, ple de
sensacions, experiències i moments compartits amb molta gent. Gent que ha estat amb mi durant
part o la totalitat d’’aquesta tesi i de la que no em voldria oblidar ara que sembla que arribem al
final.
Gràcies al Dr. Joan Ferré i al Dr. Ricard Boqué per confiar en mi.
Joan, Ricard, gràcies pels consells i per donar-me l’’oportunitat d’’aprendre al vostre costat.
Gràcies a tots els membres del grup de Quimiometria, Qualimetria i Nanosensors per aquests anys.
Gràcies als companys amb els que he compartit els inicis, la totalitat o el final del doctorat. Així i
sense voler oblidar-me de ningú, gràcies a tot el grup per acollir-me com ho heu fet. Gràcies a Vero i
Giselle pel seu suport i ànims sobretot als inicis. Gràcies a Idoia, Vane, Santi, Jordi, Jaume, Carol i
Kris per tots els bons moments i els riures de les millors hores de cafè.
Joe,aiix al final mi compi de despacho, cuántas horas compartidas y cuántos buenos momentos, me
quedo con ellos, gracias. Igualment, gràcies a Marta S, pels ànims, per preocupar-te i posar-li
somriures a aquesta tesi.
Montse, encara que sigui des de la distància, gràcies. Gràcies pels teus correus i els teus ànims.
També des de la distància, gràcies a Sílvia, Laia, David i Lluis; des de les nostres terres m’’heu
acompanyat dia a dia. Les vostres paraules han estat sempre importants.
Laura, Antonio, Rafa ... aquest camí ha tingut sentit gràcies a vosaltres, GRÀCIES per ser com
sou, no canvieu mai.
Laura, GRÀCIES. Gracias por preocuparte por mí, por nuestras charlas y por tener
siempre una palabra de apoyo y de ánimo preparada, gracias por compartir conmigo estos
años.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Antonio, què t’’he de dir……tants anys junts, GRÀCIES. Gràcies pels breaks, pels riures
que hem compartit i has aconseguit treure’’m en els mals dies. Gracias por preocuparte y
estar siempre a mi lado.
Rafa, com tants cops, ara tampoc tinc paraules, simplement GRÀCIES. Gràcies pel teu
suport, els teus ànims en els mals moments i les teves paraules sempre ben escollides. Hem
quedo amb les nostres llargues xerrades. Gràcies per escoltar-me i ser-hi sempre.
Tomàs, la persona que ha compartit amb mi aquest camí, que m’’ha apoiat en els moments més durs
i ha fet possible que arribés a la fi, GRÀCIES. Gràcies per no deixar-me defallir i ajudar-me a mirar
endavant en tot moment. Sé que no sempre ha estat fàcil.
I que puc dir d’’aquells qui gairebé han fet la tesi per mi i amb mi…… els meus pares, GRÀCIES.
Gràcies per estar sempre al meu costat i apoiar-me en qualsevol de les meves decisions, creient amb
les meves possibilitats més que ningú.
A tots, només una paraula més, GRÀCIES.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Tableofcontents
Structure
13
Chapter1.Introduction
17
1.1 Geneticexpression
1.2 Microarrays
1.2.1
Microarrayplatformsandexperimentation
1.2.2
Microarraydata
1.2.3
Microarrayapplications
19
20
21
25
29
Chapter2.Thesisobjectives
43
Chapter3.Discussionoftheimplementationoftherejectoptionin
ProbabilisticͲDiscriminantPartialLeastSquares
47
3.1Introduction
3.2Probabilisticdiscriminantpartialleastsquares
3.2.1Thepartialleastsquaresmodel
3.2.2Theprobabilitydensityfunctionofaclass
3.3Classprediction
3.3.1Classificationbasedonprobabilities
3.3.2Classificationbasedonrisk
3.4Discussionofclassprediction
3.5Probabilisticdiscriminantpartialleastsquareswithrejectoption
3.5.1Rejectoptionasaclass
3.5.2Rejectoptionasathreshold
3.6Implicationsofrejectoptioninclassificationperformanceevaluation
3.7Conclusions
49
50
50
51
53
53
57
60
62
63
66
69
73
Chapter4.ClassificationfrommicroarraydatausingpͲDPLSwith
rejectoption
77
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Tableofcontents
Chapter5.Outlierdetectionandambiguitydetectionformicroarray
datainpͲDPLSregression
Chapter 6. Gene selection based on selectivity ratio for probabilistic
discriminantpartialleastsquares
Chapter7.MultiͲclassclassificationofmicroarraygeneexpression
data
Chapter8.Conclusions
Appendix
Datasets
Abreviations
Publications
Communications
107
137
159
179
191
193
201
203
205
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Structure
Thisthesisisstructuredineightchapters.
Chapter 1. Introduction. This chapter gives an overview of DNA microarrays, their
origin, types and applications. The steps involved in the generation of the microarray
data, from hybridization to image acquisition and data preͲprocessing, are described.
Theneedofmultivariatedataanalysisisjustified.Finally,themultivariatemethodsused
foranalyzingmicroarraydataarecited,focusingonclassificationmethods.
Chapter2.ThesisObjectives.Inthischapteraredescribedtheaimsofthisthesis.These
objectivesaredevelopedinthepublicationsincludedinthenextchapters.
Chapter 3. Discussion of the implementation of the reject option in pͲDPLS. This
chapter discusses the implementation of the reject option in pͲDPLS. Firstly, the
calculation of the pͲDPLS model and the class prediction process based on the Bayes
rulearedetailed.Then,thelimitationsoftheclassificationbasedontheBayesruleare
discussed. Two approximations to introduce a reject option that overcome the cited
limitationsdiscussedinprevioussectionarepresented.Finally,theimplicationsofthe
rejectoptionintheevaluationoftheclassifiersarecommented.
Chapter 4. Classification from microarray data using pͲDPLS with reject option. This
paper (C. Botella, J. Ferré, R. Boqué, Talanta, 80 (2009) 321Ͳ328) describes the
implementation of a reject option in pͲDPLS models in order to improve the
classificationofmicroarraydata.TherejectoptionallowsapͲDPLSmodeltonotclassify
outliersandambiguoussamples.Thisensuresthatonlythesampleswhoseclassification
is reliable enough are indeed classified. As a consequence, the number of
misclassificationsdecreasesandtheaccuracyoftheclassifierimproves.
Chapter 5. Outlier detection and ambiguity detection for microarray data in pͲDPLS
regression.OutlierdetectionisoftenoverlookedinmicroarraydataanalysiswithfactorͲ
13
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Structure
based classification methods. However, outlier diagnostics are required when
implementing any classification method in real practice. In this paper (C. Botella, J.
Ferré, R. Boqué, Journal of Chemometrics (2010) Accepted) two procedures, typically
used in chemometrics, are combined with the reject option (chapter 4) to detect
outliers and ambiguous samples in pͲDPLS. The application of these diagnostics
increasestheaccuracyofthepͲDPLSmodelsandavoidsclassifyingsamplesfromclasses
thatwerenotmodelled.
Chapter 6. Gene selection based on selectivity ratio for probabilistic discriminant
partialleastsquares.Geneselectionisafundamentalstepinmicroarraydataanalysis.
It allows both identifying the genes that characterize a certain disease and also
simplifying and improving classification models by discarding irrelevant genes. In this
paper(C.Botella,J.Ferré,R.Boqué,(2010)submitted)ageneselectionprocedurethat
is specific for PLS is used to find the best subset of genes that discriminate between
different subtypes of tumours and also between healthy and tumour samples. The
procedureisbasedonselectingthegenesthatmaximizetheselectivityratio(SR)index.
The paper also shows that the calculated accuracy of a classifier can be largely
influencedbyhowthedatasetissplittedintoatrainingsetandatestset.Certainsplits
can lead to a wrong assessment of the validity of the gene selection algorithm. A
repetitive procedure consisting of data split, gene selection, training and validation is
proposedinordertotestthegoodnessofthegenesselectedwhittheSRindex.
Chapter7.MultiͲclassclassificationofmicroarraygeneexpressiondata.Inmostcases,
samplestobeclassifiedfrommicroarraydatamaybelongtomorethantwosubtypesof
a disease. The pͲDPLS approach used so far only allows discriminating between two
subtypes. This chapter (C. Botella, J. Ferré, R. Boqué, (2010) submitted) describes a
classificationstrategytobeusedwhentherearemorethantwocandidateclasses.The
methodcombinesthepredictionsfromoneͲversusͲonepͲDPLSmodelswiththeLinear
DiscriminantAnalysis(LDA)classifier.
14
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Structure
Chapter 8. Conclusions. This chapter sums up the improvements achieved by the
methodspresentedinthisthesis.
The Appendixcontains adescription of the datasets used in this thesis, the list of the
abbreviations used, and the list of papers and presentations performed during this
period.
15
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 1 Introduction UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.1
1.1Geneticexpression
Geneticexpression
Deoxyribonucleicacid(DNA)moleculesarethe
genetic material of most living organisms [1].
They are chains of nucleotides (Figure 1). A
nucleotide consists of a phosphate group, a
deoxyribosesugarmoleculeandanitrogenous
base (guanine, cytosine, adenine or thymine)
[1]. Genes are sequences of hundreds or
thousands of these nucleotides that encode
the genetic information to make specific
proteins[2].
Figure1.DNAchain.Source:[3].
The protein formation involves a
transcriptionprocess,inwhichthe
genesaremappedintomessenger
RNA
(mRNA)
by
the
RNA
polymeraseenzyme[1,4]followed
by a translation process, in which
the aminoacids encoded by the
mRNA codons are joined in the
presence of transfer RNA (tRNA)
and ribosomal RNA (rRNA) (Figure
2).
Figure2.Transcriptionandtranslationprocessesinthemakingofaprotein.Source:[3].
19
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
Thegenesregulatetheproteinexpressionsandconsequentlythemetabolicprocesses
of the living organisms. Some genes are only expressed in particular cell types or in
certain development stages [5], so these genes (or their expressed intermediate,
mRNA)canbeseenasmarkerstodefineparticularcellularstates,suchashealthyor
tumour[6].
1.2
Microarrays
Microarray technology is a powerful
tool for simultaneously evaluating the
3
expressionlevelofthousandsofgenes
in a cell [2] and, hence, the
information that is encoded in the
DNA[6].
1
2
A microarray is a microscopic slide
that contains an ordered series of
Figure 3. Parts of a microarray. 1. Slide, 2. Probe
DNA,3.TargetDNA.Source:Affymetrix.
DNA,RNAproteinsortissues.TheDNA
microarraysarethemostcommon[7].ADNAmicroarrayisgenerallyaglassslideora
siliconchipinwhichthousandsofgenesequencesareprinted(Figure3).Oneveryspot
manycopiesofaspecifiedDNAsequencearechemicallybondedtothesurfaceofthe
slide [2]. The genes immobilized onto the slide are called the DNA probe. Over this
DNAprobe,thetargetDNAorthetargetRNA(dependingonthemicroarrayplatform)
obtained from the cell under study is hybridized (hydrogen bonded). The amount of
hybridizationismeasuredandrelatedtothepresenceandexpressionofcertaingenes
inthecell.
20
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
Figure 4 shows the workflow process in a microarray experiment. The experimental
process varies depending on the microarray platform that is used (see below). After
datahavebeenmeasuredandpreͲprocessed,multivariateanalysisisneededtodeal
withthelargeamountofdatathateverymicroarrayexperimentgenerates[7Ͳ10].
Experimental
desing
ƒBiological question
ƒGoal
Experimental
process
ƒProbe andtargetDNA
preparation
ƒPrinting
ƒHybridization
Dataextraction
ƒDataadquisition
ƒDatapreͲprocessing
ƒQuantification
Dataanalysis
ƒGeneselection
ƒCluster analysis
ƒDataclassification
Figure4.Microarrayworkflowprocess.
1.2.1Microarrayplatformsandexperimentation
ThefirstDNAarraywasdevelopedbyEdSouthernin1975[10].Southernnoticedthat
labelledacidnucleicmoleculescouldbeusedtoevaluateothermoleculeslinkedona
solid support. He used the array to verify the presence or the absence of a specific
sequenceofDNAfromthedifferentsourcesandtoidentifythesizeoftherestriction
fragment.
In1995,aninͲsituprobesynthesismethodforphotolitographicallymanufacturingDNA
arrayswasdevelopedbyFodoretal.[11]andcommercializedbyAffymetrixInc.Atthe
sametime,preͲsynthesizedDNAmicroarrayswerepopularizedbyPatrickO.Brown’’s
laboratoryatStanfordUniversity[12].TheypublishedstepͲbyͲstepplansforbuildinga
robotic DNA arrayer [13]. This was, together with the development of the Southern
blot, one of the milestones in the microarray development because the Brown’’s
method made microarrays affordable for research laboratories, while the early
methods for manufacturing miniaturized DNA arrays using inͲsitu probe synthesis
requiredsophisticatedandexpensiveroboticequipment.
21
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
Nowadays, there are two main microarray platforms, namely cDNA arrays (where c
meanscomplementary)andoligonucleotidearrays.Theydifferinthepreparationand
content of the probe, and also on the sample preparation [2, 5] (Table 1). Figure 5
shows the experimental procedure in a cDNA microarray experiment and in a inͲsitu
oligonucleotidemicroarrayexperiment.
In cDNA microarrays, the probes are cDNAs typically 100Ͳ300 bases long. A cDNA
strandisaDNAstrandsynthetizedusingareversetranscriptaseenzyme,whichmakes
a DNA sequence complementary to the mRNA present in cells [2]. Note that the
commonly called DNA microarrays are actually cDNAmicorarrays. The target sample
consists of chains of cDNA of the test samples Cy5 labeled and chains of cDNA of
refernecesampleCy3labeled[2,5].Afterthesamplehasbeenhybridized,microarrays
are washed for several minutes in decreasing salt buffers and finally dried either by
centrigugation of the slide or a rinse in isopropanol followed by quick drying with
nitrogen gas or filtered air [7]. The raw microarraydataare obtainedby excitingthe
fluorescent dyes at each spot and scanning the microarray. One intensity value is
generated by the emission from the Cyanine 3 (Cy3) fluorophore and another from
Cyanine 5 (Cy5). The total fluorescence emitted by the spot at each wavelength is
proportionaltothetotalamountofthedyeinthespot.Hence,itisproportionaltothe
total amount of reference or test sample hybridized. When images of both dyes
(colourchannels)aremixed,thetypicalmicroarraypictureisobtained[1,7,14].The
colours on the microarray image respond to the four respective situations of
microarray hybridization (Figure 6): no hybridization (black spot), reference sample
hybridization (green spot), target sample hybridization (red spot) and test and
reference sample hybridization (yellow spot). Different intensities of the colours
indicatedifferentlevelsofhybridization.
22
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
DNA microarray images from different samples are then transformed onto gene
expression data matrices. Each row of the matrix corresponds to a sample and each
column corresponds to a gene. Each value characterizes the expression level of the
particular gene in that particular sample. The gene expression is given by the ratio
betweentheintensitiesintheredandthegreenchannels,whicharedirectlyrelatedto
thelevelofexpressionofthetranscript[1].
TheinͲsituoligonucleotidearrays,producedbyAffymetrix,eachgeneisrepresented
asaprobesetof10Ͳ25oligonucleotidepairs1insteadofonefulllengthorpartialcDNA
clone.Theseprobesaresynthesizeddirectlyonthesurfaceofthesupport.Thetarget
ARRAY
IMAGE
sampleisacDNAbiotinlabeled[2,7].IncontrastonthespottedcDNAarrays,inthis
case the test and the reference sample are hybridized separately on different chips;
then,dataadquisitionisdonebyscanningtheprobearray.Itcreatesa8×8pixels(on
average)foranyprobecell.Asingleintensityvalueforeveryprobecell,representative
ofthehybridizationlevelofitstarget,isderived.Finally,thegeneexpressionisgiven
bythedifferencesofPMandMM[1].Thegeneexpressionsofallgenesanalysedfora
samplearegiveninarowofthegeneexpressionmatrix.
Table1.Typesofmicroarrays.Source:[1].
Probe
Arrayingtechnique
Microarrayplatform
cDNA
Roboticspotting
SpottedcDNAmicroarrays
Roboticspotting
Spottedoligonucleotidemicroarrays
InͲsitusynthesis
InͲsituoligonucleotidemicroarrays
Oligonucleotides
1
The oligonucleotide pair (probe pair) comprises one oligonucleotide that perfectly matches the gene
sequence(PerfectMatch,PM)andasecondoligonucleotidehavingonenucleotidemismatchinthemiddle
ofit(Mismatch,MM).
23
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
Testsample
Reference sample
Cells or
tissue
RNA
extraction
AAAA
Cells or
tissue
AAAA
AAAA
AAAA
AAAA
AAAA
Reverse
transcription
andlabeling
Cy5
labeled
cDNA
Reference sample
RNA
extraction
AAAA
TotalRNA
Testsample
AAAA
TotalRNA
cDNA
synthesis
TTTT
TTTT
TTTT
TTTT
Cy3
labeled
cDNA
AAAA
TTTTͲT7
AAAA
TTTTͲT7
promotor
promotor
Invitro
transcription
Mix andhybridize
Emission
B B B
B B B
Double
stranded
cDNA
Biotin
labeled
cRNA
Excitation
laser
Cy5
Cy5
laser
Cy3
Cy3
Emission
log2 (Cy5/Cy3)
log(PMͲMM)
Genes
Samples
GENEEXPRESSIONDATAMATRIX
Figure 5. cDNA and inͲsitu oligonucleotide microarray sample preparation, hybridization and data
measurement.
24
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
1.2.2.Microarraydata
The experimental steps involved in a microarray workflow, from microarray
manufacturetomicroarraydataextraction(Figure4and5),mayintroducenoiseand
variability in the data. Common sources of variability in microarray experiments are
variations related to microarray manufacturing and variations related to microarray
scanning[7].Variabilityrelatedtomicroarraymanufacturingisduetodyeeffects,slide
effects or printͲtip effects. The variability of microarray scanning is due to scanner
manufacturing and to a non specific background. The most common origins of both
[15]aresummarizedinTable2.Tominimizetheeffectofthesourcesofvariationthat
mayaffectmicroarraydataaproperpreͲprocessingdataisfundamentalinmicroarray
data analysis. This preͲprocessing transforms the data to make them suitable for
analysis[1].PreͲprocessingofmicroarraydataisdoneinthestepsdescribednext[16].
Table2.Sourcesofvariationsofmicroarraydata.
Dyeeffects
ƒ
ƒ
ƒ
Differentincorporationofdyes
Dyeinstability
Genelabelinteraction
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Printingvariability
Differentpinefficiencyovertime
Arraycoating
Slideeffects
Microarray
Slideinhomogeneities
manufacturing
Efficiencyofthehybridizationreaction
Backgroundnoiseontheslide
DifferentamountsofRNAofprobesandDNA
Spatial,
targetsample
PrintͲtiporPlate
ƒ
Temperatureandhumidity
effects
ƒ
PCRamplification
ƒ
Samplepreparationprotocols
ƒ
Scannermanufactureforexampledue:laserwronglyadjustedorlaser
misaligned.
Microarray
ƒ
Nonspecificbackgroundandovershining,nonspecificradiationsand
scanning
signalsfromneighbouring.
ƒ
Imageanalysis,nonlineartransmissioncharacteristics,saturation
effectsandvariationsinspotshape.
Abbreviations.PCR:Polymerasechainreaction.
25
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
Backgroundsubtraction
Signalintensitiesofageneincludecontributionsfromnonspecifichybridizationsand
other fluorescences from the glass. This background fluorescence is estimated from
thepixelsthatarenearthefeaturebutarenotapartofaspot[17].Localbackground
foreachchannelandspotisevaluatedfocusingonsmallregionssurroundingthespot
mask(region2inFigure6).Then,themedianorthemeanofpixelvaluesinthisregion
iscalculatedforeachchannelandsubtractedfromthespotintensity[14].
A less used alternative calculates a global background for each slide: an average of
negative control spot intensities is used as background value, being the empty spots
thenegativecontrolspots.
3
2
1
Figure 6. Scanned Microarray image. 1. Feature pixels 2. Background pixels 3. TwoͲpixel exclusion region.
Source:GENEPIXPRO [17].
In inͲsitu oligonucleotide arrays a local background is calculated for each probe and
thenaweightedcombinationofthesebackgroundsissubstractedfromalltheprobes
ofthemicroarray.
26
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
Treatmentofmissingvalues
Microarray datasets frequently contain missing values, either because the spot is
empty (intensity=0), or because the background intensity is higher than the spot
intensity(intensitywithbackgroundcorrected<0).Thesevaluesneedtobedeletedor
estimated and replaced, in a process called imputation, for subsequent data mining
[18].
Intheimputation,themissingvaluesmaybereplacedbya1(i.e.sincelog(1)=0,what
means no gene expression) or replaced by the mean of the intensities of the gene
amongallthesamples.
Particularly, in affymetrix datasets, when the intensity of the Mismatch probe cell is
higherthanthePerfectmatchintensity,thisprobehasnotphysiologicalsense,insuch
acaseavaluecalledChangeThresholdisusedinsteadoftheMismatchintensity[7].
Filteringbaddata
Filtering excludes from the data the observations that do not fulfil a preͲformulated
presumption[4].Forexample,toolowintensityvaluesthatcannotbetrusteddueto
instrumental limitations of the scanner. Typically, the lowest intensity value of the
reliable microarray data, referred as ““floor””, is 10. Values below ““floor”” are usually
removed(filtered)fromthedatabecausetheyarenotreliableenough.Similarly,the
array elements at the high end of the fluorescence intensities may saturate the
detector.Thethresholdreferredtoas““ceiling””valueissetat16.000andvaluesover
““ceiling””areremovedtoo[4,19].
27
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
Foldchange,log2(twofold)
IncDNAmicroarraystheexpressionofageneinasampleistheratiooftheintensities
inbothchannelsforthatgene.Althoughtheseratiosprovideanintuitivemeasureof
expressionschanges,theyhavethedisadvantageoftreatingupͲanddownͲregulated
genes differently. Genes upͲregulated by a factor of 2 have an expression ratio of 2,
whereasthosedownͲregulatedbythesamefactorhaveanexpressionratioof0.5.The
mostwidelyusedtransformationoftheratioisthelogarithmbase2,whichtreatsupͲ
regulatedanddownͲregulatedgenessymmetrically,sothatageneupͲregulatedbya
factorof2hasalog2(ratio)=1,agenedownͲregulatedbyafactorof2hasalog2(ratio)
=о1,andageneexpressedataconstantlevel(witharatioof1)hasalog2(ratio)=0.
So,log2(ratio)willbeusedtorepresentexpressionlevels[19].
In some cases the log transformation may be too ““strong”” and have the effect of
increasingtheimportanceofthelowintensities.Then,aweakertransformationlikea
cuberootisused[6].
Normalization
Normalization consists of removing arbitrary variations in the measured gene
expression levels of hybridized samples so that biological differences (different gene
expressions) can be more easily distinguished. Table 3 summarises the main
normalizationcriteriausedandthesystematicvariationtheyremove.
The most used method is the LOcally WEighted Scatterplot Smoothing (LOWESS)
correction [20] for non linear data, and total intensity normalization or median
subtraction otherwise. In inͲsitu microarrays analysis a separate probe array
experimentisperformed,whichisusedbyscalingtechniquestominimizedifferences
28
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
in overall signal intensities between the two arrays, allowing for a more reliable
detectionofbiologicallyrelevantchangesinthesampples[1,7].
Table3.Strategiesformicroarraydatanormalization.
Normalizationmethods
Dye
effects
Slide
effects
spatial
effects
Scanner
effects
between
arrays
LOWESScorrectionforeach
printͲtip[7,15,16]
Linearcorrection
[16,21]
Totalintensitynormalization
[19]
TwodyesCy3andCy5
[7,15]
Doubledyeexperimentation,
dyingasampleoncewithCy5
andwithCy3inthesecond
experiment[19,22]
Ratiosvaluesescalationacross
theslides[19,22]
Housekeepinggenes[15]
1.2.3Microarrayapplications
ThefirstmicroarraypaperfeaturedthesmallmustardplantArabidopsisthaliana[23],
but the technology quickly spread to yeast [24], mouse [25], and human [26, 27]
studies.
Present main applications of microarrays [28] include the identification of genetic
individuality of tissues or organisms (e.g. detection of single nucleotide
polymorphisms,SNPs)[7,29],theinvestigationofcellularstatesandprocesses(such
29
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
asthesporulationprocess)[30],thediagnosisofgeneticandinfectiousdiseases[31Ͳ
33],theidentificationofthesubtypesofacertaindisease[34,35],thedetectionof
geneticwarningsigns[36]orthedrugselection[37].
Inthelargenumberofinvestigationareas,oncologyhasbecomethemainfieldofDNA
microarrayapplications[38].Generalaspectsofcancerexpressionprofilinghavebeen
extensively reviewed [39Ͳ41]. It has been shown that subclassification of tumours
based on their molecular profiles may help to explain why these tumours respond
differently to treatment. Golub et al. [34] were the first to use microarray gene
expressiondatatodistinguishbetweenacutemyeloidleukemiaandacutelymphocytic
leukemia. Posterior studies allowed distinguishing samples of adult versus paediatric
leukemia[42],differentsubtypesofleukemia[43]andtheirmolecularcharacterization
[44]. Recently, Su et al. [45] and Ross et al. [46] used largeͲscale RNA profiling to
construct a molecular classification of different carcinomas (prostate, lung, ovary,
colorectum,kidney,liver,pancreas,bladder/urethra,andgastroesophagus).Additional
researchfordiagnosisbygeneticprofilinghasbeendonefordifferentcancers[47,48].
In breast cancer, microarrays permitted differentiating between tumour types,
corresponding to BRCA1, BRCA2 and sporadic mutations [13, 49], the differentiation
betweentheestrogenreceptors[50]andthedifferentiationbetweenthestagesinthe
cancer progression [31]. In melanoma, most of the efforts have been applied to
differentiate between metastasis and no metastasis tissues [51, 52] and in
hepatocellularcarcinomatheresearchhasinvolvedthepursuitofcancerprogression
[53].Inothertypesoftumours,thediagnosishasbeenthemaintarget.Thisisthecase
ofbladdercancer[54],cutaneoussquamouscellcancer[55],andlungcancer[56].In
thefieldsofcolon[57],prostate[23],liver[58],glioma[59]andepithelia[60]cancers,
the research has focused on the differentiation between tumour and normal tissues
and in the case of lymphoma [35], medulloblastoma [61] and adenocrinoma [62] on
thedifferentiationbetweendifferentsubtypesofthem. 30
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
In non oncological clinical diagnosis, DNA microarrays are used to search for the
expression pattern characteristics of complex genetic disorders [47] such as diabetes
[33], obesity [63, 64], and schizophrenia [65]. Microarrays have also been used in
transplantation research; for example in renal transplantation to generate gene
expression profiles of renal biopsies for diagnoses of acute rejection [66], or in
diagnosis of infectious diseases, to detect gene sequences in the genomes of
Mycobacterium tuberculosis, HIV [67, 68], and other pathogens with the aim of
providing a diagnostic tool that detects expression of antibiotic resistance genes or
specifiedviralsubtypes[38].
Another important application of DNA microarrays is the identification of the genes
thatareresponsibleofacertaindisease[48].Oneofthefirstpapersthatreportedthe
use of microarrays for this purpose identified the genes differentially expressed
betweenaratstrainwithinsulinresistanceandanormalinsulinsensitivecontrolstrain
[69]. After this study, microarrays have been applied to identify genes involved in
manydifferentcancerexpressions[70Ͳ72],tumourprogressions[73]orinmanyother
clinicalfieldssuchasneuronaldiseases[74,75].Inthelastfewyearsmanymethods
havebeendevelopedtoidentifythemostrelevantgenesforacertaindiagnosis.Three
majorgroupsofmethodsexist:filters,wrappersandembeddedtechniques[76].These
methodshavebeenbasedonGeneticAlgorithms[77],RandomForests[78],weights
ofSVM[79],tͲtestsortheWilcoxontest[80],tociteafew.Mostofthesecriteriaare
univariate(i.e.eachfeatureisevaluatedindependently),thussimpletointerpret,but
they omit interactions and correlations between genes during gene selection [81].
Anyhow,theseinteractionsmustbetakenintoaccountsinceithasbeenshownthat
thereexistpairsofgenesthatarecoexpressed.Inasimplemanner,ifwefindthatthe
genetic expression levels for two genes are similar, we can hypothesize that the
respective genes are coͲregulated and possibly functionally related [82]. More
31
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
accurately, these coexpressions have been proved based on the correlation of
expressionprofilesoronfunctionalandchromosomalstructuralinformation[83,84].
Linked with the identification of the genes that are responsible for a disease,
microarrays have also been applied to find mutations that are responsible for the
disease phenotype. Although there are numerous methods for identifying the
mutations,microarraysmaybestsatisfyaneedforrapid,accurateandcostͲeffective
method for genetic polymorphism identification [47]. This identification has been
presentedasthefoundationofpharmacogenomics.Innextfuture,pharmacogenomics
aims to optimize the dose and drug formulation and to predict good and adverse
clinicalresponsestoindividualdrugs,usingmicroarraysforpersonalizedmedicine[38,
47,85].
Thehugeamountofdatageneratedineachmicroarrayexperimentimpliestheuseof
multivariatetechniquesfortheiranalysis.Inoneofthefirststudieswithmicroarrays,
Golub et al. [34] applied two cluster selfͲorganizing maps to group 38 samples of
leukemiaintotwoclasses.Eisenetal.[86]usedhierarchicalclusteringtofindoutthe
geneswithsimilarfunctions.Hierarchicalclusteringhasalsobeenusedtodiscovertwo
moleculardistincttypesofdiffuselargeBͲcelllymphomainwhichthepatientsinthe
twosubgroupsshowedsignificantdifferencesinoverallsurvival[35],andtocategorize
breast cancer into its subtypes [87]. PCA has been applied to discriminate between
differenttumourtissues,includingcoloncarcinoma,breastcarcinoma,centralnervous
system tumour, lung cancer, leukemia, melanoma, ovarian carcinoma, and prostate
cancer[88].ThesameanalysishasbeenperformedwithkͲmeansclustering[88].
Multivariatesupervisedclassificationmethodsareprobablythemostimportanttools
formicroarraydataanalysis.Suchmethodscanbeusedtoidentifydifferentexpressed
genes, to find subgroups of samples, to differentiate between different states of a
32
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
tumourandtoinfertheclassofasamplefromitsgeneexpressionmicroarraydata.In
generalterms,theaimofanyclassifieristobuildadecisionrulefromapreclassified
datasetanduseittoassignanewunlabeledsampletooneornoneofthepredefined
classes. A large number of classification methods have been used in microarray data
analysis.ThemainstudiesaresummarizedinTable4.
Table4.Classificationreferencesformicroarraygeneexpressiondata.
Classificationmethod
Objectiveofthestudy
Differentiatebetweenovariancancertissues,normalovariantissuesand
othernormaltissues[89].
SVM
Recognizefivesetsofgenesinfunctionalclassesthatwereexpectedtobe
coͲregulated:thosemediatingthetricarboxylicacidcycle,respiration,
cytoplasmicribosomebiosynthesis,proteasomebiosynthesisandhistone
biosynthesis[90].
Discriminatebetweentumoursfromavarietyoftissuesandorgans,e.g.
betweensubtypesofleukemiaandthemutationsofbreastcancer[91].
TPCR
Differentiatebetweenroundbluecelltumoursofchildhood
(neuroblastoma,rhabdomysarcoma,nonͲHodgkinlymphomaandEwing
familyoftumours)[91].
Classifycancersamplesintothesamefourgroupsofchildhoodcancer
NN
[92].
Investigatethegeneexpressionpatternsassociatedwithestrogens
receptorstatusinsporadicbreastcancer[93].
Classifytypesofleukemia[94].
MCRͲALS
Differentiatebetweenninetypesoftumoursamples(breastcancer,
centralnervoussystemtumour,coloncarcinoma,lungcancer,leukemia,
melanoma,ovariancarcinoma,prostatecancerandrenalcarcinoma)[94].
SOM+kͲmeansclustering
Classifysubtypesoforalcancer[95].
33
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
KNN
Selectasubsetofgenestoclassifysubtypesofleukemiaandasubsetto
discriminatebetweentumourandhealtycolonsamples[96].
Differentiatebetweentumourandhealthysamplesofcolonandovarian
PLS(fordimension
reduction)+LDorQLD
[97].
Differentiatebetweenthesubtypesofacancersuchaslymphomaor
leukemia[97].
PLS+PHR
PLS+RPLR
Predictpatientsurvivalprobabilities[98].
Classifysamplesoftwotypesofleukemia[99].
Differentiatebetweenhealthyandtumourcolonsamples[99].
Differentiatebetweensamplesbeforeandafterchemotherapy[100].
Identifytheestrogensreceptorstatus[100].
Differentiatethestatesofabreastcancertumour[101].
Predictthedrugefficacyusingexpressiondatabiomarkers[102].
Identifythemostrelevantgenescorrelatedwithacertaintumour[103].
PredictthequalityofaDNAmicroarrayspot[104].
Classifytumoursamples(differenttypesoflymphomaandbreast
cancer)[105].
DPLS
Differentiatebetweenhealthysamplesandsamplesofcarcinoma,colon
andprostatetumour[105].
Identifygeneswhoseexpressionappearstobesynchronizedwithcell
cycling[106].
Identifygeneswithperiodicfluctuationsinexpressionlevelscoupledto
thecellcycleinthebuddingyeast[106].
Selectafewgeneexpressionsthatarethemosteffectivein
discriminatingtumoraltypes(melanoma,colon,leukemiaandrenal
tumourcells)[103,107].
Identifynewlungcancermolecularmarkerswithdiagnosticvalue[108].
Abbreviations. SVM: Support vector machines, TPCR: Total principal component regression, NN: Neural
networks, MCRͲALS: Multivariate curve resolution alternating least squares, SOM: selfͲorganizing maps ,
KNN: KͲnearest neighbours, PLS: Partial least squares, LD: Logistic discrimination, QLD: Quadratic logistic
discrimination, PHR: Proportional hazard regression, RPLR: Ridge penalized logistic regression, DPLS:
Discriminantpartialleastsquares.
34
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
1.2Microarrays
Recently,theinterestforusingDiscriminantPartialLeastSquares(DPLS)hasincreased
[109,110].Thisinterestarisesfromthehighcomputationalefficiency,largeflexibility
andversatilityofthemethodfortheaddressedmicroarrayclassificationproblems,and
from the existence of a variety of algorithmic variants [110]. Hence, to improve the
DPLSmodelinordertoobtainbetterclassificationmodelsandperformancesplaysa
keyroleingeneexpressionmicroarraydataclassification.
35
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
‡ˆ‡”‡…‡•
[1]
[2]
[3]
[4]
][5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
36
TingͲLee, M.L., Analysis of microarray gene expression data. 2004, USA: Kluwer Academic
Publishers.
Higgs, P.G. and T.K. Attwood, Bioinformatics and Molecular Evolution, ed. B.S. Ltd. 2006:
BlackwellPublishing.
U.S.Departmentofhealthandhumanservices,TheNewGenetics,inNIHPublicationNo.07Ͳ662.
2006.
Baldi, P. and G.W. Hatfield, DNA microarrays and Gene expression. From Experiments to Data
AnalysisandModeling.2002,Cambridge:CambridgeUniversityPress.
Primrose, S.B. and R.M. Twyman, Principles of Gene Manipulation and Genomics (7th edition).
2006:BlackwellPublishing.
Göhlmann, H. and W. Talloen, Gene Expression Studies Using Affymetrix Microarrays.
MathematicalandComputationalBiologySeries,ed.C.Hall.2009:Taylor&FrancisGroup,LLC.
Pasanen,T.,etal.,DNAMicroarrayDataAnalysis.2003,Helsinki:Ed.CSCͲTheFinnishITcenterfor
Science.
Allison, D.B., et al., Microarray data analysis: form disarray to consolidation and consensus.
NatureReviews(Genetics),2006.7:p.55Ͳ65.
Liew, A.W.ͲC., H. Yan, and M. Yang, Pattern Recognition techniques for the emerging field of
bioinformatics:Areview..PatternRecognition,2005.38:p.2055Ͳ2073.
Southern, E., Detection of specific sequences among DNA fragments separated by gel
electrophoresis.JournalofMolecularBiology,1975.98:p.503Ͳ507.
Fodor,S.P.,etal.,Multiplexedbiochemicalassayswithbiologicalchips.Nature,1993.364:p.555Ͳ
556.
Shena,M.,etal.,QuantitativemonitoringofgeneexpressionPatternswithcomplementaryDNA
microarray.Science,1995.270:p.467Ͳ470.
Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal
ofMedicine,2001.344:p.539Ͳ548.
Mada, H., Microarray Data Analysis (I), Part A: cDNA spotted Microarray. Material of Data
Analysis Course. http://www.sinica.edu.tw/~hmwu/CourseSMDA/index.htm, Academia Sinica:
InstituteofStatisticalScience:Taiwan.
Schuchhardt, J., et al., Normalization strategies for cDNA microarrays. Nucleic Acids Research,
2000.28:p.e47.
Berrar,D.,W.Dubitzky,andM.Granzow.,Apracticalapproachtomicroarraydataanalysis.2004,
USA:KluwerAcademicPublishers.
http://www.moleculardevices.com/.
Wang, D. and e. al., Effects of replacing the unreliable cDNA microarray measurements on the
disease classification based on gene expression profiles and functional modules. Bioinformatics,
2006.22:p.2883Ͳ2889.
Quackenbush,J.,ExtractingbiologyfromhighͲdimensionalbiologicaldata.JExpBiol,2007.210:
p.1507Ͳ1517.
Cleveland,W.S.,RobustLocallyWeightedRegressionandSmoothingScatterplots.Journalofthe
AmericanStatisticalAssociation,1979.74:p.829Ͳ836.
Kepler,T.B.,L.Crosby,andK.T.Morgan,NormalizationandanalysisofDNAmicroarraydataby
selfͲconsistencyandlocalregression.GenomeBiology,2002.3:p.1Ͳ12.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
References
Yang,Y.H.,etal.,NormalizationforcDNAmicroarraydata:arobustcompositemethodaddressing
singleandmultipleslidesystematicvariation.NucleicAcidsResearch,2002.30:p.e15.
Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002.
1:p.203Ͳ209.
Shalon, D., S.J. Smith, and P.O. Brown, 1996. Genome Research, A DNA microarray system for
analyzingcomplexDNAsamplesusingtwoͲcolorfluorescentprobehybridization.6:p.639Ͳ645.
Lockhart,D.J.,etal.,DNAExpressionmonitoringbyhybridizationtohighdensityoligonucleotide
arrays.NatureBiotechnology,1996.14:p.1675Ͳ1680.
Baldini, A. and D.C. Ward, In situ hybridization banding of human chromosomes with AluͲPCR
products:asimultaneouskaryotypeforgenemappingstudies..Genomics,1991.9:p.770Ͳ774.
Ried, T., et al., Multicolor fluorescence in situ hybridization for the simultaneous detection of
probe sets for chromosomes 13, 18, 21, X and Y in uncultured amniotic fluid cells. Human
MolecularGenetics,1992.1:p.307Ͳ313.
Lesk,A.M.,IntroductiontoBioinformatics(3rdEdition).2008:OxfordUniversityPress.
Butcher,L.M.,etal.,SNPs,microarraysandpooledDNA:identificationoffourlociassociatedwith
mildmentalimpairmentinasampleof6000children.HumanMolecularGenetics,2005.14:p.
1315Ͳ1325.
Friedlander, G., et al., Modulation of the transcription regulatory program in yeast cells
committedtosporulation.GenomeBiology,2006.7:articleR20.
Veer,L.J.v.t.,etal.,Geneexpressionprofilingpredictsclinicaloutcomeofbreastcancer.Nature,
2002.415:p.530Ͳ535.
Thomas,R.S.,etal.,IdentificationoftoxicologicallypredictivegenesetsusingcDNAmicroarrays.
MolecularPharmacology,2001.60:p.1189Ͳ1194.
Mootha, V.K., et al., PGCͲ1aͲresponsive genes involved in oxidative phosphorylation are
coordinatelydownregulatedinhumandiabetes.Nature,2003.34(3):p.266Ͳ273.
Golub,T.R.,etal.,MolecularClassificationofCancer:ClassDiscoveryandClassPredictionbyGene
ExpressionMonitoring.Science,1999.285:p.531Ͳ537.
Alizadeh,A.A.,etal.,DistincttypesofdiffuselargeBͲcelllymphomaidentifyedbygeneexpression
profyling.Nature,2000.403p.503Ͳ511.
Sebat, J., et al., Strong Association of De Novo Copy Number Mutations with Autism. Science,
2007.316:p.445Ͳ449.
Chavan,P.,K.Joshi,andB.Patwardhan,DNAMicroarraysinHerbalDrugResearch.eCAM,2006.
3(7):p.447Ͳ457.
Aitman,T.J.,Science,medicine,andthefuture:DNAmicroarraysinmedicalpractice.TheBrithish
MedicalJournal,2001.323:p.611Ͳ615.
CuperlovicͲCulf, M., N. Belacel, and J. Ouellette, Determination of tumour marker genes from
geneexpressiondata.DrugDiscoveryTodayTargets(Reviews),2005.10:p.429Ͳ437.
MacGregor,P.F.andJ.A.Squire,Applicationsofmicroarraystotheanalysisofgeneexpressionin
cancer.ClinicalChemistry,2002.48:p.1170Ͳ1177.
Wadlow, R. and S. Ramaswamy, DNA microarrays in clinical cancer research. Current Molecular
Medicine,2005.5:p.111Ͳ120.
Kohlmann, A., et al., Pediatric acute lymphoblastic leukemia (ALL) gene expression signatures
classifyanindependentcohortofadultALLpatientsLeukemia,2004.18:p.63Ͳ71.
Haferlach,T.,etal.,AMLM3andAMLM3varianteachhaveadistinctgeneexpressionsignature
but also share patterns different from other genetically defined AML subtypes. Genes
ChromosomesCancer,2005:p.113Ͳ127.
37
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
38
Kohlmann, A., et al., Molecular characterisation of acute leukemias by use of microarray
technology.GenesChromosomesCancer,2003.37:p.396Ͳ405.
Su,A.I.,etal.,Molecularclassificationofhumancarcinomasbyuseofgeneexpressionsignatures.
CancerResearch,2001.61:p.7388Ͳ7393.
Ross, D.T., et al., Systematic variation in gene expression patterns in human cancer cell lines.
NatureGenetics,2000.24:p.227Ͳ235.
Petrik,J.,Diagnosticapplicationsofmicroarrays.TransfusionMedicine.16:p.233Ͳ247.
Frolov, A.E., Differential Gene Expression Analysis by DNA Microarray Technology and Its
ApplicationinMolecularOncology.MolecularBiology,2003.37:p.486Ͳ494.
Sorlie,T.,etal.,Geneexpressionpatternsofbreastcarcinomasdistinguishtumorsubclasseswith
clinical implications. . Proceedings of the National Academy of Sciences, 2001. 98: p. 10869Ͳ
10874.
West, M., et al., Predicting the clinical status of human breast cancer by using gene expression
profiles.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS),
2001.98:p.11462Ͳ11467.
Bittner,M.,etal.,Molecularclassificationofcutaneousmalignantmelanomabygeneexpression
profiling.Nature,2000.406:p.536––540.
Clark,E.A.,etal.,GenomicanalysisofmetastasisrevealsanessentialroleforRhoC.Nature,2000.
406:p.532Ͳ535.
Mao,H.J.,etal.,MonitoringmicroarrayͲbasedgeneexpressionprofilechangesinhepatocellular
carcinoma.WorldJournalofGastroenterology,2005.11:p.2811Ͳ2816.
Dyrskjot,L.,Classificationofbladdercancerbymicroarrayexpressionprofiling:towardsageneral
clinicaluseofmicroarraysincancerdiagnostics.ExpertReviewsinMolecularDiagnostics.2003.3:
p.635––647.
Dooley, T.P., et al., Biomarkers of human cutaneous squamous cell carcinoma from tissues and
cell lines identified by DNA microarrays and qRTͲPCR. . Biochemical and Biophysical Research
Communications,2003.306:p.1026Ͳ1036.
Gordon, G.J., R.V. Jensen, and L.L. Hsiao, Translation of microarray data into clinically relevant
cancerdiagnostictestsusingexpressionratiosinlungcancerandmesothelioma.CancerResearch,
2002.62:p.4963––4967.
Alon, U., et al., Broad patterns of gene expression revealed by clustering analysis of tumor and
normalcolontissuesprobedbyoligonucleotidearrays.CellBiology,1999.96:p.6745Ͳ6750.
Chen,X.,etal.,Geneexpressionpatternsinhumanlivercancers.MolecularBiologyoftheCell,
2002.13:p.1929Ͳ1939.
Boom, J.v.d., et al., Characterization of Gene Expression Profiles Associated with glioma
progression using OligonucleotideͲbased microarray analysis and RealͲTime Reverse
TranscriptionͲPolymerase Chain Reaction. American Journal of Pathology, 2003. 163: p. 1033Ͳ
1043.
Kitahara, O., et al., Alterations of gene expression during colorectal carcinogenesis revealed by
cDNA microarrays after laserͲcapture microdissection of tumour tissues and normal epithelia. .
CancerResearch,2001.61:p.3544Ͳ3549.
Pomeroy, S.L.ande.al,Predictionofcentralnervous systemembryonaltumouroutcomebased
ongeneexpression.Nature,2002.415:p.436Ͳ442.
Bhattacharjee, A., et al., Classification of human lung carcinomas by mRNA expression profiling
reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences,
2001.98:p.13790––13795.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
References
Nadler, S.T., et al., The expression of adipogenic genes is decreased in obesity and diabetes
mellitus.ProceedingsoftheNationalAcademyofSciences,2000.97:p.11371Ͳ11376.
Permana,P.A.,A.D.Parigi,andP.A.Tataranni,Microarraygeneexpressionprofilinginobesityand
insulinresistance.Nutrition,2004.20:p.134Ͳ138.
Hakak, Y., et al., GenomeͲwide expression analysis reveals dysregulation of myelinationͲrelated
genes in chronic schizophrenia. Proceedings of the National Academy of Sciences, 2001. 98: p.
4746Ͳ4751.
Mayeux, R., Mapping the new frontier: complex genetic disorders. The Journal of Clinical
Investigation,2005.115:p.1404Ͳ1407.
Kozal, M.J., et al., Extensive polymorphisms observed in the HIV1 cladeB protease gene using
highdensityoligonucleotidearrays.NatureMedicine,1996.2:p.753Ͳ759.
Gingeras, T.R., et al., Simultaneous genotyping and species identification using hybridization
pattern recognition of generic mycobacterium DNA arrays. Genome Research, 1998. 8: p. 435Ͳ
448.
Aitman,T.J.,etal.,IdentificationofCd36(Fat)asaninsulinresistancegenecausingdefectivefatty
acidandglucosemetabolisminhypertensiverats.NatureGenetics,,1999.21:p.76Ͳ83.
Otero,E.,etal.,DNAmicroarraysinoralcancer.MedicinaOral,2004.9:p.288Ͳ292.
Graveel, C.R., et al., Expression profiling and identification of novel genes in hepatocellular
carcinomas.Oncogene,2001.20:p.2704Ͳ2712.
Brem,R.,etal.,GlobalanalysisofdifferentialgeneexpressionaftertransformationwiththevͲHͲ
rasoncogeneinamurinetumormodel.Oncogene,2001.20:p.2854Ͳ2858.
Okabe,H.,etal.,GenomeͲwideanalysisofgeneexpressioninhumanhepatocellularcarcinomas
using cDNA microarray: identification of genes involved in viral carcinogenesis and tumor
progression.CancerResearch,2001.61:p.2129––2137.
Cavallaro, S., et al., Gene expression profiles during longͲterm memory consolidation. European
JournalofNeuroscience,2001.13:p.1809Ͳ1815.
Zirlinger, M., G. Kreiman, and D.J. Anderson, AmygdalaͲenriched genes identified by microarray
technologyarerestrictedtospecificamygdaloidsubnuclei.ProceedingsoftheNationalAcademy
ofSciences,2001.98:p.5270Ͳ5275.
Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics.
Bioinformatics,2007.23:p.2507Ͳ2517.
Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast
squaressupportvectormachine.BMCBioinformatics,2006.7:article95.
DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using
randomforest.BMCBioinformatics,2006.7:article3.
Guyon,I.,etal.,GeneSelectionforCancerClassificationusingSupportVectorMachines.Machine
Learning,2002.46:p.389Ͳ422.
Troyanskaya,O.G.,etal.,Nonparametricmethodsforidentifyingdifferentiallyexpressedgenesin
microarrays.Bioinformatics,2002.18:p.1454Ͳ1461.
Li, G.ͲZ., et al., Partial Least Squares based dimension reduction with gene selection for tumour
classification. 7th IEEE International Conference onBioinformatics and Bioengineering, 2007: p.
1439Ͳ1444.
Brazma, A. and J. Vilo, Gene expression data analysis. Federation of European Biochemical
SocietiesLetters,2000.480:p.17Ͳ24.
Lee, H.K., et al., Coexpression Analysis of Human Genes Across Many Microarray Data Sets.
GenomeResearch,2004.14:p.1085Ͳ1094.
39
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Introduction
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
40
Kluger,Y., etal.,Relationship betweengenecoͲexpressionandprobelocalizationon microarray
slides.BMCGenomics,2003.4:p.49Ͳ54.
Gunther, E.C. and e. al., Prediction of drug efficacy by classification of drugͲinduced genomic
expressionprofilesinvitro.ProceedingsoftheNationalAcademyofSciences,2003.100:p.9608Ͳ
9613.
Eisen,M.B.,etal.,ClusteranalysisanddisplayofgenomeͲwideexpressionpatterns.Proceedings
oftheNationalAcademyofSciences,1998.95:p.14863Ͳ14868.
Perou,C.M.,etal.,Molecularportraitsofhumanbreasttumours.Nature,2000.406:p.747Ͳ752.
Crescenzi,M.andA.Giuliani,Themainbiologicaldeterminantsoftumorlinetaxonomyelucidated
byaprincipalcomponentanalysisofmicroarraydata.FEBSLetters2001.507:p.114Ͳ118.
Furey,T.S.,etal., Support VectorMachineclassification and validation ofcancertissuesamples
usingmicroarrayexpressiondata.Bioinformatics,2000.16:p.906Ͳ914.
BrownM.P.S,e.a.,KnowledgeͲbasedanalysisofmicroarraygeneexpressiondatabyusingsupport
vectormachines.ProceedingsoftheNationalAcademyofSciences,2000.97:p.262Ͳ267.
Tan, Y., et al., MultiͲclass cancer classification by total principal component regression (TPCR)
usingmicroarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65.
Khan,J.,etal.,Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofiling
andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679.
Gruvberger, S., et al., Estrogen receptor status in breast cancer is associated with remarkably
distinctgeneexpressionpatterns.CancerResearch,2001.61:p.5979Ͳ5984.
Jaumot, J., R. Tauler, and R. Gargallo, Exploratory data analysis of DNA microarrays by
multivariatecurveresolution.AnalyticalBiochemistry,2006.358:p.76Ͳ89.
Warner, G.C., et al., Molecular classification of oral cancer by cDNA Microarrays Identifies
overexpressedgenescorrelatedwithnodalmetastasis.InternationalJournalCancer,2004.110:p.
857Ͳ868.
Li, L., et al., Gene selection for sample classification based on gene expression data: study of
sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001. 17: p. 1131Ͳ
1142.
Nguyen, D.V. and D.M. Rocke, Tumor classification by partial least squares microarray gene
expressiondata.Bioinformatics,2002.18:p.39Ͳ50.
Nguyen,D.V.andD.M.Rocke,Partialleastsquaresproportionalhazardregressionforapplication
toDNAmicroarraysurvivaldata.Bioinformatics,2002.18:p.1625Ͳ1632.
Fort,G.andS.LambertͲLacroix,ClassificationusingPartialLeastSquareswithPenalizedLogistic
Regression.Bioinformatics,2005.21:p.1104Ͳ1111.
PérezͲEnciso,M.andM.Tenenhaus,Predictionofclinicaloutcomewithmicroarraydata:apartial
leastsquaresdiscriminantanalysis(PLSͲDA)approach.HumanGenetics,2003.112:p.581Ͳ592.
Modlich, O., et al., Predictors of primary breast cancers responsiveness to preoperative
Epirubicin/CyclophosphamideͲbased chemotherapy: translation of microarray data into clinically
usefulpredictivesignature.JournalofTranslationalMedicine,2005.3:article32.
Man, M.Z., et al., Evaluation methods for classifying Expression data. Journal of
BiopharmaceuticalStatistics,2004.14:p.1065Ͳ1084.
Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch:
identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of
Chemometrics2004.18:p.125Ͳ132.
Bylesjö, M., et al., MASQOT: a method for cDNA microarray spot quality control. BMC
Bioinformatics,2005.6:p.250.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
[105]
[106]
[107]
[108]
[109]
[110]
References
Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical
ApplicationsinGeneticsandMolecularBiology,2004.3:article33.
Johansson,D.,P.Lindgren,andA.Berglund,Amultivariateapproachappliedtomicroarraydata
foridentificationofgeneswithcellcycleͲcoupledtranscription..Bioinformatics.19:p.467Ͳ473.
Musumarra,G.,etal.,ABioinformaticApproachtotheIdentificationofCandidateGenesforthe
DevelopmentofNewCancerDiagnostics.Biol.Chem.,2003.384:p.321Ͳ327.
Musssumarra,G.,etal.,GenomeͲbasedidentificationofdiagnosticmolecularmarkersforhuman
lungcarcinomasbyPLSͲDA.ComputationalBiologyandChemistry,2005.29:p.183Ͳ195.
Nguyen,D.V.andD.M.Rocke,MultiͲclasscancerclassificationviapartialleastsquareswithgene
expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226.
Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ
dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44.
41
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 2 Thesis Objectives
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Thesisobjectives
Microarraysallowthesimultaneousanalysisofthousandsofgeneexpressions.Clinical
diagnosis based on gene expression data has two main targets: 1) to achieve the
correct diagnostic for a patient with the greatest confidence and 2) to identify the
genes responsible for a particular disease. In data analysis words, these objectives
implydevelopingthebestclassificationmodelinordertoclassifyasampleinitstrue
classwithalowriskofmisclassificationandtoidentifytherelevantvariablesthatallow
discriminatingamongtheclassesunderstudy.
Multivariate methods are required toanalyse thehugeamount of data generated in
microarray experiments. Discriminant Partial Least Squares (DPLS) classification is
commonlyusedinthisfield.Theperformanceofthisclassificationmethoddependson
manysettingssuchasthedatapreͲprocessing,thenumberoffactors,thenumberof
variables and the presence of outliers. Taking into account these considerations the
aim of this thesis is to optimize the classification based on DPLS in order to classify
clinical samples from their gene expression microarray data. More in detail the
objectivesofthepresentthesisare:
1.TodiscussthelimitationofpͲDPLSclassificationfollowingtheBayesrule,which
forcestheclassifiertoalwaysassignasampletooneofthemodeledclasses,and
proposedifferentapproachestoovercomethislimitation.
2. To implement the reject option in the probabilistic Discriminant Partial
Least Squares method (pͲDPLS), used to classify the samples from their gene
expression data. This gives to the classification rule the ability to reject to
classify a sample when the risk of misclassification is too high, and avoids
forcingtheclassificationintooneofthemodelledclasses.
3.TodevelopanewmethodfordetectingambiguoussamplesandoutliersforpͲ
DPLS,inordertoimprovetheaccuracyoftheclassificationmodel.Thiswillavoid
45
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter2
classifying samples that would be probably misclassified due: 1) they share
characteristicsofthetwoclassesmodelled,2)theydonotbelongtoanyofthe
modelledclasses3)theyhaveerrorsininstrumentaldataor4)theyhaveerrorsin
theirclasscodification.
4. To develop a new method for gene selection in order to reduce the data
dimensionality––eliminatingtheredundantdataandthenoise––andtoimprove
theclassificationmodelbydecreasingtheriskofmisclassification.
5. To study the implications that the split of the datasets into training and test
setshaveongeneselectionandontheperformanceoftheclassificationmodels.
6.ToextendthebinaryclassificationbasedonDPLStomultiͲclassclassification.
Thisshouldhelptosolvecommonclinicalclassificationproblemsinwhichmore
thantwosubtypesofsamplesareinvolved.
46
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 3 Discussion of the implementation of the reject option in p-­‐DPLS
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.1Introduction
3.1Introduction
Microarray gene expression data are characterized by a set of P features or
measurements obtained through observation, which are represented by the vector x.
Theobjectiveofaclassifieristoassignaclass(category)label(y)tothissamplebased
on its recorded x. In the probabilistic Discriminant Partial Least Squares classification
method(pͲDPLS) [1],thePLSmodeltranslatesxintoapredictedvalueNJ.ThisNJandthe
probability density function (PDF) that describes the distribution of the NJ’’s of the
trainingsamplesofeachclassareusedtocalculatetheaposterioriprobabilitythatthe
sample belongs to each modeled class. Classification is then decided using the Bayes
ruleforminimumerror[2].
TheBayesruleiscommonlyusedasacriterionforclassification.Itsdrawbackisthatthe
unknown sample is always classified, even if the sample is either an outlier or is
ambiguous (it has a similar a posteriori probability to belong to both classes). In such
situations,itwouldbebettertorejecttoclassifythesample[3].
TheobjectiveofthischapteristodiscusstheimplementationoftherejectoptioninpͲ
DPLS.Section3.2introducestheformulationofthepͲDPLSmodel.Then,theapplication
oftheBayesruleforclassifyinginpͲDPLSisshowninsection3.3.Section3.4discusses
the limitations of using the Bayes rule in pͲDPLS. Limitations that are overcome by
implementingarejectoption.Twoapproximationsforimplementingtherejectoptionin
pͲDPLSarediscussedinsection3.5.Finally,section3.6discussesthenecessarychanges
intheinterpretationofthemeasuresofclassificationperformancewhentheclassifier
includestherejectoption.
49
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
3.2Probabilisticdiscriminantpartialleastsquares
3.2.1Thepartialleastsquaresmodel
Onetaskindataanalysisistodescribetherelationshipbetweentheobservationsinthe
predictor space (X) and a dependent variable (y) [4]. Partial least squares (PLS) is a
regression method that specifically searches a set of components (or factors) that
perform a simultaneous decomposition of X and y with the constraint that these
componentsexplainasmuchaspossiblethecovariancebetweenXandy.Discriminant
PLS (DPLS) applies PLS regression to binary classification problems, in which y codifies
theclassofthesamples[5,6].Withmicroarraygeneexpressiondata,XisanNuPmatrix
ofNsamplesandPgeneexpressionsandyisaNu1vectorofonesandzeros,wherethe
integer 0 indicates that the sample belongs to class Z0 (e.g. ““cancer type I””) and the
integer1indicatesthatthesamplebelongstoclassZ1(e.g.““cancertypeII””).
PLSdecomposesXandyinto:
‫ ܆‬ൌ ‫۾܂‬୘ ൅ ۳
(1)
‫ ܡ‬ൌ ‫ ݍܝ‬൅ ܎
(2)
whereTisthescoresmatrix,Pistheloadingsmatrix,uisthevectorofscoresforyandq
istheloading[7].Eisthe(error)residualmatrixoftheX––matrixandfisthevectorof
(error) residual of the y––vector. An inner relationship is constructed that relates the
scoresoftheXblocktothescoresoftheyblock.
‫ ܝ‬ൌ ‫ܟ܂‬
(3)
50
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.2Probabilisticdiscriminant
partialleastsquares
Oncethemodeliscalculated,theaboveequationscanbecombinedtoobtainavector
ofregressioncoefficientsforagivennumberoffactors:
መ ൌ ‫܅‬ሺ‫ ۾‬୘ ‫܅‬ሻିଵ ‫ݍ‬
‫܊‬
(4)
whereWisthematrixwhosecolumnsaretheweightsinEq.(3).
Thepredictionforasampleiscalculatedas:
መ ‫ݕ‬ො ൌ ‫ ܠ‬୘ ‫܊‬
(5)
NotethatifbhasbeencalculatedfrommeanͲcentereddata,thenxinEq.(5)shouldbe
meanͲcentered, and the predicted NJ should be processed accordingly. Ideally, the
predictionNJforasampleofclassZ1shouldbe1andforasampleofclassZ0shouldbe
0. Since this is never the case, because of random variability and modelling error, a
threshold is defined so that a sample whose prediction NJ is above this threshold is
classifiedintoclassZ1,andotherwiseitisclassifiedintoclassZ0.Thethresholdcanbe
definedwithadifferentdegreeofrigour(e.g.,thethresholdisarbitrarilysetat0.5or
assuming that the NJ’’s of the training samples follow a Gaussian distribution and
estimating the distribution using the mean and standard deviation of the NJ ’’s of each
class). In the following section, the threshold is defined from PDFs that describe each
class.ThishasleadtoanewversionofDPLScalledprobabilisticͲDPLS(pͲDPLS).
3.2.2Probabilitydensityfunctionofaclass
InpͲDPLS,onePDFiscalculatedthatrepresentsthePLSpredictionscharacterizingthe
samplesofclassZ0andonePDFiscalculatedthatrepresentstherangeofpredictionsof
samples of class Z1. The PDFs are calculated as follows. For the PLS model with A
factors, the training samples are predicted with Eq. (5). For each training sample i, a
51
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
Gaussian function (also called kernel function) centred at the predicted value NJi is
calculatedas:
݂ሺ‫ݕ‬ො௜ ሻ ൌ
ଵ
൉݁
ෝ೔ మ
భ ೤ష೤
൬
൰
మ ೄಶು೔
(6)
ܵ‫ܲܧ‬௜ ൌ ඥͳ ൅ ݄௜ (7)
ௌா௉೔ ξଶగ
where
and
ෝ ݅ ି௬೔ ሻమ
σಿ
೔సభሺ‫ݕ‬
ൌ ට
ேି஺ିఋ
(8)
SEPi is the standard error of prediction for sample i, hi is the leverage of the sample,
RMSECistherootmeansquareerrorofcalibration,yiistheknownclassofthetraining
samplei(i.e.thevalue0forasampleofclassZ0andthevalue1forasampleofclass
Z1)andɷis1ifthedatahasbeencentredand0otherwise.Figure1showstheGaussian
functionscalculatedforthreetrainingsamplesofclassZ0andfoursamplesofclassZ1.
p( yˆˆ | Ȧ0 )
p( yˆˆ | Ȧ1 )
f(ǔi )
f(ǔi )
SEP i
ǔi
Z0
ǔ
Z1
Figure1.Gaussianfunctions(f(NJi))andPDFs(p(NJ|Z0),p(NJ|Z1))calculatedforahypotheticalpͲDPLSmodel.
NotethatthewidthoftheGaussiankernelforeachsampleisdifferent,becauseitdependsontheleverageof
thesample,and,ultimately,ontherelativepositionofthesampleinthemultivariatespace.
52
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.2Probabilisticdiscriminant
partialleastsquares
ThePDFsforclassZ0 andZ1arecalculatedbyaveragingtheindividualkernelfunctions
ofthetrainingsamplesofeachclass:
‫݌‬ሺ‫ݕ‬ොȁɘ଴ ሻ ൌ
ଵ
௡బ
௡బ
σ௜ୀଵ
݂ሺ‫ݕ‬ො௜ ሻ (9)
‫݌‬ሺ‫ݕ‬ොȁɘଵ ሻ ൌ
ଵ
σ௡భ ݂ሺ‫ݕ‬ො௜ ሻ
௡భ ௜ୀଵ
(10)
wheren0andn1arethenumberofsamplesofclassZ0andclassZ1respectively.
Foratestsample,thepredictedvalueNJiiscalculatedwithEq.(5)foraDPLSmodelwith
A factors. Then, the sample is classified according to its probability to belong to each
oneoftheclasses,asitisshowninthenextsection.
3.3Classprediction
3.3.1Classificationbasedonprobabilities
Classificationbasedonaprioriprobability
Let{Z1……Zc}beafinitesetofCclasses.TheaprioriprobabilityP(Zc)istheprobabilityof
observingclasscwhenanewsamplearrives[8].Itreflectsourpriorknowledgeofhow
likelywearetogetasampleofoneclass(e.g.““cancertypeI””)andnotanotherkindof
sample(e.g.““healthy””or““cancertypeII””)[9].Aprioriprobabilitiesareoftenconsidered
equal for all the classes [10, 11] or calculated from the number of samples in the
trainingsetassumingthatthissetisrepresentativeofthepopulation[12Ͳ14],withthe
constraintthatσ஼௖ୀଵ ܲሺZୡ ሻ ൌ ͳ[8].
53
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
InpͲDPLS,theaprioriprobabilitiesareP(Z0)=n0/NandP(Z1)=n1/NforclassZ0andclass
Z1respectively,wheren0isthenumberoftrainingsamplesofclassZ0,n1isthenumber
ofsamplesofclassZ1andN=n0+n1.
Basedontheaprioriprobabilityonly,theclassificationruleinpͲDPLSthatminimizesthe
probabilityoferroristoassignasampletoclassZcif
ܲሺɘ௖ ሻ ൐ ܲሺɘ௖ᇱ ሻܿ ᇱ ൌ ͳ ǥ ‫ܥ‬Ǣ ܿ ് ܿԢ (11)
Thedrawbackofthisruleisthatitwillalwaysassignanynewsampletothesameclass
(the one with the highest a priori probability), although we know that samples from
differentclassesmayarrive.Theinformationaboutthesamplecontainedinxisignored.
Classificationbasedonprobabilitydensityfunctions
A better classification decision can be made by using the measurement vector x that
characterizes the incoming sample; in our case, the data x from a microarray
experiment.InpͲDPLS,xisfirstconvertedintothepredictionNJwithEq.(5)forthePLS
modelwithAfactors.Then,theruleistoassignthesampleiwithpredictionNJitothe
classZcif
‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௖ ሻ ൐ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௖ᇱ ሻܿ ᇱ ൌ ͳ ǥ ‫ܥ‬Ǣ ܿ ് ܿԢ (12)
where p(NJi|Zc) is the classͲconditional PDF for class c obtained from the NJ’’s of the
training samples evaluated at position NJi (section 3.2.2). Note that, if for a certain
sample,p(NJi|Z0)=p(NJi|Z1),thevalueofthePDFwillnotdecide.Figure2showsdifferent
PDFsfortwoclasses,Z0andZ1fordifferenthypotheticalpͲDPLSmodels(e.g.calculated
withdifferentnumberAoffactors). ForagivensamplewiththepredictedvalueNJi(ƒ),
the classification is done by comparing the values of each PDF at such NJi (arrows in
Figure2a).Thesampleisclassifiedintotheclasswiththelargestp(NJi|Zc).Notethatin
thezonewherethePDFsoverlap,thevaluesp(NJi|Zc)aresimilarforbothclasses(seethe
54
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.3Classprediction
firsttwoPDFsimagesinFigure2a).Hence,asmallvariationinNJiduetorandomerrorin
xmaychangetheclassthathasthelargestp(NJi|Zc),andhencechangestheclassification
decision. Theclassification of samples in that zone(called ambiguous samples) will be
discussedlater.
0.8
0.6
1
1
a
p( yˆˆi | Ȧ0 )
p( yˆˆi | Ȧ1)
c
b
0.8
0.8
P(Ȧ0 | yˆˆi )
0.6
0.6
R(Ȧ1 | yˆˆi )
0.4
0.4
0.2
0Ͳ1.5
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
1.4
0Ͳ1.5
a
0.8
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
b
P(Ȧ0 | yˆˆi )
0.8
Ͳ1.5
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
0 Ͳ1.5
1
a
Ͳ1
Ͳ0.5
0.2
0
ǔ
0.5
1
1.5
2
2.5
0
P(Ȧ0 | yˆˆi )
1.2
0.6
0.6
0.8
0.4
0.4
0.4
0.2
0 Ͳ1.5
Ͳ1
Ͳ0.5
p( yˆˆi | Ȧ1 )
0
ǔ
0.5
1
1.5
2
2.5
4
1
a
p( yˆˆi | Ȧ1)
p( yˆˆi | Ȧ0 )
3
0 Ͳ1.5
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
0 Ͳ1.5
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
P(Ȧ0 | yˆˆi )
b
0.8
0.6
0.6
0.4
0.4
0 Ͳ1.5
0.5
1
1.5
2
2.5
c
R(Ȧ1 | yˆˆi )
0.2
P(Ȧ1 | yˆˆi )
Ͳ1
Ͳ0.5
R(Ȧ0 | yˆˆi )
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
0
ǔ
0.5
1
1.5
2
2.5
0
c
R(Ȧ0 | yˆˆi )
Ͳ1.5
0.8
0.2
Ͳ1
0
1
2
1
ǔ
R(Ȧ1 | yˆˆi )
0.2
P(Ȧ1 | yˆˆi )
Ͳ1
Ͳ1.5
b
0.8
p( yˆˆi | Ȧ0 )
0
1
0.8
1.6
Ͳ0.5
0.4
P(Z1 | yˆˆ )
0.2
2
Ͳ1
0.6
0.4
0.2
0Ͳ1.5
1
0.6
0.6
R(Ȧ0 | yˆˆi )
0.2
1
p( yˆˆi | Ȧ1)
p( yˆˆi | Ȧ0 )
1
0.4
P(Ȧ1 | yˆˆi )
0.2
Ͳ1.5
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
R(Ȧ1 | yˆˆi )
2.5
c
R(Ȧ0 | yˆˆi )
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
Figure 2. Example for hypothetical pͲDPLS models a. PDFs b. a posteriori probabilities c. risk functions
assumingʄcc=0andʄcc’’=1.
Classificationbasedonaposterioriprobability
A more elaborated classification decision combines the a priori probability and the
prediction NJi of the incoming sample. The probability that this new sample belongs to
classcinaCͲclassproblemisgivenbytheBayes’’aposterioriprobabilityexpression:
55
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
ܲሺɘ௖ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁன೎ ሻ௉ሺன೎ ሻ
௣ሺ௬ො೔ ሻ
(13)
Whenappliedtomicroarraydataclassification,P(Zc|NJi)istheprobabilitythatacellor
tissue characterized by its gene expression data x(from which NJi is obtained) is either
fromthe““healthy””classor,otherwise,fromthe““tumour””class.
Thedenominator(knownasevidenceorunconditionalprobabilitydensityfunction)isa
scalefactorthatmeasureshowfrequentlywewillmeasureasamplewithsuchNJi:
‫݌‬ሺ‫ݕ‬ො௜ ሻ ൌ σ஼௖ୀଵ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௖ ሻܲሺɘ௖ ሻ
(14)
The rule assigns the sample to theclasswith the largest a posterioriprobability. So, a
samplewillbeassignedtoclassZcif:
ܲሺɘ௖ ȁ‫ݕ‬ො௜ ሻ ൐ ܲሺɘ௖ᇱ ȁ‫ݕ‬ො௜
ሻܿ ᇱ
ൌ ͳ ǥ ‫ܥ‬Ǣ ܿ ് ܿԢ (15)
Or,sincetheevidenceisthesameforalltheclasses,if
‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௖ ሻܲሺɘ௖ ሻ ൐ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௖ᇱ ሻܲሺɘ௖ᇱ ሻܿ ᇱ ൌ ͳ ǥ ‫ܥ‬Ǣ ܿ ് ܿԢ
(16)
ForatwoͲclassclassificationproblem,asinpͲDPLS,theaposterioriprobabilitiesP(ZcʜNJi)
are:
ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனబ ሻ௉ሺனబ ሻ
ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனభሻ௉ሺனభ ሻ
௣ሺ௬ො೔ ሻ
(17a)
௣ሺ௬ො೔ ሻ
(17b)
where:
‫݌‬ሺ‫ݕ‬ො௜ ሻ ൌ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ଴ ሻܲሺɘ଴ ሻ ൅ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘଵ ሻܲሺɘଵ ሻ
56
(18)
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.3Classprediction
Figure2bshowstheaposterioriprobabilitiescalculatedfromthePDFSofFigures2afor
twoclassesalongtheNJdomain.Thearrowsindicatetheaposterioriprobabilityofthe
sample(ƒ)ineachclass.
Note that since the a posteriori probability is calculated as a ratio (Eq. 17aͲb), it
increasesforoneclassastheNJ isfarawayfromthePDFoftheotherclass.Hence,fora
sample with NJi predicted value (ƒ) the classification is more risked when PDFs overlap
(first two rows of images of Figure 2). Instead when the distributions are more
separated(imagesonthirdandfourthrowsinFigure2),theclassificationactionistaken
withhigherprobabilityofbeingcorrect.
3.3.2Classificationbasedonrisk
Classificationcosts
Eachclassificationdecisionhasanassociatedcost.Let{D1……Dc}bethepossibledecisions,
where Dc indicates that the sample is classified in class Zc. Let ʄ(Dc|Zc’’) be the cost
incurredformakingthedecisionDc(classifyinZc)whenthetrueclassisZc’’. Forshort
ʄ(Dc|Zc’’)isrepresentedasOcc’’.
In practice, to decide the right costs for the classification problem is difficult and
requires an expert opinion. Costs result from combining several factors measured in
different units –– money, time or quality of life [8] ––, but a general approach is to
considerthatacorrectclassificationhascost0(i.e.,whenasampleofclasschasbeen
classifiedinclassc,ʄcc=0)andanincorrectclassificationhascost1(i.e.,whenasample
of class c has been classified in class c’’, ʄcc’’ = 1) [15Ͳ17]. Other approaches have been
used. SantosͲPereira [18] proposed seven different combinations of costs to optimize
the classification, based on the work published by Tortorella [19]. They introduced
negativecostsforcorrectclassificationsandpositivecostsformisclassifications.Deceux
57
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
[15] presented costs of classifying the samples in three different classes, with values
from0.5to3topenalizeeachclassification.Anotherstrategyistoassigndifferentcosts
toeachtypeoferrorandclassification,i.e.toclassifyasampleas““healthy””whenitis
““tumor””ispenalizeddifferent,withahighercost,thantoclassifyasampleas““tumor””
whenitis““healthy””[9,20].
Theriskofclassification
Theriskofclassification,calledtheconditionalrisk,R(ɲc|NJi)isdefinedastheexpected
loss(cost).Conditionalmeansthattheriskdependsonthevaluethatcharacterizesthe
sample(hereNJi,thatderivesfromtheobservedxthroughthePLSmodel)inwhichthe
classification is based on.Depending on NJi, we may runa higher ora lower risk. Fora
particular NJi and the action Dc taken, the loss incurred is ʄ(Dc|Zc’’), where Zc’’ is the
possibletrueclass(i.e.classesinwihchthesamplesmaybeclassified).SinceP(Zc’’|NJi)is
the probability that the true class for such NJi is Zc’’, the expected loss associated with
takingactionDcis[9]:
ܴሺD௖ ȁ‫ݕ‬ො௜ ሻ ൌ σ஼௖ᇲ ୀଵ ɉሺD௖ ȁZ௖ᇱ ሻ ൉ ܲሺZ௖ᇱ ȁ‫ݕ‬ො௜ ሻ
(19)
Fortwoclasses,theriskofclassificationbecomes:
ܴሺȽ଴ ȁ‫ݕ‬ො௜ ሻ ൌ ߣ଴଴ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ൅ߣ଴ଵ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ (20a)
ܴሺȽଵ ȁ‫ݕ‬ො௜ ሻ ൌ ߣଵଵ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ൅ߣଵ଴ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ
(20b)
Hereactionɲ0is““classifythesampleintoclassZ0””andactionɲ1is““classifythesample
intoclassZ1””.ʄ01isthelossincurredfordecidingZ0whenthetrueclassisZ1,ʄ10isthe
lossincurredfordecidingZ1whenthetrueclassis Z0and ʄ00andʄ11arethecostsof
correctlyclassifyingthesamplesintoclassZ0andclassZ1,respectively.
58
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.3Classprediction
WheneverwehaveapredictionNJiwecanminimizetheexpectedlossbyselectingthe
actionthatminimizestheconditionalrisk.Thedecisionrulebasedonriskisknownas
Bayes’’theoremoftheminimumerror[2].TherulefortheBayesminimumriskclassifies
thesampleinclassZcif
ܴሺȽ௖ ȁ‫ݕ‬ො௜ ሻ ൏ ܴሺȽ௖ᇱ ȁ‫ݕ‬ො௜ ሻܿ ᇱ ൌ ͳ ǥ ‫ܥ‬Ǣ ܿ ് ܿԢ (21)
ForbinaryclassifierslikepͲDPLS,Eq.(21)becomestoclassifythesampleintoclass:
ɘ଴ ݂ܴ݅ሺȽ଴ ȁ‫ݕ‬ො௜ ሻ ൏ ܴሺȽଵ ȁ‫ݕ‬ො௜ ሻ
ɘଵ ݂ܴ݅ሺȽଵ ȁ‫ݕ‬ො௜ ሻ ൏ ܴሺȽ଴ ȁ‫ݕ‬ො௜ ሻ
(22)
withR(ɲ0|NJi)andR(ɲ1|NJi)evaluatedwithequations20aͲ20b.
Ifweconsidercostzeroforacorrectclassificationandcostoneforanyerror(i.e.,ʄ00=
ʄ11=0andʄ01=ʄ10=1),theriskofclassificationbecomes:
ܴሺȽ଴ ȁ‫ݕ‬ො௜ ሻ ൌ ߣ଴ଵ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ ൌ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ
(23a)
ܴሺȽଵ ȁ‫ݕ‬ො௜ ሻ ൌ ߣଵ଴ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൌ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ
(23b)
andtheclassificationdecisionmaybeexpressedintermsofaposterioriprobabilitiesas
†‡…‹†‡ɘ଴ ݂݅ߣଵ଴ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൐ ߣ଴ଵ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ‘–Š‡”™‹•‡†‡…‹†‡ɘଵ (24)
Figure2cshowstheriskovertheNJdomainforabinaryclassifierwithʄcc=0andʄcc’’=1.
Notethattheriskcurvesareoppositetotheaposterioriprobabilitycurves,i.e.,ahigha
posteriori probability involves a low risk, and viceͲversa. Also note that the risk of
classificationinoneoftheclassesdecreasesthefurthestawaythepredictionisfromthe
PDF of the other class. For a test sample (ƒ), in the top two models, the risk taken to
classifythesampleintoclassZ0,R(ɲ0|NJi),issimilartotherisktoclassifythesampleinto
classZ1,R(ɲ1|NJi).Insuchasituationthechanceofmisclassificationishigh.Bycontrast,
when the PDFs are not overlapped(Figure2c, bottom) the risk taken whenclassifying
59
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
thissampleinclassZ0ismuchhigherthantherisktakenwhenclassifyingitinclassZ1
(i.eR(ɲ0|NJi)>>R(ɲ1|NJi)).HencethesamplewillbeclassifiedinclassZ1withalowriskof
classification.
Theclassificationbasedonriskisageneralrulefromwhichthepreviousrulesderive.To
bemeaningful,theclassificationbasedonrisksrequiresthecoststobesetobjectively
(e.g. in monetary units). If they are not known and the cost of misclassification is set
equaltooneandthecostofcorrectclassificationissetequaltozero,theclassification
basedonriskisequivalenttotheclassificationbasedonlyonaposterioriprobabilities.
3.4Discussionofclassprediction
The Bayes rule is optimal in the sense that no other rule can yield a lower error
probability.However,whentheNJiliesintheambiguityregionandwhenthesamplelies
in the limits of the classes’’ domains this rule may lead to questionable results. These
situationsarecommentedbelow.
It is common that in binary classification the PDFs of class Z0 and class Z1 overlap
(Figure3a).Theoverlaparisesbecauseeithertheclassificationalgorithmhasalimited
discriminativepower,orbecausesomesamplesofbothclasseshavesimilarmeasuredx.
AsamplewhosepredictionisinthatregionhassimilarvaluesofthePDFsp(NJiʜZ0)|p(NJi
ʜZ1)and,assumingthattheaprioriprobabilitiesareequal,hasalsosimilarvaluesofthe
a posteriori probabilities P(Z0ʜ NJi) | P(Z1ʜ NJi). Since there is not a clear difference, the
sample could well belong to any of the two classes and the probability of
misclassificationishigh.Theoverlapzone(dashedregioninFigure3)iscalledambiguity
region.
60
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.4Discussionofclass
prediction
1.2
a
1
4
LL 0
HL 1
p(yˆˆ | Z1)·P(Z1)
p(yˆˆ | Z0 )·P(Z0 )
3
HL 0 LL 1
HL 1
p( yˆˆ | Z1)·P(Z1)
p( yˆˆ | Z0 )·P(Z0 )
0.8
0.6
LL 0
b
2
0.4
1
0.2
0
Ͳ1.5
1
c
Ͳ1
LL 0
Ͳ0.5
0
0.5
ǔ
1
1.5
2
2.5
0
Ͳ1.5
1
P(Z0 | yˆˆ)
P(Z1 | yˆˆ)
HL 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
Ͳ1.5
Ͳ1
Ͳ0.5
0
0.5
ǔ
1
1.5
2
2.5
Ͳ1
d P(Z0 | yˆˆ)
0
Ͳ1.5
Ͳ1
Ͳ0.5
0
0.5
ǔ
1
HL 1
LL 0
Ͳ0.5
1.5
0
0.5
ǔ
1
1.5
2.5
2
P(Z1 | yˆˆ)
2
2.5
Figure3.HipotheticpͲDPLSmodel.aͲb.PDF’’scͲb.aposterioriprobabilityfunctions.ClassZ0isrepresentedby
thegreenlineandclassZ1bytheyellowline.Thedashedregionistheambiguityregion.
Anothercommonsituationariseswhenthesample’’spredictionisoutsidetherangeof
the predictions of the training samples. This situation may happen at the extremes of
thePDFs(Figure3aand3b)andalsointheregionbetweenthePDFsifthePDFsdonot
overlap(Figure3b).Intheseregions,theclassͲconditionalprobabilitiesp(NJi|Zc)arevery
low for both classes and also the products p(NJi|Z0)ͼP(Z0) and p(NJi|Z1)ͼP(Z1) are low.
However, note in the limits of the PDFs, the a posteriori probability for one of the
classes is high (Figures 3c and 3d) because it is calculated as a ratio. For example, for
p(NJi|Z0)ͼP(Z0) = 10Ͳ7 and p(NJi|Z1)ͼP(Z1)= 10Ͳ10, the a posteriori probability is P(Z0ʜ NJi) =
10Ͳ7/(10Ͳ7+10Ͳ10)|1.Bylettingtheaposterioriprobabilitydecide,thesamplewouldbe
classifiedintoclassZ0withahighaposterioriprobability.Thisresultissatisfactoryifthe
61
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
therejectoptioninpͲDPLS
samplemustnecessarilybelongtooneofthetwopossibleclassesandtheclassification
modelhasbeendesignedtodoso.However,thefactthatthepredictionofthesample
is in the tail of the PDF, where only a very low percentage of training samples are,
suggests that the sample may be an outlier and even not belong to the class. Hence,
allowingtheclassifiertorejecttoclassify,insteadofforcingittomakeaclassification
decision, might beadvantageous. This possibility is notconsideredneither inthetwoͲ
class Bayes rule of a posteriori probability (Eq. 15) nor in the minimum risk of
classificationrules(Eq.21),whichwillalwaysclassifythesample.
3.5Probabilisticdiscriminantpartialleastsquareswith
rejectoption
Inmanycases,suchasinclinicaldiagnosis,thecostofawrongclassificationmaybeso
highthatitmaybebettertosuspendthedecision(torejecttoclassifythesample),and
callforafurthertest[21],thantorisktoobtainawrongclassification.Therejectoption
is introduced in a classification rule topreserve againstexcessive misclassifications [3]
and to obtain the accuracy required by the user of the classification system [22]. The
reject option avoids classifying the samples with a high probability to be wrongly
classified [22], and only the classifications with a low risk are performed. Hence, the
rejectoptionconvertspotentialmisclassificationsintorejections[23],whichreducesthe
errorrate.Therejectoption,however,hastwolimitations:
1.Somesamplesthatwouldbecorrectlyclassifiedbytheclassificationmodel
maybeconvertedintorejections.
2.Theclassificationmodelbecomesuselessiftoomanysamplesarerejected.
Undoubtedly a tradeoff between errors and rejects must be achieved [18]. Several
strategieshavebeendevelopedtodefinetheoptimalrejectoption[11,18,21,23,24].
62
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.5Probabilistic discriminantpartialleast
squareswithrejectoption
These strategies basically reduce to two approximations, either by defining the reject
optionasanewclass(rejectclass)towhichtheobjectsareassignedtoorbydefining
the reject option as a threshold so the object is only classified if its a posteriori
probabilityishigherthanthethreshold.Thesetwoapproachesarecommentedbelow.
3.5.1Rejectoptionasaclass
Therejectoptionmaybeintroducedintheclassificationprocessasanadditionalclass,
the reject class (Zr). In such a case, the possible classification actions of the pͲDPLS
classifierare:classifythesampleintoclassZ0(D0),classifythesampleintoclassZ1 (D1)
andclassifythesampleintotherejectclassZr(Dr).
Classificationbasedonaposterioriprobability
The a posteriori probabilities when the reject option is implemented as a class are
definedas:
ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனబ ሻ௉ሺனబ ሻ
ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனభሻ௉ሺனభ ሻ
ܲሺɘ௥ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனೝ ሻ௉ሺனೝ ሻ
(25a)
(25b)
(25c)
௣ሺ௬ො೔ ሻ
௣ሺ௬ො೔ ሻ
௣ሺ௬ො೔ ሻ
wherethescalefactordefinedinEq.(14)becomes:
‫݌‬ሺ‫ݕ‬ො௜ ሻ ൌ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ଴ ሻܲሺɘ଴ ሻ ൅ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘଵ ሻܲሺɘଵ ሻ ൅ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ௥ ሻܲሺɘ௥ ሻ
(26)
Theruleistoclassifyinto:
݈ܿܽ‫ݏݏ‬ɘ଴ ݂݅ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൐ ƒš ሺܲሺɘଵ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘ௥ ȁ‫ݕ‬ො௜ ሻሻ
݈ܿܽ‫ݏݏ‬ɘଵ ݂݅ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ ൐ ƒš ሺܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘ௥ ȁ‫ݕ‬ො௜ ሻሻ
݈ܿܽ‫ݏݏ‬ɘ௥ ݂݅ܲሺɘ௥ ȁ‫ݕ‬ො௜ ሻ ൐ ƒš ሺܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻሻ
(27)
63
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
Iftherejectclassisdefinedinthisway,theaprioriprobabilitiesforclassZ0andclassZ1
arecalculatedfromtheproportionofsamplesofeachclassinthetrainingset.Forthe
rejectclass,P(Zr)istheaprioriprobabilitythatanewsamplethatshouldberejected
arrives and p(NJiʜZr) defines the distribution of the NJi of any sample that should be
rejected. Both p(NJiʜZr) and P(Zr) are clearly difficult to calculate. Usually it is assumed
thattherejectclasshasauniformdistributionovertheNJdomain[25]andsincethea
priori probability has only a
1
multiplicative effect, only the
a
product p(NJiʜZr)ͼP(Zr) must be
0.8
p( yˆˆ | Z1)·P(Z1)
p( yˆˆ | Z0 )·P(Z0 )
calculated. One criterion is to
0.6
define p(NJiʜZr)ͼP(Zr) as a threshold
so that the 5% of the area in the
0.4
tails of the PDFs is below this
0.2
threshold [11] (dashed regions in
p( yˆˆ | Zr )·P(Zr )
0
Ͳ1.5
Ͳ1
Ͳ0.5
0
Reject
4
ǔ 0.5
1
1.5
2
Acceptance
2.5
Reject
Figure 4). In this way a sample
whose NJi is atthe tails of the PDFs
b
is rejected. Figure 4 shows the
p( yˆˆ | Z0 )·P(Z0 )
3
PDFs for class Z0 and class Z1 for
p( yˆˆ | Z1)·P(Z1)
overlapped and non overlapped
classes. The red horizontal line is
2
the uniform distribution calculated
1
for the reject class. Note that this
p( yˆˆ | Zr )·P(Zr )
0
Ͳ1.5
Ͳ1
Reject
Ͳ0.5
0
ǔ 0.5
1
Acceptance Reject Acceptance
1.5
2
2.5
reject class defines two kinds of
regions, the acceptance and the
Reject
rejectones.
Figure4.PDFsforoverlappedandnonͲoverlappedclasses(Z0andZ1)andtherejectclass(Zr).Therejectclass
isdefinedasauniformdistribution.Thisissetasthe5%areainthetailsofthePDFs.
64
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.5Probabilistic discriminantpartialleast
squareswithrejectoption
Theaposterioriprobabilities(Eq.25aͲ25c)derivedfromthePDFsinFigure4areshown
inFigure5.WhenthePDFsareoverlapped(Figure5a),therejectclassisusefulatthe
endsofthePDFsandalsointheambiguouszoneifthedistributiondefiningthereject
class is higher than the PDFs of the classes. However, samples in the ambiguous zone
will not be rejected if the reject class is below the PDFs of the classes (as usually
happens)becausetheaposterioriprobabilityoftherejectclasswillalwaysbesmaller
than the probability of classification
1
a
(i.e. P(Zr|NJi) < max (P(Z0|NJi),
P(Z1|NJi)).
For
non
overlapped
distributions(Figure5b)betweenthe
P(Z0 | yˆˆ)
0.8
P(Z1 | yˆˆ)
0.6
PDFstheprobabilityofthesampleto
belong to the reject class is the
largest of the three a posteriori
0.4
0.2
P(Ȧr | yˆˆ)
probabilities, so a sample in that
zone would be rejected. This is the
behaviour to be expected because
there are no training samples with
0
Ͳ1.5
1
Ͳ1
Ͳ0.5
0
0.5
ǔ
1
1.5
2
2.5
b
P(Ȧr | yˆˆ)
0.8
suchNJivalues.Thesamehappensat
the extreme of these distributions
(i.e. equally to the extremes of
0.6
0.4
P(Z0 | yˆˆ)
overlapped distributions). In which
P(Z1 | yˆˆ)
0.2
the samples whith such NJi will be
rejectedtoclassify.
0
Ͳ1.5
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
Figure 5.Aposterioriprobabilitiesforclass Z0 (green),class Z1 (yellow),andtherejectclass Zr(red)fora.
overlappedclassesandb.nonoverlappedclassespresentedonFigure4. Different adaptations of the reject class have been described. Pereira et al. [18]
introducedanindecisionclassinordertorejectthesamples,butthisrejectclassisnot
65
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
introducedintheevaluationofaposterioriprobabilitiesnorintheconditionalrisk.Their
approach may be assimilated to introduce a reject threshold. Landgrebe et al. [10]
consideredtheproblemasoneinwhichthereisawellͲdefinedtargetclassandapoorly
definedoutlierclass,andintroducedtherejectclassonlyinthepredictionstep.Inother
words, in the training step there are two classes (target and outlier) and in the
predictionorclassificationstepanadditionalclassisused,therejectclass.Thisclassis
assumed to be uniformly distributed across the training classes’’ domains, and it is
included in the evaluation of the probabilities. The criticism arises because in this
approach the a priori probabilities used in the training step are different than the a
priori probabilities used in the prediction step. Muzzolini et al. [11] introduced an
ambiguous class to reduce the probability of an erroneous classification. This class
identifies those samples that are classified as belonging to two or more classes with
(near) equal probability. In addition, they introduced the reject distance to identify
those samples that have little or no similarity with the predefined classes. The reject
thresholdstoidentifysuchsamplesaredeterminedbyfixingtheprobabilityinwhichthe
samples are classified as belonging to the distance reject class. This is equivalent to
rejectthesamplespredictedoutsideaconfidenceintervalfixedaroundeachPDF(reject
distance)[11].
3.5.2Rejectoptionasathreshold
Asecondalternativetointroducetherejectoptionistointroducearejectthreshold.
Classificationbasedonaposterioriprobability
TheaposterioriprobabilitiesforeachclassovertheNJdomainarecalculated(Eqs.17aͲ
17b) using the PDFs (Figure 6aͲ6b). For such a posteriori probabilities a threshold of
rejection is set at (1––t) (Figure 6cͲ6d), so that a sample is rejected if the maximum a
posterioriprobabilityislowerthanthisthresholdvalue[22].
66
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.5Probabilistic discriminantpartialleast
squareswithrejectoption
The classification rule based on a posteriori probabilities with reject option becomes
classifyinto:
݈ܿܽ‫ݏݏ‬ɘ଴ ݂݅ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൐ ݉ܽ‫ ݔ‬ሺܲሺɘଵ ȁ‫ݕ‬ො௜ ሻǡ ሺͳ െ ‫ݐ‬ሻሻ
݈ܿܽ‫ݏݏ‬ɘଵ ݂݅ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ ൐ ݉ܽ‫ ݔ‬ሺܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ሺͳ െ ‫ݐ‬ሻሻ (28)
andrejectthesampleif:
ሺͳ െ ‫ݐ‬ሻ ൐ ݉ܽ‫ ݔ‬ሺܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻሻ
(29)
IfthePDFsofthetwoclassesareoverlapped,therejectthresholddividestheNJdomain
intotworegions:acceptanceregionandrejectregion(Figure6cand6d).
1
4
a
b
0.8
3
p( yˆˆ | Z1 )·P(Z1 )
p ( yˆˆ | Z 0 )· P ( Z 0 )
0.6
p( yˆˆ | Z0 )·P(Z0 )
p( yˆˆ | Z1)·P(Z1 )
2
0.4
1
0.2
0
Ͳ1.5
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
2.5
0
Ͳ1.5
1
P(Z0 | yˆˆ)
c
d
P(Z1 | yˆˆ)
0.8
Ͳ0.5
0
ǔ
0.5
1
1.5
2
0.6
0.4
0.4
0.2
0.2
Ͳ1
Ͳ0.5
Acceptance
0
ǔ
0.5
Reject
1
1.5
2
Acceptance
P(Z0 | yˆˆ)
2.5
P(Z1 | yˆˆ)
0.8
(1-t)
0.6
0
Ͳ1.5
Ͳ1
1
2.5
0
Ͳ1.5
(1-t)
Ͳ1
Ͳ0.5
Acceptance
0
ǔ
0.5
1
1.5
2
2.5
Acceptance
Figure 6. (aͲb) PDFs for overlapped and no overlapped classes. (cͲd) a posteriori probabilities with reject
thresholdderivedfromaͲbPDFs.
67
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
AccordingtoChowin[23],theoptimalrejectthreshold(t)isgivenby
‫ ݐ‬ൌ ሺߣ௥ െ ߣ௖ ሻȀሺߣ௠ െ ߣ௖ ሻ (30)
where ʄr is the cost of rejecting a sample and ʄc and ʄm are the costs of a correct
classificationoramisclassification,respectively.Generallyʄm> ʄr> ʄc,andinmostcases
ʄc=0(i.e.thereisnocostiftheclassificationiscorrect)[23,26].
AlimitationofclassificationbasedonEq.(28)andEq.(29)isthatthethresholdhasno
effectifthePDFsarenotoverlapped(Figures6band6d),sincethereisnotasignificant
ambiguous region. Note also that for the reject option work properly, (1––t) must be
higherthan0.5.If(1––t)islowerthan0.5theprobabilitytoclassifythesampleinoneof
theclasseswillalwaysbehigherthantherejectthresholdsothattheclassificationrule
based on a posteriori probabilities with reject option is simply the classical Bayes rule
(seeFigure6a).Inaddition,theuseoftheaposterioriprobabilityoftheclassandthe
reject threshold for rejection ignores the possibility of having samples from unknown
classes. This situation may be partially overcomed by setting limits on the PDFs (High
Limit and Low Limit in Figure 3 as will be discussed on chapter 4). These limits avoid
classifyingsamplesthatlieontheextremesoftheclasses.
Other approaches have been proposed to implement rejection based on thresholds.
Fumeraetal.[22]proposedtosetanindividualthresholdforeachclass,thusavoiding
rejecting too many samples of one of the classes if the number of samples of both
classesisnotbalanced.Tortorellaetal.[19,21]consideredalsotwothresholds,which
wereoptimizedbymaximizingtheclassificationutilityfunction.Thisisanalternativeto
the Chow’’s approach. Chow takes into account costs and minimizes the risk [18]. In
order to optimize the reject threshold, Li et al. [27] proposed to control the error
insteadoffindingatradeͲoffbetweenrejectionrateanderrorrate.Theyreformulated
theproblemas:givenanerrorrateforeachclass,designaclassifierwiththesmallest
68
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.5Probabilistic discriminantpartialleast
squareswithrejectoption
rejectionrule.AsimilaralternativewasproposedbyHanczaretal.[28],althoughthey
controlled the conditional error rate of the classifier, not the error rate. Kressel et al.
[29]optimizedtherejectthresholdtogetaminimalfalsepositiverateandHerbeietal.
[30]presentedtherejectioncosttastheupperboundontheconditionalprobabilityof
misclassification,optimizedbyminimizingtheerrorrateforalsoaminimalrejectrate.
These approaches often ignore the detection of outliers and the rejection of samples
whentheclassesarenotoverlapped.
Furtherimprovementsontheapplicationoftherejectoptionarediscussedinchapter4.
3.6 Implications of reject option in classification
performanceevaluation
When a classifier involves the reject option, the performance measures the classifier
mustbeproperlyinterpretedinordertotakeintoaccountthatsamplescanberejected.
pͲDPLS is a binary classifier. This means that the classification decision is to choose
between two classes, Z1 and Z0, that can be generically called Positive (P) class and
Negative(N)classrespectively.Hence,theresultfrompͲDPLScanbethatthesampleis
correctlyclassifiedinitsclass,eitherinclassZ1(TruePositive,TP,i.e.,apositivesample
thatisclassifiedaspositive)orinclassZ0(TrueNegative,TN,i.e.,anegativesamplethat
isclassifiedasnegative)orincorrectlyclassified,eitherinclassZ1(FalsePositive,FP,i.e.,
anegativesamplethatisincorrectlyclassifiedaspositive)orinclassZ0(FalseNegative,
FN,i.e.,apositivesamplethatisclassifiedasnegative)(Table1).Whentherejectoption
is implemented, the possible outputs of the classifier include that the sample may be
rejected.ApositiveobjectthatisrejectediscalledRejectPositive(RP)and,equivalently,
anegativeobjectthatisrejectediscalledRejectNegative(RN).Asamplemayhavebeen
69
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
rejected because its classification was not reliable enough (the risk was too high) or
becauseitwaspointedoutasoutlier(seechapters4and5).
Table1.Confusionmatrix,outcomesofabinaryclassifier,asdescribedbyKohaviandProvostin[31].
Trueclass
Predicted
Positive(Z0)
Negative(Z1)
Positive(Z0)
TP
FP
Negative(Z1)
FN
TN
Rejected(Zr)
RP
RN
The objective of pͲDPLS or any other classifier is to classify correctly as many future
samples as possible, i.e., minimize the number of false positives, false negatives and
rejections.Forsimplicity,thisisgenerallyevaluatedbytheaccuracyortheerrorrateof
theclassificationmodel.
Accuracyisdefinedasthepercentageofsamplescorrectlyclassified:
‫ ݕܿܽݎݑܿܿܣ‬ൌ
୘୒ା୘୔
୘୒ା୊୒ା୘୔ା୊୔
(31)
Ifrejectionisnotanoption,allsamplesareclassifiedandthedenominatorofEq.(31)is
equal to the number of samples I submitted to the classifier (i.e. I = TN+FN+TP+FP).
Hence, classically, accuracy is calculated by dividing the number of samples correctly
classifiedbythetotalnumberofsamples,I.Whenrejectionisanoption,Eq.(31)isstill
validbutnotethatthedenominatorisnolongerequaltothetotalnumberofsamplesI,
sincesomeofthemmayhavebeenrejected(i.e.I=TN+FN+TP+FP+RP+RN).Hence,the
accuracy must be interpreted as the percentage of correctly classified samples with
respecttothenumberofsamplesforwhichtheclassifierissuedaclasslabel[22].Note
thatthis is the most meaningful interpretation,althoughit is rarellyconsidered in the
workswithrejectoption,inwhichtheaccuracyiscalculatedbydividingthenumberof
70
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.6Implicationsoftherejectoptionin
classificationperformanceevaluation
samplesclassifiedcorrectlybythenumberofsamplessubmittedtotheclassifier,either
rejectedornot[21].
Thissignificanceresidesinthattheexperimenterwantsthattheclasslabelissuedbythe
classifierbecorrect.Hence,theperformancemeasureshouldreflectthepercentageof
thesamplesforwhichtheclassifierassignedaclassandifithasbeendonecorrectlyor
wrongly.Inthisway,theaccuracyoftheclassifierwithrejectoptioncanbehigherthan
the accuracy of the classifier without reject option (note that if the accuracy were
defined over the total number of samples, classifiers with reject option would always
performe worse than models without reject option, because the number of samples
wellclassifiedusingtherejectoptionwouldbeequalorlower).
Similarly,theerrorrateisdefinedasthepercentageofsamplesthatareassignedtothe
wrongclass[32]:
””‘””ƒ–‡ ൌ
୊୒ା୊୔
୘୒ା୊୒ା୘୔ା୊୔
(32)
Theerrorratemustbealsoreinterpretedliketheaccuracyparameterwhenrejectionis
anoption.Hence,thedenominatorofEq.(32)isthetotalnumberofsamplesclassified
(withouttakingintoaccounttherejectedones).
Thesensitivityandthespecificityaredefinedinsimilarterms[33].
‡•‹–‹˜‹–› ൌ
୘୔
୘୔ା୊୒
(33)
’‡…‹ϐ‹…‹–› ൌ
୘୒
୘୒ା୊୔
(34)
The sensitivity is evaluated as the number of positive samples (class Z1) correctly
classified respect to the number of positive samples classified. Note that, while the
denominator expression must be maintained, without reject option the number of
71
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
positive samples classified is the total number of positive samples, with reject option,
thenumberofpositivesamplesclassified(TP+FN)maydifferfromthetotalnumberof
positivesamples(TP+FN+RP).Ananalogoussituationhappensforthenegativesamples
with the Specificity. In this measure, the number of negative samples classified may
differfromthenumberoftotalnegativesamplessincesomeofthemmayberejected
whentherejectoptionisimplemented.
Furthermore, when the reject option is introduced, new performance parameters
appear[33]:
‘•‹–‹˜‡”‡†‹…–‹˜‡ƒŽ—‡ ൌ
୘୔
ୖ୔
‡‰ƒ–‹˜‡”‡†‹…–‹˜‡ƒŽ—‡ ൌ
୘୒
(35)
(36)
ୖ୒
However, the redefinition of the performance parameters is not enough to accurately
evaluatetheclassifierswithrejectoption.Notethatamodelthatrejectstoclassifymost
of the samples but classifies correctly the remaining few will have a high accuracy;
however,itisnotuseful.Inaddition,thedrawbackofusingparameterslikeaccuracyis
that individually they are not enough to evaluate all the aspects that summarize the
performance of the classifier (i.e. correct classifications, misclassifications and
rejections).Forthatpurpose,thecostisamoreusefulparameter.Itisdefinedas:
‫ ݐݏ݋ܥ‬ൌ ߣ௠ ܰ௠ ൅ ߣ௥ ܰ௥ ൅ ɉୡ ୡ (37)
whereʄmisthecostofawrongclassification,ʄristhecostofrejectingasample,ʄcisthe
costofacorrectclassificationandNm,Nr,Ncarethenumberofsamplesmisclassified,
rejected or correctly classified, respectively. The Cost allows taking into account the
rejectionsand,inaddition,thecostthateachclassificationimplies[3,17].Thesecosts
(ʄ)mustbeoptimizedtokeeptheefficiencyoftheclassifier.
72
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
3.7Conclusions
3.7
Conclusions
Probabilistic Discriminant Partial Least Squares (pͲDPLS) is a binary classifier that has
some advantages over other versions of DPLS: 1) it assumes neither an arbitrary
classificationthresholdfortheNJ’’snoraGaussiandistributionfortheNJ’’sofeachclass,
and2)itassignstheclasslabelbasedontheBayesclassificationruleoftheaposteriori
probability,or,moregenerally,ofminimumrisk.
However,thestrictapplicationoftheBayesruleforcestheclassifiertoalwaysassignthe
sampletooneofthepredefinedclasses.Thisisalimitationforthosesamplesthatmay
beoutliersorambiguous,andhencewithalargechancetobemisclassified.Thedanger
ofmisclassificationcanbereducedbyimplementingtherejectoption.Inthischapter,
twoapproximationstoimplementtherejectoptioninpͲDPLShavebeendiscussed.One
ofthemintroducesrejectoptionasarejectclass.Thesecondoneintroducesthereject
optionasathreshold.Thebestapproachtointroducetherejectoptionistosetareject
threshold.With this approach, the aprioriprobabilities or shapes ofan extra class do
notneedtobeassumed.However,therejectoptionsetbytherejectthresholdaloneis
notabletorejectoutliers;so,additionalconstraintsmustbeconsidered.
Itisalsoessentialforanyclassifiertoevaluatecorrectlytheclassificationperformance.
Ageneralapproachistousetheaccuracyortheerrorrate.Theseparameters,however,
have the weaknesses that they consider all incorrect decisions (or correct decisions)
equallyriskyandtheytreatalloutcomesasequallylikely[26].Sincetherejectionsare
notevaluated,theseparametersarenotusefultoevaluateclassifierswithrejectoption.
Forsuchclassifiers,theCostparameterisabetterapproach.
73
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Discussionoftheimplementation
oftherejectoptioninpͲDPLS
‡ˆ‡”‡…‡•
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in Discriminant
Partial LeastͲSquares Classification. Journal of Chemometrics and Intelligent Laboratory Systems,
2009.95:p.122Ͳ128.
Bayes, T., An Essay towards solving a Problem in the Doctrine of Chances. Philosophical
TransactionsoftheRoyalSocietyofLondon,1763.53:p.370Ͳ418.
Chow,C.K.,Anoptimumcharacterrecognitionsystemusingdecisionfunctions.IRETrans.Electronic
Computers,1957.16:p.247Ͳ254.
Eriksson, L., et al., MultiͲ and Megavariate Data Analysis. Principles and Applications. 2001:
UmetricsAB.
Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ
dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44.
Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985,
Wiley:NewYork.p.581Ͳ591.
Gemperline, P.J., L.D. Webber, and F.O. Cox, Raw Materials Testing Using Soft Independent
ModellingofClassAnalogyAnalysisofNearͲInfraredReflectance.Anal.Chem,1989.61:p.138Ͳ144.
Webb,A.,StatisticalPatternRecognition,2nedition,ed.Wiley.2002,Malvern,UK.
Duda,R.O.,P.E.Hart,andD.G.Store,PatternClassification(2ndedition),ed.W.Intersicence.2001,
NewYork.
Landgrebe, T., et al., The interaction between classification and reject performance for distanceͲ
basedrejectͲoptionclassifiers.PatternRecognitionLetters,2006.27:p.908Ͳ917.
Muzzolini, R., Y.ͲH. Yang, and R. Pierson, Classifier desing with incomplete knowledge. Pattern
Recognition,1998.31:p.345Ͳ369.
Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic
discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328.
Hills, M., Allocation Rules and their Error Rates. Journal of the Royal Statistical Society. Series B
(Methodological),1966.28:p.1Ͳ31.
Bishop,C.M.,PatternRecognitionandMachinelearning,ed.Springer.2006,NewYork.
Denceux, T., Analysis of evidenceͲTheoretic Decision rules for pattern classification. Pattern
Recognition,1997.30:p.1095Ͳ1107.
Lachenbruch,P.A.andM.Goldstein,DiscriminantAnalysis.Biometrics,1979.35:p.69Ͳ85.
Anderson,T.W.,IntroductiontoMultivariateStatisticalAnalysis.1958, NewYork:John Wileyand
Sons.
SantosͲPereira,C.M.andA.M.Pires,Onoptimalrejectrulesand ROCcurves.PatternRecognition
Letters,2005.26:p.943Ͳ952.
Tortorella, F., An optimal reject rule for binary classifiers. In: Ferri, F.J et al. (Eds.), Advances in
PatternRecognition:JointIAPRInternationalWorkshops,SSPR2000andSPR2000,LectureNotes
inComputerScience,vol1876.SpringerͲVerlag,Heidelberg.2000:p.611Ͳ620.
Bishop,C.M.,PatternRecognitionandMachineLearning.SpringerScience+BussinessMedia.2006,
Singapore.
Tortorella,F.,AROCͲbasedrejectrulefordichotomizers.PatternRecognitionLetters,2005.26:p.
167Ͳ180.
74
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
References
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
Fumera, G., F. Roli, and G. Giacinto, Multiple Reject Thresholds for Improving Classification
Reliability, in Advances in Pattern Recognition, SSPR&SPR, Editor. 2000, Springer: Berlin Ͳ
Heidelberg.p.863Ͳ871.
Chow, C.K., On optimum recognition error and reject tradeoff. IEEE ͲTransactions on information
theory,1970.16:p.41Ͳ46.
Fumera, G., I. Pillai, and F. Roli, Classification with Reject Option. Proceedings of the 12th
InternationalConferenceonImageAnalysisandProcessing(ICIAP’’03),2003.
Landgrebe,T.,etal.AcombiningstrategeyforillͲdefinedproblems.inFifteenthAnn.Sympos.ofthe
PatternRecognitionAssociationofSouthAfrica.2004.
Brown, C.D. and H.T. Davis, Receiver operating characteristics curves and related decision
measures:Atutorial.ChemometricsandIntelligentLaboratorySystems,2006.80:p.24Ͳ38.
Li, M. and I.K. Sethi, ConfidenceͲbased classifier design. Pattern Recognition, 2006. 39: p. 1230Ͳ
1240.
Hanczar, B. and E.R. Dougherty, Classification with reject option in gene expression data.
Bioinformatics,2008.24:p.1889Ͳ1895.
Kressel,U.,F.Lindner,andC.Wöler,ClassificationSystemwithrejectclass.2004,DaimlerChrysler
AG(DE):UnitedStates.
Herbei,R.andM.H.Wegkamp,Classificationwithrejectoption.TheCanadianJournalofStatistics,
2006.34:p.709Ͳ721.
Kohavi,R.andF.Provost,GlossaryofTermsMachineLearningͲKluwerAcademicPublishers,1998.
30:p.271Ͳ274.
Smith,C.A.B.,Someexamplesfodiscrimination.Ann.Eugen.,1974.13:p.272Ͳ282.
Bradley, A.P., The use of the area under the ROC curve in the evaluation of machine learning
algorithms.PatternRecognition,1997.30:p.1145Ͳ1159.
75
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 4 Classification from
microarray data using
p-­‐DPLS with reject option
Talanta, 2009, Vol.80 (1): 321-32
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
Microarraysallowevaluatingsimultaneouslytheexpressionofthousandsofgenesina
cell. Oneof the most relevant applications of thesegene expressions is to classify the
samples (e.g. cell or tissues) into one of the classes of interest. Discriminant Partial
LeastͲSquares(DPLS)isoftenusedforsuchapurpose.However,mostpublishedresults
report the straight application of this method, with disregard to the quality of each
individualpredictionandthepossibilityofdetectingpredictionoutliers.Theaimofthis
chapteristoimproveDPLSforclassifyingmicroarraydata.Firstly,weimplementanew
version of DPLS called probabilistic Discriminant Partial Least Squares (pͲDPLS). This
method bases the classification of a sample on kernel probability density functions
(PDFs) and the Bayes rule of a posteriori probability. Secondly, a reject option is
introduced so that the classifier can reject samples in the ambiguity region, based on
Chow’’s rule, and can reject samples outside the defined limits of the classes. The
ambiguityregionisthezonewherethePDFsthatcharacterizeeachoneoftheclasses
overlap. In that zone, the model cannot discriminate well enough whether a sample
belongstooneclassortotheother,eitherbecauseoflimitationofthePLSmodel,or
becausethesamplesactuallysharecharacteristicsofthemodeledclasses.Hence,there
ishighriskthatanyattemptofclassifyingthatsamplecouldresultinamisclassification.
Thesecondpossibilityofrejectionisimplementedattheendsoftheclasses’’domains
and also between PDFs for non overlapped classes. Samples in those regions have
extremepredictions,outsidethelimitssetfortheclasses,sotheymaybeconsideredas
outliers.Forsuchsamples,weprefertorejecttoclassifytheminsteadoftakingtherisk
of misclassifying them. These two approaches will be detailed and discussed in the
methodssection.
The existence of a reject option increases the experimenter’’s confidence in the
classificationruleandimprovestheaccuracyofthefinalclassificationmodels.Notethat
with reject option only those samples whose classification is reliable are actually
classified, while the samples either outside the limits or in the ambiguity region that
couldleadtomisclassificationsarerejectedtoclassify.
79
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter4
The pͲDPLS with reject option was tested with two public datasets. With the Human
Cancers dataset, the accuracy measured by leaveͲoneͲout crossͲvalidation was
improved from 97% to 99% when compared to pͲDPLS without reject option. For the
BreastCancerdataset,themethodcouldreject100%ofthetestsamplessubmittedto
theclassifierthatdidnotbelongtoanyofthemodelledclasses.Thesesampleswould
havebeenmisclassifiediftherejectoptionhadnotbeenconsidered.
ThisworkispresentedinpaperformpublishedinTalanta2009,Vol.8(1)321Ͳ328.
80
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydatausingprobabilistic
discriminantpartialleastsquareswithrejectoption
CristinaBotella,JoanFerré*,RicardBoqué
Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University.
Marcel·lí Domingo s/n, 43007. Tarragona, Spain
*
Correspondingauthor:[email protected]
Talanta2009,Vol.8(1)321Ͳ328(Editedforformat)
Abstract
Microarrays are used to simultaneously determine the expressions of thousands of
genes. An important application of microarrays is in the classification of samples into
classesofinterest(e.g.eitherhealthycellsortumourcells).DiscriminantPartialLeastͲ
Squares (DPLS) has often been used for this purpose. In this paper, we describe an
improvement to DPLS that uses kernelͲbased probability density functions and the
Bayesruletoclassifysampleswhilstkeepingtheoptionofnotclassifyingthesampleif
this cannot be done with sufficient confidence. With this approach, those samples
outside the boundaries of the known classes or from the ambiguity region between
classesarerejectedandonlysampleswithahighprobabilityofbeingcorrectlyclassified
are indeed classified. The optimal model is found by simultaneously minimizing the
misclassificationandrejectioncosts.Themethod(pͲDPLSwithrejectoption)wastested
withtwodatasets.FortheHumanCancersdatasettheaccuracy(obtainedbyleaveͲoneͲ
outcrossͲvalidation)wasimprovedfrom97%to99%whencomparedtopͲDPLSwithout
rejectoption.FortheBreastCancerdataset,pͲDPLSwithrejectoptionwasabletoreject
100% of the test samples that did not belong to any of the modelled classes. These
samples would have been misclassified if the reject option had not been considered.
81
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
4.1Introduction
Supervised classification is increasingly being applied to microarray gene expression
data in order to predict tumour types [1Ͳ3], to differentiate between healthy and
tumoursamples[4Ͳ6]andtodifferentiatebetweenpharmacologicalmechanisms[7],
amongotherapplications.Microarraydataarecharacterizedbythousandsofvariables
(genes) and few samples, resulting in high redundancy and a high number of nonͲ
informative measurements. There has been a lot of interest in using factorͲbased
multivariateclassificationmethodssuchasDiscriminantPartialLeastSquares(DPLS)to
analyze these data [8, 9]. The DPLS uses a few latent variables rather than a lot of
measuredvariablesandthisbringswithitaseriesofadvantages.DPLStakesvariable
correlations into account, filters noise and leads to classification rules with good
predictiveperformance, especially when DPLS is implementedtogetherwith variable
selectionmethods.DPLShasbeenusedtodifferentiatebetweensamplesbeforeand
afterchemotherapy[10],todeterminethedifferentstatesofabreastcancertumour
[11],topredicttheefficacyofadrugbyusingexpressiondatabiomarkers[12],andto
predictthequalityofDNAͲmicroarrayspots[13].
Like other classification rules, DPLS must have two main qualities: it must provide
reliable classifications of forthcoming samples and it must minimize the number of
misclassifications (i.e. the expected error rate). Both of these are improved if the
classifier is allowed to reject doubtful samples instead of always being forced to
classifytheminoneofthemodelledclasses.ByclassifyingonlythemostwellͲdefined
cases, both the accuracy of the classifier and the reliability of each classification are
improved.
82
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.1Introduction
In this paper we implement the reject option in the recently developed pͲDPLS
classifier and show how it can be used for microarray data classification. pͲDPLS is a
variantofDPLS,whichuseskernelfunctionstocalculateaprobabilitydensityfunction
(PDF) for each class. This allows a flexible implementation of the Bayes rule for
classification, and also provides a measure of the reliability of the classification.
Reliability is a primary concern in statistical classification, especially when this
classification is used in critical health applications such as cancer diagnosis [14], an
issuewhichhasalsoledtoseveralotherstudies[15].
Inclassification,rejectionisadvantageouswhen:(a)thenewsampledoesnotbelong
toanyofthetrainedclasses,(b)thenewsamplebelongstooneoftheclassesbutis
verydifferentfromthesamplesusedfortrainingtheclassifier,or(c)thesampleisin
the boundary region between classes. Situation (a) occurs when the sample is an
outlier.Forcingtheclassifiertodecideamongoneofthemodelledclasseswillproduce
aclassificationerror(e.g.acelldoesnotbelongtoanyofthemodelledcelltypesbutit
is classified as one of them). Situation (b) typically arises when the sampling of the
trainingsamplesisincompleteornotrepresentative.Finally,situation(c)mayarise,for
example, because of the limited discriminative power of the measured variables or
because the classification algorithm has limited discriminative power. Although
samplesinsituations(b)and(c)mightfinallybeclassifiedcorrectly,theymightalsobe
classified incorrectly because either they are unique samples or they are ambiguous
samplesandcanbelongtoeitheroftheclasses,respectively.
Therejectoptionaimstoovercomesituations(a)to(c)byrejectingthesampleandnot
classifyingitwhentheprobabilityoferroristoohigh.Thisisasafeguardagainsterrors
and improves the accuracy of the classifier, which is evaluated as the percentage of
samplescorrectlyclassifiedamongthenumberofsamplesclassified[16].Thisinturn
leadstogreaterconfidenceinthesamplesthatarefinallyclassified.Therejectoption
83
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
canbefineͲtunedinordertoavoidrejectingtoomanysamplesthatwouldotherwise
beclassifiedcorrectly.Sincetoohigharejectionratewoulddecreasetheusefulnessof
the classifier, a compromise must be reached between improving the accuracy and
reducing the usefulness of the classifier. There has been extensive research into the
theoretical aspects of the reject option [14, 17Ͳ28], most of which relates to Chow’’s
reject option [29], which implemented the reject option for the Bayes rule. Chow’’s
rejectoptionhasrecentlybeenusedtomicroarrayexpressiondata[30].
TherearestilltwolimitationstojointlyapplyingtheBayesandChowrules.First,they
are not adequate for the extreme (outlying) samples (situations (a)Ͳ(b)) which are
typically found at the extremes of the probability density functions (PDFs). These
samplesmustberejectedaccordingtoadifferentcriterion.Second,bothrulesrequire
knowledgeoftheaprioriprobabilitiesandthePDFsoftheclasses[31],whichmakes
applying these rules more difficult. In this paper, the first limitation is overcome by
including distance based thresholds, which is equivalent to selecting a confidence
intervalaroundeachclassandrejectingsamplesoutsidethisinterval[17].Thesecond
limitation is overcome by the calculating PDFͲlike functions in pͲDPLS [32], which
makesanapproximateBayesianclassificationeasier.
4.2Methods
4.2.1ProbabilisticDPLS
TheDPLSmethodappliesPartialLeastͲSquares(PLS)regressiontobinaryclassification
problems,inwhichthedependentvariableycodifiestheclassofeachsample[8,33].
ADPLSmodeliscalculatedbyregressingyonXusingtheadequatenumberoffactors.
84
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.2Methods
For microarray gene expression data, X is an NuP matrix of N samples and P gene
expressionsandyisaNu1vectorofonesandzeros,wheretheinteger0codifiesthe
sampleasbelongingtoclassZ0(e.g.““canceroftypeI””)andtheinteger1codifiesthe
sample as belonging to class Z1 (e.g. ““cancer of type II””). For a sample i, the value
predictedbythePLSmodelisNJi=xiTb,wheretheb'saretheregressioncoefficientsfor
themodelofAfactorsandtheadequatepreͲprocessingisimplicit(e.g.iftheb’’shad
beencalculatedfrommeanͲcentereddata,thenxishouldbemeanͲcentered,andthe
predictedNJishouldbeunprocessedaccordingly).Withthecodingofy,theprediction
forasampleshouldbecloseto0ifthesamplebelongstoclass Z0,anditshouldbe
closeto1ifthesamplebelongstoclassZ1.InordertobetterdefinethecutͲoffvalue
betweenclasses,Pérezetal.[32]developedpͲDPLS,aprobabilisticversionoftheDPLS
inwhichtheuncertaintyofthepredictedvalueNJisaccountedforinthecalculationof
the model. This method is described here for completeness. The method starts by
calculatingaDPLSmodelofAfactorswithXandy.Then,thismodelisusedtopredict
thetrainingsamplesand,foreachtrainingsamplei,aGaussianfunctioncentredatthe
predictedvalueNJiiscalculatedas:
‫ܨ‬ሺ‫ݕ‬ො௜ ሻ ൌ
ଵ
ௌா௉೔ ξଶగ
ෝ మ
భ ೤ష೤
൬ ೄಶು ೔൰
೔
൉ ݁మ
(1)
ܵ‫ܲܧ‬௜ ൌ ඥͳ ൅ ݄௜ σ೙
ො ೔ ି௬೔ ሻమ
೔సభሺ௬
ൌ ට
௡ି஺ିఋ
(2) (3)
where SEPi is the standard error of prediction for sample i, hi is the leverage of the
sample,RMSECistherootmeansquareerrorofcalibration,yiistheknownclassofthe
trainingsamplei(i.e.value0forasampleofclassZ0andvalue1forasampleofclass
Z1) and ɷ is 1 if the data has been centred and 0 if it has not. Figure 1 shows the
85
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
Gaussian functions calculated from the predictions of three training samples of class
Z0andfoursamplesofclassZ1.NotethatthewidthoftheGaussiankernelforsample
i depends on SEPi, which is particular to that sample, and depends on the relative
positionofthesampleinthemultivariatespace.Then,forclassesZ0 andZ1,aPDFis
calculatedastheaverageoftheindividualkernelfunctionsofthetrainingsamplesof
eachclass:
ଵ
௡
బ
‫݌‬ሺ‫ݕ‬ොȁɘ଴ ሻ ൌ σ௜ୀଵ
݂௜ ሺ‫ݕ‬ොሻ
௡బ
‫݌‬ሺ‫ݕ‬ොȁɘଵ ሻ ൌ
ଵ
σ௡భ ݂ ሺ‫ݕ‬ොሻ
௡భ ௜ୀଵ ௜
(4)
(5)
wheren0andn1arethenumberofsamplesofclassZ0andclassZ1respectively.
Figure1.SimulatedPDFsofclassZ0 andclassZ1obtainedfromEquations(4)and(5).Thekernelfunctions
(Eq.(1))arecentredonpredictionNJi ofeachtrainingsample.Accordingtothecodeassignedtotheclasses,
thesamplepredictionsofclassZ0andclassZ1shouldbelocatedaroundthevalues0and1respectively.
86
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.2Methods
4.2.2 Bayesruleforclassification
In pͲDPLS, the predicted class for sample i is obtained using the Bayes rule. The
predictionNJiforthatsampleisusedtoobtaintheaposterioriprobabilitiesP(Z0ʜNJi)and
P(Z1ʜNJi). These are the probabilities that the sample belongs either to class Z0 or to
class Z1, once it is known that the sample’’s prediction is NJi. For the twoͲclass
classificationproblem:
ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனబ ሻ௉ሺனబ ሻ
ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻൌ
௣ሺ௬ො೔ ȁனభ ሻ௉ሺனభ ሻ
௣ሺ௬ො೔ ሻ
(6a)
௣ሺ௬ො೔ ሻ
(6b)
wherep(NJiʜZ0)andp(NJiʜZ1)aretheconditionalprobabilitiesevaluatedfromthePDFsof
classes Z0 and Z1 and P(Z0) and P(Z1) are the a priori probabilities. Both a priori
probabilities may be estimated as the proportion of samples of each class in the
trainingset,providedthatthesetisrepresentativeoftheoverallpopulation.Thatis,
P(Z0)=n0/N and P(Z1) = n1/N where N=n0+n1. The denominator of Equation (6a) and
(6b)is:
‫݌‬ሺ‫ݕ‬ො௜ ሻ ൌ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘ଴ ሻܲሺɘ଴ ሻ ൅ ‫݌‬ሺ‫ݕ‬ො௜ ȁɘଵ ሻܲሺɘଵ ሻ
(7)
TheBayesruleassignsthesampletotheclassinwhichithasthehighestaposteriori
probability[31].Theruleis:
••‹‰–Š‡•ƒ’Ž‡–‘
݈ܿܽ‫ݏݏ‬ɘ଴ ݂݅ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൐ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ
݈ܿܽ‫ݏݏ‬ɘଵ ݂݅ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ ൐ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ
(8)
Although this rule is optimal in the sense that no other rule can yield a lower error
probability, it is not always satisfactory. For example, when the NJi is at one of the
extremesofthePDF(Figure2),bothp(NJiʜZ0)andp(NJiʜZ1)arelow,andtheproductsp(NJi
87
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
ʜZ0)ͼP(Z0)andp(NJiʜZ1)ͼP(Z1)arealsolowbuttheaposterioriprobabilityforoneofthe
classes (the ratio in Equations 6a and 6b) is high. This means that the further the
prediction NJi is from one class, the more likely it will be allocated to the other class.
Thisisareasonableresultsincetheclassifieronlyexpectstoreceivesamplesfromthe
twomodelledclasses.Inmostmultivariateapplications,however,samplesfromnonͲ
modelled classes (outliers) may also be inadvertently submitted to theclassifier. The
predictionsforthosesampleswillmostprobablybefoundatthetailsofaPDF,and,
hence give a misleading high a posteriori probability for one of the classes.
Consequently,forcingthetwoͲclassBayesruletoclassifyanyinputsamplemayinvolve
a high risk because outliers may be erroneously classified in one of the modelled
classes.
HL1
LL0
a
p(ǔi Z1 )·P(Z1 )
p(ǔi Z0 )·P(Z0 )
Z0
reject
b
LL0
Z1
ǔ
HL0
reject
HL1
LL1
p(ǔi Z1 )·P(Z1 )
p(ǔi Z0 )·P(Z0)
reject
88
Z0
ǔ
reject
Z1
reject
Figure 2. Possible distributions
ofclassZ0andclassZ1withthe
distance reject limits. a.
Overlapped classes. b. Well
separated classes. Ɣ and Ƒ
indicatepossiblepredictionsof
unknown samples for which
the
Bayes
rule
gives
questionable results. LL0, HL0,
LL1 and HL1 are the limits for
rejectionbasedonthedistance
rejectoption.
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.2Methods
AnothersituationinwhichtheusefulnessoftheBayesruleislimitedoccurswhenthe
predictedvalueNJisintheboundarybetweenclasses(theambiguityregion).Thedotin
thecentreofFigure2arepresentsasamplewhosecharacteristicsaresimilarforboth
classes,withtheresultthatthemodelcannotclearlydistinguishwhetheritbelongsto
one class or to the other. Again, the sample will be assigned to the class to which,
accordingtotheBayesrule,ithasthehighestprobabilityofbelonging.However,since
theprobabilitythatthesamplebelongstoclassZ0 issimilartotheprobabilitythatit
belongs to class Z1, there is a high risk of misclassification and the reliability of the
classificationislow.
Thesesituationsshowthattherejectoptionmightbeanadvantageousadditiontothe
decision rule. In this paper, we implement the reject option in pͲDPLS. Both the
classificationreliabilityandaccuracyofthepͲDPLSmodelareimprovedbyidentifying
unreliable classifications and rejecting the sample instead of running the risk of
misclassifyingit.
4.2.3ImplementationoftherejectoptioninpǦDPLS
Therejectoptioninthecaseofclassificationambiguity(i.e.foroverlappedPDFs)can
bederivedbyadaptingChow’’sruletothePDFsobtainedinpͲDPLS.Chow’’srulesetsa
thresholdtsothatthesampleisrejectedifthehighestaposterioriprobabilityislower
than(1––t).Inotherwords,thesampleisclassifiedonlyif:
݉ܽ‫ݔ‬൫ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ൯ ൐ ሺͳ െ ‫ݐ‬ሻ
(9)
Thus,onlythosesampleswhoseclassificationisreliableenoughareindeedclassified.
Theothersamplesarerejectedbecausetheycouldbemisclassified.Thethresholdthat
89
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
optimizesthetradeͲoffbetweentheerrorrateandrejectratecanbederivedfromthe
costsassociatedwitheachclassificationresult[29]:
‫ ݐ‬ൌ ሺߣ௥ െ ߣ௖ ሻȀሺߣ௠ െ ߣ௖ ሻ
(10)
whereʄm,ʄr,andʄc,arethecostsofincorrectclassification,ofrejectionandofcorrect
classification,respectively.Thevaluesthatareassignedtothesecostsmakethereject
optiontuneable.Thecostofbeingwrongishigherthanthecostofbothrejectingand
classifyingcorrectly(ʄm>ʄr>ʄc).Infact,itispreferabletorejectasampleandgather
additional information than to classify the sample incorrectly. It is also generally
assumed that classifying correctly has no cost (ʄc = 0). Note that Equation 9 is a
generalizationofthestandardBayesrule.Inparticular,fortheextremecaseinwhich
thecostofrejectionʄrequalsthecostofmisclassificationʄm,therejectthresholdist=
1andChow’’sruleisreducedtothestandardBayesrule,inwhichsamplesarenever
rejected. A sample is also not rejected if t >1/C, where C is the number of possible
classes(C=2forabinaryclassification)[34].
Thesecondreasonforusingtherejectoptionistoavoidclassifyingextremesamples
thathavealargeaposterioriprobabilitybutlowvaluesatbothPDFs.Inordertosolve
thisproblem,DubuissonandMasson[18]addedadistancerejectcriteriontoChow’’s
ambiguity reject option. This idea is implemented here for the pͲDPLS model by
imposinglimitsontheNJvalues,whichdefinetheextremeregionsinwhichthesamples
willberejected.Thelimitsarechosensothatthesumoftheareainthetailsofeach
PDF is five percent of the total area of the distribution (i.e. the distance reject
probabilityequals0.05foreachclass,seeFigure2)[19].Sincethelimitsdependonthe
shapeofthedistributionsofeachclass,theyareparticularforeachpͲDPLSmodelwith
a given number of factors. In practice, when the PDFs are overlapped we have two
90
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.2Methods
operative limits, a High Limit (HL) and a Low Limit (LL) and when the PDFs are
separatedwehavefourlimits(HLandLLforeachclass,seeFigure2).
Assuming the constraints for the distance reject and the ambiguity reject, the Bayes
rulewithrejectoptionis:
‡Œ‡…–݂݅‫ݕ‬ො௜ ൏ ଴ ‫݂݅ݎ݋‬ሺ଴ ൏ ‫ݕ‬ො௜ ൏ ଵ ሻ
‫ݕ݂݅ݎ݋‬ො௜ ൐ ଵ ‫ݔܽ݉ ݂݅ݎ݋‬൫ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻǡ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ൯ ൏ ሺͳ െ ‫ݐ‬ሻ
–Š‡”™‹•‡
݈ܿܽ‫݋ݐ݊݅ݕ݂݅ݏݏ‬ɘ଴ ݂݅ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ ൐ ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ
‫݋ݐ݊݅ݎ݋‬ɘଵ ݂݅ܲሺɘଵ ȁ‫ݕ‬ො௜ ሻ ൐ ܲሺɘ଴ ȁ‫ݕ‬ො௜ ሻ
(11)
4.2.4Evaluationoftheclassificationmethodperformance
ThepͲDPLSmodelscanbecalculatedforadifferentnumberoffactorsthatareneeded
toexplaintherelevantinformation.Thus,everypͲDPLSmodelwillproducedifferentNJ
predictionsforthecalibrationsamplesand,therefore,forthedifferentPDFs,which,in
turn,willinfluencetheperformanceoftheclassifier.Theperformanceofaclassifieris
commonly characterized by its error rate (or the classification rate, which is the
percentage of correctly classified samples) when classifying a test set of unseen
samplesthatwerenotusedduringthetrainingphase.Theactualclassofeverysample
in the test set is compared to the class to which it is assigned by the classifier. In
generalterms,however,itisnotthemisclassification(andrejection)ratethatwewant
to minimize, but the misclassification (and rejection) cost [35], since the cost more
accuratelyreflectstheobjectiveoftheclassificationrule[36].TheCostisheredefined
as:
91
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
‫ ݐݏ݋ܥ‬ൌ ߣ௠ ௠ ൅ ߣ௥ ௥ (12)
where Nr is the number of rejected samples and Nm is the number of misclassified
samples. The cost of correctly classifying a sample has been set to zero. Here, the
minimizationofthecostwillbeusedtodecideontheoptimalnumberoffactorsinthe
pͲDPLSmodel.
4.3ResultsandDiscussion
4.3.1Datasets
The proposed classification rule (Eq. 11) was applied to two datasets, the Human
Cancers dataset [37] and the Breast Cancer dataset [38]. These datasets have been
studiedextensivelyintheliterature[39,40]andalsousedtoevaluatetheperformance
ofclassificationmodels[41Ͳ44].TheHumanCancersdatasetconsistsof282microRNA
(miRNA, non coding RNA species) normalized expression profiles for 218 samples,
including46healthysamples(classZ0)and172tumoursamples(classZ1)fromseveral
healthyandtumourtissues(ovary,colonandlungtomentionafew).Thedatasetwas
dividedintoatrainingsetandatestsetbyapplyingtheKennardͲStonealgorithm[45]
tothescoresofthefirst20PrincipalComponents(PCs),whichwereobtainedfromthe
Principal Component Analysis (PCA) of the raw gene expression matrix. For this
dataset, the training set contained 153 samples (116 samples of class Z1 and 37
samplesofclassZ0),andthetestsethad65samples(56ofclassZ1and9ofclassZ0).
TheBreastCancerdatasetconsistsof5361normalizedgeneexpressionratios.These
wereusedin[38]toprovethatahereditablemutationinfluencesthegeneexpression
profileofbreastcancer.SevensamplesoftheBRCA1mutationwereusedasclassZ0,
eightsamplesofBRCA2mutationwereusedasclass Z1,andsixsamplesofSporadic
mutationwereusedastestsamples.
92
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.3Resultsanddiscussion
4.3.2HumanCancersdataset
Although pͲDPLS is a fullͲvariable method, it can often be improved by carefully
selectingthevariablesandremovingirrelevantmiRNAexpressionsthatinterferewith
the discriminative power of the relevant miRNA [46]. For this dataset, the 100
variables with the highest VIP values (variable importance for the projection) were
considered. VIP values were calculated as described in [8, 47]. These values quantify
howeachvariableinfluencestheresponsesummedoverallcomponentsandclasses.
Fortheselectedvariables,sixpͲDPLSmodelswerecalculatedwith1to6factorsusing
meanͲcentered miRNA expression patterns (we will denote each model as pͲDPLSA,
whereAisthenumberoffactors).Theaprioriprobabilitiesforthesesixmodelswere
P(Z0)=37/153=0.24andP(Z1)=116/153=0.76.
ThePDFofclassesZ0andZ1werecalculatedforeachpͲDPLSmodel(Eqs.1to5).The
test sample was classified by obtaining its NJ prediction and then calculating the a
posterioriprobabilities(Eqs6a,6b).Finally,thesamplewaseitherrejectedorclassified
intheclasswiththehighestaposterioriprobability(Eq.11).Inthisdataset,thehigh
and low limits (HL and LL) for NJi were defined so as to retain the five percent of the
totalareaofthePDFinthetailsofthedistributions.Thecostswerearbitrarilysetto
ʄc=0,ʄr=0.25,andʄm=1becausenoinformationwasavailableaboutthecostsofeach
classification decision. Note that these costs are relative, and indicate that it is
preferable to reject four samples than to classify one wrongly. These values are
illustrative and should be adjusted for each particular classification problem. With
thesevalues,thethresholdvalueforrejectionintheambiguityzoneist=0.25(Eq.10).
ThemodelswithA=1toA=6factorswerevalidatedbyleaveͲoneͲoutcrossͲvalidation
(CV).Inthisprocess,sampleiwasleftoutofthetrainingset,thepͲDPLSAmodelwas
calculated,andthepredictionNJifortheleftͲoutsamplewasobtained(notethatthea
priori probabilities were recalculated to take into account that one sample had been
93
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
leftout).Thisprocedurewasrepeatedforallthesamplesofthetrainingsetandforall
thepͲDPLSmodels.
Figure 3 and Table 1 show the crossͲvalidation results obtained for the different pͲ
DPLSmodelswhentherejectoption(Eq.11)isconsidered.Notethatpredictionsfor
samplesinclassZ0 arearound0andpredictionsforsamplesinclassZ1arearound1,
but that the predictions partially overlap in models with less than four factors
(underfittedmodels).Asaresultoftheoverlap,manysamplesareeitherrejectedor
wronglyclassifiedandthecostofthesemodels(Table1)ishigh.Forexample,forpͲ
DPLS2, 51%ofthesamplesinclassZ0 wererejectedbyCVand27%weremisclassified.
On the other hand, the predictions from the models with four to six factors are
grouped tighter together. Consequently, these models have fewer misclassifications,
fewerrejections,andlowerclassificationcosts.
6
Number offactors (A)
5
4
3
2
1
Ͳ0.6
Ͳ0.4
Ͳ0.2
0
0.2
0.4
0.6
NJ
0.8
1
1.2
1.4
Figure 3. Prediction of the training samples by CV for the different pͲDPLS models with reject option.
Squares: healthy samples (class Z0), –– Green: correctly classified, Blue: misclassified, Red: rejected to
classify––. Circles: tumour samples (class Z1), –– Yellow: correctly classified, Brown: misclassified, Orange:
rejectedtoclassify––.
94
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.3Resultsanddiscussion
Table 1. Classification of validation samples by leaveͲoneͲout crossͲvalidation and test samples for the
different pͲDPLS models for t =0.25. In brackets, classifications performed without considering the reject
option.
Wrongly
Factors
CrossͲValidationSamples
Cost
classified
Rejected
Correctly
Wrongly
classified
classified
SamplesofclassZ0
Rejected
Correctly
classified
SamplesofclassZ1
1
30.3
0(6)
24
10(31)
2(19)
89
25(97)
2
15.5
10(22)
19
8(15)
0(0)
3
113(116)
3
7.3
4(9)
13
20(28)
0(0)
0
116(116)
4
5
2(4)
8
27(33)
0(0)
4
112(116)
5
3
1(4)
4
32(33)
0(1)
4
112(115)
6
3
1(5)
7
29(32)
0(1)
1
115(115)
TestSamples
Wrongly
Factors
classified
Rejected
Correctly
Wrongly
classified
classified
SamplesofclassZ0
Rejected
Correctly
classified
SamplesofclassZ1
1
0(2)
6
3(7)
0(0)
17
39(55)
2
2(4)
5
2(5)
0(0)
0
56(56)
3
0(2)
4
5(7)
0(0)
0
56(56)
4
0(0)
2
7(9)
0(0)
0
56(56)
5
0(0)
0
9(9)
0(0)
0
56(56)
6
0(0)
0
9(9)
0(0)
0
56(56)
95
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
In terms of classification cost, the optimal model is pͲDPLS5 since no further
improvement is obtained for the model of six factors. The PDFs for this model are
presented in Figure 4a and the a posteriori probabilities across the NJ domain are
presentedinFigure4b.ThelimitswerefoundtobeLL0=––0.43andHL1=1.42.Thus,
sampleswithapredictedvalueNJi<––0.43orNJi>1.42wouldbeflaggedasoutliersand
rejected. These limits were different for each pͲDPLSA model because the training
sample predictions changed.According to the rejection criterion, eight samples (four
fromclassZ0andfourfromclassZ1)wererejected,allofthemintheambiguityregion
(Table 1). As an example, the dot in Figure 4 corresponds to the sample T_BRST_2
(tumoursample,classZ1)duringtheleaveͲoneͲoutprocess.ThepredictionisNJi=0.44
andthecalculatedaposterioriprobabilitiesareP(Z0ʜNJi)=0.59andP(Z1ʜNJi)=0.41(Eqs.
6a, 6b). Since both probabilities are similar, the confidence (reliability) that the
classification is correct is low because a slight shift in NJi due to measurement errors
couldhavechangedtheassignedclass.TheapplicationoftheclassicBayesrule(Eq.8)
wouldassignthesampletotheclasswiththehighestaposterioriprobability,meaning
that the sample would be wrongly classified into class Z0. By allowing the reject
option, defined here by Chow’’s rule (with t=0.25), the sample was rejected and not
classifiedbecausethehighestaposterioriprobabilitywasbelow1t(i.e.max(P(Z1ʜNJi),
P(Z0ʜNJi)<0.75).Inthiscase,therejectoptionpreventedusfromclassifyingatumour
sample as ahealthy sample, and the expert wouldbeprompted to make more tests
before the final diagnosis. It is interesting to note, as we indicated before, that the
rejectoption’’sperformancedependsontherelativecostsassignedtotheclassification
results. Thus, by setting different costs, the threshold (and hence the number of
samplesrejected)willbetunedtomeettheexperimenter'sneeds.
96
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.3Resultsanddiscussion
1.8
1
a
LL 0
b
HL 1
LL 0
P(Z0 | yˆˆ )
P(Ȧ1 | yˆˆ )
HL 1
0.8
(1-t)
1.4
p( yˆˆ | Ȧ0 )
p( yˆˆ | Ȧ1)
0.6
1
0.4
P(Ȧ0 | yˆˆT _ BRST2 )
0.6
P(Ȧ1 | yˆˆT _ BRST2 )
0.2
0.2
Ͳ1
Ͳ0.5
0
0.5
ǔ
1
1.5
2
0
Ͳ1
ȚT_BRST_2
Ͳ0.5
0
0.5
ǔ
1
1.5
2
Figure 4. a. PDFs for the five factor pͲDPLS model obtained from the training samples during the LOOCV
processwhenT_BRST_2isusedasthevalidationsample.b.AposterioriprobabilitiesacrosstheNJdomain
(Eq. 6a and 6b) derived from the PDFs in aͲb. The prediction and the a posteriori probability for sample
T_BRST_2duringtheLOOCVprocessarealsoshown.
Forcomparison,Table1showsinbracketstheclassificationresultswhentheclassical
Bayesruleisapplied.Forthemodelthatbestminimizesthecost,thatis,pͲDPLS5,five
samples were misclassified if the reject option was not applied, whereas only one
samplewasmisclassified(ahealthysample)whentherejectconstraintswereapplied.
Thus, this model’’s classification accuracy(i.e. the ratio of samples well classified and
the number of samples classified) was improved from 97% (148/153) to 99%
(144/148). Notice, however, that the reject option also rejected some samples that
would otherwise be correctly classified: the number of samples well classified
decreasedfrom148to144.Thisreductioninthenumberofwellclassifiedsamplesis
thepricetopayforsafeguardingagainsterrors,andfollowsthetrendofthesuggested
costsofclassifications,inwhichrejectingfoursampleswaspreferabletomisclassifying
one.
Different reject thresholds were tested by varying the classification costs (Table 2).
Whenʄr=0.10,ʄm=1andʄc=0,thethresholdwast=0.10.Asexpected,thenumberof
rejectedsamplesincreasedbecausethecostofdoingsodecreased(i.e.wepreferred
97
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
to reject ten samples rather than classify one wrongly). However, the number of
misclassifiedsamplesdidnotchange,whichmeansthattherulerejectedsamplesthat
wouldhavebeencorrectlyclassifiedwitht=0.25.Thus,decreasingttobelow0.25did
not improve the model’’s classification performance for this dataset. On the other
hand,whentwassetto0.35(i.e.ʄr=0.35,ʄe=1,ʄc=0),theresults(notshown)werethe
sameasthoseobtainedfort=0.25.Hence,t=0.25wasconsideredoptimalforthispͲ
DPLS5model.
The samples of the test set were also classified according to Eq. 11. For the pͲDPLS5
model using t=0.25, 100% of the samples were well classified and there were no
rejects (Figure 5 and Table 1). By setting the threshold to t=0.10, two correctly
classified healthy samples turned into rejects (Table 2). This was seen in the
classificationofthetrainingsamplesaboveandhighlightstheneedtosetanadequate
rejectthresholdinordertoobtainanadequatetradeͲoffbetweentherejectsandthe
misclassifications. This will depend on the needs of the experimenter and the cost
constraintsineachparticularapplication.
6
Number offactors (A)
5
4
3
2
1
Ͳ0.2
0
0.2
0.4
0.6
NJ
0.8
1
1.2
1.4
Figure5.ClassificationoftestsamplesforthedifferentpͲDPLSmodelswithrejectoption.Squares:healthy
samples(classZ0),––Green:correctlyclassified,Blue:misclassified,Red:rejectedtoclassify––.Circles:tumour
samples(classZ1),––Yellow:correctlyclassified,Brown:misclassified,Orange:rejectedtoclassify––.
98
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.3Resultsanddiscussion
Table2.ClassificationofvalidationsamplesvialeaveͲoneͲoutcrossͲvalidationandtestsamplesforthe
differentpͲDPLSmodelsfort=0.10.
CrossͲValidationSamples
Wrongly
Factors
Cost
classified
Rejected
Correctly
Wrongly
classified
classified
SamplesofclassZ0
Rejected
Correctly
classified
SamplesofclassZ1
1
16.1
0
36
1
1
115
0
2
6.5
2
31
4
0
14
102
3
5.9
3
23
11
0
6
110
4
3.5
1
17
19
0
8
108
5
3.3
1
14
22
0
9
107
6
2.9
1
13
23
0
4
110
TestSamples
Wrongly
Factors
classified
Rejected
Correctly
Wrongly
classified
classified
SamplesofclassZ0
Rejected
Correctly
classified
SamplesofclassZ1
1
0
9
0
0
56
0
2
0
9
0
0
0
56
3
0
7
2
0
0
56
4
0
3
6
0
0
56
5
0
2
7
0
0
56
6
0
2
7
0
0
56
99
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
4.3.3BreastCancerdataset
This dataset demonstrates the rejection of test samples that are outside the class
limits.ThesamemethodologyasfortheHumanCancersdatasetwasusedexceptthat
the Kennard Stone algorithm was not applied. Instead, the samples of mutations
BRCA1andBRCA2wereusedasatrainingsetandtheSporadicmutationsampleswere
used as a test set. The aim was to show that the classification rule could reject
prediction samples from nonͲmodelled classes. This would prevent the classification
errorthatwouldotherwiseoccuriftheclassifierhadtoassignthesamplestooneof
the two modelled classes. Detecting this type of outlier is fundamental to the
applicationofanyclassificationrule.
ProbabilisticDPLSmodelswerecalculatedforonetothreefactorsbyusinglog2meanͲ
centred gene expression data from BRCA1 (class Z0) and BRCA2 (class Z1) mutation
samples. This data consisted of the 51 most relevant gene expressions according to
[38]. These genes were found to be the most discriminative between the three
mutations. The costs of classifying correctly, rejecting and misclassifying were set at
ʄc=0,ʄr=0.25,andʄm=1respectively.TheonefactorpͲDPLSmodel(pͲDPLS1)wasthe
optimal model with the lowest cost (i.e. cost of 0.5). Models with two and three
factorswereoverfitted,withcostsof2.5and3.25respectively.Thesehighercostsare
due to the fact that most of the samples are rejected and, although there are no
misclassifications,theclassifiersbecomeuseless.Forexample,forthepͲDPLS2model,
10 of the 15 training samples were rejected during LOOCV. Similarly, the pͲDPLS3
modelrejected13ofthetrainingsamples.
ThepͲDPLS1 calculatedwiththe51geneexpressionsselectedinthebibliographywas
abletodistinguishthesamplesofclassZ0 fromthoseofclassZ1,thusprovidingwell
separated PDFs (Figure 6). Only the sample s1252_P2, of class Z0, and the sample
s1816_P13,ofclassZ1,wererejectedduringLOOCV.Thepredictionsofbothsamples
100
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.3Resultsanddiscussion
wereoutsidethelimitsoftheclasses(i.e.NJs1252_P2>HL0andNJs1816_P13>HL1).Noticethat
becausethePDFswerenotoverlapped,therewasnoambiguityregionandthelimits
oftheclassesweredefinedbyfouroperativelimits.Theclassificationperformanceof
thepͲDPLS1didnotchangewhentherejectthresholdwasvariedtot=0.35andt=0.10.
The pͲDPLS1 model was used to classify the six test samples of sporadic mutation of
breast cancer. This mutation was not modelled in the training step; hence, all these
samplesshouldbepointedasoutliersandnotclassified.Classifyingthesesamplesin
anyofthetwomodelledclasseswouldresultinaclassificationerror.Figure6shows
the PDFs (Eqs. 4 and 5) of class Z0 and class Z1 for pͲDPLS1 together with the
predictions for the test samples. Accordingto Eq. 11, all test samples were correctly
detectedasoutliersandrejectedsincetheirpredictionsNJiwerebetweenthelimitsHL0
(NJ=0.24)andLL1(NJ=0.54). Iftherejectconstraintshadnotbeenapplied,theclassifier
would have assigned the test samples to the class with the highest a posteriori
probability. In this case, the samples s1572_P16 and s1324_P17 would have been
incorrectlyclassifiedintoclassZ1(i.e.asBRCA2mutationsamples)andtheremaining
samples (s1649_P15, s1320_P18, s1542_P19 and s1281_P21) would have been
incorrectlyclassifiedintoclassZ0(asBRCA1mutationsamples).Forthesesamples,the
aposterioriprobabilityforoneclasswasnear1.Forexample,samples1572_P16had
p(NJiʜZ0)= 6ͼ10Ͳ6 and p(NJiʜZ1)= 2ͼ10Ͳ3 which results in P(Z0ʜNJi) у0 and P(Z1ʜNJi) у1.
Therefore, if it is believed that the a posteriori probability demonstrates the
classification’’s reliability, then the high values of probability obtained for the test
sampleswouldsuggestthatwecantrusttheclassifications,despitethefactthatallof
them were incorrect. This shows that the classic Bayes rule is unreliable when both
conditional probabilities p(NJiʜZ0) and p(NJi ʜZ1) are low. Moreover, it should be noted
that the predicted values are not as extreme as those expected for outliers. Hence,
thesearenotdirectlysuspicioussamplesbecauseoftheirNJivalues.It wasthereject
option,whichsetlimitsontheclasses,whichallowedthesesamplestobedetected.
101
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
LL 0
HL 0
HL 1
LL 1
0.6
0.5
p( yˆˆ | Ȧ0 )
0.4
0.3
p( yˆˆ | Ȧ1)
0.2
0.1
0
Ͳ1
Ͳ0.5
0
ǔ
0.5
1
1.5
2
Figure 6. PDFs for the one factor pͲDPLS model. Limits on LL0=Ͳ0.15, HL0=0.24, LL1=0.54 and HL1=1.50.
TrianglesrepresentthetestsamplesclassifiedwithpͲDPLS1.
4.4Conclusions
Recently,theDPLSmethodhasreceivedmuchattentioninthefieldofgeneexpression
data analysis. We have applied a new version of DPLS, namely probabilistic DPLS (pͲ
DPLS),toclassifybiologicalsamplesusingtheirmicroRNA(miRNA)expressionpatterns
and cDNA microarray data. pͲDPLS takes into account the uncertainty of the PLS
predictionsinthedefinitionoftheclassificationmodel.Inthisversion,thepossibility
ofrejectionhasbeenintroduced.pͲDPLSwithrejectoptionperformsbetterthanthe
originalpͲDPLS,becauseonlythosesamplesthathavethehighestprobabilityofbeing
correctly classified are indeed classified, whereas doubtful cases are rejected. The
methodology involves evaluating the probability of each classification together with
102
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
4.4Conclusions
theoverallcostoftheclassificationsperformedforeachmodel.Inaddition,thereject
optionallowsustodealwithsituationsinwhichtheresultsoftheBayesrulemaybe
questioned. Moreover, the classification rule with reject option can help the
experimentertocheckthatasampledoesnotbelongtoanyoftheclassesmodelledin
thetrainingstepandthereforetoensurethatitisrejectedratherthanmisclassified.
Thus, the reject option enables the classifier to detect outliers, and this in turn
providesanewapproachforimprovingoutlierdetectionmethodsinthenearfuture.
Acknowledgements
The authors thank the Department of Universities, Research and the Information
SocietyoftheCatalanGovernmentforprovidingCristinaBotella’’sdoctoralfellowship,
andoftheSpanishMinistryofEducationandScience(projectCTQ2007Ͳ66918/BQU).
Theauthorswouldlikealsotoacknowledgetheusefulcommentsofthereferees.
103
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
‡ˆ‡”‡…‡•
[1]
Alizadeh,A.A.,etal.,DistincttypesofdiffuselargeBͲcelllymphomaidentifyedbygeneexpression
profyling.Nature,2000.403p.503Ͳ511.
[2]
Golub,T.R.,etal.,MolecularClassificationofCancer:ClassDiscoveryandClassPredictionbyGene
ExpressionMonitoring.Science,1999.285:p.531Ͳ537.
[3]
Li,L.,etal.,GeneAssessmentandSampleClassificationforGeneExpressionDataUsingaGenetic
Algorithm/kͲnearest Neighbor Method. Combinatorial Chemistry & High Throughput Screening,
2001.4:p.727Ͳ734.
[4]
BrownM.P.S,etal.,KnowledgeͲbasedanalysisofmicroarraygeneexpressiondatabyusingsupport
vectormachines.ProceedingsoftheNationalAcademyofSciences,2000.97:p.262Ͳ267.
[5]
Furey, T.S., et al., Support Vector Machine classification and validation of cancer tissue samples
[6]
Nguyen,D.V.and D.M.Rocke,MultiͲclasscancerclassificationvia partialleastsquareswithgene
usingmicroarrayexpressiondata.Bioinformatics,2000.16:p.906Ͳ914.
expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226.
[7]
GuntherE.C.,etal.,PredictionofdrugefficacybyclassificationofdrugͲinducedgenomicexpression
profilesinvitro.ProceedingsoftheNationalAcademyofSciences,2003.100:p.9608Ͳ9613.
[8]
Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ
dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44.
[9]
Nguyen, D.V. and D.M. Rocke, Tumor classification by partial least squares microarray gene
expressiondata.Bioinformatics,2002.18:p.39Ͳ50.
[10]
PérezͲEnciso,M.andM.Tenenhaus,Predictionofclinicaloutcomewithmicroarraydata:apartial
leastsquaresdiscriminantanalysis(PLSͲDA)approach.HumanGenetics,2003.112:p.581Ͳ592.
[11]
Modlich, O., et al., Predictors of primary breast cancers responsiveness to preoperative
Epirubicin/CyclophosphamideͲbased chemotherapy: translation of microarray data into clinically
usefulpredictivesignature.JournalofTranslationalMedicine,2005.3:article32.
[12]
Man,M.Z.,etal.,EvaluationmethodsforclassifyingExpressiondata.JournalofBiopharmaceutical
Statistics,2004.14:p.1065Ͳ1084.
[13]
Bylesjö, M., et al., MASQOT: a method for cDNA microarray spot quality control. BMC
Bioinformatics,2005.6:p.250.
[14]
Tax, D.M.J. and R.P.W. Duin, Growing a multiͲclass classifier with a reject option. Pattern
RecognitionLetters,2008.29:p.1565Ͳ1570.
[15]
Knauthe, B., et al., Visualization of quality parameters for classification of spectra in shooting
crimes.JournalofChemometrics,2008.22:p.252Ͳ258.
104
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
References
[16]
Fumera, G., F. Roli, and G. Giacinto, Multiple Reject Thresholds for Improving Classification
Reliability, in Advances in Pattern Recognition, SSPR&SPR, Editor. 2000, Springer: Berlin Ͳ
Heidelberg.p.863Ͳ871.
[17]
Devarakota, P.R.R., B. Mirbach, and B. Ottersten, Reliability estimation of a statistical classifier.
PatternRecognitionLetters,2008.29:p.243Ͳ253.
[18]
Dubuisson,B.andM.Masson,Astatisticaldecisionrulewithincompleteknowledgeaboutclasses.
PatternRecognition,1993.26:p.155Ͳ165.
[19]
Muzzolini, R., Y.ͲH. Yang, and R. Pierson, Classifier desing with incomplete knowledge. Pattern
Recognition,1998.31:p.345Ͳ369.
[20]
Ripley, B.D., Statistical ideas for selecting network architectures, in Neural Networks: Artificial
IntelligenceandIndustrialApplications,B.K.a.S.Gielen,Editor.1995,Springer.p.183Ͳ190.
[21]
Ripley, B.D., Pattern Recognition and Neural Networks. 2000, Cambridge, Unitet Kingdom:
CambridgeUniversityPress.
[22]
Tortorella, F., An optimal reject rule for binary classifiers. In: Ferri, F.J et al. (Eds.), Advances in
PatternRecognition:JointIAPRInternationalWorkshops,SSPR2000andSPR2000,LectureNotes
inComputerScience,vol1876.SpringerͲVerlag,Heidelberg.,2000:p.611Ͳ620.
[23]
Fumera, G., I. Pillai, and F. Roli, Classification with Reject Option. Proceedings of the 12th
[24]
Fumera,G.andF.Roli.ErrorRejectioninLinearlyCombinedMultipleClassifiers.inProceedingsof
InternationalConferenceonImageAnalysisandProcessing(ICIAP’’03),2003.
2ndInt.WorkshoponMultipleClassifierSystems(MCS2001).2001.RobinsonCollege,Cambridge,
UK.
[25]
Fumera, G., F. Roli, and G. Giacinto, Reject option with multiple thresholds. Pattern Recognition
2000.33:p.165Ͳ167.
[26]
Cordella,L.P.,etal.,Amethodforimprovingclassificationreliabilityofmultilayerperceptrons.IEEE
Transactionsonneuralnetworks,1995.6:p.1140Ͳ1147.
[27]
Landgrebe, T., et al., The interaction between classification and reject performance for distanceͲ
basedrejectͲoptionclassifiers.PatternRecognitionLetters,2006.27:p.908Ͳ917.
[28]
Landgrebe,T.,etal.AcombiningstrategeyforillͲdefinedproblems.inFifteenthAnn.Sympos.ofthe
PatternRecognitionAssociationofSouthAfrica.2004.
[29]
Chow, C.K., On optimum recognition error and reject tradeoff. IEEE ͲTransactions on information
theory,1970.16:p.41Ͳ46.
[30]
Hanczar, B. and E.R. Dougherty, Classification with reject option in gene expression data.
Bioinformatics,2008.24:p.1889Ͳ1895.
[31]
Duda,R.O.,P.E.Hart,andD.G.Store,PatternClassification(2ndedition),ed.W.Intersicence.2001,
NewYork.
105
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Classificationfrommicroarraydata
usingpͲDPLSwithrejectoption
[32]
Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in Discriminant
Partial LeastͲSquares Classification. Journal of Chemometrics and Intelligent Laboratory Systems,
2009.95:p.122Ͳ128.
[33]
Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985,
Wiley:NewYork.p.581Ͳ591.
[34]
Webb,A.,StatisticalPatternRecognition,2nedition,ed.Wiley.2002,Malvern,UK.
[35]
Bradley, A.P., The use of the area under the ROC curve in the evaluation of machine learning
algorithms.PatternRecognition,1997.30:p.1145Ͳ1159.
[36]
Li, M. and I.K. Sethi, ConfidenceͲbased classifier design. Pattern Recognition, 2006. 39: p. 1230Ͳ
1240.
[37]
Lu,J.,etal.,MicroRNAexpressionprofilesclassifyhumancancers.NatureLetters,2005.435:p.834Ͳ
838.
[38]
Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal
ofMedicine,2001.344:p.539Ͳ548.
[39]
Zheng,Y.andC.K.Kwoh,InformativemicroRNAexpressionpatternsforcancerclassification.Data
miningforbiomedicalapplications,Proceedings,2006.3916:p.143Ͳ154.
[40]
Lin, J. and M. Li, Molecular profiling in the age of cancer genomics. Expert Review of molecular
diagnostics,2008.8:p.263Ͳ276.
[41]
Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical
ApplicationsinGeneticsandMolecularBiology,2004.3:article33.
[42]
Raza, M., et al., Comparative Study of Multivariate Classification Methods using Microarray Gene
Expression Data for BRCA1/BRCA2 Cancer Tumors. Proceedings of the Third International
ConferenceonInformationTechnologyandApplications(ICITA'05),IEEE.,2005.2:p.475Ͳ480.
[43]
Branden, K.V. and S. Verboven, Robust data imputation. Computational Biology and Chemestry,
2009.33:p.7Ͳ13.
[44]
Pochet,N.,etal.,Systematicbenchmarkingofmicroarraydataclassification:assessingtheroleof
nonͲlinearityanddimensionalityreduction.Bioinformatics,2004.20:p.3185Ͳ3195.
[45]
Kennard,R.W.andL.A.Stone,ComputerAidedDesignofExperiments.Technometrics,1969.11:p.
137Ͳ148.
[46]
Lu,Y.andJ.Han,Cancerclassificationusinggeneexpressiondata.InformationSystems,2003.28:
p.243Ͳ268.
[47]
Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch:
identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of
Chemometrics2004.18:p.125Ͳ132.
106
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 5 Outlier detection and ambiguity detection for
microarray data in p-­‐DPLS regression
Journal of Chemometrics 2010, Accepted
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguity
detectionformicroarraydatainpͲDPLS
Microarray data are obtained after a complex series of experimental steps that go
from hybridization to image analysis. Microarray manufacturing errors like dye
instability, different incorporation of the dyes, slide, spatial and printͲtip effects
together with scanning errors may introduce unspected data variability, which can
make the collected data for one sample very different than the data from other
samplesofthesameclass.Additionally,theexperimentermaybeconfrontedwithnew
samples that are not like any of the other samples that have been modelled (e.g.,
samples that do not belong to any of the modelled classes). All these samples are
consideredasoutliers,andcanhaveadegradingimpactinthecalculatedclassification
model (if they are training samples), can produce wrong evaluations of the
classificationperformanceofthemodel(ifthesamplesarevalidationsamples)andcan
leadtowrongclassifications(ifthesamplesarenewsamplestobeclassified).
Outlier detection is often unnoticed in microarray data classification. However it is
essentialthatanyclassificatitionmethodthatisintendedtohavearealpracticaluse
beimplementedtogetherwithappropriateoutlierdetectiontools.
Basically, all the outliers can be detected either because they have errors in the
recordeddata(x),becausetheyhavebeenidentifiederroneously(witherroneousy),
because they have abnormal xͲy relation or because they belong to a different
populationthanthesampleswearetryingtoclassify.Inthisworkwedevelopoutlier
detection for probabilistic discriminant partial least squares (pͲDPLS) method by
combining diagnostics based on leverage and xͲresiduals (common in PLS) and the
rejectoptionapproachdevelopedinchapter4.
The method was tested on two datasets: the prostate cancer dataset and the small
roundbluecelltumoursofchildhooddataset.Resultsshowedthatwithoutoutliersthe
pͲDPLS classification models have better classification abilities and samples from
109
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter5
classes not modelled during the training step are rejected to classify, thus avoiding
theirmisclassification.
The removal of outliers in the prostate cancer dataset reduced the Cost of
classificationpersamplefrom0.11to0.06,andthemodelincreasedtheproportionof
correctclassificationsoftestsamplesfrom95%to100%.Inthesmallroundbluecell
tumoursofchildhooddatasetthepͲDPLSwithoutlierdetectionmethodimplemented
isabletoflagcorrectlyasoutliersthe95%ofthesamplesinthepredictionstep.These
samples did not belong to any of the classes modelled. When the outlier detection
method was not implemented in the training step, only the 5% of the test samples
werepointedasoutliers,misclassifiyingtheremaining95%.
This work is presented in paper form published in Journal of Chemometrics 2010
(Accepted).
110
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetectionfor
microarraydatainprobabilisticDiscriminant
PartialLeastSquaresRegression
C.Botella*,J.Ferré,R.Boqué
Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University.
Marcel·lí Domingo s/n, 43007. Tarragona, Spain
Correspondingauthor:[email protected]
*
JournalofChemometrics,2010Accepted.(Editedforformat)
Abstract
Therejectoptionplaysanimportantroleintheclassificationofmicroarraydata.Inthis
work,arejectoptionisimplementedinthediscriminantpartialleastsquares(pͲDPLS)
methodinordertorejecttoclassifybothoutliersandambiguoussamples.Microarray
dataarehighlysusceptibletopresentoutliersbecauseofthemanystepsinvolvedin
the experimental process. During the development of the classifier, outliers in the
training data may strongly influence the model and degrade its performance. Some
future samples to be classified may also be outliers that will most probably be
misclassified.Ambiguoussamplesaresamplesthatcannotbeclearlyassignedtoany
of the classes with a high confidence. In this work outlier detection and ambiguity
detection are implemented taking into account the xͲresiduals, the leverage and the
predicted NJ. The method was applied to oligonucleotide microarray data and cDNA
microarraydata.Forthefirstdataset(prostatecancerdataset),theoutlierdetection
criteriaallowedustoremoveninesamplesfromthetrainingset.Themodelwithout
thosesampleshadbetterclassificationability,withadecreaseintheclassificationCost
per sample from 0.10 to 0.07. The method was also used in a second dataset (small
round blue cell tumours of childhood dataset) to detect prediction outliers so that
mostoftheoutlierswererejectedtoclassifyandmisclassificationswerereducedfrom
100%to5%.
111
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
5.1Introduction
Outlier detection plays a fundamental role in the development and application of
multivariate classification methods for microarray data. Outliers are either samples,
variables,orcertainvariablesincertainsamplesthathaveadifferentbehaviourthan
the rest of the data. This paper focuses on sample outliers. Sample outliers may be
training samples, validation samples, or future samples to be classified. The
experimenter is interested in flagging them for different reasons. Outliers in the
training set may have an excessive influence on the classification rule, unless robust
methods of classification are used. Hence, it is interesting to know whether the
classificationruleisdominatedbyafewspecialsamples,anddiscoverifthisinfluence
canbeadverse.Sampleswithlargemeasurementerrorsorsamplesthatbelongtoa
differentpopulation than the samples we are trying toclassify will degrade the rule.
These"bad"outliersshouldbedetected,removedandtherulerecalculated.Training
outliers may also contain "good" samples with unique information. These must be
kept, since they will improve the model by expanding its application domain. Their
detectionwillwarntheexperimenterthatmoresamplesofthesimilartypeshouldbe
obtainedinordertomodelthatvariabilitybetter.Studyofthegoodoutliersmayalso
lead to discover special variables (gene expressions) that may have a high
discriminativepower[1].Outlierdetectionmustalsobeappliedwhenfuturesamples
aretobeclassified,whichistheultimateobjectiveoftheclassificationrule.Unknown
samplesthatdonotbelongtoanyoftheclassesforwhichtheclassificationrulewas
trainedorsampleswithlargedataerrorswillbemisclassified.Theexperimenterwants
tobewarnedaboutthesesamplessothattheycanberejectedtoclassifyuntilmore
information is available. In this sense, outlier detection increases the confidence the
experimenter has in the classification protocol, since the samples that might be
misclassifiedwillhopefullybeflagged.Finally,outlierdetectionmustalsobeusedto
detectoutliersinthevalidationset.Samplesnotrepresentativeofthefuturesamples
112
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.1Introduction
to be classified will likely produce an erroneous classification result that will worsen
theclassificationabilityofthemodel.Hence,thesesamplesshouldbedetected,asitis
done for the unknown samples, and not considered to evaluate the performance of
themodel.
Theparticularitiesofmicroarraygeneexpressiondataandthemanylevelsofvariation
introducedatthecomplexexperimentalstages,fromhybridizationtoimageanalysis,
make necessary the use of outlier diagnostics [2, 3]. First of all, the recorded
microarray data depend on the biological variations of the population under study
(intrinsic to all organisms and influenced by genetic or environmental factors).
Technical variations introduced during the extraction, labelling or hybridization of
samples,scannersettingsandmeasurementerrorsassociatedwiththereadingofthe
fluorescentsignals(whichmaybeaffected,forexample,bydustonthearray[4])will
also increase the data variability. Moreover, the large number of variables (gene
expressions)comparedtotherelativelylownumberofobjects,makethedataanalysis
and the classification a nontrivial task. Fortunately, the combined use of data preͲ
processingandmultivariatealgorithmscanextractthemainsystematicvariationinthe
data and lead to satisfactory classification results. For example, normalization
methods,suchasthelowesscorrection[5]orthetotalintensitynormalization[6]can
removeinconsistenciesofthemicroarraydata.However,notalltheerrorsinthedata
may be mathematically removed and outlier diagnostics are still needed in order to
preventmisclassificationsduetonewunexpecteddatavariations.Outlierdetectionis
also needed to flag those samples from new unexpected classes (biological outliers)
andthosethatpresentextremebiologicalvariability.
Severalmethodshavebeenusedfordetectingoutliersinmicroarraydataforparticular
classification rules. Paoli [7] improved the performance of Support Vector Machines
(SVM) by selecting the optimal number of genes and treating the most relevant as
113
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
outliers.Moffitt[8]constructedtheSVMmodelbyremovingoutliersviareͲvalidation.
Olsen [9] analysed the intensity scores of tissue microarrays of sarcoma phenotypes
withEuclideanhierarchicalclusteranalysis,presentingasoutliersthosesamplesthat
didnotclusterintoanyofthedefinedgroups.TheVizRanktool,whichcombinesthekͲ
nearest neighbours (kͲNN) method with a range of visualizations was also used to
detect outliers [10]. Model et al. [11] pointed out that outliers in microarray data
cannot always be detected visually and proposed a robust version of Principal
Component Analysis (rPCA). Their objective was to exclude single outlier chips from
the analysis and to detect systematic changes in experimental conditions as early as
possibleinordertofacilitateafastrecalibrationoftheproductionprocess.Shieh[12]
addressed outlier detection with highly different expression patterns in microarray
data using also PCA and a robust estimation of Mahalanobis distance. Tomlins et al.
[13] proposed the cancer outlier profile analysis (COPA) method for detecting
translocations from microarray data. For gene selection, genetic algorithms were
proposedforoutlierdetectionusingagridcounttree[14].Liuetal.studieddifferent
statistical methods to detect genes with differential expressions across the different
class samples (1). And Loo et al. used with the same objective, filterͲbased methods
[15]. In contrast, Tibshirani [16] and Wu [17] proposed alternative cancer outlier
differential expression detection methods for detecting genes that, inside a disease
group,exhibitunusuallyhighgeneexpressioninsomebutnotallsamples.
Inthiswork,wedevelopoutlierdetectionfordiscriminantpartialleastsquares(DPLS).
DPLSisoneofthepreferredmethodsforclassificationofmicroarraydata[18].InDPLS,
the assigned class is decided from the predicted value NJ when the measured
microarray data are submitted to a PLS model. Hence, the outlier detection
approaches that exist for PLS (already applied in multivariate calibration in chemical
and industrial fields) can be applied. Pell used the studentized residuals versus
leverage plot to detect outliers in PLS, which was successful when either masking or
114
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.1Introduction
swampingoccurred[19].Pell[20]also,basedontheworkbyMartensandNæs[21],
detectedoutliersfromanFͲratiowhichcomparedthevalidationsamplesxͲresidualsto
the xͲresiduals of thecalibration samples.Chiang and Pell [22] presented theclosest
distancetocenter(CDC),amultipleoutlierdetectionalgorithmappliedtogetherwith
ellipsoidal multivariate trimming (MVT), taking into account only the xͲdata. A
methodology to detect prediction outliers in PLS was applied by projecting the new
objects on the Sammon’’s mapping space containing the convex hull which defines a
boundaryaroundeachclusterandanotheraroundthewholecalibrationdata[23].Q
andHotelling’’sT2statistics[24]werealsousedtodetectoutliersinPLS,althoughthe
authorsindicatedthatinsomecasestheseindexeswouldnotbeenough.
Most of these mentioned approaches take into account only the xͲresponse data to
point a sample as a potential outlier since it is the only information available for
unknown samples. Note also that the predicted value NJ is rarely used to detect
prediction outliers in PLS, since it is often difficult to set limits on the lowest and
highestvaluesofNJthatcanbeaccepted.Onlythosepredictionsthatarereallyextreme
canwarnthesamplebeinganoutlier.DPLS,however,hastheparticularitythattheNJ
values(fromwhichtheclassisdecided)arelocatedaroundthevaluethatcodifiesthe
class(around0or1intheDPLSschemeusedinthispaper)andthatprobabilitydensity
functionsofthepredictionscanbeestablished.Thisfacthasbeenpreviouslyusedto
definearejectoptionforDPLSandmicroarraydata[25].Therejectoptionallowedto
reject to classify those samples that had extreme NJ values or those with "normal" NJ
valuesbutwhoseclassificationwasambiguous(i.e.,samplesthathaveaverysimilar
probability to belong to any of the modelled classes). In this paper, we provide a
unified approach for outlier detection in DPLS for microarray data. This approach
combinesthenew criterionbased on thepredicted value NJparticularlydevelopedfor
DPLS, with the wellͲknown diagnostics based on the leverage and the xͲresiduals
commonlyusedinPLS.
115
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
5.2Theory
5.2.1ProbabilisticDiscriminantPartialLeastSquares
DPLS is the application of PLS regression to classification problems. A DPLS model is
calculatedbyregressingy,whichcodifiestheclassofthesamples,onXusingAlatent
variables(factors)[18,26].Formicroarraygeneexpressiondata,XisanNuPmatrixof
NsamplesandPgeneexpressionsandyisaNu1vectorofonesandzeros,wherethe0
codifiesthatthesamplebelongstoclassZ0andthe1codifiesthatthesamplebelongs
toclassZ1.ForanunknownsamplewithmeasuredxͲdata,xt,thevaluepredictedby
theDPLSmodeltakingintoaccountAfactorsisgivenbyNJt=xtTb,wherebisthevector
of regression coefficients and the preͲprocessing is implicit in the formula. With the
mentionedcoding,NJtshouldideallybeclosetozeroifthesamplebelongstoclassZ0
andclosetooneifthesamplebelongstoclassZ1.Thecriterionfordecidingtheclass
fromNJtwillinfluencetheperformanceoftheclassificationrule.Thecriterionusedin
this work is based on the probabilistic version of DPLS, pͲDPLS [27]. The pͲDPLS
procedure starts by calculating a PLS model of A factors relating X and y. Then, the
trainingsamplesarepredictedwiththismodel.Foreachtrainingsamplei,apotential
functionf(NJi,SEPi)iscalculatedwiththeshapeofaGaussiancentredatthepredicted
value NJi and with standard deviation the standard error of prediction (SEPi) of that
sample. Next, the individual potential functions of all the samples of class Z0 are
averaged to obtain the probability density function (PDF) that describes the
predictionsofclassZ0(Eq.1):
‫݌‬ሺ‫ݕ‬ො௜ ȁɘ଴ ሻ ൌ 116
೙
బ ௙ሺ௬
σ೔సభ
ො ೔ǡ ୗ୉୔೔ ሻ
௡బ
(1)
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.2Theory
where n0 is the number of samples of class Z0. The PDF for class Z1 is calculated
likewise, using n1, the number of samples of class Z1. A sample is classified by
calculatingitspredictionNJandapplyingtheBayestheoremtothetwoPDFssothatthe
sampleisallocatedintheclasswiththehighestaposterioriprobability.Aconsequence
ofthestraightapplicationofthisruleisthatasampleisalwaysclassifiedinoneofthe
classes. So, samples from new unexpected classes will be misclassified, and those
samples with either extremely low or extremely high values of NJ (which may be
outliers)willbeassignedtooneoftheclasseswithaverylargeprobability.
5.2.2RejectoptioninpǦDPLS
ThepurposeoftherejectoptioninpͲDPLSistoallowtheclassifiertorejectasampleif
this will likely be misclassified. In other words, a class label is assignedonly tothose
samples with the highest probability of being correctly classified. By not forcing the
classifier to always make a decision in one of the two modelled classes, the
misclassification rate of the model (measured as the number of correctly classified
sampleswithrespecttothenumberofsamplesforwhichtheclassifierassignsaclass)
decreases, and gives confidence to the experimenter on the outputs of the
classificationrule.TherejectoptioninpͲDPLSisimplementedherefortwomaintypes
ofsamples:outliersandambiguoussamples.
5.2.2.1Rejectionofoutliers
OutliersaresampleswhosexͲdatahavedifferentfeaturesthanthebulkofthetraining
samples.Severalreasonsforthisbehaviourare(a)thesamplebelongstoaclassthat
was not modelled, (b) the sample belongs to one of the modelled classes but the xͲ
data have gross errors or contain unmodelled interferences, and (c) the sample
belongs to one of the modelled classes but has correct extreme values of some
variables.Samplesinsituation(a)shouldbedetectedandrejectedotherwisetheywill
117
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
bewronglyclassifiedinoneofthetwomodelledclasses.Samplesinsituations(b)and
(c)willnotnecessarilybemisclassified,butthisuncommonbehaviourwilllikelyaffect
theclassificationresultandhencewemightprefertorejectthesamplesandaskfor
extended analysis instead of running the risk of misclassifying them. Outliers (hence,
candidatestoberejectedtoclassify)inpͲDPLSareflaggedbasedonthefollowingfour
criteria:limitsontheNJ,leverage,ratioofresidualvariancesandclassificationerror.
a.LimitsontheNJ
InpͲDPLS,thepredictionsNJofthetrainingsamplesareusedtocalculateadistribution
ofpredictionsforeachclass(seeFigure1).Thesedistributionsareideallycenteredon
0 and 1, the reference values used at the training stage. Uncommon xͲdata will
produce NJ values at the extremes of the PDF’’s of a class. Hence, limits for NJ are set
aroundthemajorityoftheNJofthetrainingdata.TheselimitsdefineregionsintheNJ
axisinwhichthesampleiseitherclassifiedinoneclass,intheotherclass,orrejected
toclassify[25].Thelimitsaredefinedsuchthattheareainthetailsofeachdistribution
isfivepercentofthetotalareaofthedistribution(i.e.2.5%ineachtailofthePDFof
each class). These limits depend on the PDFs. Hence, they are different for pͲDPLS
modelswithadifferentnumberoffactors.Inpractice,whenthePDFsareoverlapped
(Figure 1) there are two limits, a High Limit (HL) and a Low Limit (LL) and when the
PDFs are separated there are four limits (a HL anda LLfor eachclass) (Figure 4b).A
samplewithaNJ predictedoutsidethelimitswillbeflaggedasoutlier.IfthePDFsare
not overlapped, a sample with NJ between HL0 and LL1 will be flagged as inlier. This
criterion improves the direct application of the Bayes rule in the sense that, at the
extremesofthePDFs,theaposterioriprobabilityforoneclassishigh,andhencethe
Bayesrulewouldassignthesampletothatclasswithahighprobability.Byimposing
thelimits,thesamplewillnowberejectedtoclassify.NotealsothatthelimitsontheNJ
valueswillnotaccountforalltheoutliersituationsinpͲDPLS,sincetheywillnotdetect
those outliers whose unusual xͲdata makes NJ be inside a classification region, e.g.
118
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.2Theory
when the NJ of a sample of class Z0 falls within the classification region of class Z1.
Thesesamplesmightbedetectedbythecriteriadescribednext.
b.Leverage
The leverage of sample t for a DPLS model calculated with meanͲcentered xͲdata is
givenby[28]:
ଵ
݄௧ ൌ ൅ ࢚௧ ୘ ሺ‫ ܂‬୘ ‫܂‬ሻିଵ ࢚௧ ௡
(2)
where tt denotes the score vector and T is the scores matrix of the mean centered
training data. The leverage measures the distance from the sample to the center
(mean)ofthetrainingsettakingintoaccountthecorrelationinthedata.Alowvalueof
ht indicates that the sample is similar to the average of the training samples. A high
leverageindicatesthatthesamplehasanunusualxͲvector(orscorevector)relativeto
thetrainingsamples,soitisanxͲoutlier.Inthatcase,theexperimentershouldsuspect
about the reliability of the classification and wait for additional studies. Although no
strict rules exist, it is common to declare as a highͲleverage sample the one with
ത ൌ ͳȀ ൅
ht ! 3 h where h is the average leverage value for the training samples (݄
‫ܣ‬Ȁܰ)[29,30].
c.Ratioofresidualvariances
InDPLS,thereisavectorofxͲresidualsforeachsampleandnumberoffactorsAused
inthemodel.TheresidualsarethedifferencebetweenthemeasuredxͲdataandthe
datapredictedbythemodelwithAfactors.Whiletheleveragereferstothepositionof
thesampleinthesubspaceofthefactorsusedforregression,theresidualreferstothe
orthogonalsubspace,i.e.,thefactorsnotusedforregression.Residualsthataremuch
larger than most of the residuals of the training samples indicate that the sample is
119
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
poorlydescribedbythemodelforthatnumberoffactorsand,hence,itisanxͲoutlier.
It must be pointed out, however, that large xͲresiduals do not necessarily imply a
wrong NJ, and, hence, a wrong classification. Actually, one of the advantages of the
factorͲbased methods such as PLS is that the factors retained in the model should
accountfortherelevantvariabilityinthexͲdata,whiletheremainingfactorsnotused
in the model should account for the irrelevant variability (the xͲresiduals). Hence, a
large xͲresidual simply indicates that some part of the measured xͲdata is not
modelled.However,thereisalargechancethatthesourceoftheseunmodelleddata
had also a contribution in the model space and influenced the NJ. These outliers are
detected by comparing the unmodelled parts of the test sample to the unmodelled
parts of the training samples using the AͲfactor pͲDPLS model [31] with the ratio of
residualvariances:
ൌ
ௌ೟మ
ௌ೅మ
(3)
wherest2istheresidualvarianceforthetestsample:
ܵ௧ଶ ൌ σು
ೕసభ൫௫೟ೕ ି௫ො೟ೕ ൯
ሺ௉ି஺ሻ
మ
(4)
andsT2isthetotalvarianceforthetrainingsamples[21]:
்ܵଶ ൌ
ು
σಿ
೔సభ σೕసభ൫௫೟ೕ ି௫ො೟ೕ ൯
మ
ሺே൉௉ି௉ି஺൉ሺ୫ୟ୶ሺேǡ௉ሻሻ
(5)
AnobjectwithV>3isconsideredtobeanoutlier.Asimilarcriterionwasusedin[31]to
detectoutliersinPLS.NotethattheusualcomparisonofVwithatabulatedFͲvalueis
not useful. The very large number of degrees of freedom involved [32] makes the
tabulated FͲvalue be low and most of the samples be flagged as outliers, which is
meaningless.
120
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.2Theory
2.5
Ambiguity
Region
LL0
HL1
9
(a)
(b)
8
2
7
6
1.5
V
p(NJ|Zc)
5
4
1
3
2
0.5
1
0
Ͳ1
Ͳ0.5
0
2.5
0.5
NJ
1
Ambiguity
Region
LL0
HL1
0
0
2
1.5
0.05
0.1
0.15
leverage
0.2
10
(c)
0.25
(d)
9
2
8
6
V
p(NJ|Zc)
7
1.5
1
5
4
3
0.5
2
1
0
Ͳ1
2.5
Ͳ0.5
LL0
0
0.5
NJ
1
Ambiguity
Region
1.5
HL1
0
2
0
0.05
0.1
leverage
0.15
9
(e)
0.2
(f)
8
2
7
5
V
p(NJ|Zc)
6
1.5
4
1
3
2
0.5
1
0
Ͳ1
0
Ͳ0.5
0
0.5
NJ
1
1.5
2
0
0.05
0.1
leverage
0.15
0.2
Figure1.Probabilitydensityfunctions(PDFs)forthepͲDPLSmodelwithtwofactorsobtainedduringleaveͲ
oneͲoutcrossͲvalidationandinfluenceplotsforthetrainingsampleswhenasampleisusedasatest.aͲb.
PDFs and influence plot when sample N43_normal is left out cͲd. PDFs and influence plot when sample
T11_tumourisleftout,eͲf.PDFsandinfluenceplotwhensampleN41_normalisleftͲout.Ina,cande,the
triangle(S)identifiesthepredictionoftheleftͲoutͲsample.Inb,dandf,thetriangle(S)identifiestheleftͲ
outͲsampleascomparedtotherestofthetrainingdata.Theverticalandthehorizontaldottedlinesindicate
thelimitsforoutlierdetection.
121
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
d.Classificationerror
ClassificationerrorisaneasyͲtoͲuseoutlierdiagnosticsduringthetrainingstageofa
classification rule. When the xͲy relation of a sample does not agree with the xͲy
relationdescribedbythemodel,thesampleismisclassified,andthisisusedtoflagthe
sample as an outlier. This is the equivalent to a large prediction error in regression
models.Differentfromthecriteria(a)to(c),theclassificationerrorcanonlybeusedto
detectoutliersinthetrainingandvalidationsetsbecauseitrequiresthetrueclassto
be known. Despite it cannot be applied to new samples, the criterion is still very
helpfultorefinetheclassificationmodel.
5.2.2.2Rejectionofambiguoussamples
AmbiguoussamplesaresamplesthatsharecharacteristicsofbothclassZ0andclassZ1
because the measured xͲvariables are not discriminative enough for the algorithm
used.WhenthesesamplesarepredictedbytheDPLSmodel,theirNJvaluesareinthe
boundarybetweenclasses(ambiguityregion,Figure1)sotheBayesianprobabilityof
belonging to any of the classes P(Zc|NJt) is similar. Even small variations in the
measuredxͲdatacanmaketheclassifierassignthesampletoeitheroneclassorthe
other.Thisincreasestheuncertaintyoftheclassificationresult,soitmaybepreferable
torejectthatsample.Thisrejectionisdefinedbytherule:
”‡Œ‡…–݂݅݉ܽ‫ݔ‬൫ܲሺɘ௖ ȁ‫ݕ‬ො௜ ሻ൯ ൏ ሺͳ െ ‫ݐ‬ሻܿ ൌ Ͳǡͳ
(6)
sothatthesampleisrejectediftheaposterioriprobabilityofbelongingtoanyofthe
classes is lower than a reject threshold (1––t). Note that the threshold can be set to
rejectanyslightlydoubtfulsample.Thisimprovestheerrorrateoftheclassifier,since
less samples will be misclassified, but, in turn, more samples will be rejected that
otherwisecouldbecorrectlyclassified,whichreducestheusefulnessoftheclassifier.
Chow[33]derivedanoptimumrejectionschemethatgivesatradeoffbetweenreject
rateanderrorrate.ThisrulewasrecentlydescribedforpͲDPLS[25].
122
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.3Results
5.3Results
5.3.1Data
Theprostatecancerdataset[34]consistsof50nonͲtumoursamples(classZ0)and52
tumoursamples(class Z1)with12600geneexpressions(variables).Fromthesegene
expressions(variables),the150withthehighestvarianceweight[35]wereselectedto
avoidirrelevantgenesfrominterferingwiththediscriminationpoweroftherelevant
genes[36].Thedatasetwasdividedintoatrainingset(82samples,42ofclassZ0and
40ofclass Z1)andatestset(20samples,8ofclass Z0 and 12ofclassZ1)usingthe
KennardͲStone algorithm [37]. This dataset is used to show the ability of the
methodology to detect outliers in the training set and to show that the final
classificationmodelandthepredictionofthetestsetimprovewhentheseoutliersare
deleted.
The small round blue cell tumours of childhood dataset [38] includes 2308 gene
expressions of 12 samples of neuroblastoma (NB), 8 samples of nonͲHodgkin
lymphoma (BL), 23 samples of Ewing family of tumours (EWS) and 20 samples of
rhabdomyosarcoma(RMS).EWSsamples(class Z0)andRMSsamples(class Z1)were
usedfortrainingandtheremaining,NBandBLsamples,astestsamples.Thisdataset
was used to show how the proposed method can reject new samples that do not
belongtoanyofthemodelledclasses.Since,thetestsamplesdonotbelongtoanyof
the modelled classes, they would be misclassified unless the reject option is
implemented.
123
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
5.3.2Prostatedataset
Briefly,theprocedurewasasfollows.First,thepͲDPLSmodelwascalculatedforagiven
numberoffactorsusingmeanͲcenteredgeneexpressionsofthetrainingsamples.Then,
the training samples were predicted and the predictions, NJ, were used to calculate
kernelGaussians,which,inturn,definedaPDFforeachclass(Eq.1).FromthePDFs,the
rejectoptionlimitsforNJwereset.TheleverageandxͲresidualsofthetrainingsamples
werealsocalculated.Anunknownsamplewithmeasuredxt,wasfirstpredicted(NJt=xtTb)
andthexͲresiduals,theleverageandtheprobabilityofclassificationforeachmodelled
class(evaluatedastheBayesaposterioriprobabilitydetailedin[25])werecalculated.
Thesamplewastheneitherclassifiedorrejectedtoclassifyifitwasflaggedasoutlier
(section2.2.1)orambiguous(section2.2.2,Eq.6).Beforeclassifyingunknownsamples,
theoptimalmodelwasselectedbyleaveͲoneͲoutcrossͲvalidation(LOOCV).InLOOCV,a
sampleisleftoutandthemodeliscalculatedusingtheremainingsamples.EachleftͲout
sample was treated as an unknown sample and was either classified or rejected as
describedabove.Oftherejectedsamples,outlierswereremovedfromthetrainingset
andthemodelwasrecalculated;ambiguoussamples,however,weremaintainedinthe
modelsincetheyintroducedrelevantvariability.
Theperformanceofthemodelwasevaluatedwiththeclassificationcostpersample:
‫ ݐݏ݋ܥ‬ൌ ሺߣ௥ ௥ ൅ ߣ௠ ௠ ሻȀ
(7)
whereNristhenumberofsamplesrejected,Nmisthenumberofsamplesmisclassified,
OrandOm,arethecostsofrejectingasampleormisclassifyingitrespectivelyandNis
thetotalnumberofsamplesusedtovalidatethemodel.Thecostcriterion,calculated
during LOOCV, was used to compare the pͲDPLS models with a different number of
factorsandtoselecttheoptimalmodel.NotethatOrandOmmaybefineͲtunedtomeet
the requirements of the classification problem. Since for this dataset there is no
reference in the literature about the associated costs of rejecting or misclassifying a
124
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.3Results
sample, we used Or =0.25 and Om =1, which indicates that we prefer to reject four
samplesinsteadofclassifyingonewrong.Inthiscase,Orvalueslowerthan0.25didnot
improvetheclassificationperformanceofthemodel.
Forthisdataset,preliminarypͲDPLSmodelsusing1to4factorswerecalculated.Taking
into account the cost per sample calculated by LOOCV, the optimal model had two
factors.SamplesN43_normalandN25_normal(bothofclassZ0)werepointedoutas
outliers because their predictions were outside the accepted region for NJ for its
corresponding crossͲvalidation segment. The predictionof sample N43_normal (Figure
1aͲb)wasNJ=0.59,lowerthanLL0=0.56,whilesampleN25_normalhadNJ=0.66,lower
thanthelimitLL0=0.50establishedforitspͲDPLSmodel(notethatthelimitsHLandLL
vary for each crossͲvalidation segment since the pͲDPLS is calculated with different
samples). These extreme predictions suggested the possibility of an unusual x vector.
Thiswaslaterconfirmedbecausetheleverageofthesesamplesexceededthreetimes
theaverageleverageofthetrainingset:sampleN43_normalhadh=0.13while h =0.024
andsampleN25_normalhadh=0.23while h =0.037.Thereasonforthehighleverageis
that five genes, those with Accession Numbers 36785_at, 221_s_at, 774_g_at,
31449_at,38411_at,hadhigherintensitiesthantherestofthesamplesofclassZ0.The
fivevariables(genes)differentiallyexpressed,inthiscase,werenotconsideredrelevant
sincethedifferentintensitieswereonlypresentinafewsamplessotheydidnotseem
torespondtoadifferentialcharacteristicofoneclass.
In addition to the samples N43_normal and N25_normal, the leverage criterion also
flaggedsampleN33_normalasoutlier(h=0.12,while h =0.025),despitethissampledid
nothaveanunusualNJ.
Six additional samples (N04_normal, T02_tumour, T05_tumour, T11_tumour,
T15_tumourandT25_tumour)wererejectedforhavinghighxͲresiduals(V>3).These
125
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
sixsampleshadmostofthegeneexpressionswithhigherintensitiesthanthemeanof
the intensities of the training samples, so the samples were not wellͲmodelled by the
factorsoftheDPLSmodel.TheT11_tumoursample(classZ1)(Figure1cͲd),forexample,
was rejected because V = 9.45. Its prediction NJ=0.47 was closer to the predictions for
class Z0 than to the predictions of class Z1 so the sample would have been classified
wrongly(i.e.,nonͲtumour)ifithadnotbeenrejected.Noticethatthepredictionforthis
sampleisnotanextremevalue,sothesamplehadnotbeenlabelledassuspiciousbased
onlyontheprediction.
In addition to the previous samples flagged as outliers, five samples were wrongly
classified (Table 1). In these samples, the relation of xͲy did not agree with the trend
modelled by the pͲDPLS model. The reason for the wrong classification is that the
intensities of the samples of class Z0 (nonͲtumour) are lower than those of class Z1
(tumour)forthemajorityofthesamplesofthisdataset.ThesampleN38_normal(class
Z0), however, had intensities in some of the variables higher than expected, more
similar to the intensities of tumour samples (class Z1) than to the intensities of the
samplesofitstrueclass(Figure2a).Forthisreason,thesamplewasmisclassified.The
opposite happened with the misclassified samples of class Z1 (T39_tumour,
T21_tumour,T49_tumourandT34_tumour).Someintensitieswerelowerthanmostof
theintensitiesofclassZ0(Figure2b).Thissituationmayresultfromeitheranincorrect
codification of the samples (mislabelling), experimental problems (e.g. bad intensity
acquisition)orbecausethesesamplesweretrulydifferentfromtherestofsamplesof
theirclass(whichwouldindicatethatmorerepresentativesamplesofthistypeshould
becollectedbeforetheyareincludedinthemodel).
126
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.3Results
1200
a
1000
Intensities
800
600
400
200
0
0
50
100
150
100
150
Variablenumber
1600
1400
b
1200
Intensities
1000
800
600
400
200
0
50
Variablenumber
Figure 2. a. Intensities of sample N38_normal of class Z0 (grey) and mean of intensities of class Z0. b.
IntensitiesofsampleT21_tumourofclassZ1(grey)andmeanofintensitiesofclassZ1(black).
During the crossͲvalidation process, four additional samples were rejected to classify
because they were ambiguous. These samples did not have extreme values, so they
werenotlikelytoinfluencethemodelexcessivelyandtheywerekeptinthetrainingset.
However, since in the LOOCV process these samples acted as test samples, they were
consideredasrejectsforthecalculationoftheperformanceoftheclassifier.Anexample
isshowninFigure1eand1f.Thefigureshowstheacceptanceandrejectregionsforthe
crossͲvalidation model when sample N41_normal is left out. Because its NJ was in the
127
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
ambiguity zone, this sample would have been rejected to classify if it had been an
unknownsample.
Itistonotealsothattheoutlierdetectionprocesscouldsufferfromthemaskingeffect,
sothatthepresenceofseveraloutlierscouldhidethepresenceofsomeotheroutlier.
Despite this, extreme samples could still be detected and the model was recalculated
withoutthosesamples.TheoptimalmodelwasagainthepͲDPLSmodelwith2factors,
withadecreaseoftheCostofclassificationpersamplefrom0.11to0.06(Figure3).
Cost perSample
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
2
3
4
1
Number offactors
Figure 3. Cost per sample for the training samples with Or=0.25 and Om=1. (y) pͲDPLS model with all the
samples.(„)pͲDPLSmodelsafterremovingoutliers.
Table 1 shows the classification results for the models calculated with the original
datasetandwiththedatasetafterremovingtherejectedtrainingsamples.TheLOOCV
and test set classifications are first presented for the initial dataset using the pͲDPLS
model for two factors (columns 2 and 3) without reject option (i.e., there are no
rejected samples). Columns 4 and 5 show the classifications when the reject option is
128
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.3Results
enabled. Note that one false negative and one false positive of the classical model
becomerejects.Inturn,fivetruenegativesandsixtruepositivesbecomealsorejects.
This is because the high certainty required in the classification results makes the
sampleswithuncertainclassificationberejected.Columns6to9showtheresultsafter
the outliers in the training set had been removed. Comparing the classical pͲDPLS
models with and without outliers (columns 2Ͳ3 and 6Ͳ7), it is seen that the model
withoutoutliersmisclassifiesonesampleless.Thisimprovementismorenotablewhen
rejection is allowed (columns 45 versus 89). In this case, the LOOCV error rate
(calculated as the ratio of samples misclassified divided by the samples classified), for
themodelwithoutliersis5/69=0.07,higherthantheerrorrateforthedepuratedmodel
(2/56=0.04). The reduction of misclassified samples is also observed in the test set.
Columns 8 and 9 show the results of the depurated model with reject option. This
depurated model predicts better than the models calculated with all the training
sampleswithoutrejectoption.Thisoptimalmodelclassifieswronglyonlytwosamples,
andalsohasfewerrejections,sotheclassificationCostpersampleislower(Figure3).
The prediction of the test set is also better. The two misclassifications of pͲDPLS
calculatedwiththeinitialdatasetarenowrejections(basedontheambiguityrejection
rule, Eq. 6). Compared with the pͲDPLS model with reject option calculated with the
initial dataset, the number of misclassifications and of rejections decreased, so the
classification cost per sample decreases from 0.10 to 0.07. Hence, the removal of the
outliersofthetrainingsetimprovedthepͲDPLSmodelinthesenseofclassifyingbetter
boththetrainingsamplesviaLOOCVandthetestsamples.
129
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
Table1.Prostatecancerdataset.ClassificationofvalidationandtestsamplesforthepͲDPLSmodelwithtwo
factorscalculatedwithinitialtrainingsamplesandafterremovingoutliers.
Initialdataset
pͲDPLS
Datasetafterremoving outliersfromthe
trainingset
pͲDPLSwithreject
option
pͲDPLS
pͲDPLSwithreject
option
LOOCV
test
LOOCV
test
LOOCV
test
LOOCV
test
FN
2
2
1
1
6
2
2
0
FP
5
0
4
0
0
0
0
0
TN
42
5
37
5
38
5
33
5
TP
33
13
27
13
24
13
21
13
RN
0
0
6
0
0
0
5
0
RP
0
0
7
1
0
0
7
2
** False Negative (FN): samples of class Z0 classified in class Z1, False Positive (FP): samples of class Z1
classifiedasclassZ0,RejectNegative(RN):samplesofclassZ1rejected,RejectPositive(RP):samplesofclass
Z0rejected,TrueNegative(TN):samplesofclassZ1correctlyclassified,TruePositive(TP):samplesofclassZ0
correctlyclassified.
5.3.3Smallroundbluecellstumourdataset
Thesamestrategyasfortheprostatecancerdatasetwasfollowed.InthiscasethepͲ
DPLSmodelswerecalculatedusingthe96mostsignificantgeneexpressionsaccording
toreference[38].PreliminarypͲDPLSmodelswerecalculatedwith1to3factorsusing
meanͲcenteredgeneexpressiondataandthenvalidatedbyLOOCV.Theoptimalmodel,
with the lowest cost of classification per sample was the one factor model. For this
model, four training samples were detected as outliers. Three of them had large xͲ
residualswithvaluesst2/sT2of4.98(sampleEWS_T13),6.11(sampleRMS_T7),and8.43
(sampleRMS_T11)largerthanthecutͲoffvalueof3.Moreover,thepredictionofsample
RMS_T11wasNJ=1.91,higherthantheclasslimitHL1=1.35.Thefourthoutlier,sample
EWS_T12, had a prediction NJ=0.18, lower than the limit LL0=0.072. After deleting
130
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.3Results
these four samples, the pͲDPLS model was recalculated and used to predict the test
samples. Without reject option, all the test samples would have been incorrectly
classified by the model. With the reject option, 19 out of the 20 test samples were
pointedasoutliersbytheNJlimitsbecausetheywereinliers.Theothersamplehadthe
predictionNJintheacceptanceregionandhenceitwasclassified,buterroneously.The
classification performance would have been worse if the pͲDPLS model had not been
depuratedfromoutliers.Withoutexcludingthetrainingoutliers,thePDFsofthemodel
varied,andhencetheNJlimitsforrejection(Figure4).Inthatcase(i.e.,thepͲDPLSmodel
calculated with all the training samples) only 13 of the 20 test samples were rejected
andtheremaining7wereconsideredvalidbythemodelandclassifiedeitherinclassZ0
orinclassZ1(hence,wronglyclassified).Notethatthetestsampleshaveintermediate
values of the xͲvariables between the two modelled classes EWS and RMS. Since the
samplesareclosetothecentreofthemultivariatespace,theirpredictionswerearound
0.5, in the middle of the PDFs of the two modelled classes. In this case, none of test
samples could have been rejected neither by the leverage criterion (the maximum
leveragewash=0.01,whilethe3݄തwas0.15forthismodel)norbytheratioofvariances
(allhadV<3).ThisshowsthecomplementaryinformationthattheNJlimits,theleverage
andtheratioofvariancesoffer.
4
LL 0
HL 0
LL 1
HL 1
a
3.5
3
p(NJ|Zc)
2.5
2
1.5
1
0.5
0
Ͳ0.5
0
0.5
NJ
1
1.5
2
131
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
4.5
LL 0
HL 0
LL 1
b
HL 1
4
3.5
3
p(NJ|Zc)
2.5
2
1.5
1
0.5
0
Ͳ1
reject
Ͳ0.5
Z0
0
0.5
reject
1
NJ
Z1
1.5
2
reject
Figure4.Smallroundbluecellstumourdataset.PDFsofpͲDPLSmodelwithonefactora.withallthetraining
samples, b. without the training outliers. Note how PDFs (and hence, the NJ limits and the rejection and
acceptancezones)changewhenoutliersintrainingsetareremoved.
5.4Conclusions
Classification rules for microarray data require appropriate rejection diagnostics. The
severalstepsinvolvedinthegenerationandmeasurementofmicroarraydata,thatmay
introduce important errors in the data, as well as the possibility of submitting to the
classifiersamplesfromanonͲmodelledclass,makeitnecessarytheuseofdiagnosticsto
prevent misclassifications. Rejection diagnostics act both in the training stage of the
rule,byidentifyingthoseoutliersthancandegradetheperformanceoftherule,andin
thepredictionofnewincomingsamples,byidentifyingthosesamplesthatwilllikelybe
misclassified.Withinthisapproach,theclassificationmodelisnotforcedtoclassifyany
132
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
5.4Conclusions
futuresamplethatarrives.ThisworkextendsthepreviousworkonrejectoptionforpͲ
DPLSthatwasbasedonlyonthepredictedNJ,whichhasbeenshowntobenotalways
sufficienttodetectoutliers.Bothtrainingandpredictionoutlierswerenowdetectedby
takingintoaccountthexͲresiduals,theleverageandthepredictedNJ.Thepossibilityof
using xͲresiduals is an advantage of classification methods based on latent variables
suchaspͲDPLS.Thedeletionofthetrainingoutliersfromthetrainingsetimprovedthe
classification model. At the prediction stage, samples were rejected to classify either
becausetheywereoutliers,orbecausetheywereambiguous.
Acknowledgements
TheauthorsthankthesupportoftheDepartamentd’’Universitats,RecercaiSocietatde
laInformaciódeCatalunyaforprovidingCristinaBotella’’sdoctoralfellowship,anofthe
Spanish Ministerio de Educación y Ciencia (project CTQ2007Ͳ66918/BQU).
133
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
‡ˆ‡”‡…‡•
[1]
Liu, F. and B. Wu, MultiͲgroup cancer outlier differential gene expression detection.
ComputationalBiologyandChemistry,2007.31:p.65Ͳ71.
[2]
Li, C. and W.H. Wong, ModelͲbased analysis of oligonucleotide arrays: Expression index
computation and outlier detection. Proceedings of the National Academy of Sciences, 2001.
98:p.31Ͳ36.
[3]
Gottardo,R.,etal.,QualityControlandRobustEstimationforcDNAMicroarraysWithReplicates.
JournaloftheAmericanStatisticalAssociation,2006.101:p.30Ͳ40.
[4]
Churchill,G.A.,FundamentalsofexperimentaldesignforcDNAmicroarrays.Nature
Genetics,
2002.32:p.490Ͳ495.
[5]
Cleveland,W.S.,RobustLocallyWeightedRegressionandSmoothingScatterplots.Journalofthe
AmericanStatisticalAssociation,1979.74:p.829Ͳ836.
[6]
Yang,Y.H.,etal.,NormalizationforcDNAmicroarraydata:arobustcompositemethodaddressing
singleandmultipleslidesystematicvariation.NucleicAcidsResearch,2002.30:p.e15.
[7]
Paoli, S., et al., Integrating gene expression profiling and clinical data. International Journal
ofApproximateReasoning,2008.47:p.58Ͳ69.
[8]
Moffitt,R.,etal.,EffectofOutlierRemovalonGeneMarkerSelectionUsingSupport
Machines. Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual
Conference,2005.1:p.917Ͳ920.
[9]
Olsen, S.H., D.G. Thomas, and D.R. Lucas, Cluster analysis of immunohistochemical profiles in
synovial sarcoma, malignant peripheral nerve sheath tumor, and Ewing sarcoma. Modern
Pathology,2006.19:p.659Ͳ668.
[10]
Mramor, M., et al., VisualizationͲbased cancer microarray data classification analysis.
Bioinformatics,2007.23:p.2147Ͳ2154.
Vector
[11]
Model,F.,etal.,Statisticalprocesscontrolforlargescalemicroarrayexperiments.Bioinformatics,
2002.18:p.S155ͲS163.
[12]
Shieh,A.D.andY.S.Hung,DetectingOutlierSamplesinMicroarrayData.StatisticalApplicationsin
GeneticsandMolecularBiology,2009.8:article13.
[13]
Tomlins,S.A.,etal.,RecurrentfusionofTMPRSS2andETStranscriptionfactorgenesinprostate
cancer.Science,2005.310:p.644Ͳ648.
[14]
Bandyopadhyay, S. and S. Santra, Agenetic approach for efficient outlier detection in projected
space.PatternRecognition,2008.41:p.1338Ͳ1349.
[15]
Loo, L.ͲH., et al., New Criteria for Selecting Differentially Expressed Genes. IEEE
EngineeringinMedicineandBiologyMagazine,2007.26:p.17Ͳ26.
134
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
References
[16]
Tibshirani, R. and T. hastie, Outlier sums for differential gene expression analysis. Biostatistics,
2007.8:p.2Ͳ8.
[17]
Wu, B., Cancer outlier differential gene expression detection. Biostatistics, 2007. 8: p.566Ͳ
575.
[18]
Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ
dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44.
[19]
Pell,R.J.,Multipleoutlierdetectionformultivariatecalibrationusingrobuststatistical techniques.
ChemometricsandIntelligentLaboratorySystems,2000.52:p.87Ͳ104.
[20]
Pell,R.J.,L.S.Ramos,andR.Manne,Themodelspaceinpartialleastsquaresregression.Journalof
Chemometrics,2007.21:p.165Ͳ172.
[21]
Martens,H.andT.Naes,MultivariateCalibration.1989,NewYork:JohnWilley&Sons.
[22]
Chiang, L.H., R.J. Pell, and M.B. Seasholtz, Exploring process data with the use of robust outlier
detectionalgorithms.JournalofProcessControl,2003.13:p.437Ͳ449.
[23]
Pierna, J.A.F., et al., A methodology to detect outliers/inliers in prediction with PLS.
ChemometricsandIntelligentLaboratorySystems,2003.68:p.17Ͳ28.
[24]
Lleti, R., et al., Outliers in partial least squares regression Application to calibration of wine
gradewithmeaninfrareddata.AnalyticaChimicaActa2005.544:p.60Ͳ70.
[25]
Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic
discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328.
[26]
Wold, H., Partial least squares, in Encyclopedia of Statistical Sciences K.a.N.L. Johnson, Editor.
1985,Wiley:NewYork.p.581Ͳ591.
[27]
Pérez, N.F., J. Ferré, and R. Boqué, Calculation of the reliability of classification in
DiscriminantPartialLeastͲSquaresClassification.JournalofChemometrics and
LaboratorySystems,2009.95:p.122Ͳ128.
[28]
Faber, N.K.M. and R. Bro, Standard error of prediction for multiway PLS: 1. Background and a
simulationstudy.ChemometricsandIntelligentLaboratorySystems,2002.61:p.133Ͳ149.
Intelligent
[29]
Faber,N.K.M.,Estimatingtheuncertaintyinestimatesofrootmeansquare error of prediction:
application to determining the size of an adequate test set in multivariate calibration.
ChemometricsandIntelligentLaboratorySystems,1999.49:p.79Ͳ89.
[30]
Faber, N.K.M., A closer look at the biasͲvariance tradeͲoff in multivariate calibration. Journal of
Chemometrics,1999.13:p.185Ͳ192.
[31]
FernándezͲPierna, J.A., et al., Methods for outlier detection in prediction. Chemometrics and
IntelligentLaboratorySystems,2002.63:p.27Ͳ39.
[32]
Maesschalck,R.D.,etal.,Decisioncriteriaforsoftindependentmodellingofclass analogy applied
tonearinfrareddata.ChemometricsandIntelligentLaboratorySystems,1999.47:p.65Ͳ77.
135
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Outlierdetectionandambiguitydetection
formicroarraydatainpͲDPLS
[33]
Chow,C.K.,Onoptimumrecognitionerrorandrejecttradeoff.IEEEͲTransactionsoninformation
theory,1970.16:p.41Ͳ46.
[34]
Singh, D., et al., Gene expression correlates of clinical prostate cancer behavior. Cancer Cell,
2002.1:p.203Ͳ209.
[35]
Sharaf,M.A.,D.L.Illman,andB.R.Kowalski,Chemometrics.1986:WileyͲIEEE.
[36]
Lu, Y. and J. Han, Cancer classification using gene expression data. Information Systems,
2003.28:p.243Ͳ268.
[37]
Kennard, R.W. and L.A. Stone, Computer Aided Design of Experiments. Technometrics, 1969.
11:p.137Ͳ148
[38]
Khan,J.,etal.,Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofiling
andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679.
136
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 6 Gene selection based on
selectivity ratio for
probabilistic discriminant partial least squares
Submitted April 2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Geneselectioninmicroarray
databasedonSRindex
Microarraydataareoftenusedtodetermineifacelloratissueishealthyortumour,
or if it belongs to a subtype of a certain tumour. The quality of these classifications
depends on the discriminating ability of the multivariate classification model. This
ability decreases if irrelevant genes are included in the training data. Hence, gene
selection plays a key role in the analysis of microarray data. In fact, gene selection
acomplishes several purposes: 1) the identification of genes that are biologically
relevant for the development of a certain disease 2) the discovery of coexpressed
genesinordertobuildmetabolicpathwaysand3)thereductionofthedimensionality
ofthedatainordertomakedataanalysiseasier.
Many gene selection methods have been developed. Some are based on biological
inferences and some have been developed from other type of data. In some cases,
geneselectionisbasedoncriteriathatcanbevalidfordifferenttypesofclassification
models, such as using genetic algorithms to select the genes that minimize the
prediction error of a certain classifier [1]. Different classification strategies can be
plugged into this selection scheme, as long as the model takes in certain selected
genesandgivesoutapredictionerrorthatcharacterizestheselectedsubsetofgenes.
Others,suchasselectingthegenesthataremostcorrelatedwiththeclasslabel[2]or
basedonstatisticaltests[3Ͳ4]ignorehowclassificationalgorithmsprocessesthedata,
soitmaynotfavourthesamesystematicvariationsinthedatathatthealgorithmwill
do.
Since the basis of this thesis has been the application of DPLS, we sought for gene
selectionthatcouldenhancethecharacteristicsthattheDPLSalgorithmusesfromthe
data. Hence, in this work, we implement the selectivity ratio (SR) index in order to
choose the most relevant subset of genes for classification with pͲDPLS models. The
selectivity ratio evaluates specifically the most relevant variables in PLS models. For
each variable, this index is the ratio of the explained variance with respect to the
139
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter6
residualvariance.Thebestgenesarethosewithahighexplainedvarianceandalow
residualvariance.Hence,thegeneswiththehighestSRareselectedassignificantand
theremaininggenesarediscardedfromtheanalysis.
Thispaperalsodiscussesanotherimportantaspectrelatedtogeneselection,namely
theinfluencethatthesplitofthedatasetintotrainingandtestsetshasonthesubset
ofselectedgenesandontheevaluationoftheclassifierperformance.Itisacommon
practicethatthegoodnessofageneselectionalgorithmischeckedbyclassifyingatest
set.Forthatpurpose,theinitialdatasetissplitintoatrainingsetandatestseteither
randomlyorusinganalgorithmsuchastheKennardͲStonealgorithm.Then,basedon
thetrainingset,asubsetorseveralsubsetsofgenesareselected,andtheclassification
modeliscalculatedusingonlythesegenes.Next,thetestsetisclassified.Thesubsetof
genes with the highest classification ability indicates the best goodness of the
selection. These selected genes may be relevant only to discriminate the samples of
thisparticulartrainingsetandthetestaccuracymaybeoveroptimisticsincethegenes
wereselectedbasedontheaccuracyofclassificationofthisparticulartestset.
In this chapter it is shown that the split of the data intro training and test subsets
influencestheaccuracyoftheclassification.Certainsplitscanleadtoclassifycorrectly
100%ofthetestsampleswhileothersplitscanonlyclassifycorrectly80%ofthetest
set,thusgivingafalseindicationofthetrueabilityofthegeneselectionalgorithmfor
selecting the best genes. In this work, many random splits of training and test sets
havebeenusedfordefiningthefinalaccuracyoftheclassificationmodels.
These aspects are discussed and implemented for two datasets, prostate cancer
dataset and nonͲsmall cell lung cancer dataset. For the prostate cancer dataset, the
mean of the accuracies (by crossͲvalidation) of classification increased from 85% (all
5966 genes used) to 94% when only 17 selected genes were used. Equivalently, the
140
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Geneselectioninmicroarray
databasedonSRindex
meanofaccuraciesforthetestsamplesincreasedfrom84%to94%.ForthenonͲsmall
celllungcancerdataset,themodelcalculatedwithonly17ofthe54675originalgenes,
providedacrossͲvalidationclassificationaccuracyof93%.
ThisworkhasbeensubmittedinApril2010.
‡ˆ‡”‡…‡•
[1]Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast
squaressupportvectormachine.BMCBioinformatics,2006.7:article95.
[2]MaoK.Z.andW.Tang, CorrelationͲBasedRelevancyandRedundancy.MeasuresforEfficientGene
Selection.PatternRecognitioninBioinformatics,2007,4774:p.230Ͳ241.
[3]
Dai,J.J.,L.Lieu,andD.Rocke,Dimensionreductionforclassificationwithgeneexpressionmicroarray
data.StatisticalApplicationsinGeneticsandMolecularBiology,2006.5:article6.
[4]Huang,X.,etal.,Borrowinginformationfromrelevantmicroarraystudiesforsampleclassificatiousing
weightedpartialleastsquares.ComputationalBiologyandChemistry,2005.29:p.204––211.
141
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Geneselectioninmicroarraydatabasedon
selectivityratioindex
C.Botella,J.Ferré*,R.Boqué
Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University.
Marcel·lí Domingo s/n, 43007. Tarragona, Spain
*
Correspondingauthor:[email protected]
SubmittedApril2010.(Editedforformat)
ABSTRACT
Mostofthegeneexpressionsmeasuredinamicroarrayexperimentareirrelevantfor
the final application of the data. Irrelevant genes may confound the classification
modelsanddecreasetheirperformance.Inthiswork,ageneselectionmethodbased
ontheselectivityratioindexisused.ThisindexisspecificfortheDPLSmethodandhas
been used to select the best genes that discriminate between healthy and tumour
prostatecancertissuesandthatdiscriminatebetweendifferentsubtypesofnonsmall
celllungcancers.Itisalsoshownthatthesplitofthedatasetintotrainingandtestsets
influences both the genes selected and the estimated accuracy of the classification
model.Awrongassessmentoftheaccuracyofthemodelmayleadtoeitherrejecta
good subset of genes or accept a suboptimal subset. To overcome this influence a
repetitive strategy including data split, gene selection, validation and prediction is
performed. For the prostate dataset, models calculated with only 17 selected genes
wereabletoclassifythesampleswithaccuraciesaroundthe94%,betterthanmodels
calculated with all the gene expressions (5966) whose accuracies varied between 50
and100%dependingonthedatasplit.ForthenonͲsmallcelllungcancerdatasetthe
models calculated with the genes selected following the selectivity ratio index had
142
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.1Introduction
betterclassificationabilities,independentlytothesplitofthedata(accuraciesfrom94
to 98% for leave one out crossͲvalidation) than the models calculated with all the
genes.
6.1Introduction
DNA microarrays simultaneously provide gene expressions for thousands of genes.
Usually, only a few of the measurements describe informative genes either
overexpressed or underexpressed, while the rest describe unspecific variations or
noise. Discovering the coͲexpressed genes is interesting in order to build metabolic
pathways,toknowthebiologicalrelevanceofgenesforclinicaldiagnosisandalsoto
enhance the performance of classification algorithms [1]. Classification of cells and
tissues according to their gene expression profiles is one of the main uses of
microarray data. Multivariate classification is adversely affected by irrelevant genes,
which interfere with the discriminative power of the relevant genes. Hence, gene
selection is needed to enhance the accuracy of the classifiers, and it is especially
relevantwhenthebiochemicalimportanceoftheselectedgeneswillbesought.
In the last years, many methods have been developed to identify the most relevant
genes for certain types of diagnoses. Three major groups of methods have been
described:filters,wrappersandembeddedtechniques[2].Somemethodshavebeen
based on genetic algorithms [3], random forests [4], weights of support vector
machines [5] and statistical tests such as the tͲtest or the Wilcoxon test [6] to cite a
few.
DPLS is one of the most used classification methods for gene expression data [7].
DPLS's most important feature is that it uses linear combinations of the original
143
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
variables,whichenablesdimensionalityreduction,noisefilteringandoutlierdetection.
AlthoughDPLSdoesnotnecessarilyrequirevariableselection,itispreferabletoinput
onlytherelevantvariablesandtodiscardthosethatcandistortthecalculatedfactor
space.SeveralapproachesforgeneselectioninPLShavebeendescribed.Tanetal.[8]
selected genes using the sum of squared correlation coefficients between the gene
expressions and the response variables. Czekaj and Walczak [1] used the stability of
regressioncoefficients,andLiShen[9],followingtheworkofGuyonetal.[5],selected
the genes with a high absolute value of the regression coefficient using a recursive
featureeliminationsystem.Petterson[10],basedonTryggetal.approach[11],used
thefirstweightvectorofaPLSmodelwithonefactortoestimatetheimportanceofa
genefordescribingthedependentvariable.Othercriteriaoftenusedtoselectgenesis
theVariableImportanceonProjection(VIP)[12],whichisbasedontheweighsofthe
DPLSmodelandt––orF––statistics[13,14].
Since each classification method enhances particular features of the data, gene
selectionbasedongeneralcriteria(e.g.,selectingthegenesthataremostcorrelated
withtheclasslabel)doesnotalwaysprovideoptimalsolutions.Recently,Rajalahtiet
al.[15]usedtheselectivityratio(SR)indextodiscovertherelevantvariablesinamass
spectralprofile,detectingpeptidesinthelowmolecularmassrangewithoutproblems
offalsebiomarkercandidates.Theadvantageofthisindexisthatitcanbecalculated
specifically for DPLS so that the variables pointed as relevant have also the largest
discriminativepowerforthistypeofclassificationmodel.
Inthisworkweshowtheuseoftheselectivityratioindextochoosethemostrelevant
geneswhentheclassificationiscarriedoutusingDPLSandmicroarraygeneexpression
data.Itisshownthattheinitialsplitofthedatasetintoatrainingandatestsetmay
influence significantly the estimated classification performance of the classifiers, and
hencetheconclusionaboutthegoodnessoftheselectioncriterionandoftheselected
144
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.1Introduction
subsetofgenes.Anapproachbasedonrepetitivedatasplit,geneselection,trainingof
the classifier and validation is used in order to better estimate the ability of the
selectedgenesforprovidingagoodclassifier.
6.2Methods
6.2.1Probabilisticdiscriminantpartialleastsquares(pǦDPLS)
Probabilistic Discriminant Partial Least Squares (pͲDPLS) is a new version of
Discriminant Partial Least Squares (DPLS) regression [16]. Briefly, pͲDPLS starts by
calculating a PLS model of A factors relating a NuP geneexpression matrix (X) and a
Nu1 vector of ones and zeros that codifies the samples’’ class (y). Next, the training
samplesarepredictedwiththismodel.Foreachtrainingsample,apotentialfunctionis
calculatedasagaussiancentredatthepredictedvalueNJandwithstandarddeviation
equal to the standard error of prediction (SEP) of that sample. Next, the potential
functions of the samples of the same class are averaged to obtain the probability
densityfunction(PDF)ofclassZ0 andofclassZ1.Theclassificationofatestsampleis
donebycalculatingtheaposterioriprobabilityoneachclass,basedonthepredictionNJ
ofthesample.TheperformanceofDPLSdependsontherelevanceoftheinputgenes.
Below,theselectivityratioindexisintroducedasamethodforgeneselection.
6.2.2Selectivityratioindex
Theselectivityratio(SR)index[15]isbasedonKvalheimandKarstangtargetrotation
approach[17].Itisdefinedastheratiooftheexplainedvariance(vex,p)totheresidual
variance(vres,p,)ofavariable:
145
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
ܴܵ௣ ൌ
௩೐ೣǡ೛
௩ೝ೐ೞǡ೛
(1)
Atargetprojectionmodeliscalculatedas
X= tTPpTPT + ETP = XTP + ETP
(2)
wheretTP(N×1)arethetargetͲprojectedscoresandpTP(P×1)andthetargetͲprojected
loadings.Theseareobtainedas
tTP=X bPLS/||bPLS||
(3)
pTPT=tTPTX/(tTPTtTP)
(4)
where bPLS (P×1) are the regression coefficients of the DPLS model calculated for A
factors.FromEq.(2),theexplainedvarianceforvariablep,vex,p,iscalculatedfromthe
pthcolumnofXTP andtheresidualvarianceforvariablep,vres,p,iscalculatedfromthe
pthcolumnofETP.
ThegeneswiththehighestSRaretheonesthatbestdefinetherelevantvariationsin
thedata.
6.2.3Effectofdatasplitonperformanceevaluation
Commonly,geneselectionstartsbysplittingthedatasetintoatrainingsetandatest
set[18Ͳ20],eitherrandomlyorusingasampleselectionalgorithmsuchastheKennard
and Stone algorithm [21]. Then, genes are selected so as to optimize a criterion
calculatedfromthetrainingset,andthegoodnessoftheselectedgenes,andhenceof
theselectioncriterion,ischeckedeitherbycrossvalidation[22,23]orbypredictinga
testset[18Ͳ20,24].Otherdebatableapproaches,suchasselectingthegenesthatbest
classifyatestsethavealsobeenused[25].ThelimitationofthesingleͲsplitapproachis
that a selection algorithm or a set of selected genes may be discarded because an
146
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.2Methods
unfortunatesplitofthedatasetleadstolowclassificationaccuraciesforthetestset.
Ortheotherwayround,asuboptimalsetofgenescanbeacceptediftheclassification
abilityofthatparticulartestsetishigh.
In order to overcome this situation, gene selection is done in this work from one
thousand different training subsets selected randomly. For each training set, a DPLS
modelisevaluatedandtheselectivityratioindexSRforeachgeneisevaluated.After
theonethousanditerations,themeanoftheSR'sofeachgeneiscalculatedandthe
geneswiththelargestmeanSRareselected.Theusefulnessofthegenesselectedis
then checked by calculating the classification accuracy of new five hundred DPLS
modelscalculatedusingtheselectedgenesafterrandomlyselectingthetrainingand
testsetsagain.
6.3Results
6.3.1Datasets
Theprostatedataset[26]consistsof50nonͲtumoursamples(classZ0)and52tumour
samples(classZ1)with12.600geneexpressionsanalysedforeachsample.Thisdataset
has been previously studied in gene selection studies and used to evaluate the
performanceofaclassificationmethod[4,27,28]tociteafew.
ThenonͲsmallcelllungcancer(NSCLC)dataset[29]consistsof58samplesofthetwo
majorhistologicalsubtypesoflungcancer,40fromadenocarcinoma(classZ0)and18
fromthesquamouscellcarcinoma(classZ1)with54675geneexpressionsanalysedfor
eachsample.
147
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
6.3.2Discussion
6.3.2.1Prostatecancerdataset
ThedatasetwaspreͲprocessedlikein[26].Thefloorvaluewassetat10,theceilvalue
at16000andthegeneswith(ImaxpͲIminp)<50and(Imaxp/Iminp)<5wereremoved,
where Imaxp and Iminp are the maximum and minimum intensities of the gene
respectively.Theintensitiesofthefinal5966genesleftwerethenlog2transformed.
This dataset was randomly split into a training set and a test set with the only
constraintthatthetrainingsetshouldcontain50%ofthesamplesofeachclassfrom
theinitialdataset.ThentwoͲfactorDPLSmodelswerecalculatedwithmeanͲcentered
dataandtheSRindexwascalculatedforeachgene.Thenumberoffactorswasinitially
determined as the one with the lowest root mean square error of cross validation
usingallthegenes.Itwaslattercheckedthatadifferentreasonablenumberoffactors
oftheDPLSmodeldidnotaffectthegenesthatwereselectedasrelevantaftertheone
thousand repetitions. The procedure was repeated one thousand times and the
average SR index of each gene was calculated. The 10, 17 and 35 genes with the
highestaverageofSRforthesemodelswereselectedaspotentiallyrelevant(Table1).
Figure1showsthemeanSRforthefiftygeneswiththehighestindex.Afterthefirst17
selectedgenes,theremaininggeneshavesimilarSR.Hence,thediscriminativepower
fortherestofthegenesisnotrelevantenoughtojustifytheirinclusioninthemodel.
Anyway,thebest10,17and35geneswereselectedinordertocomparethemwith
previous selection results using the random probabilistic model building genetic
algorithm(RPMBGA)criterion[25].
148
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.2Methods
mean of SR amongthe 1000 iterations
3
2
1
0
0
5
10
15
20
25
va ria ble
30
35
40
45
50
Figure1.MeanofSRamongthe1000iterationsforthefiftygeneswithhighestSR.
Table1.The35mostrelevantgenesaccordinglywiththeSRindexcalculatedfrom1000pͲDPLSmodels.
Idofgenesselected
10genes
17genes
35genes
37639_at
1767_s_at
39756_at
33137_at
32598_at
36601_at
769_s_at
32076_at
40282_s_at
37720_at
36491_at
1521_at
41468_at
575_s_at
38410_at
35742_at
38406_f_at
39315_at
38087_s_at
32206_at
41288_at
34840_at
40024_at
1740_g_at
38634_at
31444_s_at
38051_at
34407_at
32243_g_at
33904_at
33198_at
33362_at
1513_at
37366_at
40856_at
*Notethattoavoidredundancy,the17genesarethe10inthefirstcolumnplusthe7inthesecondcolumn,
andanalogouslythe35genesarethe10inthefirstcolumnplusthe7inthesecondandthe18inthethird
andthefourthcolumns.
The ability of the selected genes to discriminate between tumour and nonͲtumour
sampleswasevaluatedformodelscalculatedusingthe10,17and35relevantgenes
only[24].Inordertomaketheresultslessdependentondatasplit,theclassification
149
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
performance was calculated for fiveͲhundred pͲDPLS models. These models were
calculatedfromrandomlygeneratedtrainingandtestsetswith50%ofsamplesofeach
classineachset.Forabettercomparison,thesamplesinthetrainingsetandtestset
ineachrepetitionarethesameforthemodelscalculatedwith10,17and35genes.
The histograms in Figure 2 summarize the validation accuracies of the five hundred
models calculated with 10, 17 or 35 genes. For each model (a selected subset of
trainingsamplesandgenes)theleaveͲoneͲoutcrossͲvalidation(LOOCV)accuracyand
the test set accuracy were evaluated. If the subset of genes is adequate, one would
expect both accuracies be high and similar, independently on the samples used to
calculatethemodel.
Figure2ashowsthatthemodelscalculatedwith10geneshadLOOCVaccuraciesfrom
85%to100%dependingonthedatasplit.Testsetaccuraciesalsorangedfrom85%to
100%.MostofthemodelshadaLOOCVaccuracyof96%andtestsetaccuracyof92%.
Thesehighvaluesofbothaccuraciesindicatethatthesubsetofgenesaccountedfor
themaindifferencesbetweennonͲtumourandtumourprostatecancersamples.The
fact that the histogram is sharp indicates that the high accuracy was maintained for
most of the models and it was quite independent on the split of the samples into
trainingandtestsets.Notealsothatasingleunfortunatesplitcanleadtolowvaluesof
both LOOCV accuracy (88%) and test accuracy (90%), which could lead to reject the
selected subset of genes in front of previously reported subsets as they did not
improvetheperformance.Alsonotethatsomedatasplitscanleadtomodelswitha
large difference between the LOOCV classification accuracies and the test sets
classificationaccuracies(e.g.LOOCVaccuracyof88%andtestsetaccuracyof100%).
These results highlight the relevance thatthe data splitmay have when determining
theusefulnessofaselectedsubsetofgenesortheusefulnessofagivenclassification
rule.
150
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.2Methods
SimilarremarkscanbedrawnfromFigures2band2cformodelscalculatedwiththe
optimal 17 and 35genes following the SRcriterion. For17genes, themost frequent
LOOCVaccuracyis94%,withatestaccuracyof92%(Figure2b).Forthesubsetof35
genes,mostofthemodelshavehighLOOCVandtestaccuraciesof94%(Figure2c).
The models calculated with genes selected with the selectivity ratio index were
compared with models calculated from genes selectedin the bibliography. Figure 2d
shows the accuracies when the fiveͲhundred models calculated using optimal genes
reportedinreference[25].Notethatalthoughthesubsetsofgeneswerechosenwith
adifferentcriterion(RMPMGA)andforadifferentclassifier(supportvectormachines),
theycanalsogiveDPLSmodelswithhighaccuracies.However,thehistogramsarenot
assharpasinFigure2(aͲc),sothequalityofthemodelsdependsmuchmoreonthe
data into training and test sets than when the genes are selected with the SR index.
Reference [25] reported test set accuracies of 98% calculated for one single dataset
split.NotethatforDPLSthosegenescangiveaccuraciesashighas100%forcertain
datasetsplits,butmostofthemhavearound92%accuracy.Thissuggestsaninferior
performanceforpͲDPLSthanwhenthesubsetselectedwiththeSRindexisused.
For the subsets of 17 and the 35 genes, the accuracies varied from 85% to 100%
(Figure2eͲ2f).Notethatinthatcasetheaccuraciesobtaineddependedevenmoreon
thetrainingandtestsetsinwhichthedatasetwassplitandthehistogramsweremore
flat.
When using the raw dataset without gene selection (5966 genes), the validation
accuracies range from 50% to 100% for different data splits (Figure 3). The mean of
LOOCV accuracy was 85% and the mean of test accuracy was 84%. The lower
accuraciesascomparedtousingsubsetsofselectedgenescanbeattributedtothefact
thatthemodelsaretakingintoaccountfalsecorrelations.Giventhelargenumberof
151
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
genes, some uninteresting genes may become correlated with the class label for a
certaindatasplit,sothatthemodelwillassignahighmodellingimportancetothose
genes. The test set, which does not show the same correlation pattern, is then
classified with a high error. The almost flat histogram suggests that the accuracies
changeoftendependingonthesplitintotrainingandtestssetandhencethatusingall
thegenesarenotabletoprovidemodelsthatsystematicallyperformwell.
Figure2.Prostatedatasettraining(LOOCV)andtestaccuracyfrequences(perunit)forthefivehundredpͲ
DPLSmodelscalculatedwith10(a,d),17(b,e)and35(c,f)geneschosenwiththeSRcriterion(aͲc)andby
RPMBGA(dͲf).
152
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.2Methods
Figure3. Prostatecancerdataset.Trainingandtestaccuracyfrequences(perunit)forthefivehundredpͲ
DPLSmodelscalculatedwithallfinalgenesthegenesafterpreprocessing(5966genes).
6.3.2.2NonǦsmallcelllungcancerdataset
The non small cell lung cancer dataset consists of 54675 gene expressions from 58
samples of asenocarcinoma (AC) and squamous cell carcinoma (SCC). Following the
procedure described for the prostate dataset, one thousand randomly training and
testsubsetsweregeneratedandtheSRindexforeachgenewascalculatedforeachof
the models to discriminate between AC and SCC samples. The 17 and 30 genes with
thehighestaverageSRindexovertheonethousandmodelswereselectedasrelevant
(Table 2). This number of genes was decided in order to compare the results with
previouslyreportedresults[29].
153
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
Table2.The30mostrelevantgenesaccordinglywiththeSRindex1000pͲDPLSmodels.
Idofgenesselected
17genes
30genes
206032_at
204455_at
1559606_at
1555501_s_at
206033_at
217528_at
205595_at
219507_at
211194_s_at
206164_at
217272_s_at
228806_at
216918_s_at
225822_at
221796_at
206156_at
244107_at
226832_at
235075_at
207382_at
221795_at
214680_at
57703_at
222892_s_at
204136_at
206266_s_at
230464_at
203097_s_at
206165_s_at
201818_at
*The30genesarethe17inthefirsttwocolumnsplusthe13inthethirdandfourthcolumns.
TheselectedgeneswereusedtocalculatefivehundredpͲDPLSmodelsusingrandom
trainingandtestsets.Thesemodelswerealsocomparedwiththemodelscalculated
withthegenesselectedinapreviouswork[29].
Figure4summarizesthevalidationaccuraciesofthefivehundredmodelsobtainedby
LOOCVandbypredictingthetestsetforsubsetsof17and30genes.Theaccuraciesfor
LOOCVandfortestdatarangedfrom85%to100%.Notethatmostofthemodelswith
the17genesselectedhavingmaximalSRhaveLOOCVandtestaccuraciesfrom94to
98% (Figure 4a). This fact is even more notable when the 30 genes are used (Figure
4b), for which the number of models with test accuracies out of this range is
insignificant.Incontrast,forthe17and30genesselectedin[29]thepͲDPLSmodels
have varying accuracies, from 88% to 98%, without dominant training and test
accuracy values (Figures 4cͲ4d). Again, this points out the importance that the data
splithasontheevaluatedaccuracies.Notealsothatthemeanofthetestaccuracies
obtainedbythemodelscalculatedwiththegenesselectedfollowingtheSRcriterion
154
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
6.2Methods
areslightlybetterthanthoseobtainedwiththegenesselectedin[29](fromthe92%
to93%forthe17genessubsetorfrom92%to93%forthe30genessubset).
selectedin[29](fromthe92%to93%forthe17genessubsetorfrom92%to93%for
the30genessubset).
Figure4.Trainingandtestaccuracyfrequences(perunit)forthefivehundredpͲDPLSmodelscalculatedwith
17(a,c)and30(b,d)genesselectedbytheSRcriterion(aͲb)orinthereferencework(cͲd).
6.4Conclusions
The selectivity ratio index has been used to select the best subset of discriminant
genes for microarray data classification with pͲDPLS. The methodology reduces the
influence of the samples selected as training samples on the final classification
accuracies,andthegenesselectedgivemodelswithverysimilarclassificationabilities
155
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
independentofthedatasplit.Wehavealsoshownthattheaccuraciesofthemodels
may depend to a large extent on the particular samples in the training set and that
usingasingletestsettovalidatethegenesubsetmayresultineithertoooptimisticor
pessimisticconclusions.
Acknowledgements
The authors thank the support of the Departament d’’Universitats, Recerca i Societat
delaInformaciódeCatalunyaforprovidingCristinaBotella’’sdoctoralfellowship,anof
the Spanish Ministerio de Educación y Ciencia (project CTQ2007Ͳ66918/BQU).
156
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
References
‡ˆ‡”‡…‡•
[1]
Czekaj, T., W. Wu, and B. Walczak, Classification of genomic data: Some aspects of feature
selection.Talanta,2008.76:p.564Ͳ574.
[2]
Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics.
Bioinformatics,2007.23:p.2507Ͳ2517.
[3]
Tang,E.K.,P.Suganthan,andX.Yao,Geneselectionalgorithmsformicroarraydatabasedonleast
[4]
DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using
squaressupportvectormachine.BMCBioinformatics,2006.7:article95.
randomforest.BMCBioinformatics,2006.7:article3.
[5]
Guyon,I.,etal.,GeneSelectionforCancerClassificationusingSupportVectorMachines.Machine
Learning,2002.46:p.389Ͳ422.
[6]
Troyanskaya,O.G.,etal.,Nonparametricmethodsforidentifyingdifferentiallyexpressedgenesin
microarrays.Bioinformatics,2002.18:p.1454Ͳ1461.
[7]
Boulesteix, A.ͲL. and K. Strimmer, Partial least squares: a versatile tool for the analysis of highͲ
dimensionalgenomicdata.BriefingsinBioinformatics,2007.8:p.32Ͳ44.
[8]
Tan, Y., et al., MultiͲclass cancer classification by total principal component regression (TPCR)
usingmicroarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65.
[9]
Shen,L.,PLSandSVDbasedpenalizedlogisticregressionforcancerclassificationusingmicroarray
data.Proceedingsofthe3rdAsiaͲPacificBioinformaticsconference,2005:p.219Ͳ228.
[10]
Pettersson,F.andA.Berglund,InterpretationandvalidationofPLSmodelsformicroarraydata.
ChemometricsandChemoinformaticsACSSymposiumseries,2005.894:p.31Ͳ40.
[11]
Trygg, J., O2ͲPLS for qualitative and quantitative analysis in multivariate calibration. Journal of
Chemometrics,2002.16:p.283Ͳ293.
[12]
Musumarra,G.,etal.,PotentialitiesofmultivariateapproachesingenomeͲbasedcancerresearch:
identification of candidate genes for new diagnostics by PLS discriminant analysisy. Journal of
Chemometrics2004.18:p.125Ͳ132.
[13]
Dai, J.J., L. Lieu, and D. Rocke, Dimension reduction for classification with gene expression
microarraydata.StatisticalApplicationsinGeneticsandMolecularBiology,2006.5:article6.
[14]
Huang,X.,etal.,Borrowinginformationfromrelevantmicroarraystudiesforsampleclassification
usingweightedpartialleastsquares.ComputationalBiologyandChemistry,2005.29:p.204––211.
[15]
Rajalahti,T.,etal.,Biomarkerdiscoveryinmassspectralprofilesbymeansofselectivityratioplot.
ChemometricsandIntelligentLaboratorySystems,2009.95:p.35Ͳ48.
[16]
Botella, C., J. Ferré, and R. Boqué, Classification from microarray data using probabilistic
discriminantpartialleastsquareswithrejectoptionTalanta,2009.80:p.321Ͳ328.
157
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Gene selection in microarray
databasedonSRindex
[17]
Kvalheim, O.M. and T.V. Karstang, Interpretation of latentͲvariable regression models
ChemometricsandIntelligentLaboratorySystems,1989.7:p.39Ͳ51.
[18]
Horng, J.ͲT., et al., An expert system to classify microarray gene expression data using gene
selectionbydecisiontreeExpertSystemswithApplications,2009.36:p.9072Ͳ9081
[19]
Yoon,Y.,etal.,Directintegrationofmicroarraysforselectinginformativegenesand phenotype
classification.InformationScience,2008.178:p.88Ͳ105.
[20]
Li, L., et al., Gene selection for sample classification based on gene expression data: study of
sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001. 17: p. 1131Ͳ
1142.
[21]
Kennard,R.W.andL.A.Stone,ComputerAidedDesignofExperiments.Technometrics,1969.11:
p.137Ͳ148
[22]
Hossain,A.,etal.,Aflexibleapproximatelikelihoodratiotestfordetectingdifferentialexpression
inmicroarraydataComputationalStatistics&DataAnalysis,2009.53:p.3685Ͳ3695
[23]
Li,G.ͲZ.,etal.,SelectingsubsetsofnewlyextractedfeaturesfromPCAandPLSinmicroarraydata
analysis.BMCGenomics,2008.9:p.S24ͲS38.
[24]
Paul, T.K. and H. Iba, Prediction of Cancer Class with Majority Voting Genetic Programming
Classifier Using Gene Expression Data. EEE/ACM Transactions on Computational Biology and
Bioinformatics,2009.6:p.353Ͳ367.
[25]
Paul,T.K.andH.Iba,Geneselectionforclassificationofcancersusingprobabilisticmodelbuilding
geneticalgorithm.BioSystems,2005.82:p.208Ͳ225.
[26]
Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002.
1:p.203Ͳ209.
[27]
Dettling, M., BagBoosting for tumour classification with gene expression data. Bioinformatics,
2004.20:p.3583Ͳ3593.
[28]
Jeffery,I.B.,D.G.Higgins,andA.C.Culhane,Comparisonandevaluationofmethodsforgenerating
differentiallyexpressedgenelistsfrommicroarraydata.BMCBioinformatics,2006.7:p.359Ͳ375.
[29]
Kuner,R.,etal.,GlobalgeneexpressionanalysisrevealsspecificpatternsofcelljunctionsinnonͲ
smallcelllungcancersubtypes.LungCancer,2009.63:p.32Ͳ38.
158
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 7 Multi-­‐class classification
of microarray gene expression data
Submitted May 2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
Microarray gene expression data were initially used for binary differentiation e.g., to
classifyasampleoracellashealthyortumour.Commonly,however,thediseaseswith
geneticoriginhavemorethantwosubtypes,sotheproblemofclassifyingasamplefrom
geneexpressiondataismoreoftenthannotamultiͲclassclassificationproblem.
Althoughsomeclassificationalgorithmscaneasilyhandlemanyclasses(e.g.,kͲnearest
neighboursclassification),others(e.g.someversionsofDPLS)aredesignedtodealwith
two classes only. In order to be able to use for multiclassͲclassification the powerful
binaryclassifiersavailable,newstrategieshavetobedevised.Oneofthesestrategiesis
toperformbinaryclassificationsbetweenpairsofclasses,andthencombinetheresults
toobtainthefinalclasslabel.ThisoneͲversusͲonestrategyisoftenbetterthantomodel
one class against all the others (the oneͲversusͲall strategy). The reason is that in the
oneͲversusͲallstrategy,differentsubtypesofsamplesaregroupedintothesameclass,
which must be differentiated from the target class. In contrast, the oneͲversusͲone
strategyallowsthemodeltofocusonthegenesthatactuallydifferentiateoneparticular
classfromanotherparticularclass.
AdifficultyintheoneͲversusͲonestrategyisthatanewsamplewillbesubmittedtoall
the binary models that make the classification system. For the binary models that
modelledtheclass,thepredictionshouldbethatthesamplebelongstothemodelled
class.Foralltheothermodels,thesampleisanoutlierandshouldbedetectedassuch.
Hence,thecombinationoftheresultsofthebinaryclassifiersinordertoobtainthefinal
assignedclassisafundamentalstep.
In the present work multiͲclass classification is performed in two steps by combining
partial least squares (PLS) regression and the linear discriminant analysis (LDA). In the
initialstep,oneͲversusͲonePLSmodelsallowobtainingthepredictionsforeachsample
161
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter7
(a single value) and for each model. Each oneͲversusͲone PLS model can only
discriminatebetweentwodifferentclasses.However,thepredictionsofsamplesfrom
the classes not modelled by each PLS model may span all the domain, and hence
misclassified.So,themultiͲclassificationisdoneinasecondstepwiththeLDAclassifier
appliedoverthepredictionsofthesamplesforalltheoneͲversusͲonePLSmodels.
The methodology was used to classify samples of leukemia and small round blue cell
tumours datasets. The accuracies of classification were 97%, using only 15 genes, and
100%with17genes,respectively.
ThispaperwassubmittedinMay2010.
162
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiǦclassclassificationofmicroarray
geneexpressiondata
C.Botella*,J.Ferré,R.Boqué
Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University.
Marcel·lí Domingo s/n, 43007. Tarragona, Spain
Correspondingauthor:[email protected]
*
SubmittedMay2010.(Editedforformat)
ABSTRACT
WhenclassificationfrommicroarraygeneexpressiondataisamultiͲclassproblem,the
outputsofbinaryclassifierssuchasdiscriminantpartialleastsquares(DPLS)mustbe
combinedtoobtainthefinalclassificationresult.Inthisworkanewmethodologyfor
multiͲclass classification that combines partial least squares (PLS) and linear
discriminant analysis (LDA) has been developed. The method also includes a gene
selection step based on the selectivity ratio index so that the best performing genes
for each binary PLS model are selected. When the methodology was applied to the
leukemiadataset,thathasthreeclasses,97%ofthesampleswerecorrectlyclassified
usingonly15genesinthePLSmodels.Fortheroundbluecelltumourdataset,thathas
fourclasses,100%ofthesampleswerecorrectlyclassifiedusingonly17genesinthe
PLSmodels.
163
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
7.1.Introduction
An important challenge in the use of largeͲscale gene expression data for biological
classificationoccurswhenthedatasetinvolvesmultipleclasses[1].Sofar,mostofthe
research on classification of microarray data has focused on two major classes only
(e.g. normal versus cancer tissue, response to treatment versus no response).
However, practical cancer diagnosis requires differentiating among more than two
typesorsubtypesand,hence,multiͲclassclassificationtechniquesareneeded[2].
MultiͲclass classification can be approached in two ways. One way is the use of
algorithms that treat multiͲclass problems directly, such as kͲNearest Neighbours
(kNN),LinearDiscriminantAnalysis(LDA)orNeuralNetworks(NN).Asecondwayisto
decompose the multiͲclass problem into multiple binary classification problems and
usebinaryclassificationalgorithms,suchasDiscriminantPartialLeastSquares(DPLS)
orTotalPrincipalComponentRegression(TPCR).Thesebinaryclassificationmodelscan
becalculatedbymodellingeitheroneclassversustheothers(oneversusall,OVA),one
classversuseachotherclass(oneversusone,OVO)orusinghierarchicalpartitioning
[3,4].Then,theresultsofthebinaryclassifiersarecombinedtoobtaintheassigned
classlabel.
Several novel methods have been developed for multiͲclass classification with
microarraydata.Tanetal.in[5]usedTPCR,whichtakesintoaccounttheinformation
of the dependent variables and also the errors in the dependent and independent
variables. Ooi et al. in [1] used genetic algorithms (GA) for gene selection and
classification was based on the maximum likelihood. They obtained better
classification accuracies than previouslypublished methods and reduced the number
of genes needed for classification. Leng et al. in [6] proposed Sparse Optimal Score
(SOS), based on Fisher LDA, as a multicategory classifier and classified three public
164
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.1Introduction
datasets satisfactorily. Tibshirani et al. [7] proposed the nearest shrunken centroid
method,forcancerclassprediction.WiththesamemultiͲclassificationobjective,some
studies proposed derivations of SVM for multiͲclassification. Lee et al. designed an
optimalmulticategorySVM[8],Pengetal.in[2]andLiuetal.in[9]combinedGAand
oneversusoneSVM.Incontrast,deSouzaetal.in[10]appliedGAandoneversusall
SVM.
DPLShasprovenusefulforbinaryclassificationofmicroarraydatabutithasnotbeen
muchstudiedformulticlassclassification.Nguyenetal.[11]usedPLSasadimension
reduction technique for a posterior classification with Logistic discrimination or
QuadraticDiscriminantAnalysis.DPLS2wasusedbyTanetal.[12]toclassifymulticlass
public datasets using the OVA strategy. However, this strategy may lack biological
sense for microarray data analysis when, for instance, healthy samples must be
groupedtogetherwithtumoursamplesanddiscriminatedfromothertumourtypes.
In this work we describe the application of PLS combined with LDA for multiͲclass
classification. Several OVO PLS models are calculated and LDA is applied to the
predictions of the samples on each of these models. The advantage of using OVO
models is that each model maximizes the differences between the two modelled
classes. Additionally, gene selection isperformed for each PLS modelto increase the
discriminantability.Theselectionisbasedonthehighestselectivityrationindex[13]
that is specially suited for PLS. The method has been applied to two datasets, the
leukemiadataset[14]andsmallroundbluecelltumourdataset[15].
165
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
7.2Methods
7.2.1MultiǦclassclassificationmethod:PartialLeastSquaresǦLinear
discriminantanalysis
ThemultiͲclassclassificationinCclassesisdonebycombiningPLSregressionandLDA
(Figure1)andmaybevalidatedbyleaveoneoutcrossvalidation(LOOCV)orbyatest
set.
PLSisaregressionmethodbasedonmaximizingthecovariancebetweenXandy[16].
The gene expression microarray data, X is an N×p matrix of N samples and P gene
expressionsandyisavectorofzerosandonesthatcodifiestheclassesofthesamples.
In this paper, oneͲversusͲone PLS models are calculated, so X only contains samples
from two modelled classes, for instance class Z1 (e.g. ““tumour type I””) and class Z2
(e.g.““tumourtypeII””).ThezerosinycodifythesamplesofclassZ1andtheonesiny
codifythesamplesofclassZ2.Withthesesettings,PLSmodelsforeverycombination
oftwoclassesZivs.Zji=1,……C,j>iarecalculated(Figure1(c)).
ForasampletobeclassifiedinoneoftheCclasses,itspredictionineachDPLSmodel
iscalculatedas:
መ
‫ݕ‬ො ൌ ‫ ܠ‬୘ ‫܊‬
(1)
wherebisthevectorofregressioncoefficientsforthemodelofAfactorsandxisthe
geneexpressionvectorforsuchsample.NotethatifbhasbeencalculatedfrommeanͲ
centereddatathenxshouldbemeanͲcenteredandNJshouldbeprocessedaccordingly.
ThesampletobeclassifiedispredictedinalltheOVOPLSmodels(Figure1(c)),thus
obtaining a vector, of predictions NJ (Figure 1(d)). For instance, if there are three
subtypesofsamples,threePLSmodelsarecalculated:classZ1versusclassZ2,classZ1
versusclassZ3andclassZ2versusclassZ3.Thepredictionofasampleinthesethree
166
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.2Methods
modelsgeneratesNJ=[NJ12 NJ13 NJ23](thesubscriptsindicatetheclassesaccountedforin
each model) that describes the behaviour of the sample in the multiclassͲclassifier.
Ideally, if the sample belongs to class Z1, NJ12 and NJ13 should be close to zero and NJ23
shouldbefarabove1orfarbelow0sothatthesamplecouldbedetectedasanoutlier
inthemodelofclassZ2 vs.classZ3.Actually,thisisnotalwaysthecaseandoutliers
mayhavepredictionsalongtheentireNJdomainandmixedwiththepredictionsofthe
modelledclasses.Similarly,asampleofclass Z2shouldhaveaNJ12closetoone,aNJ23
close to zero and an undetermined value of NJ13. Finally, a sample of class Z3 should
haveNJ13andNJ23closetooneandanundeterminedvalueofNJ12.LDAisthenappliedto
NJ.
LDA finds discriminant functions (directions) such that the distance between the
classes’’meanvectorsismaximizedwhenthedataareprojectedontosuchfunctions.
Let NJ be the vector of predictions obtained for the sample that mustbe classified. A
discriminantscore(m)iscalculatedforthatsampleineachdiscriminantfunctionas:
ିଵ
ሺ‫ܡ‬ො െ ૄ௖ ሻ െ ʹŽɎ௖ ݉ሺ‫ܡ‬ොሻ ൌ ሺ‫ܡ‬ො െ ૄ௖ ሻ୘ ‫܁‬௣௢௢௟௘ௗ
(2)
whereʅcisthemeanvectorofthepredictionsofthetrainingsamplesofclassc,ʋcis
theaprioriprobabilityofclassccalculatedasthenumberofsamplesoftheclassover
thetotalnumberofsamples.
Ɏ௖ ൌ
௡೎
ே
(3)
and‫܁‬௣௢௢௟௘ௗ isthecovariancematrixevaluatedas:
ଵ
‫܁‬௣௢௢௟௘ௗ ൌ σେୡୀଵ ݊ୡ ‫܁‬ୡ ୒
(4)
whereScis
‫܁‬ୡ ൌ
ଵ
௡ౙ ௡ౙ
σ௡ୀଵ
ሺ‫ܡ‬ො െ ૄ௖ ሻሺ‫ܡ‬ො െ ૄ௖ ሻ୘ (5)
167
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
Thenthesampleisclassifiedintheclassforwhichithasthelowestclassificationscore
(Figure1(e)).
INITIALDATASET
GENESELECTION
P12
P12
IdGenes
P
y
Genesselected
following SR to
discriminate
betweenclass ʘ1
andʘ 2
n1
n2
x1
n2
n3
x1
x2
x3
a
n1
n3
n2
n3
b
x1
x2
x3
e
}t
2Prediction
1
x3
x3
P12
n1
n2
x1
x2
PLSmodel
class Z1 vs.Z2
LDACLASSIFIER
CALCULATION
d
n3
P23
IdGenes
Genesselected
followingSR to
discriminate
betweenclassʘ2
andʘ 3
1
P13
IdGenes
Genesselected
followingSR to
discriminate
betweenclassʘ1
andʘ 3
P23
TRAINING SAMPLES
PREDICTION
c
x2
P13
n1
OVOPLSMODELS
MATRICESWITH
SELECTEDGENES
PLSmodel
classZ1 vs.Z3
P13
n1
n2
n3
x1
x2
x3
y
n1
LDAclassifier
n2
2Prediction
n3
1
PLSmodel
classZ2 vs.Z3
P23
n1
n2
n3
x1
x2
x3
2Prediction
Figure1.SchemeofathreeclassPLSͲLDAtrainingclassificationprocess:a.Initialdataset.bOVOPLSmodel
withanAfactors(initialguess)arecalculatedandgenesareselectedwiththeSRindexforeachmodel.c.
TheoptimalOVOPLSmodelsarecalculatedwiththeselectedgenes.d.Allthetrainingsamplesarepredicted
ineachOVOPLSmodelobtaininga}matrix.D.LDAclassifieriscalculated,using}asindependentvariables
andyastheclasscode.NotetheP12,P23andP13representthesamenumberofgenesbutnotnecessarily
thesamegenes.TheoptimalnumberoffactorsintheOVOPLSmodelsisthosethatminimizetheRMSECV
criterion.
7.2.2Selectivityratioindex
TheSelectivityRatio(SR)indexflagsthemostrelevantvariablesforPLS.Itisbasedon
a target rotation approach [17] and is detailed in reference [13]. The SR index is
definedastheratiooftheexplainedvariance(vex,p)totheresidualvariance(vres,p)ofa
variable(p):
SRp=vex,p/vres,p 168
(6)
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.2Methods
TakingintoaccountthatPLSdecomposesXas:
X=tTPpTPT+ETP=XTP+ETP (7)
wheretTP(P×1)arethetargetͲprojectedscoresandpTP(P×1)andthetargetͲprojected
loadings.Theexplainedvarianceforeachvariablepiscalculatedfromthepcolumnof
thereconstructedXTP,andtheresidualvarianceiscalculatedfromthepcolumnofthe
residual matrix E. Note that tTP and pTP on equation 7 are calculated following the
procedurein[13].ThegeneswithahighestSRpindexareselectedasthemorerelevant
todiscriminatebetweenthetwoclasses modelledbythePLSmodel.Notethateach
OVOPLSmodelhasitsoptimalsubsetofgenesthatbestdiscriminatebetweenthetwo
modelledclasses.ThenumberofgenesinthesubsetmaydifferfromonePLSmodelto
another.Toavoidanadditionaloptimizationstep,themethodologyimplementedhere
usedthesamenumberofgenesforallthePLSmodels,althoughthegeneswerenot
necessarilythesame.
7.3Datasets
The leukemia dataset [14] consists of 72 samples of acute lymphoblastic leukemias
carryingachromosomaltranslocationthatderivesonthreesubtypesofsamples,acute
lymphoblastic leukemia (ALL, 24 samples class Z1), mixed lineage leukemia (MLL, 20
samples,classZ2)andacutemyeloidleukemia(AML,28samples,classZ3).Foreach
sample 12582 gene expressions were obtained. This dataset was preͲprocessed as
describedin[14].
Thesmallroundbluecelltumour(SRBCT)dataset[15]consistsof63trainingsamples
fromfourdifferentcellsubtypes.23samplesarefromEwingfamilyoftumours(EWS,
Z1),20arerhabdomyosarcomas(RMS,classZ2),12areneuroblastomas(NB,Z3)and
169
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
the remaining 8 are Burkitt lymphomas (BL, Z4). The independent test set has 20
samples,6ofclassZ1,5ofclassZ2,6ofclassZ3and3ofclassZ4.Foreachtraining
and test sample 2308 genes were analysed. The algorithms were run in Matlab£
software.
7.4Results
7.4.1Leukemiadataset
ThreeOVOPLSmodelswerecalculatedforAfactors:amodelofALLvs.MLL,amodel
of ALL vs. AML, and a model of MLL vs. AML. For each PLS model, genes having the
highestSRindexwereselected.Threegroupsof15,50and100genesweretestedso
thattheresultscouldbecomparedwithpreviousresults[14,18].TheOVOPLSmodels
wererecalculatedusingtheselectedgenesandtheoptimalnumberoffactorswasthe
one that minimized the root mean square error of leaveͲoneͲout crossͲvalidation
(RMSECV). Note that this number of factors may differ from the ones used in the
preliminary model used for selecting the genes.The three optimal PLS models were
used to predict all the training samples. A matrix } (72×3) of predictions was then
obtained and used for training the LDA classifier. A sample to be classified was first
predicted with the three PLS models, thus obtaining a vector, NJ (3×1) of predictions.
ThisvectorwassuppliedtotheLDAclassifiertoobtainthefinalclassification.
Inthisdataset,atestsetwasnotavailable,soleaveͲoneͲoutcrossͲvalidation(LOOCV)
wascarriedout.Hence,allthesampleswereusedonceasatestsample,obtainingfor
eachonea(3×1)vectorofpredictions,andyieldingmatrix}t (72×3)ofpredictionsin
total.ThismatrixwasusedtopredicttheclasswiththeLDAclassifiercalculatedwith
thetrainingsamplesinthepreviousstep.
170
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.4Results
Modellingwiththe15mostrelevantgenes
Figure 2a shows the LOOCV predictions from the three binary PLS models calculated
with only the 15 most discriminant genes, selected according to the SR index. Note
that the model of ALL vs. MLL can discriminate correctly samples from the class ALL
(whosepredictionsarearound0)fromthesamplesofclassMLL(whosepredictionsare
around1).However,itcannotdifferentiatethesamplesfromclassAML.Ideally,these
samplesshouldbehavedifferentlyandhaveextremepredictions,sothattheycouldbe
detectedasoutliers.Instead,theirpredictionsarebetweenthevalues0and1,sothe
predictions of the PLS model only are not enough for correctly classifying all the
samples.AsimilarsituationhappenedwiththeMLLsamplesinthemodelALL vs.AML
andwiththeAMLsamplesinthemodelALLvs.MLL(Figure2a).
Next,LDAwasappliedtothepredictionsNJofeachLOOCVsample.Figure2bshowsthe
validation samples already classified by LDA in the space of the PLS. The LOOCV
classificationaccuracywas97.2%,higherthanthe95%accuracybyLOOCVpreviously
reportedforthisdataset[14]usingkNNandselectingthegenesfollowingasignalto
noisecriterion.A97.2%ofaccuracymeansthatthemethodonlymisclassified2ofthe
72samples.Thesetwomisclassifiedsamples,MLL_2andMLL_15,aresamplesofclass
MLLthatwereassignedtoclassAML.Figure3showsthediscriminantscoresoftheLDA
classifier for the first two discriminant functions. Note that for these samples the
discriminant score in the second discriminant function is not high enough to be
assigned to their true class MLL. Both samples have raw intensities lower than the
intensitiesofthesamplesoftheirtrueclassMLLandmoresimilartotheintensitiesof
thesamplesofclassAML.Asaconsequence,thediscriminantscoresandthepredicted
NJ’’sforthesetwosamplesweremoresimilartotheNJ’’sforclassAML.Moreconcretely
(Table1)MLL_2haspredictionsNJ12=0.62andNJ13=0.94,whicharealmostequaltothe
mean of the predictions of the samples of class AML (þത12= 0.68 and þത13 = 0.97) and
differconsiderablyfromthemeanofthepredictionsforthesamplesofitstrueclass
171
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
(þത12= 0.94 and þത13 = 0.61). The predictions for the model of MLL vs. AML did not
contribute significantly to the classification of the MLL_2 sample, having a value
betweenthepredictionsofbothclasses.
PLS model of
class MLL vs.
class AML
a
PLS model of
class ALL vs.
class AML
PLS model of
class ALL vs.
class MLL
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
ǔ
b
ǔ of model MLL vs. AML
1.5
1
0.5
0
-0.5
1.5
1
1.5
1
0.5
ǔ of modelALL vs. AML
0.5
0
0
-0.5
-0.5
ǔ of modelALL vs. AML
Figure2a.PredictionsofLOOCVsamplesforOVOPLSmodels2b.SamplesclassifiedaccordingtoLDAbased
ontheLOOCVpredictionsoftheOVOPLSmodelscalculatedwiththe15genesselectedwiththehighestSR
index.(ƒ)samplesofclassALLcorrectlyclassified,(ż)samplesofclassMLLcorrectlyclassified,(×)samplesof
classAMLcorrectlyclassified,and()misclassifiedsamples.
172
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.4Results
0
Second discriminantfunction
-10
-20
-30
-40
-50
-60
-50
-40
-30
-20
-10
0
First discriminantfunction
Figure 3. Discriminant scores of the LDA classifier calculated for the first two discriminant functions. (ƒ)
samplesofclassALL(ż)samplesofclassMLL,(×)samplesofclassAML,and()misclassifiedsamples.
Modellingwiththe50mostrelevantgenes
When the number of genes selected to calculate the PLS models was 50, the
classification performance was similar as for 15 genes, except for one additional
samplethatwasmisclassified.Thepredictionsofeachclassaremoreclusteredaround
their target values, which should improve the discrimination between the classes.
However, the two outliers detected when the classification was performed with 15
genes,MLL_2andMLL_15,wereagainoutliers.Inaddition,thesampleAML_11was
alsopointedoutasoutlier.Thisresultedina95.8%ofLOOCVclassificationaccuracy.In
this case, then, increasing the number of genes worsened the classification. This
contrastswithpreviousresultswherethebestaccuracieswereobtainedwith50genes
[18].
Figures 4a and 4b show the predictions and the LOOCV results for the models
calculated with 50 genes. The two samples of class MLL misclassified (MLL_2 and
MLL_15) behave like in the models calculated with 15 genes. AML_11 is an AML
173
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
samplewhoseintensitiesforthese50selectedgenesarehigherthantheexpectedfor
asampleofclassAML. Thisdidnothappenwhenonly15geneswereused.Thesehigh
intensitiesinfluencedthepredictedy,whichwassimilartothepredictionsoftheMLL
samplesandverydifferentfromthepredictionsofthesamplesofitstrueclass.
When the number of genes increased to 100 the classification performance was like
theperformanceofthemodelswith50genes,andthethreesamplespointedaboveas
outlierswereagainmisclassified.
a
PLS model of
cla ss MLL vs.
cla ss AML
PLS model of
cla ss ALL vs.
cla ss AML
PLS model of
cla ss ALL vs.
cla ss MLL
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
b
ǔ of model MLL vs. AML
1.2
0.8
0.4
0
-0.4
2
0
ǔ of model ALL vs. AML
-2
0
0.4
0.8
1.2
ǔ of model ALL vs. AML
Figure4a.PredictionsofLOOCVsamplesforOVOPLSmodelscalculatedwiththe50geneswithhighestSR
index.4b.ClassificationofLDAfromtheOVOPLSpredictions.(ƒ)samplesofclassALLcorrectlyclassified,(ż)
samplesofclassMLLcorrectlyclassified,(×)samplesofclassAMLcorrectlyclassified and()misclassified
samples.
174
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
7.4Results
7.4.2Smallroundbluecelltumourdataset
Following the procedure described for the leukemia dataset, OVO PLS models were
calculated.Bycombiningthefourdifferentclasses,sixPLSmodelswerecalculated.For
eachone,thebest17discriminantgenes,obtainedusingtheSRindex,wereselected.
The optimal number of factors for each one of the six PLS models was determined
basedontheminimumRMSECV.TheoptimalPLSmodelswereusedtopredictallthe
trainingsamples,whichwherethensubmittedtotheLDAclassifier.Figure5showsthe
predictions for the test samples for three of the six PLS models, along with the
classificationperformedbyLDAfromthosepredictions.FromtheOVOPLSpredictions
LDAwasabletoclassifycorrectlyalltestsamples..Notethatinreference[15]a100%
oftestaccuracywasachievedusing96genes.WithPLSͲLDA,thesameperformanceis
achieved using only 17 genes, selected independently for each one of the OVO PLS
models.
1.2
ǔ of model EWS vs. BL
1
0.8
0.6
0.4
0.2
0
-0.2
1.4
1.5
1
1.2
0.5
1
0.8
0.6
0.4
0
0.2
0
-0.2
ǔ of model EWS vs. RMS
-0.5
ǔ of model EWS vs. NB
Figure5.PredictionsfromthreeofthesixPLSmodelsandtheclassificationperformedbyLDA.(×)samples
ofclassEWS(ƒ)samplesofclassRMS() representssamplesofclassNB(S)representssamplesofclass
BL).Allofthesamplesarecorrectlyclassified.
175
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
7.5Conclusions
LDA applied on the predictions of oneͲversusͲone PLS models allows multiͲclass
classificationofmicroarraygeneexpressiondatawithgoodperformance.Byselecting
the most discriminant genes independently for each PLS model, the accuracies are
similartothosepreviouslypublishedbutusinglessgenes.Inaddition,theuseofonlya
fewgenesallowsabetterposteriorinterpretationofthebiologicalsenseofthegenes
andtheirrelationwithaparticularillness.
Acknowledgements
The authors thank the support of the Departament d’’Universitats, Recerca i Societat
delaInformaciódeCatalunyaforprovidingCristinaBotella’’sdoctoralfellowship,anof
theSpanishMinisteriodeEducaciónyCiencia(projectCTQ2007Ͳ66918/BQU).
176
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
References
‡ˆ‡”‡…‡•
[1]
Ooi, C.H. and P. Tan, Genetic algorithms applied to multiͲclass prediction for the analysis of gene
expressiondata.Bioinformatics,2003.19:p.37Ͳ44.
[2]
Peng,S.,etal.,Molecularclassi¢cationofcancertypesfrommicroarraydatausingthecombination
ofgeneticalgorithmsandsupportvectormachines.FEBSLetters,2003.555:p.358Ͳ362.
[3]
Statnikov, A., et al., A comprehensive evaluation of multicategory classification methods for
microarraygeneexpressioncancerdiagnosis.Bioinformatics,2005.21:p.631Ͳ643.
[4]
Yeang, C.H., et al., Molecular classification of multiple tumour types. Bioinformatics, 2001. 17: p.
S316ͲS322.
[5]
Tan,Y.,etal.,MultiͲclasscancerclassificationbytotalprincipalcomponentregression(TPCR)using
microarraygeneexpressiondata.NucleicAcidsResearch2005.33:p.56Ͳ65.
[6]
Leng, C., Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using
microarraydata.ComputationalBiologyandChemistry,2008.32:p.417Ͳ425.
[7]
Tibshirani, R., et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression.
PNAS,2002.99:p.6567Ͳ6572.
[8]
Lee, Y., Y. Lin, and G. Wahba, Multicategory Support Vector Machines: Theory and Application to
theClassificationofMicroarrayDataandSatelliteRadiancedata.JournaloftheAmericanStatistical
Association,2004.99:p.67Ͳ81.
[9]
Liu,J.J.,etal.,MulticlasscancerclassificationandbiomarkerdiscoveryusingGAͲbasedalgorithms.
Bioinformatics,2005.21:p.2691Ͳ2697.
[10]
Souza,B.F.d.andA.P.d.L.F.d.Carvalho,GeneselectionbasedonmultiͲclasssupportvectormachines
andgeneticalgorithms.Geneticsandmolecularresearch,2005.4:p.599Ͳ607.
[11]
Nguyen, D.V. and D.M. Rocke, MultiͲclass cancer classification via partial least squares with gene
expressionprofiles.Bioinformatics,2002.18:p.1216Ͳ1226.
[12]
Tan,Y.,etal.,MultiͲclasstumorclassificationbydiscriminantpartialleastsquaresusingmicroarray
gene expression data and assessment of classification models. Computational Biology and
Chemistry2004.28:p.235––244.
[13]
Botella,C.,J.Ferré,andR.Boqué,Geneselectioninmicroarraydatabasedontheselectivityratio
index.Submitted,2010.
[14]
Armstrong, S.A., et al., MLL translocations specify a distinct gene expression profile that
distinguishesauniqueleukemia.NatureGenetics,2002.30:p.41Ͳ47.
[15]
Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling
andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679.
[16]
Wold,H.,Partialleastsquares,inEncyclopediaofStatisticalSciencesK.a.N.L.Johnson,Editor.1985,
Wiley:NewYork.p.581Ͳ591.
177
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
MultiͲclassclassificationof
microarraygeneexpressiondata
[17]
Kvalheim, O.M. and T.V. Karstang, Interpretation of latentͲvariable regression models
ChemometricsandIntelligentLaboratorySystems,1989.7:p.39Ͳ51.
[18]
Yang, T.Y., Efficient multiͲclass cancer diagnosis algorithm, using a global similarity pattern.
ComputationalStatisticsandDataAnalysis,2009.53:p.756Ͳ765.
178
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
CHAPTER 8 Conclusions
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Conclusions
1. Probabilistic Discriminant Partial Least Squares (pǦDPLS) has been
appliedtothebinaryclassificationofmicroarraygeneexpressiondata.
The probabilistic Discriminant Partial Least Squares (pͲDPLS) method has been
successfullyappliedtoclassificationofmicroarraygeneexpressiondata.Inthetraining
step,aPLSmodeliscalculatedfromthemicroarraydatamatrixXandthevectoryof0’’s
and 1’’s that codifies two classes. Next, the training data are predicted with the PLS
modelforaselectednumberoffactorsandtheirpredictionsNJareusedtoestimatetwo
probabilitydensityfunctions(PDFs),oneforeachmodelledclass.ThesePDFsdefinethe
rangeofpredictionsthatcharacterizeseachclass.Inthepredictionstep,theprediction
NJ of the sample to be classified and the PDFs are used to calculate the a posteriori
probabilitythatthesamplebelongstoeachoneofthemodelledclasses.Thesampleis
thenassignedtotheclasswiththehighestprobability.
There are several reasons that make pͲDPLS suitable for classifying microarray data.
Microarraydatainvolvethousandsofvariablesandamuchsmallernumberofsamples.
Many of these variables are redundant, falsely correlated or irrelevant to distinguish
between classes. The PLS model compresses the large data matrix X into a few latent
variablesbyfocussingonthevariablesinXthataremostcorrelatedwiththevectorof
classcodesy.Hence,theclassifierusesthesystematicrelevantdatavariability,sothat
thepredictionNJofasampleandthefinalclassificationresultareminimallyaffectedby
irrelevantgenes.Inaddition,sinceonlyafewlatentvariablesareused,anoisefiltering
effectisachieved.
AnotheradvantageofpͲDPLSinfrontofotheralgorithmsthatperformdiscriminantPLS
liesinthecalculationofthePDFsofeachclassandinhowtheclasslabelisassigned.The
classical discriminant PLS approach decides the class label based only on whether NJ is
higher or lower than an arbitrary threshold (e.g. 0.5). More elaborated procedures
assumethattheNJ'sofeachclassarenormallydistributed,andthemeanandstandard
181
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter8
deviation of the NJ's are used to estimate a Gaussian distribution for each class. The
threshold is then the NJ where the PDFs of both classes coincide or (if a priori
probabilitiesaretakenintoaccount)wheretheaposterioriprobabilitiesarethesame.
Noneoftheseapproacheshasbeenusefulformicroarraydata.First,thereisnotreason
forsettinganarbitrarythreshold.Second,thenumberofsamplesavailableforanalysis
isusuallylimitedandoftenoneclassmayhavemanymoresamplesthantheother.This
makesthepredictionofthePLSmodelbeusuallynotclusteredaroundthetargetvalues
0and1thatcodifytheclasses,butslightlybiasedandnotnormallydistributed(see,for
example,thepredictionsinFigure6ofchapter4).InpͲDPLSthetypeofdistributionof
theNJ'sdoesnotneedtobeassumedandthePDFsarecalculatedbycombiningkernel
functions. Hence, the PDFs better describe the distribution of the predictions of each
class. In addition, the kernel functionsuse as smoothing parameter the uncertaintyof
the predictions, so that the relative position of the samples in the multivariate space
alsocontributestothecalculatedPDFsthroughtheleverageandthefitofthemodel.
AnotheradvantageofthepͲDPLSmethodusedinthisthesisisthatlimitsfortherange
of possible NJ's of each class can be set, which allows outlier detection (see section 2
below) and the implementation of a reject option that allows rejecting to classify a
samplewhentheaposterioriprobabilitiesforbothclassesaretoosimilar(seesection2
below). The latent variable structure of the PLS model also offers enhanced outlier
detectioncapabilitiesbasedontheleverageandresidualvariance(seesection4below).
A final advantage of pͲDPLS is that diverse variable selection methodologies, already
usedinPLSregression,canbeusedtoselectthemostrelevantgenesforclassification.
OneofthesemethodologieshasbeenimplementedinthepͲDPLS,asitisexplainedin
thesection5below.
182
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Conclusions
2. A reject option was implemented in probabilistic Discriminant
PartialLeastSquares(pǦDPLS).
The classification in pͲDPLS is based on the Bayes Theorem so that the sample is
assignedtotheclasswiththehighestaposterioriprobability.Thestraightapplicationof
thisrulemakesproducesthatasamplewillalwaysbeassignedtooneofthemodelled
classesevenwhenthesamplemaybesuspected.Oneofthesesituationsoccurswhen
the prediction of the new sample is at the extremes of the PDF of one class. Such a
sample is so different from the training samples (it is an outlier) that it might be
misclassified.ThesecondsituationoccurswhenthePDFsofthetwoclassesarepartially
overlapped, and the sample has a prediction NJ in the overlap zone (called ambiguous
region).Thatsamplehascharacteristicsofbothclasses,sotheaposterioriprobabilityto
belongtoanyoftheclassesissimilaranditsclassificationisnotreliableenough.While
the samples in the two mentioned situations should preferably be not classified, the
strictapplicationoftheBayesTheoremforcesitsassignementintooneofthemodelled
classes.Inthisthesis,thepossibilityofnotclassifyingasamplehasbeenimplementedin
pͲDPLS. This is called the reject option. The reject option in pͲDPLS is generally
overlooked.However,itallowsavoidingclassificationswithalowreliability,byrejecting
to classify both outliers and ambiguous samples. This increases the confidence of the
experimenter that the classification model yields correct results when a class label is
issuedforanewsample.
Inthiswork,therejectoptionforambiguoussampleshasbeenimplementedinpͲDPLS
asarejectthreshold(followingChow’’srule),andtherejectoptionforoutliershasbeen
implementedbysettinglimitstotheallowedNJvaluesforeachclass.
Aninconvenientoftherejectoptionisthatsomesamplesrejectedwouldbeclassified
correctly if reject option is not implemented. Hence, when the reject threshold and
183
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter8
limits are set, a tradeͲoff between the number of samples incorrectly classified,
correcltyclassifiedandrejectedmustbeachieved.
In this thesis, pͲDPLS with reject option has been successfully applied to classify
oligonucleotideandmiRNAmicroarraydatabyrejectingsamplesthatwouldhavebeen
classified incorrectly. With the reject option, for the Small Round Blue Cell Cancer
datasetthemisclassificationrateofthemodelwasreducedfrom100%to10%fortest
samplesfromclassesnotmodelledduringthetrainingstep,andfortheHumanCancers
datasetfrom3%tolessthan1%forthetrainingsamplesclassifiedbycrossͲvalidation.
3. The performance evaluation of classifiers must be reconsidered
whenarejectoptionisallowed.
A pͲDPLS classifier must be evaluated to assure its quality. Common measures of a
classifiers’’performancearetheaccuracyortheerrorrate.Theseparametersareusually
calculatedasthenumberofcorrect(orerroneous)classificationsoverthetotalnumber
ofsamplesclassified.
Whenrejectionisnotanoption,thetotalnumberofsamplesclassifiedisequaltothe
number of samples that have been submitted to the classifier. In contrast, when
rejectionisanoptionthecalculationofperformancevaluessuchastheaccuracyorthe
errorratearestillusefulbutmustbereinterpretedtobemeaningful.Theyareequally
calculatedasthenumberofcorrect(orerroneous)classificationsoverthetotalnumber
ofsamplesclassified.However,thenumberofsamplesforwhichtheclassifierhasgiven
aclasslabel(classified)maybedifferentthanthetotalnumberofsamplessubmittedto
theclassifier(thedifferenceisthenumberofsamplesthathavebeenrejected).
Thereasoningofthisreinterpretationisthattheanalystwants,firstofall,thattheclass
labelissuedbytheclassifieriscorrect.Hence,theperformancemeasureshouldreflect
184
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Conclusions
the percentage of the samples for which the classifier assigned a class, which are the
onesforwhichadecisionistaken(e.g.,'tumourtype1','tumourtype2').Afterthat,the
analyst may accept the classifier to reject some ““difficult”” samples (of course, the
classifiershouldclassifyasmanysamplesaspossibleandrejectasfewaspossible).In
addition,iftheaccuracyweredefinedoverthetotalnumberofsamples,classifierswith
rejectoptionwouldalwaysperformworsethanmodelswithoutrejectoption,because
the number of samples correctly classified using the reject otpion would be equal or
lower).
The performance measures are also used to decide among several classifiers. For
example,inpͲDPLS,differentclassifiersareobtainedbyselectingadifferentnumberof
factorsinthePLSmodel.Whentherejectoptionisallowed,theerrorratealonemay
not be a sufficient criterion to compare classifiers, since the rejected samples are not
included in the count. In that sense, a classifier that rejects most of the samples and
classifies correctly the remaining will have a high accuracy, although it is clearly not
usefulforclassification.
A better criterion for evaluating the performance of a classifier is to use the Cost
parameter, which takes into account the number of rejected samples. The Cost
evaluates the number of correct classifications, the misclassifications and also the
rejectionsofthemodel,takingintoaccounttheindividualcostofeachoftheseactions
andprovidinga single valuerepresentative of the performanceof theclassifier or the
classificationmodel.TheCosthasbeenusedinthisthesistocomparetheperformance
ofthepͲDPLSwithrejectoptionandtodeterminetheoptimalnumberoffactorsforthe
pͲDPLS model. The Cost has also been used to evaluate if removing outliers improves
thepͲDPLSmodels(seesection4).
185
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter8
4.OutlierdetectioninpǦDPLShasbeenimplementedasarejectoption.
Microarray data may contain outliers caused by the many steps involved in obtaining
thedata.Moreover,samplesthatbelongtoclassesthathavenotbeenmodelledmay
alsobesubmittedtothepͲDPLSclassifier.Hence,outlierdetectionisanecessarytool
for the practical implementation of pͲDPLS. Outliers in pͲDPLS were detected in this
work by combining leverage, variances and predicted values (NJ) of the pͲDPLS model.
This method for outlier detection allows to reject not only samples with errors in the
instrumentaldata(x),inthecodification(y)orsampleswithanerroneousxͲyrelation
butalsotoidentifythatanincomingsampledoesnotbelongtoanyoftheclassesinthe
trainingset.
In the Small Blue Round Cell tumours dataset, 90 % of the samples of a class not
modelled in the training step were detected as outliers using this method. These
sampleswouldhavebeenallmisclassifiediftherejectoptionhadnotbeenused.Inthe
prostate dataset, outlier elimination improves the classification model, decreasing the
Cost per classification from 0.11 to 0.06. The outlier elimination has also a beneficial
effectontheaccuracyoftheclassificationofunknown(test)samples,whichincreases
from95%to100%,rejectingtoclassifyasamplethathadbeenwronglyclassified.
5.GeneselectionwasimplementedinpǦDPLSwithrejectoption
Most of the thousands of gene expressions in microarray datasets are irrelevant to
classifysamples.Irrelevantdatamaydegradetheclassifier’’sperformanceanddifficult
the understanding of the genes that are discriminating the classes. For these reasons,
variableselectionisrequired.
186
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Conclusions
Inthisworktheselectivityratioindexhasbeenappliedasageneselectionmethodto
selecttherelevantvariablesinPLS.Thisallowedpointingoutthemostrelevantgenesto
discriminate subtypes of prostate cancer and nonͲsmall cell lung types of cancer with
highaccuracyindependentlyonthetrainingandtestsetsused.
For the prostate dataset, models with only 17 selected genes had a mean LOOCV
accuracyof94%,comparedtothe85%accuracyobtainedforthepͲDPLSmodelwithout
gene selection (5966 genes). Equivalently, the mean of the accuracies for the test set
improvedto92%fromthe84%obtainedwithoutgeneselection.Whenthenumberof
selectedgenesincreasesfrom17to35,theaccuracydidnotimprove.Similarlyforthe
nonͲsmall cell lung cancer dataset, the genes used in the classification were reduced
from54675to17,achievingameanofLOOCVaccuracyof93%.Inthiscasetheincrease
in the number of genes selected from 17 to 30 neither improved the classification
accuracy.
Themostadequatemethodforprovingthevalidityofaselectedsubsetofgenes(and,in
turn,thevalidityofthegeneselectionalgorithm,andofthegeneselectioncriterion)has
alsobeenstudied.Mostvariableselectionmethodsstartbyinitiallysplittingthedataset
into a training and a test set. Such an split influences the calculated accuracy of the
classificationmodelandalsoinfluencestheconclusionaboutthevalidityoftheselected
subset of genes. If the selected genes and the conclusions are based on a single split,
underoptimistic or overoptimistic results can be found. A single unfortunate split can
lead to low accuracies (around 88%) and, by contrast, a fortunate split can lead to
overoptimistic accuracies (around 100%). For this reason, a repetitive strategy of
trainingsetandtestsetsplits,geneselection,pͲDPLSmodelcalculationandvalidation
wascarriedouttomeasuretheperformanceoftheselectedgenes.Thegenesselected
followingthisstrategyprovidedmodelsmuchlessinfluencedbythesplitofthedata.
187
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Chapter8
6. Linear Discriminant Analysis has been combined with PLS to solve
multiǦclassclassificationproblems.
MultiͲclass classifiers are required for microarray data classification since most of the
cellsortissuestobeclassifiedmaybelongtomorethantwoclasses.
pͲDPLS is suitable to analyse microarray data due to advantages like the use of latent
variablesorthenoisereduction(detailedinsection1),whichareimportantinorderto
improvethemulticlassclassification.However,pͲDPLSisabinaryclassifier,hence,itcan
only discriminate between two classes at a time. One usual option is to reduce the
multiclass classification problems to binary classification ones, following a oneͲversusͲ
oneoraoneͲversusͲallstrategy;butthesestrategiesarenotalwaysenoughtoachieve
an adequate muticlass classification. The inconvenient resides that the DPLS allows
discriminating between two modelled classes, but the NJ predicted values of the
incomingsamples(thatmaynotbelongtoanyofthesetwoclasses)presentvaluesthat
span all the NJ domain (i.e. Figure 2a chapter 7). Hence, these samples are confused
among the samples of the modelled classes, assigned to any of them and, so,
misclassified.
In this thesis a method that combines PLS and linear discriminant analysis (LDA) has
beendevelopedformultiͲclassclassification.Themethodinvolvesalsoaselectionofthe
mostdiscriminantgenesforeachofthePLSmodels.Thisstrategyallowsreducingthe
datadimensionandperformingthemultiͲclassclassificationwithhighaccuracywitha
fewgenes.Thismethodhasbeenappliedtotheleukemiaandthesmallroundbluecell
tumour dataset. Leukemia data consist on three different types of samples (AML, ALL
and MLL) that generally have poor prognosis and the small round blue cell tumour
includes four subtypes (NB, RMS, NHL and EWS) the accurate diagnosis of which is
essential because the treatment options, responses to therapy and prognoses vary
widely depending on it. For both datasets, the accuracies achieved were very high, a
188
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Conclusions
97% and a 100% of classification accuracy, respectively, using 15 genes to classify the
leukemiadatasetand17genesforthesmallroundbluecelltumourdataset.
189
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
Datasets
Humancancersdataset
TheHumanCancersdatasetwaspublishedbyLuetal.in[1].Thenormalizeddatasetis
available at [2] together with supplementary information [1]. The dataset consists of
282 microRNA (miRNA, non coding RNA species) of 218 samples (46 healthy and 172
tumour) from twenty tissues (ovary, colon, lung, prostate, bladder, breast, follicular
lymphoma, kidney, liver, brain, melanoma, mesothelioma, stomach, uterus, acute
myelogenous leukaemia, diffuse largeͲB cell lymphoma, BͲcell ALL, mycosis fungoides,
mixedlineageleukaemiaandTͲcellALL).
The published dataset had been normalized as detailed in the Supplementary_Notes
document:
1.
WellͲtoͲwellscaling––thereadingfromeachwellwasscaledsuchthat
thetotalofthetwopostͲlabelingcontrols,inthatwell,became4500
(amedianvaluebasedonapilotstudy).
2.
Samplescaling––thenormalizedreadingswerescaledsuchthattotal
ofthe6preͲlabelingcontrolsineachsamplereached27,000(amedian
valuebasedonapilotstudy).
3.
Floorthresholdwassetat32.
4.
Datawerelog2transformed.
ThenormalizeddownloadabledatafileisatabͲdelimitedtextfile(miGCM_218.gct),of
218 samples and 217 gene expression (left after filtering). The first row of the matrix
indicatesthetissueID,andthefirstandthesecondcolumndetailthegenenameand
genesdescriptionrespectively.
193
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
Inthisoriginalwork,thedatasetwasusedtodemonstratethefeasibilityandutilityof
monitoring the expression of miRNAs in human cancer tissue. This dataset has been
usedinotherstudies.Lodesetal.[3]usedthemiRNAasmarkersforcancerdetection
andithasbeenpointedthatmiRNAsmaybethefutureofpharmacogenomics[4].
InthisthesisithasbeenusedtoevaluatetheperformanceoftheprobabilisticDPLSwith
rejectoptionclassifier.
Breastcancerdataset
TheBreastCancerdatasetwaspublishedbyHedenfalketal.in[5].Thedatasetafter
filtering(3226genes)isavailablein[6].
The downloadable data are the normalized gene expression ratios of 21 samples and
from three different mutations (BRCA1, BRCA2 and sporadic mutation). The format
description document, in the same web page, describes the downloadable data. The
downloadable data file is a tabͲdelimited text file, in which the first row indicates the
Patient ID for each experiment (1to 21). The second row provides the mutation
classificationforeachexperiment,(BRCA1,BRCA2,Sporadic)andthethirdrowprovides
theexperimentID,(s1996,s1822,etc).Columns1to3arerelatedtothegenesIDand
theirlocalizationintheplate.Columns4to24containgeneexpressionratiosforeach
geneineachexperiment.
The gene expression ratios are derived from the fluorescent intensity (proportional to
the gene expression level) of a tumor sample (BRCA1, BRCA2, or Sporadic) divided by
the fluorescent intensity of a common reference sample (MCFͲ10A cell line). The
commonreferencesampleisusedinall21microarrayexperiments.
194
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
The genes are filteredbased on: (a) average fluorescent intensity (level of expression)
greaterthan2,500(graylevel)acrossall21samples,(b)averagespotareagreaterthan
40pixelsacrossall21samples,and(c)nomorethanonesampleinwhichthespotarea
iszeropixels.
Theratios,includedinthedownloadabledatafile,foreachexperimentwerenormalized
suchthatthemajorityofthegeneexpressionratiosfromapreͲselectedinternalcontrol
genesetwerearound1.0.Nologtransformationwasdoneinthedownloadabledata.
This dataset was previously used to evaluate the performance ofclassification models
[7, 8], for gene selection methods testing [9, 10], for multiclass classification models
evaluation[6]andtocheckimputationmethods[11],tociteafew.
WehaveusedthisdatasettodemonstratetheusefulnessofpͲDPLSwithrejectoption
torejecttoclassifysamplesfromclassesnotmodeledinthetrainingstep.
Prostatedataset
The prostatecancerdatasetwas published bySinghet al in [12] and itis available on
[13].After filtering, it has 50 nonͲtumour samples and 52 tumour samples with 12600
geneexpressions.
The preͲprocessing was detailed in the supplementary information document
(SuppInfo_CCv3.pdf).Briefly,thedatawasscaledtoreferenceintensity(meanaverage
differenceofallgenespresentinthemicroarrays).Thegeneswithaveragedifferences
below 10 were filtered. Equivalently, the maximum threshold was set at 16000. After
thresholding, the relative variation of expression for each gene was determined by
dividingthemaximumexpression(Max)ofthegeneamongallsamplesbytheminimum
195
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
expression (Min). The absolute variation in expression was determined by subtracting
the (Min) from the maximum (Max). The genes with (Max/Min) <5 or (MaxͲMin) <50
werealsofiltered.
ThedownloadablematrixisatabͲdelimitedtextfilethatcontainsexpressionvaluesin
Affymetrix's scaled average difference units. Rows 1 to 3 contain the identification of
the samples, the scale factor of each microarray (sample) and the number of genes
respectively.AssociatedtoeachaveragedifferenceexpressionnumberthereisaP,M,
or A label that indicates whether RNA for the gene is present, marginal, or absent,
respectively (as determined by the GeneChip software), based upon the matched and
mismatchedprobesforthegenes.
Thisdatasetwaspreviouslystudiedingeneselectionstudiesandusedtoevaluatethe
performance of classification methods. To cite a few, Dettling et al. [14] used this
dataset (and others) to demonstrate that when bagging was used as a module in
boosting, the resulting classifier consistently improved the predictive performance;
DiazͲUriarteetal.in[15]usedthisdatasettocheckgeneselectionandtheperformance
of a classification using random forest; and Jeffery et al. in [16] used this dataset to
compare different gene selection methods (and the lists of genes generated by each
one)anddifferentclassifiers.
In this thesis, this has been used to check the outlier detection and gene skeleton
methodsimplementedtopͲDPLSclassifier.
Smallroundbluecellstumourdataset
ThesmallroundbluecelltumoursofchildhooddatasetwaspublishedbyKhanetal.in
[17] and it is available at [18]. The preͲprocessing of the data is detailed in the
SupplementalMethodsdocument.
196
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
Initially,theexpressionlevelsfrom6567genesweremeasuredforeachoneofthe88
analyzedsamples(ofwhich63werelabelledascalibrationsamplesand25wereblind
tests).Intheanalysistheredintensity(ri)andtherelativeredintensity(rri)wereused.
Genes were omitted if for any of the samples ri was less than 20. This main removed
spotsforwhichtheimageanalysisfailed.Withthiscutonly2308geneswereleft.
The final downloadable dataset is a tabͲdelimited text file that contains the natural
algorithmoftherelativeredintensity(rri)forallofallthe88samplesand2308genes.
Thisdatasetwaspreviouslyusedtocheckgeneselectionmethods[19,20],tocompare
between different linear discriminant methods [21] or to evaluate multiͲclass
classificationmethods[22].
We have used this to check the ability of the proposed outlier detection method of
detectingsamplesfromclassesnotmodeledinthetrainingstepofthepͲDPLSmodels.
Furthermore it has been used to demonstrate the ability of the PLS combined with
lineardiscriminantanalysis(LDA)tomultiͲclassclassification.
NonǦsmallcelllungcancer
ThenonͲsmallcelllungcancer(NSCLC)datasetwaspublishedbyKuneretal.in[23].The
datasetconsistsof58samplesofthetwomajorhistologicalsubtypesoflungcancer,40
fromadenocarcinomaand18fromthesquamouscellcarcinoma.Foreachone,54675
gene expressions were analysed. The data were normalized by the gcRMA method
published by Wu et al. in [24]. From the initial 60 hybridizations two microarray
hybridizations (PatID 42 and 421) failed the quality criteria due to local hybridization
artefactsandwereexcludedfromfurtheranalysis.
197
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
The data are available at NCBI GEO database [25] with the dataset identification
GSE10245. Raw data are provided as supplementary files, one for each sample. All
samplesaregroupedinamatrixintheSeriesMatrixFile.Thisisatabdelimitedfilewith
thehybridizationsofthe58samples.
Thisdatasetwasrecentlypublished(year2009)and,asfarasweknown,ithasnotbeen
used yet to check classifiers or gene selection method. It has been only used as a
referenceinbiologicalstudiesoflungcancer.
We have used nonͲsmall cell lung cancer dataset to verify the usefulness of the gene
selection method proposed and to show the influence over the accuracies of the
classificationmodelsthathavetheinitialdivisionsofthedatasets(i.e.thesplitsofthe
datasetintoatrainingandatestset).
Leukemiadataset
TheleukemiadatasetwaspublishedbyArmstrongetalin[26]anditisavailableon[27].
This dataset consists of 72 samples of acute lymphoblastic leukemias carrying a
chromosomal translocation that derives on three subtypes of samples, 24 samples of
acutelymphoblasticleukemia(ALL),20samplesofmixedlineageleukemia(MLL)and28
samples of acute myeloid leukemia (AML). For each sample 12582 gene expressions
wereanalysed.
Thedownloadabledataisatabdelimitedfiletext.ThefilecontainsAffymetrix"average
difference"expressionvaluesforallsamples.Thedataarealreadyscaledasdetailedin
the File info document. Linear scaling is used to reduce technical noise due to global
intensitydifferencesbetweenscans.Linearregressionofall"Present"genes(Affymetrix
"P"calls)wasusedtodeterminethescalingfactorforeachscan(thefirstALLscanused
as a reference). The scaling factor was applied to expression values (regardless of A/P
198
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
call).Scalingfactorsrangedfrom0.93to2.1;allscalingfactorsareshowninthescanid
file.
Then once the dataset obtained, user must preͲprocess it according to the authors in
[26] as follows: a floor threshold and a ceiling threshold were set at 100 units and at
16000 units respectively. After this preͲprocessing, gene expression values were
subjectedtothevariationfilter.ThevariationfiltertestsforafoldͲchangeandabsolute
variation over samples, by comparing max/min and maxͲmin intensities. The max/min
filterwassetat5andthemaxͲminat500forallexperiments.
Thisdatasethadbeenpreviouslyusedtocomparedifferentgeneselectionmethods[20]
andtocheckdifferentmultiͲclassclassificationmethodsandstrategies[20,28,29].
We have used this dataset to show the ability of the multiͲclass classifier proposed in
thisthesisbycombiningPLSandLDA.
199
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
‡ˆ‡”‡…‡•
[1]
Lu,J.,etal.,MicroRNAexpressionprofilesclassifyhumancancers.NatureLetters,2005.435:p.834Ͳ
[2]
http://www.broadinstitute.org/cgibin/cancer/publications/pub_paper.cgi?mode=view&paper_id=1
838.
14.
[3]
Lodes,M.J.,etal.,DetectionofCancerwithSerummiRNAsonanOligonucleotideMicroarray.PLOS
One,2009.4:p.e6229.
[4]
Mishra,P.J.andJ.R.Bertino,MicroRNApolymorphisms:thefutureofpharmacogenomics,
molecularepidemiologyandindividualizedmedicine.Pharmacogenomics,2009.10:p.399Ͳ416.
[5]
Hedenfalk,I.,etal.,GeneExpressionprofilesinhereditarybreastcancer.TheNewEnglandJournal
ofMedicine,2001.344:p.539Ͳ548.
[6]http://research.nhgri.nih.gov/microarray/NEJM_Supplement/
[7]
Boulesteix, A.ͲL., PLS dimension reduction for classification with microarray data. Statistical
ApplicationsinGeneticsandMolecularBiology,2004.3:p.article33.
[8]
Raza, M., et al., Comparative Study of Multivariate Classification Methods using Microarray Gene
Expression Data for BRCA1/BRCA2 Cancer Tumors. Proceedings of the Third International
ConferenceonInformationTechnologyandApplications(ICITA'05),IEEE.,2005.2:p.475Ͳ480.
[9]
Pettersson, F. and A. Berglund, Interpretation and validation of PLS models for microarray data.
ChemometricsandChemoinformaticsACSSymposiumseries,2005.894:p.31Ͳ40.
[10]
McLachlan, G.J., R.W. Bean, and L.B.ͲT. Jones, A simple implementation of a normal mixture
approach to differential gene expression in multiclass microarrays. Bioinformatics, 2006. 22: p.
1608Ͳ1615.
[11]
Branden, K.V. and S. Verboven, Robust data imputation. Computational Biology and Chemistry,
2009.33:p.7Ͳ13
[12]
Singh,D.,etal.,Geneexpressioncorrelatesofclinicalprostatecancerbehavior.CancerCell,2002.
1:p.203Ͳ209.
[13]
http://www.broadinstitute.org/cgiͲbin/cancer/datasets.cgi.
[14]
Dettling,M.,BagBoostingfortumourclassificationwithgeneexpressiondata.Bioinformatics,2004.
20:p.3583Ͳ3593.
[15]
DíazͲUriarte, R. and S.A.d. Andrés, Gene selection and classification of microarray data using
randomforest.BMCBioinformatics,2006.7:article3.
[16]
Jeffery,I.B.,D.G.Higgins,andA.C.Culhane,Comparisonandevaluationofmethodsforgenerating
differentiallyexpressedgenelistsfrommicroarraydata.BMCBioinformatics,2006.7:p.359Ͳ375.
200
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
[17]
Khan, J., et al., Classification and diagnostic prediction of cancers using gene expression profiling
andartificialneuralnetworks.NatureMedicine,2001.7:p.673Ͳ679.
[18]
http://research.nhgri.nih.gov/microarray/Supplement/.
[19]
Zhu, S., et al., Feature Selection for Gene Expression Using ModelͲBased Entropy. IEEE/ACM
Transactionsoncomputationalbiologyandbioinformatics,2010.7:p.25Ͳ36.
[20]
Mohamad, M.S., et al., ThreeͲStage Method for Selecting Informative Genes for Cancer
Classification.IEEJTransactionsonElectricalandElectronicEngineering,2009.4:p.725Ͳ730.
[21]
Huang,D.,etal.,Comparisonoflineardiscriminantanalysismethodsfortheclassificationofcancer
based on gene expression data. Journal of Experimental & Clinical Cancer Research, 2009. 28: p.
149:156.
[22]
Chetty, G. and M. Chetty, Multiclass Microarray Gene Expression Analysis Based on Mutual
Dependency Models. Pattern Recognition in Bioinformatics, Proceedings. Lecture notes in
bioinformatics,2009.5780:p.46Ͳ55.
[23]
Kuner, R., et al.,Global gene expression analysis reveals specific patterns of cell junctions in nonͲ
smallcelllungcancersubtypes.LungCancer,2009.63:p.32Ͳ38.
[24]
Wu,Z.,etal.,AmodelͲbasedbackgroundadjustmentforoligonucleotideexpressionarrays.Journal
oftheAmericanStatatisticalAssociation,2004.99:p.909Ͳ17.
[25]
http://www.ncbi.nlm.nih.gov/geo/.
[26]
Armstrong, S.A., et al., MLL translocations specify a distinct gene expression profile that
distinguishesauniqueleukemia.NatureGenetics,2002.30:p.41Ͳ47.
[27]
http://research.dfci.harvard.edu/korsmeyer/MLL.htm.
[28]
Anand, A. and P.N. Suganthan, Multiclass cancer classification by support vector machines with
classͲwiseoptimizedgenesandprobabilityestimates.JournalofTheoreticalBiology,2009.259:p.
533Ͳ540.
[29]
Wang,X.andO.Gotoh,Accuratemolecularclassificationofcancerusingsimplerules.BMCMedical
Genomics,2009.2:p.64Ͳ87.
201
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
Abreviations
AC
ALL
AML
BL
BRCA1
BRCA2
CDC
cDNA
CV
Cy3
Cy5
DA
DNA
DPLS
EWS
FN
FP
GA
HL
KNN
LDA
LL
LOOCV
LOWESS
MAplot
miRNA
MLL
mRNA
MVT
NB
NN
NSCLC
OVA
OVO
PCA
Pcs
PDF
202
Adenocarcinoma
Acutelymphoblasticleukemia
Acutemyeloidleukemia
Burkittlymphomas
Breastcancergene1
Breastcancergene2
Closestdistancetocenter
Complementarydeoxyribonucleicacid
Crossvalidation
Cyanine3
Cyanine5
Discriminantanalysis
Deoxyribonucleicacid
Discriminantpartialleastsquares
Ewingfamilyoftumours
Falsenegative
Falsepositive
Geneticalgorithms
Highlimit
Knearestneighbours
Lineardiscriminantanalysis
Lowlimit
LeaveoneoutcrossͲvalidation
Locallyweightedscatterplotsmoothing
RatioͲintensityplot
MicroRNA,noncodingRNAspecies
Mixedlineageleukemia
Messengerribonucleicacid
Ellipsoidalmultivariatetrimming
Neuroblastoma
Neuralnetworks
NonͲsmallcelllungcancer
Oneversusall
Oneversusone
Principalcomponentanalysis
Principalcomponents
Probabilitydensityfunction
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
pͲDPLS
RMS
RMSEC
RMSECV
RMSEP
RN
RNA
RP
RPMBGA
rRNA
SCC
SEP
SOS
SR
SRBCT
SVM
TN
TNR
TP
TPCR
TPR
tRNA
VIP
Probabilisticdiscriminantpartialleastsquares
Rhabdomyosarcoma
Rootmeansquareofcalibration
Rootmeansquareofcrossvalidation
Rootmeansquareofprediction
Rejectnegative
Ribonucleicacid
Rejectpositive
Randomprobabilisticmodelbuildinggeneticalgorithm
Ribosomalribonucleicacid
Squamouscellcarcinoma
Standarderrorofprediction
Sparseoptimalscore
Selectivityratio
Smallroundbluecelltumour
Supportvectormachines
Truenegative
Truenegativerate
Truepositive
Totalprincipalcomponentregression
Truepositiverate
Transferribonucleicacid
Variableimportanceonprojection
203
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
Publications
Cristina Botella, Joan Ferré, Ricard Boqué. Classification from microarray data using
probabilistic discriminant partial least squares with reject option.Talanta, 2009, 80(1):
321Ͳ329.
CristinaBotella,JoanFerré,RicardBoqué.Outlierdetectionandambiguitydetectionfor
microarraydatainprobabilisticDiscriminantPartialLeastSquaresRegression.Journalof
Chemometrics,2010,Accepted.
Cristina Botella, Joan Ferré,Ricard Boqué.Gene selection in microarray data based on
selectivityratio.2010,Submitted.
Cristina Botella, Joan Ferré,Ricard Boqué.MultiͲclass classification ofmicroarray gene
expressiondata.2010,Submitted.
204
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Appendix
Communications
CristinaBotella,JoanFerréandRicardBoqué
AnewcriterionforselectingtheoptimalnumberoffactorsinDiscriminantͲPartialLeast
Squares(DPLS).Applicationtomicroarraygeneexpressiondata.
VI Colloquium Chemiometricum Mediterraneum, SaintͲMaximin. France. 2007
Postercommunication
CristinaBotella,JoanFerréandRicardBoqué
A new performance criterion for classification methods for microarraygeneexpression
data.
CAMDA(CriticalAssessmentofMicroarrayDataAnalysis),Valencia,Spain.2007
Postercommunication
CristinaBotella,JoanFerréandRicardBoqué
Classification of tumour cells from gene expression data using Probabilistic DPLS with
rejectoption.
IIIWorkshopdeQuimiometria,Burgos,Spain.2008
Oralcommunication
CristinaBotella,JoanFerréandRicardBoqué
Reject option implementing outlier detection and ambiguity detection in the
classificationofmicroarraygeneexpressiondata.
11thScandinavianSymposiumonChemometrics,Loen,Norway.2009
Postercommunication
205
UNIVERSITAT ROVIRA I VIRGILI
MULTIVARIATE CLASSIFICATION OF GENE EXPRESSION MICROARRAY DATA
Cristina Botella Pérez
ISBN:978-84-693-5427-8/DL:T-1418-2010
Multivariateclassificationofgene
expressionmicroarraydata
CristinaBotella,JoanFerréandRicardBoqué
Geneselectioninmicroarraydatabasedonselectivityratio.
VIIColloquiumChemiometricumMediterraneum,Granada.Spain.2010
Postercommunication
206
Fly UP