Comments
Transcript
STAT 503X Case Study 2: Italian Olive Oils
STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. (An analysis of this data is given in (Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria). The data available are: Region Area Palmitic Acid Palmitoleic Acid Stearic Acid Oleic Acid Linoleic Acid Linolenic Acid Arachidic Acid Eicosenoic Acid South, North or Sardinia Sub-regions within the larger regions (North and South Apulia, Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample Percentage ×100 in sample The primary question is “How do we distinguish the oils from different regions and areas in Italy based on their combinations of the fatty acids?” 1 2 2 Suggested Approaches Approach Data Restructuring Summary Statistics Visual Inspection The aim of visual methods is to understand separations, help decide the best classification method, and hence interpret solutions Numerical Analysis The aim of numerical solutions is to get the best predictive results. Reason This is very clean data so I don’t see any need to restructure. To get at location and scale information for each variable, and by groups Univariate Plots, Bivariate Plots, Touring Plots Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Classification and Regression Trees (CART), Forests, Feed-Forward Neural Networks and Support Vector Machines 3 Type of questions addressed What is the average percent composition of eicosenoic acid overall? Is there a difference in the average percentage of eicosenoic acid for olives from different growing regions? Are there differences in the fatty acid composition between the olives from different growing regions? “Can define the percentage composition of fatty acids that distinguishes olives from different growing regions?” 3 Actual Approaches 3.1 Summary Statistics The following tables contain information on the minimum, maximum, median, mean and standard deviation for the fatty acids, in total and broken down by different growing region. All the 572 observations (values reported as percentages): Min. Median Max. Mean Std. Dev. palmitic 6.10 12.01 17.53 12.32 1.69 palmitoleic 0.15 1.10 2.80 1.26 0.53 stearic 1.52 2.23 3.75 2.29 0.37 oleic 63.00 73.02 84.10 73.12 4.06 linoleic 4.48 10.30 14.70 9.81 2.43 linolenic 0.00 0.33 0.74 0.32 0.13 arachidic 0.0 0.61 1.05 0.58 0.22 eicosenoic 0.01 0.17 0.58 0.16 0.14 From the summary statistics we notice that olive oils are mostly constituted of oleic acid. Palmitic accounts for about 12% on average and the rest less than 10% on average. South Sardinia North Nth Apul Calabria Sth Apul Sicily Inla Sard Coas Sard East Lig West Lig Umbria n 323 98 151 25 56 206 36 65 33 50 50 51 palmitic 13.32 11.11 10.95 10.27 13.02 13.96 12.3 10.98 11.38 11.45 10.3 10.86 palmitoleic 1.55 0.97 0.84 0.62 1.21 1.84 1.05 0.95 1.01 0.84 1.08 0.60 stearic 2.29 2.26 2.31 2.35 2.63 2.11 2.74 2.17 2.44 2.41 2.57 1.94 oleic 71.00 72.68 77.93 78.20 73.07 69.11 73.58 73.61 70.86 77.46 76.74 79.56 linoleic 10.33 11.97 7.27 7.06 8.19 11.66 8.35 11.25 13.37 6.89 8.97 5.97 linolenic 0.38 0.27 0.22 0.43 0.46 0.35 0.42 0.29 0.24 0.26 0.05 0.34 arachidic 0.63 0.73 0.38 0.72 0.64 0.60 0.76 0.74 0.72 0.64 0.07 0.42 eicosenoic 0.27 0.02 0.02 0.35 0.28 0.24 0.38 0.02 0.02 0.02 0.02 0.02 Southern oils have much higher eicosenoic acid on average eicosenoic and slightly higher palmitic and palmitoleic acid content. The north and sardinian oils have some difference in the average oleic, linoleic and arachidic acids. Among the southern oils there is some difference in most of the averages. Northern oils have some difference in most of the averages. 4 3.2 Visual Inspection The objective here is to find differences in the measured variables amongst the classes. Differences here might be actual separations between classes on a variable or linear combination of variables. The approach is relatively simple: 1. Use color and or symbol to code the categorical class information into plot. 2. Begin with low-dimensional plots (histogram, density plot, dot plot, scatterplot) of the measured variables and work up to high-dimensional plots (parallel coordinate plot, tours), exploring class structure in relation to data space. 3.2.1 Regions - Univariate Plots Using 1D Plot mode sequentially work through the variables, either by manually sepecting variables or cycling through automatically, to examine separations between regions. Its possible to neatly separate the oils from southern Italy from the other two regions using just one variable, eicosenoic acid. Figure 1 displays a textured dotplot and an ASH plot of this variable. The oils from southern Italy are removed, and we concentrate on separating the oils from northern Italy and Sardinia. Although a clear separation between these two regions cannot be found using one variable, two of the variables, oleic and linoleic acid, appear to be important for the separation (Figure 1). 3.2.2 Regions - Bivariate Plots Starting from the two variables identified by the univariate plots as important for separating northern Italian oils from Sardinian oils, the remaining variables are explored in relation two these two using scatterplots. Oleic acid and linoleic acid show some, but not cleanly, separated regions. Arachidic acid and linoleic acid display a clear separation between the regions, but it is a very non-linear boundary. 3.2.3 Regions - Multivariate Plots Starting from the three variables found from bivariate plots to be important for separating northern Italian and Sardinian oils we use a higher-dimensional technique to explore them. Using either Rotation, Tour1D or Tour2D examine the separation between the two regions in the 3-dimensional space. Figure 2 shows the results of using Tour1D on the three variables. The two regions can be separated cleanly by a linear combination of linoleic and arachidic acid, roughly corresponding to 0.957 × linoleic + 0.289 × arachidic. 3.2.4 Areas - Northern Italy There are three areas in the region, Umbria, East and West Liguria. From univariate plots there are no clear separations between areas, although several variables, for example, linoleic acid (Figure 3, show some differences. In bivariate plots, two variables, stearic and linoleic show some differences between the areas. Examining combinations of variables in Tour1D shows that West Liguria is almost separable from the other two areas using palmitoleic, stearic, linoleic and arachidic acids. Umbria and East Liguria are separable, except for one sample, in a combination of most of the variables. To get this result, the projection pursuit controls were used followed by manual manipulation to assess the importance of each variable in the separation of the areas. 5 Figure 1: Separation between the 3 regions of Italian olive oil data in univariate plots. 6 Figure 2: Separation between the northern Italian and Sardinian oils in bivariate plots and multivariate plots. 3.2.5 Areas - Sardinia Separating the oils from the coastal and inland areas of Sardinia can be simply done by using two variables, oleic and linoleic acid. 3.2.6 Areas - Southern Italy There are four areas in the south Italy region, North and South Apulia, Calabria, Sicily. From univariate plots and bivariate plots there are no clear separations between all four areas. Simplify the work by working with fewer groups, take two of the areas at a time, or three at a time, and plot the data in univariate and bivariate plots. The best results are obtained by removing Sicily. If the oils from this area are removed then the remaining 3 areas are almost separable using the variables palmitic, palmitoleic, stearic and oleic acids. Figure 4 shows plots taken from the Tour2D of these 4 variables revealing the separations between the areas, North and South Apulia, and Calabria. However, the fatty acid content of oils from Sicily is confounded with those of the other areas. 3.2.7 Taking stock What we have learned from this data is that the olive oils from different geographic regions have dramatically different fatty acid content. The three larger geographic regions, North, South, Sardinia, are well-separated based on eicosenoic, linoleic and arachidic acids. The oils from areas in northern Italy are mostly separable from each other using all the variables. The oils from the inland and coastal areas of Sardinia have different amounts of oleic and linoleic acids. The oils from three of the areas in southern Italy are almost separable. And one is left with the curious content of the oils from Sicily. Why are these oils indistinguishable from the oils of all the other areas in the south? Is there a problem with the quality of these samples? 7 Figure 3: Separation in the oils from areas of northern Italy, univariate, bivariate and multivariate plots. 8 Figure 4: The areas of southern Italy are mostly separable, except for Sicily. 3.3 Numerical Analysis The data is broken into training and test samples for this part of the analysis. Approximately 25% of cases within each group are reserved for the test samples. These are the cases used for the test sample, where the data has been sorted by region and area: 1 76 151 218 305 375 460 530 7 80 156 219 310 376 462 532 12 95 165 225 313 386 466 541 15 101 175 227 314 392 468 543 16 102 177 241 323 405 470 545 22 105 182 242 330 406 474 546 27 106 183 246 333 415 476 551 32 110 185 257 338 416 480 559 34 116 186 259 341 418 481 567 35 118 187 263 342 420 482 570 36 119 190 266 347 421 487 41 122 192 274 351 423 492 50 134 194 280 352 426 493 54 137 201 284 356 428 500 61 140 202 289 358 435 501 68 147 211 291 359 440 509 70 148 213 292 369 451 519 75 150 217 297 374 458 522 Region Area # Train # Test South N. Apul. 19 6 South Calabria 42 14 South S. Apul 158 48 South Sicily 27 9 Sard Inland 49 16 Sard Coast 25 8 North E. Lig. 38 12 North W. Lig. 38 12 North Umbria 40 11 We will use what we learned from the visual analysis to guide the numerical analysis. Some methods restrict classification to just two groups, so we will use this restriction with all the classification methods. We will also drop the Sicilian oils from the study - because there is some question raised about the validity of these oils. Here is the binary classification steps we will follow: 9 3.3.1 Methods Used Classification trees generate a classification tree by sequentially doing binary splits on variables. Splits are decided on according to the variable and split value which produces the lower measure of impurity. There are several common measures of impurity, Gini and deviance. For example, for a two class problem with 400 observations in each class, denoted as (400,400), variable 1 produces a split of (300,100) and (100,300), and variable 2 produces a split (200,400) and (200,0). The split on variable 2 will produce a lower impurity value because one branch is more pure. There are numerous implementations: R packages such as rpart and standalone packages such as C4.5, C5.0. Classical Linear discriminant analysis (LDA) assumes that the populations come from multivariate normal distributions with different mean but equal variance-covariance matrices. Linear boundaries result, and it is possible to reduce the dimension of the data into the space of maximum separation using canonical coordinates. The MASS package in R has functions for LDA. Quadratic discriminant analysis (QDA) assumes that the populations come from multivariate normal distributions with different mean and different variance-covariance. Non-linear boundaries are the result. The MASS package in R has functions for LDA. Feed-forward neural networks provide a flexible way to generalize linear regression functions. A simple network model as produced by nnet code in R (Venables & Ripley 1994) may be represented by the equation: y = φ(α + s X wh φ(αh + p X wih xi )) i=1 h=1 where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is 10 the number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function. This model has a single hidden layer, and univariate output values. SVM have recently gained widespread attention due to their success at prediction in classification problems. The subject started in the late seventies (Vapnik 1979). The definitive reference is Vapnik (1999), and Burges (1998) gives a simpler tutorial into the subject. The main difference between this and the previously described classification techniques is that SVMs really only work for separating between 2 groups. There are some modifications for multiple groups but these are little more than one might do manually by working pairwise through the groups. SVM algorithms work by finding a subset of points in the two groups that can be separated, which are then known as the supportPvectors. These support vectors can be used to define NS a separating hyperplane, w.x + b = 0, where w = i=1 αi yi xi , NS is the number of support vectors, and PNS α comes from the constraint i=1 αi yi = 0. Actually we search for the support vectors which gives the hyperplane which gives the biggest separation between the two classes. Note that it is posssible to incorporate non-linear kernel functions allows for defining non-linear separations, and also that modifications allow SVMs to perform well when the classes are not separable. We use the software SVM Light: binary and documentation at http://svmlight.joachims.org/ 3.3.2 The Classification Separating Regions In separating the southern oils from the other two regions, and then northern oils from Sardinian oils, here is the classification tree solution: If eicosenoic acid $>$ 7 then the region is South (1) Else If linoleic $>$ 1053.5 then the region is Sardinia (2) Else the region is North (3) There are no missclassifications on the first split but the second split is problematic. It is a perfect separation for this data but there is no gap between the two groups. This is easily seen from the plot of the two variables used, Figure 5. So at this point we will shift to using a different method to build a classification rule for northen oils from Sardinian oils. Here is the R code: # South vs others d.olive.train<-d.olive[indx.tr,-c(1,2)] d.olive.test<-d.olive[indx.tst,-c(1,2)] c1<-rep(-1,572) c1[d.olive[,1]!=1]<-1 c1.train<-c1[indx.tr] c1.test<-c1[indx.tst] # North vs Sardinia d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,1]==3]<-1 c1.train<-c1.train[d.olive.train[,1]!=1] d.olive.train<-d.olive.train[d.olive.train[,1]!=1,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 11 Figure 5: (Left)Plot illustrating the results of tree classifier. (Right) Second boundary from LDA solution, dashed boundary is the tree solution on the LDA predicted values. c1.test[d.olive.test[,1]==3]<-1 c1.test<-c1.test[d.olive.test[,1]!=1] d.olive.test<-d.olive.test[d.olive.test[,1]!=1,-c(1,2)] # Trees # Load the rpart library olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here is the output from R: # South vs others n= 436 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 436 190 -1 (0.5642202 0.4357798) 2) eicosenoic>=7 246 0 -1 (1.0000000 0.0000000) * 3) eicosenoic< 7 190 0 1 (0.0000000 1.0000000) * c1.train -1 1 -1 246 0 1 0 190 c1.test -1 1 12 -1 77 0 1 0 59 # Sardinia vs North n= 190 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 190 74 1 (0.3894737 0.6105263) 2) linoleic>=1053.5 74 0 -1 (1.0000000 0.0000000) * 3) linoleic< 1053.5 116 0 1 (0.0000000 1.0000000) * c1.train -1 1 -1 74 0 1 0 116 c1.test -1 1 -1 24 0 1 0 35 LDA results in errors in the training sample (error rate = 1/190 = 0.005) and no errors in the test sample. The predicted values are well-separated (see figure) but the boundary is in the wrong place! (See Figure 5, right plot.) palm -0.13 palm’oleic -0.27 stear 0.10 oleic 0.91 lino -0.92 lino’nic -0.24 arach -0.64 eico 0.03 Table 1: Correlations between predicted values and the predictors. The variables oleic, linoleic and arachidic are highly correlated with the predicted values (Table 1). The predicted values show a big gap between the northern oils and the sardinian oils but the boundary is set too close to the sardinian group because LDA assumes equal variances. olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x olive.lda.proj<-predict(olive.lda1,d.olive[,-c(1,2)],dimen=1)$x par(pty="s") plot(d.olive[,10],olive.lda.proj,type="n", xlab="eicosenoic",ylab="LDA Discrim 1") 13 points(d.olive[d.olive[,1]==1,10],olive.lda.proj[d.olive[,1]==1],pch="x",col=2) points(d.olive[d.olive[,1]==2,10],olive.lda.proj[d.olive[,1]==2],pch="s",col=3) points(d.olive[d.olive[,1]==3,10],olive.lda.proj[d.olive[,1]==3],pch=16,col=6) abline(v=7) lines(c(0,7),c(-0.49,-0.49)) text(50,7,"South") text(-2,8,"Sardinia",pos=4) text(-2,-6,"North",pos=4) # Predicted values are generated by # x’sigma^(-1)(mean1-mean2)-((mean1+mean2)/2)’sigma^(-1)(mean1-mean2) # So these lines should recreate the predicted values. xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,190),nrow=190,byrow=T))%*% as.matrix(olive.lda1$scaling) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) These are the results: Prior probabilities of groups: -1 1 0.3894737 0.6105263 Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic -1 1111.514 95.82432 224.5811 7270.824 1193.757 27.90541 73.85135 1 1095.578 83.52586 231.3879 7790.284 729.250 21.53448 36.79310 eicosenoic -1 1.972973 1 2.017241 Coefficients of linear discriminants: LD1 palmitic 0.0003340794 palmitoleic 0.0152893597 stearic 0.0142798870 oleic 0.0010155503 linoleic -0.0097371901 linolenic -0.0104774449 arachidic -0.0236128945 14 eicosenoic -0.1193127354 c1.train -1 1 -1 74 0 1 1 115 c1.test -1 1 -1 24 0 1 0 35 Quadratic discriminant analysis provides a perfect classification but its more difficult to determine the location of the boundary between the two groups. # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) Trees calculated on the predicted values provide a perfect classification, although the boundary seems a little too close to the northern oils (Figure 5). # Check if trees will give a better boundary rpart(c1.train~olive.x,method="class") par(lty=2) lines(c(0,7),c(-1.49,-1.49)) par(lty=1) This gives the classification of the regions as: If eicosenoic acid > 7 then the region is South (1) Else If 0.0003×palmitic +0.0153×palmitoleic +0.01423×stearic +0.0010×oleic −0.0097×linoleic 0.0105×linolenic −0.0236×arachidic −0.1193×eicosenoic - 2.12 > -1.49 then the region is Sardinia (2) Else the region is North (3) There is no error associated with this rule. Sardinian Oils Split the samples into training and test sets. # Sardinian oils d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==6]<-1 c1.train<-c1.train[d.olive.train[,1]==2] d.olive.train<-d.olive.train[d.olive.train[,1]==2,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 15 c1.test[d.olive.test[,2]==6]<-1 c1.test<-c1.test[d.olive.test[,1]==2] d.olive.test<-d.olive.test[d.olive.test[,1]==2,-c(1,2)] Figure 6: Separation of the inland and coastal Sardinian oils using trees (horizontal axis) and LDA (vertical axis). A classification tree produces a perfect classification of both training and test samples but the separation between groups is rather small relative to the split for a slightly more complex solution provided by LDA (Figure 6). Here is the code: # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x par(mfrow=c(1,1),pty="s") 16 plot(c(d.olive.train[,5],d.olive.test[,5]),c(olive.x,olive.x2),type="n", xlab="Tree",ylab="LDA Discrim") points(c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]), c(olive.x[c1.train==-1],olive.x2[c1.test==-1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]), c(olive.x[c1.train==1],olive.x2[c1.test==1]),pch="x",col=4) abline(v=1246.5) abline(h=0.73) text(1150,-4,"Inland",pos=4) text(1260,7,"Coastal",pos=4) The solution for Sardinian oils then is: If 0.0097×palmitic +0.0018×palmitoleic +0.0276×stearic +0.00075×oleic +0.0271×linoleic +0.0750×linolenic −0.0260×arachidic −0.0400×eicosenoic - 55.04 > 0.73 then the sample is from inland Sardinia. palm 0.52 palm’oleic 0.28 stear 0.73 oleic -0.96 lino 0.98 lino’nic -0.47 arach -0.17 eico -0.01 Table 2: Correlations between variables and predicted values for Sardinian oils. Based on correlations with the predicted values (Table 2) the important variables for separating inland and coastal Sardinia are oleic, linoleic and stearic acids, with palmitic and linolenic being somewhat important. Northern Oils Separating these northern oils is more difficult, based on our visual analysis. East Liguria vs West Liguria,Umbria The classification difficulty is reflected in the results of tree classifiers: the error in the training set is 8/116 = 0.07 and the error in the test set is 8/35 = 0.23. But is is a simple solution using just two variables, palmitic and arachidic. (See Figure 7 left plot.) The LDA solution is a littler better, with 6/116 = 0.05 errors in the training set and 3/35 = 0.09 in the test set (Figure 7). # East Liguria vs others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) plot(c(d.olive.train[,7],d.olive.test[,7]), c(d.olive.train[,1],d.olive.test[,1]),type="n", xlab="arachidic",ylab="palmitic") points(c(d.olive.train[c1.train==-1,7],d.olive.test[c1.test==-1,7]), c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,7],d.olive.test[c1.test==1,7]), c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]),pch="x",col=4) 17 abline(v=59.5) lines(c(0,59.5),c(1165,1165)) text(80,1400,"East Liguria",pos=4) text(0,800,"West Liguria, Umbria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,79),nrow=79,byrow=T))%*% as.matrix(olive.lda1$scaling) xmn%*%as.matrix(olive.lda1$scaling) plot(c(olive.x,olive.x2),c(c1.train,c1.test),type="n", xlab="LDA Discrim",ylab="Area") points(c(olive.x[c1.train==-1],olive.x2[c1.test==-1]), c(c1.train[c1.train==-1],c1.test[c1.test==-1]),pch=16,col=2) points(c(olive.x[c1.train==1],olive.x2[c1.test==1]), c(c1.train[c1.train==1],c1.test[c1.test==1]),pch="x",col=4) abline(v=0.76) text(2,0.5,"East Liguria",pos=4) text(-4,-0.5,"West Liguria, Umbria",pos=4) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) Turning to feed-forward neural networks. After several runs a good result the net converges, to a solution that has no error in the training set but two errors in the test set. The solution is as follows: Area = 1.01-2.01×(-0.22× palmitic-0.01× palmitoleic-0.13× stearic +0.05× oleic -0.21× linoleic +0.61× linolenic-0.05× arachidic +0.03× eicosenoic) -0.01×(0.07× palmitic+0.03× palmitoleic+0.06× stearic -0.02× oleic +0.05× linoleic -0.05× linolenic+0.04× arachidic) -2.00×(-0.10× palmitic+0.10× palmitoleic+0.11× stearic +0.07× oleic -0.01× linoleic +0.20× linolenic-0.26× arachidic -0.01× eicosenoic) The code is as follows: 18 Figure 7: (Left) Tree solution for East Liguria vs West Liguria,Umbria, (Middle) LDA solution for East Liguria vs West Liguria,Umbria, (Right) LDA solution for West Liguria vs Umbria. olive.nn<-nnet(c1.train~.,d.olive.train,size=3,linout=T,decay=5e-4, range=0.6,maxit=1000) # weights: 31 initial value 101.133439 iter 10 value 100.792293 iter 20 value 88.353498 ... iter 560 value 0.004900 final value 0.004900 converged > table(c1.train,round(predict(olive.nn,d.olive.train))) c1.train -1 1 -1 79 0 1 0 37 > table(c1.test,round(predict(olive.nn,d.olive.test))) c1.test -3 -1 0 1 -1 0 21 1 0 1 1 0 0 12 Support vector machines (SVM) produce 2 errors in the training set and 6 errors in the test set. The Windows version of SVM light is used. To run SVM the data needs to be outputted to file in a special format. The svm learn and svm classify executables are run in a command window. The parameters needs to be adjusted some to get the best results. Linear kernels are used. The neural network solution is the best but not substantially different from the LDA solution in test set error. So we will use the LDA solution which is a little simpler. West Liguria vs Umbria Trees result in no errors in the training set and 2/22 = 0.09 errors in the test set. LDA results in no errors in either training or test sets (Figure 7). 19 # West Liguria vs Umbria d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==8]<-1 c2.train<-c2.train[d.olive.train2[,1]==3&d.olive.train2[,2]!=7] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==3&d.olive.train2[,2]!=7, -c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] c2.test[d.olive.test2[,2]==8]<-1 c2.test<-c2.test[d.olive.test2[,1]==3&d.olive.test2[,2]!=7] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==3&d.olive.test2[,2]!=7,-c(1,2)] The solution for northern oils then is: If 0.0285×palmitic +0.0009×palmitoleic +0.0317×stearic +0.0187×oleic + 0.0180×linoleic −0.0264×linolenic +0.0768×arachidic +0.0795×eicosenoic - 200.0 > 0.76 then the sample is from East Liguria Else if 0.0124×palmitic +0.0343×palmitoleic +0.0261×stearic +0.0122×oleic + 0.0172×linoleic −0.0627×linolenic −0.0342×arachidic −0.2244×eicosenoic - 128.0 > -0.03 then the sample is from West Liguria Else it is from Umbria. E.Lig vs Oth W. Lig vs Umb palm 0.67 -0.34 palm’oleic 0.03 0.80 stear 0.21 0.76 oleic -0.26 -0.88 lino -0.26 0.94 lino’nic 0.25 -0.93 arach 0.76 -0.85 eico -0.10 0.05 Table 3: Correlations between variables and predicted values for northern oils. Based on correlations with the predicted values (Table 3) the important variables for separating East Liguria are palmitic and arachidic, and for separating West Liguria from Umbria the important variables are palmitoleic, stearic, oleic, linoleic, linolenic, arachidic. Southern Oils The areas of the southern region promised to be the most difficult to classify. We ignore Sicily to begin. Calabria vs Others With trees there are 4/219 = 0.018 errors in the training set, and 2/68 = 0.029 errors in the test set. The plots in Figure 8 illustrate the splits. This is about the best solution of all the methods, surprisingly! There are too many errors in the training set, but the most important error to guard against is the test error, and trees give the smallest test error. LDA results in 4/219 = 0.18 errors in the training set and 5/68 = 074 errors in the test set. QDA results in only 1/219 = 0.004 in the training set but 3/68 = 0.044 in the test set. The best results for FFNN were 0 errors in the training set but 3/68 − 0.044 errors in the test set. SVM had 1 error in the training set and 4 errors in the test set. This is the code: 20 Figure 8: The boundaries resulting from a tree classifier on Calabria vs other southern areas. # Southern oils - without Sicily # Calabria vs Others d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==2]<-1 c1.train<-c1.train[d.olive.train[,1]==1&d.olive.train[,2]!=4] d.olive.train<-d.olive.train[d.olive.train[,1]==1&d.olive.train[,2]!=4,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] c1.test[d.olive.test[,2]==2]<-1 c1.test<-c1.test[d.olive.test[,1]==1&d.olive.test[,2]!=4] d.olive.test<-d.olive.test[d.olive.test[,1]==1&d.olive.test[,2]!=4,-c(1,2)] # Calabria vs Others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) par(pty="s",mfrow=c(1,2)) plot(c(d.olive.train[,1],d.olive.test[,1]), c(d.olive.train[,5],d.olive.test[,5]), type="n",xlab="palmitic",ylab="linoleic",main="First split") points(c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]), 21 c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]),pch=16,col=2) points(c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]), c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]),pch="x",col=4) abline(h=946) text(1000,1400,"Others",pos=4) plot(c(d.olive.train[d.olive.train[,5]<946,1], d.olive.test[d.olive.test[,5]<946,1]), c(d.olive.train[d.olive.train[,5]<946,6], d.olive.test[d.olive.test[,5]<946,6]), type="n",xlab="palmitic",ylab="linolenic",main="Next two splits") points(c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,1], d.olive.test[c1.test==-1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,6], d.olive.test[c1.test==-1&d.olive.test[,5]<946,6]),pch=16,col=2) points(c(d.olive.train[c1.train==1&d.olive.train[,5]<946,1], d.olive.test[c1.test==1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==1&d.olive.train[,5]<946,6], d.olive.test[c1.test==1&d.olive.test[,5]<946,6]),pch="x",col=4) abline(v=1116) lines(c(1116,1500),c(37,37)) text(900,70,"Others",pos=4) text(1300,65,"Calabria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) # FFNN # Load nnet library olive.nn<-nnet(c1.train~.,d.olive.train,size=5,linout=T,decay=5e-4, range=0.6,maxit=1000) table(c1.train,round(predict(olive.nn,d.olive.train))) table(c1.test,round(predict(olive.nn,d.olive.test))) olive.x<-predict(olive.nn,d.olive.train) olive.x2<-predict(olive.nn,d.olive.test) Here are the results: 22 # trees n= 219 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 219 42 -1 (0.8082192 0.1917808) 2) linoleic>=946 153 0 -1 (1.0000000 0.0000000) * 3) linoleic< 946 66 24 1 (0.3636364 0.6363636) 6) palmitic< 1116 18 0 -1 (1.0000000 0.0000000) * 7) palmitic>=1116 48 6 1 (0.1250000 0.8750000) 14) linolenic< 37 8 3 -1 (0.6250000 0.3750000) * 15) linolenic>=37 40 1 1 (0.0250000 0.9750000) * c1.train -1 1 -1 176 1 1 3 39 c1.test -1 1 -1 54 0 1 2 12 # LDA lda.data.frame(d.olive.train, c1.train) Prior probabilities of groups: -1 1 0.8082192 0.1917808 Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic -1 1353.311 170.2655 212.2655 7007.458 1121.831 35.54802 61.56497 1 1298.119 120.5952 263.4762 7319.881 808.119 45.54762 64.02381 eicosenoic -1 25.38418 1 28.83333 Coefficients of linear discriminants: LD1 palmitic -0.03604224 palmitoleic -0.05131146 stearic -0.03084686 oleic -0.04136986 linoleic -0.04599853 linolenic 0.06430998 arachidic -0.08761155 eicosenoic -0.09149567 23 c1.train -1 1 -1 173 4 1 0 42 c1.test -1 1 -1 50 4 1 1 13 # QDA c1.train -1 1 -1 176 1 1 0 42 c1.test -1 1 -1 54 0 1 3 11 # FFNN # weights: 51 initial value 144.303546 iter 10 value 69.334956 ..... iter 250 value 0.542601 final value 0.542600 converged c1.train -1 1 -1 177 0 1 0 42 c1.test -1 0 1 -1 53 0 1 1 1 1 12 North vs South Apulia These two areas turn out to be very easy to separate, using trees. The training and test sets are both perfectly classified using palmitoleic acid (Figure 9). Here is the code: # North Apulia vs South Apulia d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==1]<-1 c2.train<-c2.train[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4,-c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] 24 Figure 9: North and South Apulia can be separated very cleanly using palmitoleic acid. c2.test[d.olive.test2[,2]==1]<-1 c2.test<-c2.test[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4,-c(1,2)] # Trees olive.rp<-rpart(c2.train~.,data.frame(d.olive.train2),method="class") olive.rp table(c2.train,predict(olive.rp,type="class")) table(c2.test,predict(olive.rp,data.frame(d.olive.test2),type="class")) par(mfrow=c(1,1)) hist(c(d.olive.train2[,2],d.olive.test2[,2]),30,col=2, xlab="palmitoleic",main="North vs South Apulia",xlim=c(0,300)) text(50,30,"North",pos=4) text(250,30,"South",pos=4) abline(v=111.5) Here are the results: n= 177 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 177 19 -1 (0.8926554 0.1073446) 2) palmitoleic>=111.5 158 0 -1 (1.0000000 0.0000000) * 3) palmitoleic< 111.5 19 0 1 (0.0000000 1.0000000) * 25 c2.train -1 1 -1 158 0 1 0 19 c2.test -1 1 -1 48 0 1 0 6 Now lets review Sicily Sicily, the problem area, cannot be cleanly separated from the other areas using trees. The error is 15/246 = 0.061 in the training set, and 7/77 = 0.091 in the test set. What about FFNN or SVM? Here is the code: olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here are the results: n= 246 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 246 27 -1 (0.89024390 0.10975610) 2) eicosenoic< 34.5 199 6 -1 (0.96984925 0.03015075) * 3) eicosenoic>=34.5 47 21 -1 (0.55319149 0.44680851) 6) stearic< 261 30 8 -1 (0.73333333 0.26666667) 12) arachidic< 84.5 23 3 -1 (0.86956522 0.13043478) * 13) arachidic>=84.5 7 2 1 (0.28571429 0.71428571) * 7) stearic>=261 17 4 1 (0.23529412 0.76470588) * c1.train -1 1 -1 213 6 1 9 18 c1.test -1 1 -1 65 3 1 4 5 Thus the solution for the areas of the southern region (without Sicily) is: If linoleic>=946 If palmitoleic >= 111.5 then the area is South Apulia Else the area is North Apulia Else (linoleic<946) If palmitic<1116 If palmitoleic >= 111.5 then the area is South Apulia 26 Else the area is North Apulia Else (palmitic>=1116) If linolenic<37 If palmitoleic >= 111.5 then the area is South Apulia Else the area is North Apulia Else (linolenic>37) the area is Calabria The test error associated with this classification is 3%. 27 4 Summary The classification rule (with values given in % ×100) is then: If eicosenoic acid > 7 then the region is South If eicosenoic acid < 34.5 then the area is Sicily Else If stearic < 261 If arachidic < 84.5 the area is Sicily If stearic ≥ 261 OR if (stearic < 261 and arachidic ≥ 84.5) If linoleic ≥ 946 If palmitoleic ≥ 111.5 the area is South Apulia Else the area is North Apulia Else (linoleic <946) If palmitic ≥ 1116 If palmitoleic ≥ 111.5 the area is South Apulia Else the area is North Apulia Else (palmitic ≥ 1116) If linolenic < 37 If palmitoleic ≥ 111.5 then the area is South Apulia Else the area is North Apulia Else (linolenic > 37) the area is Calabria Else If 0.0003×palmitic +0.0153×palmitoleic +0.01423×stearic +0.0010×oleic −0.0097×linoleic 0.0105×linolenic −0.0236×arachidic −0.1193×eicosenoic - 2.12 > -1.49 then the region is Sardinia If 0.0097×palmitic +0.0018×palmitoleic +0.0276×stearic +0.00075×oleic +0.0271×linoleic+ 0.0750×linolenic −0.0260×arachidic −0.0400×eicosenoic - 55.04 > 0.73 then the sample is from inland Sardinia Else the sample is from coastal Sardinia Else the region is North If 0.0285×palmitic +0.0009×palmitoleic +0.0317×stearic +0.0187×oleic + 0.0180×linoleic −0.0264×linolenic +0.0768×arachidic +0.0795×eicosenoic - 200.0 > 0.76 then the sample is from East Liguria Else if 0.0124×palmitic +0.0343×palmitoleic +0.0261×stearic +0.0122×oleic + 0.0172×linoleic −0.0627×linolenic −0.0342×arachidic −0.2244×eicosenoic - 128.0 > -0.03 then the sample is from West Liguria Else it is from Umbria The error on the test set overall is (3 + 7 + 2)/136 = 0.09 including Sicily and 5/127 = 0.04 without Sicily, and separately on the areas is given in Table 4. The important variables in each of the separations are given in Table 5. There are several qualifications about the study: • We are assuming that the data samples have been collected appropriately, that the samples are representative of the oils of the areas they are purported to be produced in. • There is a big difference in the sample sizes for each class. South Apulia is the most abundantly represented and form the internet search of material this is one of the most prolific olive oil producing 28 Classes Overall (with Sicily) Overall (without Sicily) South vs Others North vs Sardinia Inland vs Coastal Sardinia East Liguria vs Others Umbria vs West Liguria Sicily vs Others Calabria vs Others North vs South Apulia Test Error 12/136=0.09 5/127=0.04 0/136 =0 0/59 =0 0/24=0 3/35=0.09 0/23=0 7/77=0.09 2/68=0.03 0/54=0 Table 4: Prediction errors Separation Regions South/Others Sardinia/North Sardinia Coast/Inland North E.Liguria/Others W.Liguria/Umbria South Sicily/Others Calab/Others N.Apulia/S.Apulia palmitic palmitoleic stear oleic linoleic x x x x x x x x x x linolenic arachidic eicosenoic x x x x x x x x x x x x Table 5: This table marks the important variables for each separation. areas. North Apulia has the least samples, only 25. Perhaps the sample size is indicative of the productivity and should perhaps be incorporated into the classifiers. 29 References Burges, C. J. C. (1998), ‘A Tutorial Support Vector Machines for Pattern Recognition’, Data Mining and Knowledge Discovery 2, 121–167. Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of olive oils from their fatty acid composition, in H. Martens & H. Russwurm Jr., eds, ‘Food Research and Data Analysis’, Applied Science Publishers, London, pp. 189–214. Vapnik, V. (1979), Estimation of Dependence Based on Empirical Data (In Russian), Nauka, Moscow. Vapnik, V. (1999), The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), Springer-Verlag, New York, NY. Venables, W. N. & Ripley, B. (1994), Modern Applied Statistics with S-Plus, Springer-Verlag, New York. 30