STAT 503X Case Study 2: Italian Olive Oils

by user

on 06 июля 2016

Category:

statistics

>> Downloads: 11

views

Report

Comments

Description

Download STAT 503X Case Study 2: Italian Olive Oils

Transcript

STAT 503X Case Study 2: Italian Olive Oils

STAT 503X Case Study 2: Italian Olive Oils
1
Description
This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic,
linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. (An analysis of this data is
given in (Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas, 4 from southern Italy
(North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern
Italy (Umbria, East and West Liguria).
The data available are:
Region
Area
Palmitic Acid
Palmitoleic Acid
Stearic Acid
Oleic Acid
Linoleic Acid
Linolenic Acid
Arachidic Acid
Eicosenoic Acid
South, North or Sardinia
Sub-regions within the larger regions (North and South Apulia,
Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and
West Liguria
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
Percentage ×100 in sample
The primary question is “How do we distinguish the oils from different regions and areas in Italy based
on their combinations of the fatty acids?”
1
2
2
Suggested Approaches
Approach
Data Restructuring
Summary Statistics
Visual Inspection The
aim of visual methods
is to understand separations, help decide
the best classification
method,
and hence
interpret solutions
Numerical Analysis
The aim of numerical
solutions is to get the
best predictive results.
Reason
This is very clean data so I don’t see
any need to restructure.
To get at location and scale information for each variable, and by groups
Univariate Plots, Bivariate Plots,
Touring Plots
Linear
Discriminant
Analysis
(LDA), Quadratic Discriminant
Analysis (QDA), Classification and
Regression Trees (CART), Forests,
Feed-Forward Neural Networks and
Support Vector Machines
3
Type of questions addressed
What is the average percent composition of eicosenoic acid overall? Is
there a difference in the average percentage of eicosenoic acid for olives
from different growing regions?
Are there differences in the fatty acid
composition between the olives from
different growing regions?
“Can define the percentage composition of fatty acids that distinguishes olives from different growing
regions?”
3
Actual Approaches
3.1
Summary Statistics
The following tables contain information on the minimum, maximum, median, mean and standard deviation
for the fatty acids, in total and broken down by different growing region.
All the 572 observations (values reported as percentages):
Min.
Median
Max.
Mean
Std. Dev.
palmitic
6.10
12.01
17.53
12.32
1.69
palmitoleic
0.15
1.10
2.80
1.26
0.53
stearic
1.52
2.23
3.75
2.29
0.37
oleic
63.00
73.02
84.10
73.12
4.06
linoleic
4.48
10.30
14.70
9.81
2.43
linolenic
0.00
0.33
0.74
0.32
0.13
arachidic
0.0
0.61
1.05
0.58
0.22
eicosenoic
0.01
0.17
0.58
0.16
0.14
From the summary statistics we notice that olive oils are mostly constituted of oleic acid. Palmitic
accounts for about 12% on average and the rest less than 10% on average.
South
Sardinia
North
Nth Apul
Calabria
Sth Apul
Sicily
Inla Sard
Coas Sard
East Lig
West Lig
Umbria
n
323
98
151
25
56
206
36
65
33
50
50
51
palmitic
13.32
11.11
10.95
10.27
13.02
13.96
12.3
10.98
11.38
11.45
10.3
10.86
palmitoleic
1.55
0.97
0.84
0.62
1.21
1.84
1.05
0.95
1.01
0.84
1.08
0.60
stearic
2.29
2.26
2.31
2.35
2.63
2.11
2.74
2.17
2.44
2.41
2.57
1.94
oleic
71.00
72.68
77.93
78.20
73.07
69.11
73.58
73.61
70.86
77.46
76.74
79.56
linoleic
10.33
11.97
7.27
7.06
8.19
11.66
8.35
11.25
13.37
6.89
8.97
5.97
linolenic
0.38
0.27
0.22
0.43
0.46
0.35
0.42
0.29
0.24
0.26
0.05
0.34
arachidic
0.63
0.73
0.38
0.72
0.64
0.60
0.76
0.74
0.72
0.64
0.07
0.42
eicosenoic
0.27
0.02
0.02
0.35
0.28
0.24
0.38
0.02
0.02
0.02
0.02
0.02
Southern oils have much higher eicosenoic acid on average eicosenoic and slightly higher palmitic and
palmitoleic acid content. The north and sardinian oils have some difference in the average oleic, linoleic and
arachidic acids.
Among the southern oils there is some difference in most of the averages.
Northern oils have some difference in most of the averages.
4
3.2
Visual Inspection
The objective here is to find differences in the measured variables amongst the classes. Differences here
might be actual separations between classes on a variable or linear combination of variables. The approach
is relatively simple:
1. Use color and or symbol to code the categorical class information into plot.
2. Begin with low-dimensional plots (histogram, density plot, dot plot, scatterplot) of the measured variables and work up to high-dimensional plots (parallel coordinate plot, tours), exploring class structure
in relation to data space.
3.2.1
Regions - Univariate Plots
Using 1D Plot mode sequentially work through the variables, either by manually sepecting variables or
cycling through automatically, to examine separations between regions. Its possible to neatly separate the
oils from southern Italy from the other two regions using just one variable, eicosenoic acid. Figure 1 displays
a textured dotplot and an ASH plot of this variable.
The oils from southern Italy are removed, and we concentrate on separating the oils from northern Italy
and Sardinia. Although a clear separation between these two regions cannot be found using one variable,
two of the variables, oleic and linoleic acid, appear to be important for the separation (Figure 1).
3.2.2
Regions - Bivariate Plots
Starting from the two variables identified by the univariate plots as important for separating northern Italian
oils from Sardinian oils, the remaining variables are explored in relation two these two using scatterplots.
Oleic acid and linoleic acid show some, but not cleanly, separated regions. Arachidic acid and linoleic acid
display a clear separation between the regions, but it is a very non-linear boundary.
3.2.3
Regions - Multivariate Plots
Starting from the three variables found from bivariate plots to be important for separating northern Italian
and Sardinian oils we use a higher-dimensional technique to explore them. Using either Rotation, Tour1D
or Tour2D examine the separation between the two regions in the 3-dimensional space. Figure 2 shows
the results of using Tour1D on the three variables. The two regions can be separated cleanly by a linear
combination of linoleic and arachidic acid, roughly corresponding to 0.957 × linoleic + 0.289 × arachidic.
3.2.4
Areas - Northern Italy
There are three areas in the region, Umbria, East and West Liguria. From univariate plots there are
no clear separations between areas, although several variables, for example, linoleic acid (Figure 3, show
some differences. In bivariate plots, two variables, stearic and linoleic show some differences between the
areas. Examining combinations of variables in Tour1D shows that West Liguria is almost separable from
the other two areas using palmitoleic, stearic, linoleic and arachidic acids. Umbria and East Liguria are
separable, except for one sample, in a combination of most of the variables. To get this result, the projection
pursuit controls were used followed by manual manipulation to assess the importance of each variable in the
separation of the areas.
5
Figure 1: Separation between the 3 regions of Italian olive oil data in univariate plots.
6
Figure 2: Separation between the northern Italian and Sardinian oils in bivariate plots and multivariate
plots.
3.2.5
Areas - Sardinia
Separating the oils from the coastal and inland areas of Sardinia can be simply done by using two variables,
oleic and linoleic acid.
3.2.6
Areas - Southern Italy
There are four areas in the south Italy region, North and South Apulia, Calabria, Sicily. From univariate
plots and bivariate plots there are no clear separations between all four areas. Simplify the work by working
with fewer groups, take two of the areas at a time, or three at a time, and plot the data in univariate and
bivariate plots. The best results are obtained by removing Sicily. If the oils from this area are removed then
the remaining 3 areas are almost separable using the variables palmitic, palmitoleic, stearic and oleic acids.
Figure 4 shows plots taken from the Tour2D of these 4 variables revealing the separations between the areas,
North and South Apulia, and Calabria. However, the fatty acid content of oils from Sicily is confounded
with those of the other areas.
3.2.7
Taking stock
What we have learned from this data is that the olive oils from different geographic regions have dramatically
different fatty acid content. The three larger geographic regions, North, South, Sardinia, are well-separated
based on eicosenoic, linoleic and arachidic acids. The oils from areas in northern Italy are mostly separable
from each other using all the variables. The oils from the inland and coastal areas of Sardinia have different
amounts of oleic and linoleic acids. The oils from three of the areas in southern Italy are almost separable.
And one is left with the curious content of the oils from Sicily. Why are these oils indistinguishable from
the oils of all the other areas in the south? Is there a problem with the quality of these samples?
7
Figure 3: Separation in the oils from areas of northern Italy, univariate, bivariate and multivariate plots.
8
Figure 4: The areas of southern Italy are mostly separable, except for Sicily.
3.3
Numerical Analysis
The data is broken into training and test samples for this part of the analysis. Approximately 25% of cases
within each group are reserved for the test samples. These are the cases used for the test sample, where the
data has been sorted by region and area:
1
76
151
218
305
375
460
530
7
80
156
219
310
376
462
532
12
95
165
225
313
386
466
541
15
101
175
227
314
392
468
543
16
102
177
241
323
405
470
545
22
105
182
242
330
406
474
546
27
106
183
246
333
415
476
551
32
110
185
257
338
416
480
559
34
116
186
259
341
418
481
567
35
118
187
263
342
420
482
570
36
119
190
266
347
421
487
41
122
192
274
351
423
492
50
134
194
280
352
426
493
54
137
201
284
356
428
500
61
140
202
289
358
435
501
68
147
211
291
359
440
509
70
148
213
292
369
451
519
75
150
217
297
374
458
522
Region
Area
# Train # Test
South N. Apul.
19
6
South Calabria
42
14
South
S. Apul
158
48
South
Sicily
27
9
Sard
Inland
49
16
Sard
Coast
25
8
North
E. Lig.
38
12
North
W. Lig.
38
12
North
Umbria
40
11
We will use what we learned from the visual analysis to guide the numerical analysis. Some methods
restrict classification to just two groups, so we will use this restriction with all the classification methods.
We will also drop the Sicilian oils from the study - because there is some question raised about the validity
of these oils. Here is the binary classification steps we will follow:
9
3.3.1
Methods Used
Classification trees generate a classification tree by sequentially doing binary splits on variables. Splits are
decided on according to the variable and split value which produces the lower measure of impurity. There
are several common measures of impurity, Gini and deviance. For example, for a two class problem with
400 observations in each class, denoted as (400,400), variable 1 produces a split of (300,100) and (100,300),
and variable 2 produces a split (200,400) and (200,0). The split on variable 2 will produce a lower impurity
value because one branch is more pure. There are numerous implementations: R packages such as rpart
and standalone packages such as C4.5, C5.0.
Classical Linear discriminant analysis (LDA) assumes that the populations come from multivariate normal
distributions with different mean but equal variance-covariance matrices. Linear boundaries result, and it
is possible to reduce the dimension of the data into the space of maximum separation using canonical
coordinates. The MASS package in R has functions for LDA.
Quadratic discriminant analysis (QDA) assumes that the populations come from multivariate normal
distributions with different mean and different variance-covariance. Non-linear boundaries are the result.
The MASS package in R has functions for LDA.
Feed-forward neural networks provide a flexible way to generalize linear regression functions. A simple
network model as produced by nnet code in R (Venables & Ripley 1994) may be represented by the equation:
y = φ(α +
s
X
wh φ(αh +
p
X
wih xi ))
i=1
h=1
where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is
10
the number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function.
This model has a single hidden layer, and univariate output values.
SVM have recently gained widespread attention due to their success at prediction in classification problems. The subject started in the late seventies (Vapnik 1979). The definitive reference is Vapnik (1999), and
Burges (1998) gives a simpler tutorial into the subject. The main difference between this and the previously
described classification techniques is that SVMs really only work for separating between 2 groups. There
are some modifications for multiple groups but these are little more than one might do manually by working
pairwise through the groups. SVM algorithms work by finding a subset of points in the two groups that can
be separated, which are then known as the supportPvectors. These support vectors can be used to define
NS
a separating hyperplane, w.x + b = 0, where w = i=1
αi yi xi , NS is the number of support vectors, and
PNS
α comes from the constraint i=1 αi yi = 0. Actually we search for the support vectors which gives the
hyperplane which gives the biggest separation between the two classes. Note that it is posssible to incorporate non-linear kernel functions allows for defining non-linear separations, and also that modifications allow
SVMs to perform well when the classes are not separable. We use the software SVM Light: binary and
documentation at http://svmlight.joachims.org/
3.3.2
The Classification
Separating Regions
In separating the southern oils from the other two regions, and then northern oils from Sardinian oils,
here is the classification tree solution:
If eicosenoic acid $>$ 7 then the region is South (1)
Else
If linoleic $>$ 1053.5 then the region is Sardinia (2)
Else the region is North (3)
There are no missclassifications on the first split but the second split is problematic. It is a perfect
separation for this data but there is no gap between the two groups. This is easily seen from the plot of the
two variables used, Figure 5. So at this point we will shift to using a different method to build a classification
rule for northen oils from Sardinian oils.
Here is the R code:
# South vs others
d.olive.train<-d.olive[indx.tr,-c(1,2)]
d.olive.test<-d.olive[indx.tst,-c(1,2)]
c1<-rep(-1,572)
c1[d.olive[,1]!=1]<-1
c1.train<-c1[indx.tr]
c1.test<-c1[indx.tst]
# North vs Sardinia
d.olive.train<-d.olive[indx.tr,]
c1.train<-c(rep(-1,572))[indx.tr]
c1.train[d.olive.train[,1]==3]<-1
c1.train<-c1.train[d.olive.train[,1]!=1]
d.olive.train<-d.olive.train[d.olive.train[,1]!=1,-c(1,2)]
d.olive.test<-d.olive[indx.tst,]
c1.test<-c(rep(-1,572))[indx.tst]
11
Figure 5: (Left)Plot illustrating the results of tree classifier. (Right) Second boundary from LDA solution,
dashed boundary is the tree solution on the LDA predicted values.
c1.test[d.olive.test[,1]==3]<-1
c1.test<-c1.test[d.olive.test[,1]!=1]
d.olive.test<-d.olive.test[d.olive.test[,1]!=1,-c(1,2)]
# Trees
# Load the rpart library
olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class")
table(c1.train,predict(olive.rp,type="class"))
table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class"))
Here is the output from R:
# South vs others
n= 436
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 436 190 -1 (0.5642202 0.4357798)
2) eicosenoic>=7 246
0 -1 (1.0000000 0.0000000) *
3) eicosenoic< 7 190
0 1 (0.0000000 1.0000000) *
c1.train -1
1
-1 246
0
1
0 190
c1.test -1 1
12
-1 77 0
1
0 59
# Sardinia vs North
n= 190
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 190 74 1 (0.3894737 0.6105263)
2) linoleic>=1053.5 74 0 -1 (1.0000000 0.0000000) *
3) linoleic< 1053.5 116 0 1 (0.0000000 1.0000000) *
c1.train -1
1
-1 74
0
1
0 116
c1.test -1 1
-1 24 0
1
0 35
LDA results in errors in the training sample (error rate = 1/190 = 0.005) and no errors in the test sample.
The predicted values are well-separated (see figure) but the boundary is in the wrong place! (See Figure 5,
right plot.)
palm
-0.13
palm’oleic
-0.27
stear
0.10
oleic
0.91
lino
-0.92
lino’nic
-0.24
arach
-0.64
eico
0.03
Table 1: Correlations between predicted values and the predictors.
The variables oleic, linoleic and arachidic are highly correlated with the predicted values (Table 1). The
predicted values show a big gap between the northern oils and the sardinian oils but the boundary is set too
close to the sardinian group because LDA assumes equal variances.
olive.lda1<-lda(d.olive.train,c1.train)
olive.lda1
table(c1.train,predict(olive.lda1,d.olive.train)$class)
table(c1.test,predict(olive.lda1,d.olive.test)$class)
# plot the predictions of the current sample
olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x
olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x
olive.lda.proj<-predict(olive.lda1,d.olive[,-c(1,2)],dimen=1)$x
par(pty="s")
plot(d.olive[,10],olive.lda.proj,type="n",
xlab="eicosenoic",ylab="LDA Discrim 1")
13
points(d.olive[d.olive[,1]==1,10],olive.lda.proj[d.olive[,1]==1],pch="x",col=2)
points(d.olive[d.olive[,1]==2,10],olive.lda.proj[d.olive[,1]==2],pch="s",col=3)
points(d.olive[d.olive[,1]==3,10],olive.lda.proj[d.olive[,1]==3],pch=16,col=6)
abline(v=7)
lines(c(0,7),c(-0.49,-0.49))
text(50,7,"South")
text(-2,8,"Sardinia",pos=4)
text(-2,-6,"North",pos=4)
# Predicted values are generated by
# x’sigma^(-1)(mean1-mean2)-((mean1+mean2)/2)’sigma^(-1)(mean1-mean2)
# So these lines should recreate the predicted values.
xmn<-apply(d.olive.train,2,mean)
x<-as.matrix(d.olive.train-matrix(rep(xmn,190),nrow=190,byrow=T))%*%
as.matrix(olive.lda1$scaling)
# compute correlations between variables and the predicted values
cor(olive.x,d.olive.train[,1])
cor(olive.x,d.olive.train[,2])
cor(olive.x,d.olive.train[,3])
cor(olive.x,d.olive.train[,4])
cor(olive.x,d.olive.train[,5])
cor(olive.x,d.olive.train[,6])
cor(olive.x,d.olive.train[,7])
cor(olive.x,d.olive.train[,8])
These are the results:
Prior probabilities of groups:
-1
1
0.3894737 0.6105263
Group means:
palmitic palmitoleic stearic
oleic linoleic linolenic arachidic
-1 1111.514
95.82432 224.5811 7270.824 1193.757 27.90541 73.85135
1 1095.578
83.52586 231.3879 7790.284 729.250 21.53448 36.79310
eicosenoic
-1
1.972973
1
2.017241
Coefficients of linear discriminants:
LD1
palmitic
0.0003340794
palmitoleic 0.0152893597
stearic
0.0142798870
oleic
0.0010155503
linoleic
-0.0097371901
linolenic
-0.0104774449
arachidic
-0.0236128945
14
eicosenoic
-0.1193127354
c1.train -1
1
-1 74
0
1
1 115
c1.test -1 1
-1 24 0
1
0 35
Quadratic discriminant analysis provides a perfect classification but its more difficult to determine the
location of the boundary between the two groups.
# Quadratic discriminant analysis
olive.qda1<-qda(d.olive.train,c1.train)
olive.qda1
table(c1.train,predict(olive.qda1,d.olive.train)$class)
table(c1.test,predict(olive.qda1,d.olive.test)$class)
Trees calculated on the predicted values provide a perfect classification, although the boundary seems a
little too close to the northern oils (Figure 5).
# Check if trees will give a better boundary
rpart(c1.train~olive.x,method="class")
par(lty=2)
lines(c(0,7),c(-1.49,-1.49))
par(lty=1)
This gives the classification of the regions as:
If eicosenoic acid > 7 then the region is South (1)
Else
If 0.0003×palmitic +0.0153×palmitoleic +0.01423×stearic +0.0010×oleic −0.0097×linoleic 0.0105×linolenic −0.0236×arachidic −0.1193×eicosenoic - 2.12 > -1.49 then the region is Sardinia (2)
Else the region is North (3)
There is no error associated with this rule.
Sardinian Oils
Split the samples into training and test sets.
# Sardinian oils
d.olive.train<-d.olive[indx.tr,]
c1.train<-c(rep(-1,572))[indx.tr]
c1.train[d.olive.train[,2]==6]<-1
c1.train<-c1.train[d.olive.train[,1]==2]
d.olive.train<-d.olive.train[d.olive.train[,1]==2,-c(1,2)]
d.olive.test<-d.olive[indx.tst,]
c1.test<-c(rep(-1,572))[indx.tst]
15
c1.test[d.olive.test[,2]==6]<-1
c1.test<-c1.test[d.olive.test[,1]==2]
d.olive.test<-d.olive.test[d.olive.test[,1]==2,-c(1,2)]
Figure 6: Separation of the inland and coastal Sardinian oils using trees (horizontal axis) and LDA (vertical
axis).
A classification tree produces a perfect classification of both training and test samples but the separation
between groups is rather small relative to the split for a slightly more complex solution provided by LDA
(Figure 6). Here is the code:
# Trees
olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class")
olive.rp
table(c1.train,predict(olive.rp,type="class"))
table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class"))
# Linear discriminant analysis
olive.lda1<-lda(d.olive.train,c1.train)
olive.lda1
table(c1.train,predict(olive.lda1,d.olive.train)$class)
table(c1.test,predict(olive.lda1,d.olive.test)$class)
# plot the predictions of the current sample
olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x
olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x
par(mfrow=c(1,1),pty="s")
16
plot(c(d.olive.train[,5],d.olive.test[,5]),c(olive.x,olive.x2),type="n",
xlab="Tree",ylab="LDA Discrim")
points(c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]),
c(olive.x[c1.train==-1],olive.x2[c1.test==-1]),pch=16,col=2)
points(c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]),
c(olive.x[c1.train==1],olive.x2[c1.test==1]),pch="x",col=4)
abline(v=1246.5)
abline(h=0.73)
text(1150,-4,"Inland",pos=4)
text(1260,7,"Coastal",pos=4)
The solution for Sardinian oils then is:
If 0.0097×palmitic +0.0018×palmitoleic +0.0276×stearic +0.00075×oleic +0.0271×linoleic +0.0750×linolenic
−0.0260×arachidic −0.0400×eicosenoic - 55.04 > 0.73 then the sample is from inland Sardinia.
palm
0.52
palm’oleic
0.28
stear
0.73
oleic
-0.96
lino
0.98
lino’nic
-0.47
arach
-0.17
eico
-0.01
Table 2: Correlations between variables and predicted values for Sardinian oils.
Based on correlations with the predicted values (Table 2) the important variables for separating inland and
coastal Sardinia are oleic, linoleic and stearic acids, with palmitic and linolenic being somewhat important.
Northern Oils
Separating these northern oils is more difficult, based on our visual analysis.
East Liguria vs West Liguria,Umbria
The classification difficulty is reflected in the results of tree classifiers: the error in the training set is
8/116 = 0.07 and the error in the test set is 8/35 = 0.23. But is is a simple solution using just two variables,
palmitic and arachidic. (See Figure 7 left plot.)
The LDA solution is a littler better, with 6/116 = 0.05 errors in the training set and 3/35 = 0.09 in the
test set (Figure 7).
# East Liguria vs others
# Trees
olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class")
olive.rp
table(c1.train,predict(olive.rp,type="class"))
table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class"))
plot(c(d.olive.train[,7],d.olive.test[,7]),
c(d.olive.train[,1],d.olive.test[,1]),type="n",
xlab="arachidic",ylab="palmitic")
points(c(d.olive.train[c1.train==-1,7],d.olive.test[c1.test==-1,7]),
c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]),pch=16,col=2)
points(c(d.olive.train[c1.train==1,7],d.olive.test[c1.test==1,7]),
c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]),pch="x",col=4)
17
abline(v=59.5)
lines(c(0,59.5),c(1165,1165))
text(80,1400,"East Liguria",pos=4)
text(0,800,"West Liguria, Umbria",pos=4)
# Linear discriminant analysis
olive.lda1<-lda(d.olive.train,c1.train)
olive.lda1
table(c1.train,predict(olive.lda1,d.olive.train)$class)
table(c1.test,predict(olive.lda1,d.olive.test)$class)
olive.x<-predict(olive.lda1,d.olive.train)$x
olive.x2<-predict(olive.lda1,d.olive.test)$x
xmn<-apply(d.olive.train,2,mean)
x<-as.matrix(d.olive.train-matrix(rep(xmn,79),nrow=79,byrow=T))%*%
as.matrix(olive.lda1$scaling)
xmn%*%as.matrix(olive.lda1$scaling)
plot(c(olive.x,olive.x2),c(c1.train,c1.test),type="n",
xlab="LDA Discrim",ylab="Area")
points(c(olive.x[c1.train==-1],olive.x2[c1.test==-1]),
c(c1.train[c1.train==-1],c1.test[c1.test==-1]),pch=16,col=2)
points(c(olive.x[c1.train==1],olive.x2[c1.test==1]),
c(c1.train[c1.train==1],c1.test[c1.test==1]),pch="x",col=4)
abline(v=0.76)
text(2,0.5,"East Liguria",pos=4)
text(-4,-0.5,"West Liguria, Umbria",pos=4)
# compute correlations between variables and the predicted values
cor(olive.x,d.olive.train[,1])
cor(olive.x,d.olive.train[,2])
cor(olive.x,d.olive.train[,3])
cor(olive.x,d.olive.train[,4])
cor(olive.x,d.olive.train[,5])
cor(olive.x,d.olive.train[,6])
cor(olive.x,d.olive.train[,7])
cor(olive.x,d.olive.train[,8])
Turning to feed-forward neural networks. After several runs a good result the net converges, to a solution
that has no error in the training set but two errors in the test set. The solution is as follows:
Area = 1.01-2.01×(-0.22× palmitic-0.01× palmitoleic-0.13× stearic +0.05× oleic -0.21× linoleic
+0.61× linolenic-0.05× arachidic +0.03× eicosenoic) -0.01×(0.07× palmitic+0.03× palmitoleic+0.06×
stearic -0.02× oleic +0.05× linoleic -0.05× linolenic+0.04× arachidic) -2.00×(-0.10× palmitic+0.10×
palmitoleic+0.11× stearic +0.07× oleic -0.01× linoleic +0.20× linolenic-0.26× arachidic -0.01×
eicosenoic)
The code is as follows:
18
Figure 7: (Left) Tree solution for East Liguria vs West Liguria,Umbria, (Middle) LDA solution for East
Liguria vs West Liguria,Umbria, (Right) LDA solution for West Liguria vs Umbria.
olive.nn<-nnet(c1.train~.,d.olive.train,size=3,linout=T,decay=5e-4,
range=0.6,maxit=1000)
# weights: 31
initial value 101.133439
iter 10 value 100.792293
iter 20 value 88.353498
...
iter 560 value 0.004900
final value 0.004900
converged
> table(c1.train,round(predict(olive.nn,d.olive.train)))
c1.train -1 1
-1 79 0
1
0 37
> table(c1.test,round(predict(olive.nn,d.olive.test)))
c1.test -3 -1 0 1
-1 0 21 1 0
1
1 0 0 12
Support vector machines (SVM) produce 2 errors in the training set and 6 errors in the test set. The
Windows version of SVM light is used. To run SVM the data needs to be outputted to file in a special
format. The svm learn and svm classify executables are run in a command window. The parameters
needs to be adjusted some to get the best results. Linear kernels are used.
The neural network solution is the best but not substantially different from the LDA solution in test set
error. So we will use the LDA solution which is a little simpler.
West Liguria vs Umbria
Trees result in no errors in the training set and 2/22 = 0.09 errors in the test set. LDA results in no
errors in either training or test sets (Figure 7).
19
# West Liguria vs Umbria
d.olive.train2<-d.olive[indx.tr,]
c2.train<-c(rep(-1,572))[indx.tr]
c2.train[d.olive.train2[,2]==8]<-1
c2.train<-c2.train[d.olive.train2[,1]==3&d.olive.train2[,2]!=7]
d.olive.train2<-d.olive.train2[d.olive.train2[,1]==3&d.olive.train2[,2]!=7,
-c(1,2)]
d.olive.test2<-d.olive[indx.tst,]
c2.test<-c(rep(-1,572))[indx.tst]
c2.test[d.olive.test2[,2]==8]<-1
c2.test<-c2.test[d.olive.test2[,1]==3&d.olive.test2[,2]!=7]
d.olive.test2<-d.olive.test2[d.olive.test2[,1]==3&d.olive.test2[,2]!=7,-c(1,2)]
The solution for northern oils then is:
If 0.0285×palmitic +0.0009×palmitoleic +0.0317×stearic +0.0187×oleic +
0.0180×linoleic −0.0264×linolenic +0.0768×arachidic +0.0795×eicosenoic - 200.0 > 0.76
then the sample is from East Liguria
Else
if 0.0124×palmitic +0.0343×palmitoleic +0.0261×stearic +0.0122×oleic +
0.0172×linoleic −0.0627×linolenic −0.0342×arachidic −0.2244×eicosenoic - 128.0 > -0.03
then the sample is from West Liguria
Else it is from Umbria.
E.Lig vs Oth
W. Lig vs Umb
palm
0.67
-0.34
palm’oleic
0.03
0.80
stear
0.21
0.76
oleic
-0.26
-0.88
lino
-0.26
0.94
lino’nic
0.25
-0.93
arach
0.76
-0.85
eico
-0.10
0.05
Table 3: Correlations between variables and predicted values for northern oils.
Based on correlations with the predicted values (Table 3) the important variables for separating East
Liguria are palmitic and arachidic, and for separating West Liguria from Umbria the important variables
are palmitoleic, stearic, oleic, linoleic, linolenic, arachidic.
Southern Oils
The areas of the southern region promised to be the most difficult to classify. We ignore Sicily to begin.
Calabria vs Others
With trees there are 4/219 = 0.018 errors in the training set, and 2/68 = 0.029 errors in the test set. The
plots in Figure 8 illustrate the splits. This is about the best solution of all the methods, surprisingly! There
are too many errors in the training set, but the most important error to guard against is the test error, and
trees give the smallest test error. LDA results in 4/219 = 0.18 errors in the training set and 5/68 = 074
errors in the test set. QDA results in only 1/219 = 0.004 in the training set but 3/68 = 0.044 in the test
set. The best results for FFNN were 0 errors in the training set but 3/68 − 0.044 errors in the test set. SVM
had 1 error in the training set and 4 errors in the test set.
This is the code:
20
Figure 8: The boundaries resulting from a tree classifier on Calabria vs other southern areas.
# Southern oils - without Sicily
# Calabria vs Others
d.olive.train<-d.olive[indx.tr,]
c1.train<-c(rep(-1,572))[indx.tr]
c1.train[d.olive.train[,2]==2]<-1
c1.train<-c1.train[d.olive.train[,1]==1&d.olive.train[,2]!=4]
d.olive.train<-d.olive.train[d.olive.train[,1]==1&d.olive.train[,2]!=4,-c(1,2)]
d.olive.test<-d.olive[indx.tst,]
c1.test<-c(rep(-1,572))[indx.tst]
c1.test[d.olive.test[,2]==2]<-1
c1.test<-c1.test[d.olive.test[,1]==1&d.olive.test[,2]!=4]
d.olive.test<-d.olive.test[d.olive.test[,1]==1&d.olive.test[,2]!=4,-c(1,2)]
# Calabria vs Others
# Trees
olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class")
table(c1.train,predict(olive.rp,type="class"))
table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class"))
par(pty="s",mfrow=c(1,2))
plot(c(d.olive.train[,1],d.olive.test[,1]),
c(d.olive.train[,5],d.olive.test[,5]),
type="n",xlab="palmitic",ylab="linoleic",main="First split")
points(c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]),
21
c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]),pch=16,col=2)
points(c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]),
c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]),pch="x",col=4)
abline(h=946)
text(1000,1400,"Others",pos=4)
plot(c(d.olive.train[d.olive.train[,5]<946,1],
d.olive.test[d.olive.test[,5]<946,1]),
c(d.olive.train[d.olive.train[,5]<946,6],
d.olive.test[d.olive.test[,5]<946,6]),
type="n",xlab="palmitic",ylab="linolenic",main="Next two splits")
points(c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,1],
d.olive.test[c1.test==-1&d.olive.test[,5]<946,1]),
c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,6],
d.olive.test[c1.test==-1&d.olive.test[,5]<946,6]),pch=16,col=2)
points(c(d.olive.train[c1.train==1&d.olive.train[,5]<946,1],
d.olive.test[c1.test==1&d.olive.test[,5]<946,1]),
c(d.olive.train[c1.train==1&d.olive.train[,5]<946,6],
d.olive.test[c1.test==1&d.olive.test[,5]<946,6]),pch="x",col=4)
abline(v=1116)
lines(c(1116,1500),c(37,37))
text(900,70,"Others",pos=4)
text(1300,65,"Calabria",pos=4)
# Linear discriminant analysis
olive.lda1<-lda(d.olive.train,c1.train)
olive.lda1
table(c1.train,predict(olive.lda1,d.olive.train)$class)
table(c1.test,predict(olive.lda1,d.olive.test)$class)
olive.x<-predict(olive.lda1,d.olive.train)$x
olive.x2<-predict(olive.lda1,d.olive.test)$x
# Quadratic discriminant analysis
olive.qda1<-qda(d.olive.train,c1.train)
olive.qda1
table(c1.train,predict(olive.qda1,d.olive.train)$class)
table(c1.test,predict(olive.qda1,d.olive.test)$class)
# FFNN
# Load nnet library
olive.nn<-nnet(c1.train~.,d.olive.train,size=5,linout=T,decay=5e-4,
range=0.6,maxit=1000)
table(c1.train,round(predict(olive.nn,d.olive.train)))
table(c1.test,round(predict(olive.nn,d.olive.test)))
olive.x<-predict(olive.nn,d.olive.train)
olive.x2<-predict(olive.nn,d.olive.test)
Here are the results:
22
# trees
n= 219
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 219 42 -1 (0.8082192 0.1917808)
2) linoleic>=946 153 0 -1 (1.0000000 0.0000000) *
3) linoleic< 946 66 24 1 (0.3636364 0.6363636)
6) palmitic< 1116 18 0 -1 (1.0000000 0.0000000) *
7) palmitic>=1116 48 6 1 (0.1250000 0.8750000)
14) linolenic< 37 8 3 -1 (0.6250000 0.3750000) *
15) linolenic>=37 40 1 1 (0.0250000 0.9750000) *
c1.train -1 1
-1 176 1
1
3 39
c1.test -1 1
-1 54 0
1
2 12
# LDA
lda.data.frame(d.olive.train, c1.train)
Prior probabilities of groups:
-1
1
0.8082192 0.1917808
Group means:
palmitic palmitoleic stearic
oleic linoleic linolenic arachidic
-1 1353.311
170.2655 212.2655 7007.458 1121.831 35.54802 61.56497
1 1298.119
120.5952 263.4762 7319.881 808.119 45.54762 64.02381
eicosenoic
-1
25.38418
1
28.83333
Coefficients of linear discriminants:
LD1
palmitic
-0.03604224
palmitoleic -0.05131146
stearic
-0.03084686
oleic
-0.04136986
linoleic
-0.04599853
linolenic
0.06430998
arachidic
-0.08761155
eicosenoic -0.09149567
23
c1.train -1 1
-1 173 4
1
0 42
c1.test -1 1
-1 50 4
1
1 13
# QDA
c1.train -1 1
-1 176 1
1
0 42
c1.test -1 1
-1 54 0
1
3 11
# FFNN
# weights: 51
initial value 144.303546
iter 10 value 69.334956
.....
iter 250 value 0.542601
final value 0.542600
converged
c1.train -1 1
-1 177 0
1
0 42
c1.test -1 0 1
-1 53 0 1
1
1 1 12
North vs South Apulia
These two areas turn out to be very easy to separate, using trees. The training and test sets are both
perfectly classified using palmitoleic acid (Figure 9).
Here is the code:
# North Apulia vs South Apulia
d.olive.train2<-d.olive[indx.tr,]
c2.train<-c(rep(-1,572))[indx.tr]
c2.train[d.olive.train2[,2]==1]<-1
c2.train<-c2.train[d.olive.train2[,1]==1&d.olive.train2[,2]!=2&
d.olive.train2[,2]!=4]
d.olive.train2<-d.olive.train2[d.olive.train2[,1]==1&d.olive.train2[,2]!=2&
d.olive.train2[,2]!=4,-c(1,2)]
d.olive.test2<-d.olive[indx.tst,]
c2.test<-c(rep(-1,572))[indx.tst]
24
Figure 9: North and South Apulia can be separated very cleanly using palmitoleic acid.
c2.test[d.olive.test2[,2]==1]<-1
c2.test<-c2.test[d.olive.test2[,1]==1&d.olive.test2[,2]!=2&
d.olive.test2[,2]!=4]
d.olive.test2<-d.olive.test2[d.olive.test2[,1]==1&d.olive.test2[,2]!=2&
d.olive.test2[,2]!=4,-c(1,2)]
# Trees
olive.rp<-rpart(c2.train~.,data.frame(d.olive.train2),method="class")
olive.rp
table(c2.train,predict(olive.rp,type="class"))
table(c2.test,predict(olive.rp,data.frame(d.olive.test2),type="class"))
par(mfrow=c(1,1))
hist(c(d.olive.train2[,2],d.olive.test2[,2]),30,col=2,
xlab="palmitoleic",main="North vs South Apulia",xlim=c(0,300))
text(50,30,"North",pos=4)
text(250,30,"South",pos=4)
abline(v=111.5)
Here are the results:
n= 177
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 177 19 -1 (0.8926554 0.1073446)
2) palmitoleic>=111.5 158 0 -1 (1.0000000 0.0000000) *
3) palmitoleic< 111.5 19 0 1 (0.0000000 1.0000000) *
25
c2.train -1 1
-1 158 0
1
0 19
c2.test -1 1
-1 48 0
1
0 6
Now lets review Sicily
Sicily, the problem area, cannot be cleanly separated from the other areas using trees. The error is
15/246 = 0.061 in the training set, and 7/77 = 0.091 in the test set. What about FFNN or SVM?
Here is the code:
olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class")
olive.rp
table(c1.train,predict(olive.rp,type="class"))
table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class"))
Here are the results:
n= 246
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 246 27 -1 (0.89024390 0.10975610)
2) eicosenoic< 34.5 199 6 -1 (0.96984925 0.03015075) *
3) eicosenoic>=34.5 47 21 -1 (0.55319149 0.44680851)
6) stearic< 261 30 8 -1 (0.73333333 0.26666667)
12) arachidic< 84.5 23 3 -1 (0.86956522 0.13043478) *
13) arachidic>=84.5 7 2 1 (0.28571429 0.71428571) *
7) stearic>=261 17 4 1 (0.23529412 0.76470588) *
c1.train -1 1
-1 213 6
1
9 18
c1.test -1 1
-1 65 3
1
4 5
Thus the solution for the areas of the southern region (without Sicily) is:
If linoleic>=946
If palmitoleic >= 111.5 then the area is South Apulia
Else the area is North Apulia
Else (linoleic<946)
If palmitic<1116
If palmitoleic >= 111.5 then the area is South Apulia
26
Else the area is North Apulia
Else (palmitic>=1116)
If linolenic<37
If palmitoleic >= 111.5 then the area is South Apulia
Else the area is North Apulia
Else (linolenic>37) the area is Calabria
The test error associated with this classification is 3%.
27
4
Summary
The classification rule (with values given in % ×100) is then:
If eicosenoic acid > 7 then the region is South
If eicosenoic acid < 34.5 then the area is Sicily
Else
If stearic < 261
If arachidic < 84.5 the area is Sicily
If stearic ≥ 261 OR if (stearic < 261 and arachidic ≥ 84.5)
If linoleic ≥ 946
If palmitoleic ≥ 111.5 the area is South Apulia
Else the area is North Apulia
Else (linoleic <946)
If palmitic ≥ 1116
If palmitoleic ≥ 111.5 the area is South Apulia
Else the area is North Apulia
Else (palmitic ≥ 1116)
If linolenic < 37
If palmitoleic ≥ 111.5 then the area is South Apulia
Else the area is North Apulia
Else (linolenic > 37) the area is Calabria
Else
If 0.0003×palmitic +0.0153×palmitoleic +0.01423×stearic +0.0010×oleic −0.0097×linoleic 0.0105×linolenic −0.0236×arachidic −0.1193×eicosenoic - 2.12 > -1.49 then the region is Sardinia
If 0.0097×palmitic +0.0018×palmitoleic +0.0276×stearic +0.00075×oleic +0.0271×linoleic+
0.0750×linolenic −0.0260×arachidic −0.0400×eicosenoic - 55.04 > 0.73 then the sample is
from inland Sardinia
Else the sample is from coastal Sardinia
Else the region is North
If 0.0285×palmitic +0.0009×palmitoleic +0.0317×stearic +0.0187×oleic +
0.0180×linoleic −0.0264×linolenic +0.0768×arachidic +0.0795×eicosenoic - 200.0 > 0.76
then the sample is from East Liguria
Else
if 0.0124×palmitic +0.0343×palmitoleic +0.0261×stearic +0.0122×oleic +
0.0172×linoleic −0.0627×linolenic −0.0342×arachidic −0.2244×eicosenoic - 128.0 > -0.03
then the sample is from West Liguria
Else it is from Umbria
The error on the test set overall is (3 + 7 + 2)/136 = 0.09 including Sicily and 5/127 = 0.04 without
Sicily, and separately on the areas is given in Table 4. The important variables in each of the separations
are given in Table 5.
There are several qualifications about the study:
• We are assuming that the data samples have been collected appropriately, that the samples are representative of the oils of the areas they are purported to be produced in.
• There is a big difference in the sample sizes for each class. South Apulia is the most abundantly
represented and form the internet search of material this is one of the most prolific olive oil producing
28
Classes
Overall (with Sicily)
Overall (without Sicily)
South vs Others
North vs Sardinia
Inland vs Coastal Sardinia
East Liguria vs Others
Umbria vs West Liguria
Sicily vs Others
Calabria vs Others
North vs South Apulia
Test Error
12/136=0.09
5/127=0.04
0/136 =0
0/59 =0
0/24=0
3/35=0.09
0/23=0
7/77=0.09
2/68=0.03
0/54=0
Table 4: Prediction errors
Separation
Regions
South/Others
Sardinia/North
Sardinia
Coast/Inland
North
E.Liguria/Others
W.Liguria/Umbria
South
Sicily/Others
Calab/Others
N.Apulia/S.Apulia
palmitic
palmitoleic
stear
oleic
linoleic
x
x
x
x
x
x
x
x
x
x
linolenic
arachidic
eicosenoic
x
x
x
x
x
x
x
x
x
x
x
x
Table 5: This table marks the important variables for each separation.
areas. North Apulia has the least samples, only 25. Perhaps the sample size is indicative of the
productivity and should perhaps be incorporated into the classifiers.
29
References
Burges, C. J. C. (1998), ‘A Tutorial Support Vector Machines for Pattern Recognition’, Data Mining and
Knowledge Discovery 2, 121–167.
Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of olive oils from their fatty
acid composition, in H. Martens & H. Russwurm Jr., eds, ‘Food Research and Data Analysis’, Applied
Science Publishers, London, pp. 189–214.
Vapnik, V. (1979), Estimation of Dependence Based on Empirical Data (In Russian), Nauka, Moscow.
Vapnik, V. (1999), The Nature of Statistical Learning Theory (Statistics for Engineering and Information
Science), Springer-Verlag, New York, NY.
Venables, W. N. & Ripley, B. (1994), Modern Applied Statistics with S-Plus, Springer-Verlag, New York.
30