Classification II: Decision Trees and SVMs Digging into Data March 3, 2014

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Classification II: Decision Trees and SVMs Digging into Data March 3, 2014

Transcript

Classification II: Decision Trees and SVMs Digging into Data March 3, 2014

Classification II: Decision Trees and SVMs
Digging into Data
March 3, 2014
Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
1 / 45
Roadmap
Classification: machines labeling data for us
Last time: naïve Bayes and logistic regression
This time:
…
Decision Trees
∆
…
SVMs
∆
∆
…
…
Simple, nonlinear, interpretable
(another) example of linear classifier
State-of-the-art classification
Examples in Rattle (Logistic, SVM, Trees)
Discussion: Which classifier should I use for my problem?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
2 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
3 / 45
Trees
Suppose that we want to construct a set of rules to represent the data
can represent data as a series of if-then statements
here, “if” splits inputs into two categories
“then” assigns value
when “if” statements are nested, structure is called a tree
X1#>#5#
True#
False#
X1#>#8#
X2#<#)2#
True#
FALSE#
Digging into Data
True#
False#
X3#>#0#
TRUE#
TRUE#
True#
False#
TRUE#
FALSE#
Classification II: Decision Trees and SVMs
March 3, 2014
4 / 45
Trees
Ex: data (X1 , X2 , X3 , Y ) with X1 , X2 , X3 are real, Y Boolean
First, see if X1 > 5:
if TRUE, see if X1 > 8
…
…
if TRUE, return FALSE
if FALSE, return TRUE
if FALSE, see if X2 <
…
if TRUE, return
FALSE#
TRUE
if FALSE, return
FALSE
if FALSE, return TRUE
Digging into Data
X2#<#)2#
True#
∆
…
False#
X1#>#8#
2
if TRUE, see if X3 > 0
∆
X1#>#5#
True#
Classification II: Decision Trees and SVMs
True#
False#
X3#>#0#
TRUE#
TRUE#
True#
False#
TRUE#
FALSE#
March 3, 2014
5 / 45
Trees
X1#>#5#
True#
False#
X1#>#8#
X2#<#)2#
True#
FALSE#
True#
False#
X3#>#0#
TRUE#
TRUE#
True#
False#
TRUE#
FALSE#
Example 1: (X1 , X2 , X3 ) = (1, 1, 1)
Example 2: (X1 , X2 , X3 ) = (10, 3, 0)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
6 / 45
Trees
X1#>#5#
True#
False#
X1#>#8#
X2#<#)2#
True#
FALSE#
True#
False#
X3#>#0#
TRUE#
TRUE#
True#
False#
TRUE#
FALSE#
Example 1: (X1 , X2 , X3 ) = (1, 1, 1) ! TRUE
Example 2: (X1 , X2 , X3 ) = (10, 3, 0)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
6 / 45
Trees
X1#>#5#
True#
False#
X1#>#8#
X2#<#)2#
True#
FALSE#
True#
False#
X3#>#0#
TRUE#
TRUE#
True#
False#
TRUE#
FALSE#
Example 1: (X1 , X2 , X3 ) = (1, 1, 1) ! TRUE
Example 2: (X1 , X2 , X3 ) = (10, 3, 0) ! FALSE
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
6 / 45
Trees
Terminology:
branches: one side of a split
leaves: terminal nodes that return values
Why trees?
trees can be used for regression or classification
…
…
regression: returned value is a real number
classification: returned value is a class
unlike linear regression, SVMs, naive Bayes, etc, trees fit local models
…
…
in large spaces, global models may be hard to fit
results may be hard to interpret
fast, interpretable predictions
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
7 / 45
Example: Predicting Electoral Results
2008 Democratic primary:
Hillary Clinton
Barack Obama
Given historical data, how will a county (small administrative unit inside an
American state) vote?
can extrapolate to state level data
might give regions to focus on increasing voter turnout
would like to know how variables interact
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
8 / 45
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party
primary (as of April 16), by Amanada Cox for the New York Times.
3
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
9 / 45
Example: Predicting Electoral Results
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
10 / 45
Example: Predicting Electoral Results
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
11 / 45
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic P
primary (as of April 16), by Amanada Cox for the New York Times.
3
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
12 / 45
Decision Trees
Decision tree representation:
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
How would we represent as a function of X , Y :
X AND Y (both must be true)
X OR Y (either can be true)
X XOR Y (one and only one is true)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
13 / 45
When to Consider Decision Trees
Instances describable by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Examples:
Equipment or medical diagnosis
Credit risk analysis
Modeling calendar scheduling preferences
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
14 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
15 / 45
Top-Down Induction of Decision Trees
Main loop:
1 A
the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
5 If training examples perfectly classified, Then STOP, Else iterate over new
leaf nodes
Which attribute is best?
[29+,35-]
t
[21+,5-]
Digging into Data
A1=?
[29+,35-]
f
t
[8+,30-]
[18+,33-]
Classification II: Decision Trees and SVMs
A2=?
f
[11+,2-]
March 3, 2014
16 / 45
Entropy: Reminder
Entropy(S)
1.0
0.5
0.0
0.5
p
+
S is a sample of training examples
into Data
Classificationexamples
II: Decision Trees
p Digging
is the
proportion of positive
inandSSVMs
1.0
March 3, 2014
17 / 45
Entropy
How spread out is the distribution of S:
p ( log2 p ) + p ( log2 p )
Entropy (S ) ⌘
Digging into Data
p log2 p
p log2 p
Classification II: Decision Trees and SVMs
March 3, 2014
18 / 45
Information Gain
Which feature A would be a more useful rule in our decision tree?
Gain(S , A) = expected reduction in entropy due to sorting on A
Gain(S , A) ⌘ Entropy (S )
[29+,35-]
t
[21+,5-]
Digging into Data
A1=?
X
v 2Values(A)
|Sv |
Entropy (Sv )
|S |
[29+,35-]
f
t
[8+,30-]
[18+,33-]
Classification II: Decision Trees and SVMs
A2=?
f
[11+,2-]
March 3, 2014
19 / 45
H (S ) =
29
54
✓
lg
29
54
◆
35
64
✓
lg
35
◆
64
=
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
H (S ) =
29
54
✓
lg
29
54
◆
35
64
✓
lg
35
◆
64
= 0.96
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
H (S ) =
29
54
✓
lg
29
◆
54
35
64
✓
lg
35
◆
64
= 0.96
Gain(S , A1 ) = 0.96
26

5
Å
5
ã
21
lg
lg
64
26
26
26

✓ ◆
✓ ◆
38
8
8
30
30
lg
lg
64
38
38
38
38
✓
21
◆
26
=
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
H (S ) =
29
54
✓
lg
29
◆
54
35
64
✓
lg
35
◆
64
= 0.96
Gain(S , A1 ) = 0.96
26

5
Å
5
ã
21
lg
lg
64
26
26
26

✓ ◆
✓ ◆
38
8
8
30
30
lg
lg
64
38
38
38
38
✓
21
◆
26
= 0.96 0.28 0.44 = 0.24
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
29
H (S ) =
54
✓
lg
29
◆
54
35
64
✓
lg
35
◆
64
= 0.96
Gain(S , A1 ) = 0.96
26

5
Å
5
ã
21
✓
lg
lg
64
26
26
26

✓ ◆
✓ ◆
38
8
8
30
30
lg
lg
64
38
38
38
38
21
◆
26
= 0.96 0.28 0.44 = 0.24
Gain(S , A2 ) = 0.96
13
64

51

18
✓
18
◆
33
lg
lg
64
51
51
51
✓ ◆
✓ ◆
11
11
2
2
lg
lg
13
13
13
13
✓
33
◆
51
=
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
29
H (S ) =
54
✓
lg
29
◆
54
35
64
✓
lg
35
◆
64
= 0.96
Gain(S , A1 ) = 0.96
26

5
Å
5
ã
21
✓
lg
lg
64
26
26
26

✓ ◆
✓ ◆
38
8
8
30
30
lg
lg
64
38
38
38
38
21
◆
26
= 0.96 0.28 0.44 = 0.24
Gain(S , A2 ) = 0.96
13
64

51

18
✓
18
◆
33
lg
lg
64
51
51
51
✓ ◆
✓ ◆
11
11
2
2
lg
lg
13
13
13
13
✓
33
◆
51
= 0.96 0.75 0.13 = 0.08
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
20 / 45
ID3 Algorithm
Start at root, look for best attribute
Repeat for subtrees at each attribute outcome
Stop when information gain is below a threshold
Bias: prefers shorter trees (Occam’s Razor)
! a short hyp that fits data unlikely to be coincidence
! a long hyp that fits data might be coincidence
…
Prevents overfitting (more later)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
21 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
22 / 45
Thinking Geometrically
Suppose you have two classes: vacations and sports
Suppose you have four documents
Sports
Vacations
Doc1 : {ball, ball, ball, travel}
Doc2 : {ball, ball}
Doc3 : {travel, ball, travel}
Doc4 : {travel}
What does this look like in vector space?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
23 / 45
Put the documents in vector space
Travel
3
2
1
0
1
2
3
Ball
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
24 / 45
Put the documents in vector space
Travel
3
3
2
1
0
4
1
2
1
2
3
Ball
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
24 / 45
Vector space representation of documents
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 10,000s of dimensions and more
How can we do classification in this space?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
25 / 45
Vector space classification
As before, the training set is a set of documents, each labeled with its class.
In vector space classification, this set corresponds to a labeled set of points
or vectors in the vector space.
Premise 1: Documents in the same class form a contiguous region.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
26 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
x
Kenya
Should the document
Digging into Data
x
be assigned to China,
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
Kenya
x
x
China, UK
Kenya? to China,
Should the document
beorassigned
Should the document ? be assigned to
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
Kenya
x
x
Find separators between the classes
Find separators between the classes
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
Kenya
x
x
Find separators between the classes
Find separators between the classes
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
Kenya
x
x
Find separators between theChina
classes
Based on these separators: ? should be assigned to
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Classes in the vector space
Classes in the vector space
UK
China
x
x
Kenya
x
x
Find separators between the classes
How do we find separators that do a good job at classifying new documents like ??
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
27 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
28 / 45
Linear classifiers
Definition:
…
…
…
A linear classifier computes a linear combination or weighted sum
P
i i xi of the feature values.
P
Classification decision: i i xi > ✓ ?
. . . where ✓ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane
(higher dimensionalities).
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linear SVM
Assumption: The classes are linearly separable.
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
29 / 45
Linear classifiers
Definition:
…
…
…
A linear classifier computes a linear combination or weighted sum
P
i i xi of the feature values.
P
Classification decision: i i xi > ✓ ?
. . . where ✓ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane
(higher dimensionalities).
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linear SVM
Assumption: The classes are linearly separable.
Before, we just talked about equations. What’s the geometric intuition?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
29 / 45
A linear classifier in 1D
A linear classifier in 1D is
a point x described by the
equation w1 d1 =
x = /w1
A linear classifier in 1D is a
point x described by the
equation 1 d1 = ✓
Schütze: Support vector machines
Digging into Data
15 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
30 / 45
A linear classifier in 1D
A linear classifier in 1D is
a point x described by the
equation w1 d1 =
x = /w1
A linear classifier in 1D is a
point x described by the
equation 1 d1 = ✓
x =✓/
Schütze: Support vector machines
Digging into Data
1
15 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
30 / 45
A linear classifier in 1D
A linear classifier in 1D is
a point x described by the
equation w1 d1 =
x = /w1
Points (d1 ) with w1 d1
linear
are in theAclass
c. classifier in 1D is a
point x described by the
equation 1 d1 = ✓
x =✓/
1
Points (d1 ) with 1 d1
are in the class c.
Schütze: Support vector machines
Digging into Data
✓
15 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
30 / 45
A linear classifier in 1D
A linear classifier in 1D
A linear classifier in 1D is
a point x described by the
equation w1 d1 =
x = /w1
in 1D is a
Points (dA1 )linear
with wclassifier
1 d1
are in thepoint
class xc.described by the
Points (d1 ) with w1 d1 <
equation 1 d1 = ✓
are in the complement
class c. x = ✓ / 1
Points (d1 ) with 1 d1
are in the class c.
Schütze: Support vector machines
Digging into Data
✓
Points (d1 ) with 1 d1 < ✓
15 / 55
are in the complement
class
c.
Classification II: Decision Trees and SVMs
March 3, 2014
30 / 45
A linear classifier in 2D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 2D
A linear classifier in 2D is
A linear
classifier in 2D is a
a line described
by the
equation wline
w2 d2 = by the
described
1 d1 +
Example for
a 2D linear1 d1 + 2 d2 = ✓
equation
classifier
Schütze: Support vector machines
Digging into Data
16 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
31 / 45
A linear classifier in 2D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 2D
A linear classifier in 2D is
A linear
classifier in 2D is a
a line described
by the
equation wline
w2 d2 = by the
described
1 d1 +
Example for
a 2D linear1 d1 + 2 d2 = ✓
equation
classifier
Example for a 2D linear
classifier
Schütze: Support vector machines
Digging into Data
16 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
31 / 45
A linear classifier in 2D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 2D
A linear classifier
in 2D
is
A linear
classifier
in 2D is a
a line described by the
line described by the
equation w1 d1 + w2 d2 =
Example forequation
a 2D linear 1 d1 + 2 d2 = ✓
classifier Example for a 2D linear
Points (d1 d2 ) with
classifier
w1 d1 + w2 d2
are in
the class c.Points (d1 d2 ) with
1 d1 +
2 d2
✓ are in the
class c.
Schütze: Support vector machines
Digging into Data
16 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
31 / 45
A linear classifier in 2D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 2D
A linear classifier in 2D is
A linear
classifier in 2D is a
a line described
by the
equation wline
w2 d2 = by the
described
1 d1 +
Example for
a 2D linear1 d1 + 2 d2 = ✓
equation
classifier
Points (d1Example
d2 ) with for a 2D linear
w1 d1 + w2classifier
d2
are in
the class c.
Points (d1 d2 ) with
✓ are in the
Points (d1 d2 ) with
w1 d1 + w2 d21 d
<1 +are 2ind2
the complement
class class
c. c.
Schütze: Support vector machines
Digging into Data
Points (d1 d2 ) with
✓ are in the
1 d1 + 2 d16
2 /<
55
complement class c.
Classification II: Decision Trees and SVMs
March 3, 2014
31 / 45
A linear classifier in 3D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 3D
A linear classifier in 3D is
A linear
a plane described
by the classifier in 3D
equation
plane described by the
w1 d1 + w2 d2 + w3 d3 =
equation
1 d1 + 2 d2 +
Schütze: Support vector machines
Digging into Data
3 d3
is a
=✓
17 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
32 / 45
A linear classifier in 3D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 3D
A linear classifier in 3D is
a plane described
by the classifier in 3D
A linear
equation
plane
described by the
w1 d1 + w2 d2 + w3 d3 =
Example for a equation
3D linear
classifier
d +
1 1
2 d2 +
3 d3
is a
=✓
Example for a 3D linear
classifier
Schütze: Support vector machines
Digging into Data
17 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
32 / 45
A linear classifier in 3D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 3D
A linear
A linear classifier
in 3D isclassifier in 3D
a plane described
by the
plane
described by the
equation
equation
w1 d1 + w2 d2 + w3 d3 =
1 dlinear
1 + 2 d2 +
Example for a 3D
classifier
3 d3
is a
=✓
Example for a 3D linear
Points (d1 d2 d3 ) with
classifier
w1 d1 + w2 d2 +
w3 d3
are in the class c.
Points (d1 d2 d3 ) with
✓ are
1 d1 + 2 d2 + 3 d3
in the class c.
Schütze: Support vector machines
Digging into Data
17 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
32 / 45
A linear classifier in 3D
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
A linear classifier in 3D
A linear classifier in 3D is
a plane described
by the classifier in 3D
A linear
equation
plane
w1 d1 + w2 d2 + w3 d3 =described by the
Example for aequation
3D linear
classifier
d +
1 1
Points (d1 d2 d3 ) with
Example
w1 d1 + w2 d2 +
w3 d3
are in the classclassifier
c.
2 d2 +
3 d3
is a
=✓
for a 3D linear
Points (d1 d2 d3 ) with
Points
w1 d1 + w2 d2 +
w3 d3 <(d1 d2 d3 ) with
are in the complement
1 d1 + 2 d2 + 3 d3
class c.
✓ are
in the class c.
Schütze: Support vector machines
Digging into Data
Points (d1 d2 d3 ) with
1 d1 + 2 d2 + 3 d3 < ✓ are
in the complement class c.
17 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
32 / 45
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
M
X
i di
=✓
i =1
i = log[P̂ (ti |c )/P̂ (ti |c̄ )], di = number of occurrences of ti in d, and
log[P̂ (c )/P̂ (c̄ )]. Here, the index i, 1  i  M, refers to terms of the
vocabulary.
Logistic regression is the same (we only put it into the logistic function to turn it
into a probability).
where
✓=
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
33 / 45
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
M
X
i di
=✓
i =1
i = log[P̂ (ti |c )/P̂ (ti |c̄ )], di = number of occurrences of ti in d, and
log[P̂ (c )/P̂ (c̄ )]. Here, the index i, 1  i  M, refers to terms of the
vocabulary.
Logistic regression is the same (we only put it into the logistic function to turn it
into a probability).
where
✓=
Takeway
Naïve Bayes, logistic regression and SVM (which we’ll get to in a second) are all
linear methods. They choose their hyperplanes based on different objectives: joint
likelihood (NB), conditional likelihood (LR), and the margin (SVM).
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
33 / 45
Which hyperplane?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
34 / 45
Which hyperplane?
For linearly separable training sets: there are infinitely many separating
hyperplanes.
They all separate the training set perfectly . . .
. . . but they behave differently on test data.
Error rates on new data are low for some, high for others.
How do we find a low-error separator?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
35 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
36 / 45
Support vector machines
Machine-learning research in the last two decades has improved classifier
effectiveness.
New generation of state-of-the-art classifiers: support vector machines
(SVMs), boosted decision trees, regularized logistic regression, neural
networks, and random forests
Applications to IR problems, particularly text classification
SVMs: A kind of large-margin classifier
Vector space based machine-learning method aiming to find a decision boundary
between two classes that is maximally far from any point in the training data
(possibly discounting some points as outliers or noise)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
37 / 45
Support Vector Machines
Support Vector Machines
2-class training data
2-class training data
Schütze: Support vector machines
Digging into Data
29 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
38 / 45
Support Vector Machines
2-class training data
decision boundary
linear separator
2-class training data
decision boundary !
linear separator
hütze: Support vector machines
Digging into Data
29 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
38 / 45
upport Vector Machines
Support Vector Machines
2-class training data
decision boundary
linear separator
2-class training data
decision
boundary !
criterion:
being
linear
separator
maximally far away
from criterion:
any databeing
point
maximally far away
determines
frommargin
any data point
classifier
! determines
classifier margin
Margin is
maximized
ütze: Support vector machines
Digging into Data
29 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
38 / 45
Support Vector Machines
Recap
Vector space classification
Linear classifiers
Support Vector Machines
Discussion
Why maximize the margin?
Points near decision
surface
uncertain
classification decisions
2-class
training
data
(50%
either way).
A
classifier
with
a large
decision boundary !
margin makes no low
linear
separator
certainty
classification
decisions.
criterion:
being
Gives classification
maximally
far away
safety margin w.r.t slight
fromerrors
any data
point
in measurement
or
! determines
doc. variation
Maximum
margin
decision
hyperplane
Support vectors
classifier margin
linear separator
position defined by
Schütze: Support vector machines
support
vectors
Digging into Data
Margin is
maximized
30 / 55
Classification II: Decision Trees and SVMs
March 3, 2014
38 / 45
Why maximize the margin?
Why maximize the margin?
Points near decision
surface
uncertain
classification decisions
Points
near
decision
(50%
either
way).
A classifier
with a large
surface
! uncertain
margin makes no low
classification
certainty classification
decisions
(50% either
decisions.
way).Gives classification
safety margin w.r.t slight
A classifier
with a
errors in measurement or
largedoc.
margin
makes
variation
Maximum
margin
decision
hyperplane
no low certainty
classification
decisions.
Margin is
maximized
Schütze: classification
Support vector machines
Gives
safety margin w.r.t
slight errors in
measurement or
documents variation
Digging into Data
Support vectors
Classification II: Decision Trees and SVMs
30 / 55
March 3, 2014
39 / 45
Why maximize the margin?
SVM classifier: large margin
around decision boundary
compare to decision
hyperplane: place fat separator
between classes
…
unique solution
decreased memory capacity
increased ability to correctly
generalize to test data
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
40 / 45
SVM extensions
Slack variables: not perfect line
0.0
0.2
0.4
0.6
0.8
1.0
Kernels: different geometries
0.0
0.2
0.4
0.6
0.8
1.0
Loss functions: Different penalties for getting the answer wrong
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
41 / 45
Outline
1
Decision Trees
2
Learning Decision Trees
3
Vector space classification
4
Linear Classifiers
5
Support Vector Machines
6
Recap
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
42 / 45
Classification
Many commercial applications
There are many applications of text classification for corporate Intranets,
government departments, and Internet publishers.
Often greater performance gains from exploiting domain-specific features
than from changing from one machine learning method to another.
(Homework 3)
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
43 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None?
Very little?
A fair amount?
A huge amount
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little?
A fair amount?
A huge amount
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount?
A huge amount
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Interpretable?
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is there
currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Interpretable? Decision trees
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
44 / 45
Recap
Is there a learning method that is optimal for all text classification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:
…
…
…
…
How much training data is available?
How simple/complex is the problem? (linear vs. nonlinear decision
boundary)
How noisy is the problem?
How stable is the problem over time?
∆
∆
For an unstable problem, it’s better to use a simple and robust
classifier.
You’ll be investigating the role of features in HW3!
Digging into Data
Classification II: Decision Trees and SVMs
March 3, 2014
45 / 45