Classification II: Decision Trees and SVMs Digging into Data March 3, 2014
by user
Comments
Transcript
Classification II: Decision Trees and SVMs Digging into Data March 3, 2014
Classification II: Decision Trees and SVMs Digging into Data March 3, 2014 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 1 / 45 Roadmap Classification: machines labeling data for us Last time: naïve Bayes and logistic regression This time: … Decision Trees ∆ … SVMs ∆ ∆ … … Simple, nonlinear, interpretable (another) example of linear classifier State-of-the-art classification Examples in Rattle (Logistic, SVM, Trees) Discussion: Which classifier should I use for my problem? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 2 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 3 / 45 Trees Suppose that we want to construct a set of rules to represent the data can represent data as a series of if-then statements here, “if” splits inputs into two categories “then” assigns value when “if” statements are nested, structure is called a tree X1#>#5# True# False# X1#>#8# X2#<#)2# True# FALSE# Digging into Data True# False# X3#>#0# TRUE# TRUE# True# False# TRUE# FALSE# Classification II: Decision Trees and SVMs March 3, 2014 4 / 45 Trees Ex: data (X1 , X2 , X3 , Y ) with X1 , X2 , X3 are real, Y Boolean First, see if X1 > 5: if TRUE, see if X1 > 8 … … if TRUE, return FALSE if FALSE, return TRUE if FALSE, see if X2 < … if TRUE, return FALSE# TRUE if FALSE, return FALSE if FALSE, return TRUE Digging into Data X2#<#)2# True# ∆ … False# X1#>#8# 2 if TRUE, see if X3 > 0 ∆ X1#>#5# True# Classification II: Decision Trees and SVMs True# False# X3#>#0# TRUE# TRUE# True# False# TRUE# FALSE# March 3, 2014 5 / 45 Trees X1#>#5# True# False# X1#>#8# X2#<#)2# True# FALSE# True# False# X3#>#0# TRUE# TRUE# True# False# TRUE# FALSE# Example 1: (X1 , X2 , X3 ) = (1, 1, 1) Example 2: (X1 , X2 , X3 ) = (10, 3, 0) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45 Trees X1#>#5# True# False# X1#>#8# X2#<#)2# True# FALSE# True# False# X3#>#0# TRUE# TRUE# True# False# TRUE# FALSE# Example 1: (X1 , X2 , X3 ) = (1, 1, 1) ! TRUE Example 2: (X1 , X2 , X3 ) = (10, 3, 0) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45 Trees X1#>#5# True# False# X1#>#8# X2#<#)2# True# FALSE# True# False# X3#>#0# TRUE# TRUE# True# False# TRUE# FALSE# Example 1: (X1 , X2 , X3 ) = (1, 1, 1) ! TRUE Example 2: (X1 , X2 , X3 ) = (10, 3, 0) ! FALSE Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45 Trees Terminology: branches: one side of a split leaves: terminal nodes that return values Why trees? trees can be used for regression or classification … … regression: returned value is a real number classification: returned value is a class unlike linear regression, SVMs, naive Bayes, etc, trees fit local models … … in large spaces, global models may be hard to fit results may be hard to interpret fast, interpretable predictions Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 7 / 45 Example: Predicting Electoral Results 2008 Democratic primary: Hillary Clinton Barack Obama Given historical data, how will a county (small administrative unit inside an American state) vote? can extrapolate to state level data might give regions to focus on increasing voter turnout would like to know how variables interact Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 8 / 45 Example: Predicting Electoral Results Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party primary (as of April 16), by Amanada Cox for the New York Times. 3 Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 9 / 45 Example: Predicting Electoral Results Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 10 / 45 Example: Predicting Electoral Results Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 11 / 45 Example: Predicting Electoral Results Figure 1: Classification tree for county-level outcomes in the 2008 Democratic P primary (as of April 16), by Amanada Cox for the New York Times. 3 Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 12 / 45 Decision Trees Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent as a function of X , Y : X AND Y (both must be true) X OR Y (either can be true) X XOR Y (one and only one is true) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 13 / 45 When to Consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 14 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 15 / 45 Top-Down Induction of Decision Trees Main loop: 1 A the “best” decision attribute for next node 2 Assign A as decision attribute for node 3 For each value of A, create new descendant of node 4 Sort training examples to leaf nodes 5 If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best? [29+,35-] t [21+,5-] Digging into Data A1=? [29+,35-] f t [8+,30-] [18+,33-] Classification II: Decision Trees and SVMs A2=? f [11+,2-] March 3, 2014 16 / 45 Entropy: Reminder Entropy(S) 1.0 0.5 0.0 0.5 p + S is a sample of training examples into Data Classificationexamples II: Decision Trees p Digging is the proportion of positive inandSSVMs 1.0 March 3, 2014 17 / 45 Entropy How spread out is the distribution of S: p ( log2 p ) + p ( log2 p ) Entropy (S ) ⌘ Digging into Data p log2 p p log2 p Classification II: Decision Trees and SVMs March 3, 2014 18 / 45 Information Gain Which feature A would be a more useful rule in our decision tree? Gain(S , A) = expected reduction in entropy due to sorting on A Gain(S , A) ⌘ Entropy (S ) [29+,35-] t [21+,5-] Digging into Data A1=? X v 2Values(A) |Sv | Entropy (Sv ) |S | [29+,35-] f t [8+,30-] [18+,33-] Classification II: Decision Trees and SVMs A2=? f [11+,2-] March 3, 2014 19 / 45 H (S ) = 29 54 ✓ lg 29 54 ◆ 35 64 ✓ lg 35 ◆ 64 = Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 H (S ) = 29 54 ✓ lg 29 54 ◆ 35 64 ✓ lg 35 ◆ 64 = 0.96 Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 H (S ) = 29 54 ✓ lg 29 ◆ 54 35 64 ✓ lg 35 ◆ 64 = 0.96 Gain(S , A1 ) = 0.96 26 5 Å 5 ã 21 lg lg 64 26 26 26 ✓ ◆ ✓ ◆ 38 8 8 30 30 lg lg 64 38 38 38 38 ✓ 21 ◆ 26 = Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 H (S ) = 29 54 ✓ lg 29 ◆ 54 35 64 ✓ lg 35 ◆ 64 = 0.96 Gain(S , A1 ) = 0.96 26 5 Å 5 ã 21 lg lg 64 26 26 26 ✓ ◆ ✓ ◆ 38 8 8 30 30 lg lg 64 38 38 38 38 ✓ 21 ◆ 26 = 0.96 0.28 0.44 = 0.24 Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 29 H (S ) = 54 ✓ lg 29 ◆ 54 35 64 ✓ lg 35 ◆ 64 = 0.96 Gain(S , A1 ) = 0.96 26 5 Å 5 ã 21 ✓ lg lg 64 26 26 26 ✓ ◆ ✓ ◆ 38 8 8 30 30 lg lg 64 38 38 38 38 21 ◆ 26 = 0.96 0.28 0.44 = 0.24 Gain(S , A2 ) = 0.96 13 64 51 18 ✓ 18 ◆ 33 lg lg 64 51 51 51 ✓ ◆ ✓ ◆ 11 11 2 2 lg lg 13 13 13 13 ✓ 33 ◆ 51 = Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 29 H (S ) = 54 ✓ lg 29 ◆ 54 35 64 ✓ lg 35 ◆ 64 = 0.96 Gain(S , A1 ) = 0.96 26 5 Å 5 ã 21 ✓ lg lg 64 26 26 26 ✓ ◆ ✓ ◆ 38 8 8 30 30 lg lg 64 38 38 38 38 21 ◆ 26 = 0.96 0.28 0.44 = 0.24 Gain(S , A2 ) = 0.96 13 64 51 18 ✓ 18 ◆ 33 lg lg 64 51 51 51 ✓ ◆ ✓ ◆ 11 11 2 2 lg lg 13 13 13 13 ✓ 33 ◆ 51 = 0.96 0.75 0.13 = 0.08 Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45 ID3 Algorithm Start at root, look for best attribute Repeat for subtrees at each attribute outcome Stop when information gain is below a threshold Bias: prefers shorter trees (Occam’s Razor) ! a short hyp that fits data unlikely to be coincidence ! a long hyp that fits data might be coincidence … Prevents overfitting (more later) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 21 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 22 / 45 Thinking Geometrically Suppose you have two classes: vacations and sports Suppose you have four documents Sports Vacations Doc1 : {ball, ball, ball, travel} Doc2 : {ball, ball} Doc3 : {travel, ball, travel} Doc4 : {travel} What does this look like in vector space? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 23 / 45 Put the documents in vector space Travel 3 2 1 0 1 2 3 Ball Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 24 / 45 Put the documents in vector space Travel 3 3 2 1 0 4 1 2 1 2 3 Ball Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 24 / 45 Vector space representation of documents Each document is a vector, one component for each term. Terms are axes. High dimensionality: 10,000s of dimensions and more How can we do classification in this space? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 25 / 45 Vector space classification As before, the training set is a set of documents, each labeled with its class. In vector space classification, this set corresponds to a labeled set of points or vectors in the vector space. Premise 1: Documents in the same class form a contiguous region. Premise 2: Documents from different classes don’t overlap. We define lines, surfaces, hypersurfaces to divide regions. Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 26 / 45 Classes in the vector space Classes in the vector space UK China x x x Kenya Should the document Digging into Data x be assigned to China, Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Classes in the vector space Classes in the vector space UK China x x Kenya x x China, UK Kenya? to China, Should the document beorassigned Should the document ? be assigned to Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Classes in the vector space Classes in the vector space UK China x x Kenya x x Find separators between the classes Find separators between the classes Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Classes in the vector space Classes in the vector space UK China x x Kenya x x Find separators between the classes Find separators between the classes Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Classes in the vector space Classes in the vector space UK China x x Kenya x x Find separators between theChina classes Based on these separators: ? should be assigned to Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Classes in the vector space Classes in the vector space UK China x x Kenya x x Find separators between the classes How do we find separators that do a good job at classifying new documents like ?? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 28 / 45 Linear classifiers Definition: … … … A linear classifier computes a linear combination or weighted sum P i i xi of the feature values. P Classification decision: i i xi > ✓ ? . . . where ✓ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities). We call this the separator or decision boundary. We find the separator based on training set. Methods for finding separator: logistic regression, naïve Bayes, linear SVM Assumption: The classes are linearly separable. Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 29 / 45 Linear classifiers Definition: … … … A linear classifier computes a linear combination or weighted sum P i i xi of the feature values. P Classification decision: i i xi > ✓ ? . . . where ✓ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities). We call this the separator or decision boundary. We find the separator based on training set. Methods for finding separator: logistic regression, naïve Bayes, linear SVM Assumption: The classes are linearly separable. Before, we just talked about equations. What’s the geometric intuition? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 29 / 45 A linear classifier in 1D A linear classifier in 1D is a point x described by the equation w1 d1 = x = /w1 A linear classifier in 1D is a point x described by the equation 1 d1 = ✓ Schütze: Support vector machines Digging into Data 15 / 55 Classification II: Decision Trees and SVMs March 3, 2014 30 / 45 A linear classifier in 1D A linear classifier in 1D is a point x described by the equation w1 d1 = x = /w1 A linear classifier in 1D is a point x described by the equation 1 d1 = ✓ x =✓/ Schütze: Support vector machines Digging into Data 1 15 / 55 Classification II: Decision Trees and SVMs March 3, 2014 30 / 45 A linear classifier in 1D A linear classifier in 1D is a point x described by the equation w1 d1 = x = /w1 Points (d1 ) with w1 d1 linear are in theAclass c. classifier in 1D is a point x described by the equation 1 d1 = ✓ x =✓/ 1 Points (d1 ) with 1 d1 are in the class c. Schütze: Support vector machines Digging into Data ✓ 15 / 55 Classification II: Decision Trees and SVMs March 3, 2014 30 / 45 A linear classifier in 1D A linear classifier in 1D A linear classifier in 1D is a point x described by the equation w1 d1 = x = /w1 in 1D is a Points (dA1 )linear with wclassifier 1 d1 are in thepoint class xc.described by the Points (d1 ) with w1 d1 < equation 1 d1 = ✓ are in the complement class c. x = ✓ / 1 Points (d1 ) with 1 d1 are in the class c. Schütze: Support vector machines Digging into Data ✓ Points (d1 ) with 1 d1 < ✓ 15 / 55 are in the complement class c. Classification II: Decision Trees and SVMs March 3, 2014 30 / 45 A linear classifier in 2D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 2D A linear classifier in 2D is A linear classifier in 2D is a a line described by the equation wline w2 d2 = by the described 1 d1 + Example for a 2D linear1 d1 + 2 d2 = ✓ equation classifier Schütze: Support vector machines Digging into Data 16 / 55 Classification II: Decision Trees and SVMs March 3, 2014 31 / 45 A linear classifier in 2D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 2D A linear classifier in 2D is A linear classifier in 2D is a a line described by the equation wline w2 d2 = by the described 1 d1 + Example for a 2D linear1 d1 + 2 d2 = ✓ equation classifier Example for a 2D linear classifier Schütze: Support vector machines Digging into Data 16 / 55 Classification II: Decision Trees and SVMs March 3, 2014 31 / 45 A linear classifier in 2D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 2D A linear classifier in 2D is A linear classifier in 2D is a a line described by the line described by the equation w1 d1 + w2 d2 = Example forequation a 2D linear 1 d1 + 2 d2 = ✓ classifier Example for a 2D linear Points (d1 d2 ) with classifier w1 d1 + w2 d2 are in the class c.Points (d1 d2 ) with 1 d1 + 2 d2 ✓ are in the class c. Schütze: Support vector machines Digging into Data 16 / 55 Classification II: Decision Trees and SVMs March 3, 2014 31 / 45 A linear classifier in 2D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 2D A linear classifier in 2D is A linear classifier in 2D is a a line described by the equation wline w2 d2 = by the described 1 d1 + Example for a 2D linear1 d1 + 2 d2 = ✓ equation classifier Points (d1Example d2 ) with for a 2D linear w1 d1 + w2classifier d2 are in the class c. Points (d1 d2 ) with ✓ are in the Points (d1 d2 ) with w1 d1 + w2 d21 d <1 +are 2ind2 the complement class class c. c. Schütze: Support vector machines Digging into Data Points (d1 d2 ) with ✓ are in the 1 d1 + 2 d16 2 /< 55 complement class c. Classification II: Decision Trees and SVMs March 3, 2014 31 / 45 A linear classifier in 3D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 3D A linear classifier in 3D is A linear a plane described by the classifier in 3D equation plane described by the w1 d1 + w2 d2 + w3 d3 = equation 1 d1 + 2 d2 + Schütze: Support vector machines Digging into Data 3 d3 is a =✓ 17 / 55 Classification II: Decision Trees and SVMs March 3, 2014 32 / 45 A linear classifier in 3D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 3D A linear classifier in 3D is a plane described by the classifier in 3D A linear equation plane described by the w1 d1 + w2 d2 + w3 d3 = Example for a equation 3D linear classifier d + 1 1 2 d2 + 3 d3 is a =✓ Example for a 3D linear classifier Schütze: Support vector machines Digging into Data 17 / 55 Classification II: Decision Trees and SVMs March 3, 2014 32 / 45 A linear classifier in 3D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 3D A linear A linear classifier in 3D isclassifier in 3D a plane described by the plane described by the equation equation w1 d1 + w2 d2 + w3 d3 = 1 dlinear 1 + 2 d2 + Example for a 3D classifier 3 d3 is a =✓ Example for a 3D linear Points (d1 d2 d3 ) with classifier w1 d1 + w2 d2 + w3 d3 are in the class c. Points (d1 d2 d3 ) with ✓ are 1 d1 + 2 d2 + 3 d3 in the class c. Schütze: Support vector machines Digging into Data 17 / 55 Classification II: Decision Trees and SVMs March 3, 2014 32 / 45 A linear classifier in 3D Recap Vector space classification Linear classifiers Support Vector Machines Discussion A linear classifier in 3D A linear classifier in 3D is a plane described by the classifier in 3D A linear equation plane w1 d1 + w2 d2 + w3 d3 =described by the Example for aequation 3D linear classifier d + 1 1 Points (d1 d2 d3 ) with Example w1 d1 + w2 d2 + w3 d3 are in the classclassifier c. 2 d2 + 3 d3 is a =✓ for a 3D linear Points (d1 d2 d3 ) with Points w1 d1 + w2 d2 + w3 d3 <(d1 d2 d3 ) with are in the complement 1 d1 + 2 d2 + 3 d3 class c. ✓ are in the class c. Schütze: Support vector machines Digging into Data Points (d1 d2 d3 ) with 1 d1 + 2 d2 + 3 d3 < ✓ are in the complement class c. 17 / 55 Classification II: Decision Trees and SVMs March 3, 2014 32 / 45 Naive Bayes and Logistic Regression as linear classifiers Multinomial Naive Bayes is a linear classifier (in log space) defined by: M X i di =✓ i =1 i = log[P̂ (ti |c )/P̂ (ti |c̄ )], di = number of occurrences of ti in d, and log[P̂ (c )/P̂ (c̄ )]. Here, the index i, 1 i M, refers to terms of the vocabulary. Logistic regression is the same (we only put it into the logistic function to turn it into a probability). where ✓= Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 33 / 45 Naive Bayes and Logistic Regression as linear classifiers Multinomial Naive Bayes is a linear classifier (in log space) defined by: M X i di =✓ i =1 i = log[P̂ (ti |c )/P̂ (ti |c̄ )], di = number of occurrences of ti in d, and log[P̂ (c )/P̂ (c̄ )]. Here, the index i, 1 i M, refers to terms of the vocabulary. Logistic regression is the same (we only put it into the logistic function to turn it into a probability). where ✓= Takeway Naïve Bayes, logistic regression and SVM (which we’ll get to in a second) are all linear methods. They choose their hyperplanes based on different objectives: joint likelihood (NB), conditional likelihood (LR), and the margin (SVM). Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 33 / 45 Which hyperplane? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 34 / 45 Which hyperplane? For linearly separable training sets: there are infinitely many separating hyperplanes. They all separate the training set perfectly . . . . . . but they behave differently on test data. Error rates on new data are low for some, high for others. How do we find a low-error separator? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 35 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 36 / 45 Support vector machines Machine-learning research in the last two decades has improved classifier effectiveness. New generation of state-of-the-art classifiers: support vector machines (SVMs), boosted decision trees, regularized logistic regression, neural networks, and random forests Applications to IR problems, particularly text classification SVMs: A kind of large-margin classifier Vector space based machine-learning method aiming to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 37 / 45 Support Vector Machines Support Vector Machines 2-class training data 2-class training data Schütze: Support vector machines Digging into Data 29 / 55 Classification II: Decision Trees and SVMs March 3, 2014 38 / 45 Support Vector Machines 2-class training data decision boundary linear separator 2-class training data decision boundary ! linear separator hütze: Support vector machines Digging into Data 29 / 55 Classification II: Decision Trees and SVMs March 3, 2014 38 / 45 upport Vector Machines Support Vector Machines 2-class training data decision boundary linear separator 2-class training data decision boundary ! criterion: being linear separator maximally far away from criterion: any databeing point maximally far away determines frommargin any data point classifier ! determines classifier margin Margin is maximized ütze: Support vector machines Digging into Data 29 / 55 Classification II: Decision Trees and SVMs March 3, 2014 38 / 45 Support Vector Machines Recap Vector space classification Linear classifiers Support Vector Machines Discussion Why maximize the margin? Points near decision surface uncertain classification decisions 2-class training data (50% either way). A classifier with a large decision boundary ! margin makes no low linear separator certainty classification decisions. criterion: being Gives classification maximally far away safety margin w.r.t slight fromerrors any data point in measurement or ! determines doc. variation Maximum margin decision hyperplane Support vectors classifier margin linear separator position defined by Schütze: Support vector machines support vectors Digging into Data Margin is maximized 30 / 55 Classification II: Decision Trees and SVMs March 3, 2014 38 / 45 Why maximize the margin? Why maximize the margin? Points near decision surface uncertain classification decisions Points near decision (50% either way). A classifier with a large surface ! uncertain margin makes no low classification certainty classification decisions (50% either decisions. way).Gives classification safety margin w.r.t slight A classifier with a errors in measurement or largedoc. margin makes variation Maximum margin decision hyperplane no low certainty classification decisions. Margin is maximized Schütze: classification Support vector machines Gives safety margin w.r.t slight errors in measurement or documents variation Digging into Data Support vectors Classification II: Decision Trees and SVMs 30 / 55 March 3, 2014 39 / 45 Why maximize the margin? SVM classifier: large margin around decision boundary compare to decision hyperplane: place fat separator between classes … unique solution decreased memory capacity increased ability to correctly generalize to test data Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 40 / 45 SVM extensions Slack variables: not perfect line 0.0 0.2 0.4 0.6 0.8 1.0 Kernels: different geometries 0.0 0.2 0.4 0.6 0.8 1.0 Loss functions: Different penalties for getting the answer wrong Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 41 / 45 Outline 1 Decision Trees 2 Learning Decision Trees 3 Vector space classification 4 Linear Classifiers 5 Support Vector Machines 6 Recap Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 42 / 45 Classification Many commercial applications There are many applications of text classification for corporate Intranets, government departments, and Internet publishers. Often greater performance gains from exploiting domain-specific features than from changing from one machine learning method to another. (Homework 3) Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 43 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Very little? A fair amount? A huge amount Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? A fair amount? A huge amount Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? A huge amount Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM A huge amount Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM A huge amount Doesn’t matter, use whatever works Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM A huge amount Doesn’t matter, use whatever works Interpretable? Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM A huge amount Doesn’t matter, use whatever works Interpretable? Decision trees Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45 Recap Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between bias and variance. Factors to take into account: … … … … How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? ∆ ∆ For an unstable problem, it’s better to use a simple and robust classifier. You’ll be investigating the role of features in HW3! Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 45 / 45