...

Data Mining – Output: Knowledge Representation Chapter 3

by user

on
Category: Documents
69

views

Report

Comments

Transcript

Data Mining – Output: Knowledge Representation Chapter 3
Data Mining – Output:
Knowledge Representation
Chapter 3
Representing Structural Patterns
• There are many different ways of representing
patterns
• 2 covered in Chapter 1 – decision trees and
classification rules
• Learned pattern is a form of “knowledge
representation” (even if the knowledge does
not seem very impressive)
Decision Trees
• Make decisions by following branches down the tree
until a leaf is found
• Classification based on contents of leaf
• Non-leaf node usually involve testing a single
attribute
– Usually for different values of nominal attributes, or for
range of a numeric attribute (most commonly a two way
split, > some value and < same value)
– Less commonly, compare two attribute values, or some
function of multiple attributes
• Common for an attribute once used to not be used at
a lower level of same branch
Decision Trees
• Missing Values
– May be treated as another possible value of a
nominal attribute – if missing data may mean
something
– May follow most popular branch when data is
missing from test data
– More complicated approach – rather than going allor-nothing, can ‘split’ the test instance in proportion
to popularity of branches in test data –
recombination at end will use vote based on weights
Classification Rules
• Popular alternative to decision trees
• LHS / antecedent / precondition – tests to determine if rule
is applicable
– Tests usually ANDed together
– Could be general logical condition (AND/OR/NOT) but learning
such rules is MUCH less constrained
• RHS / consequent / conclusion – answer –usually the class
(but could be a probability distribution)
• Rules with the same conclusion essentially represent an OR
• Rules may be an ordered set, or independent
• If independent, policy may need to be established for if
more than one rule matches (conflict resolution strategy) or
if no rule matches
Rules / Trees
• Rules can be easily created from a tree – but not the
most simple set of rules
• Transforming rules into a tree is not straightforward
(see “replicated subtree” problem – next two slides)
• In many cases rules are more compact than trees –
particularly if default rule is possible
• Rules may appear to be independent nuggets of
knowledge (and hence less complicated than trees) –
but if rules are an ordered set, then they are much more
complicated than they appear
If a and b then x
If c and d then x
Figure 3.1 Decision tree for a
simple disjunction.
If x=1 and y=1
then class = a
If z=1 and w=1
then class = a
Otherwise class = b
Each gray triangle
actually contains
the whole gray
subtree below
Figure 3.3 Decision tree with a replicated subtree.
Association Rules
• Association Rules are not intended to be used
together as a set – in fact value is in the
knowledge – probably no automatic use of rules
• Large numbers of possible rules
Association Rule Evaluation
• Coverage – the number of instances for which it
predicts correctly – also called support
• Accuracy – proportion of instances that it predicts
correctly – also called confidence
• Coverage sometimes expressed as percent of the total #
instances
• Usually methods or users specify minimum coverage
and accuracy for rules to be generated
• Some possible rules imply others – present the
strongest supported
Example – My Weather – Apriori Algorithm
Apriori
Minimum support: 0.15
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Best rules found:
1. outlook=rainy 5 ==> play=no 5 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1)
4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1)
5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1)
6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1)
7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1)
8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1)
9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1)
10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1)
Rules with Exceptions
• Skip
Rules involving Relations
• More than the value for attributes may be
important
• See book example on next slide
Shaded: standing
Unshaded: lying
Figure 3.6 The shapes problem.
More Complicated – Winston’s
Blocks World
• House – 3 sided block & 4 sided block AND 3
sided is on top of 4 sided
• Solutions frequently involve learning rules that
include variables/parameters
– E.g. 3sided(block1) & 4sided(block2) &
ontopof(block1,block2)  house
Easier and Sometimes Useful
• Introduce new attributes during data preparation
• New attribute represents relationship
– E.g. for the standing / lying task could introduce new
boolean attribute: widthgreater?
which would be filled in for each instance during data
prep
– E.g. in numeric weather, could introduce “WindChill”
based on calculations from temperature and wind speed
(if numeric) or “Heat Index” based on temperature and
humidity
Numeric Prediction
• Standard for comparison for numeric prediction
is the statistical technique of regression
• E.g. for the CPU performance data the
regression equation below was derived
PRP =
- 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
Trees for Numeric Prediction
• Tree branches as in a decision tree (may be
based on ranges of attributes)
• Regression Tree – leaf nodes contain average
of training set values that the leaf applies to
• Model Tree – leaf nodes contain regression
equations for the instances that the leaf applies
to
Figure 3.7(b) Models for the CPU
performance data: regression tree.
Figure 3.7(c) Models for the CPU
performance data: model tree.
Instance Based Representation
• Concept not really represented (except via examples)
• Real World Example – some radio stations don’t define
what they play by words, they play promos basically
saying “WXXX music is:” <songs>
• Training examples are merely stored (kind of like “rote
learning”)
• Answers are given by finding the most similar training
example(s) to test instance at testing time
• Has been called “lazy learning” – no work until an
answer is needed
Instance Based – Finding Most
Similar Example
• Nearest Neighbor – each new instance is
compared to all other instances, with a
“distance” calculated for each attribute for each
instance
• Class of nearest neighbor instance is used as the
prediction <see next slide and come back>
• OR K-nearest neighbors vote, or weighted vote
• Combination of distances – city block or
euclidean (crow flies)
Nearest Neighbor
•x
x
•x
•y
•x
x
•y
•x
•x
•x
•z
•z
•z
•z
x
•z
•z
•z
•z
•y
•y
•z
T
•y
•y
•y
•y
•y
Additional Details
• Distance/ Similarity function must deal with
binaries/nominals – usually by all or nothing match –
but mild should be a better match to hot than cool is!
• Distance / Similarity function is simpler if data is
normalized in advance. E.g. $10 difference in
household income is not significant, while 1.0 distance
in GPA is big
• Distance/Similarity function should weight different
attributes differently – key task is determining those
weights
Further Wrinkles
• May not need to save all instances
– Very normal instances may not all need be be saved
– Some approaches actually do some generalization
But …
• Not really a structural pattern that can be
pointed to
• However, many people in many task/domains
will respect arguments based on “previous
cases” (diagnosis, law among them)
• Book points out that instances + distance metric
combine to form class boundaries
– With 2 attributes, these can actually be envisioned
<see next slide>
(a)
(b)
(c)
(d)
Figure 3.8 Different ways of
partitioning the instance space.
Clustering
• Clusters may be able to be represented graphically
• If dimensionality is high, best representation may only be
tabular – showing which instances are in which clusters
• Show Weka – do njcrimenominal with EM and then do
visualization of results
• In some algorithms associate instances with clusters
probabilistically – for every instance, list probability of
membership in each of the clusters
• Some algorithms produce a hierarchy of clusters and these
can be visualized using a tree diagram
• After clustering, clusters may be used as class for
classification
(a)
e
a
h
k
f
a
b
c
d
e
f
g
h
…
b
h
f
g
1
2
3
0.4
0.1
0.3
0.1
0.4
0.1
0.7
0.5
0.1
0.8
0.3
0.1
0.2
0.4
0.2
0.4
0.5
0.1
0.4
0.8
0.4
0.5
0.1
0.1
c
j
k
i
g
e
a
c
j
(c)
d
(b)
d
b
i
(d)
g a c i e d k b j f h
Figure 3.9 Different ways of representing clusters.
End Chapter 3
Fly UP