...

1-a

by user

on
Category:

arithmetic

70

views

Report

Comments

Description

Transcript

1-a
Decision trees
part II
LESSON TOPICS
 CHAID method : Chi-Squared
Automatic Interaction
Detection
 Chi-square test
 Bonferroni correction factor
 Examples
Principal features
of CHAID method
CHAID merges categories
of the predictor that are
homogeneous with respect to
the dependent variable ,
but keeps distinct all
the categories which are
heterogeneous
CHAID uses Bonferroni
multiplier for doing the needed
adjustments in order for
making simultaneous statistical
inferences
CHAID, a differenza di altri
metodi di partizione
iterativa, è limitato
a caratteri di tipo ordinale
e nominale
It uses chi-square test for
veryfing indipendence
between characters
(together with Bonferroni
factor) for assessing
significativity of partition
Chi-square test of
independence
  ( n ij 2
x
=
i
j
*
nij
* 2
n )
ij
where
nij
is the empirical frequency
corresponding to the combination
of modality i of the first character
with modality j of the second
character
*
n
ij
= ninj
Is the corresponding
theoretical frequency
according to the hypothesis of
indipendence between the two
characters
EXAMPLE
Families according to
residence and personal
computer ownership
(empirical frequencies)
Geographic zone
Ownership
of personal
computer
NorthCenter
South
Total
YES
150
100
250
NO
500
250
750
Total
650
350
1000
Families according to
residence and personal
computer ownership
(theoretical frequencies)
Geographic zone
Ownership
of personal
computer
NorthCenter
South
Total
YES
162,5
87,5
250,0
NO
487,5
262,5
750,0
650,0
350,0
1000,0
Total
Test
calculations:
(500-487,5)2/487,5+
(87,5-100)2/87,5+
(162,5-150)2/162,5+
(250-262,5)2/262,5=
Bonferroni adjustment
factor
 Let us consider the dependent variable
R and the predictors B, with five
modalities, and A, with two
 Let us take that a is the first type error
of the indipendence test in a two entry
table with B e R
(for example a =0,05)
There are 24 -1 = 15 different ways
to make dichotomous variable B
If the 15 test of hypothesis were
indipendent, the probability of
making a first type error would be:
1-(1-a)15 > a
In the above example,
15 is called Bonferroni factor
If a è piccolo
1 - (1-a)M = Ma
For the predictor A the probability
of making a first type error is
simply a
In the CHAID method
we compare the value of a
associated with the test
of indipendence for the variable
A with the value of a for
the variable B corrected with
Bonferroni factor
Basic
components of
CHAID:
1
A categorical dependent
variable
2
A set of independent variables,
categorical too, combinations of
which are used for defining the
partitions
3
A set of parameters
In each step of the analysis,
each subgroup is analyzed and
we get the best predictor,
defined as that which has
the smallest value of a
corrected by the smallest
Bonferroni factor
Kinds of predictive
variables in CHAID
1
Monotonic
2
Free
3
Floating
The CHAID
algorithm:
STEP 1: Merging
Step 2: Splitting
Step 3: Stopping
Merging
For each
predictor
1
Construct the
complete two ways
table
2
For each couple of categories
that can be merged calculate
chi-square test. For each couple
which is not significative merge
and go to step 3. If all the
remaining couples are
significative go to step 4
3
For each categories resulting
from the merge of three or
more categories originarie
controlla con il test chiquadrato se ogni categoria
originaria può essere
separata dalle altre. Torna al
passo 2
4 Merge categories which have a
too small number of
observations, taking those
which have the highest value
of
5
Calculate the value of a
corrected by Bonferroni
factor on the table resulting
by the merging process
Splitting
 Take as the best predictor that
which has the smallest value of a
corrected by Bonferroni factor
 If predictor shows a significant
value of a significativo, do not
split that subgroup
Stopping
Come back to step 1 and
analyze the next
subgroup. Stop when every
subgroup has been analyzed or
has too few observations
Example of chaid method
Dependent variable:
Response rate to a promotional
offer of subscribing a magazine
Indipendent
Variables
Head of the family age - 5
categories -floating (AGE)
gender - 2 categories -monotonic
- (GENDER)
Presence of children - 2
categories - monotonic (KIDS)
Family income - 8 categories monotonic (INCOME)
Credit card - 2 categories monotonic (BANKCARD)
Number of components - 6
categories - floating - (HHSIZE)
Occupational status -4 categories
- free (OCCUP)
Representation
of the partition
process by a
dendrogram
Total
0.02
81,040
HHSIZE
1
0.03
25,384
23
0.13
16,132
45
0.00
6,198
?
- 0.04
33,326
OCCUP
-1-
GENDER
-4-
W
0.36
1,758
BO?
0.10
14,374
M
- 0.04
25,531
F
- 0.05
7,795
-2-
-3-
-5-
-6-
Interpretation of results
Comparison of response
accordin to the variable
household size before and
after merging
% of responses
HHSIZE
Frequency
Before
merging
After
merging
1
25384
1,09
1,09
2
11240
1,49
1,52
3
4892
1,59
1,52
4
3187
1,79
1,92
3011
2,06
1,92
33326
0,87
0,87
5
Missing
value
Ranking of
segments
according to
response rate
Rank
Number
Description
Response
rate
1
Segment 2 Household with two
2,39
2
Segment 4 Households with
1,92
or tre components,
head white collar
four components
and more
Rank
Number
Description
Response
rate
3
Segment 3 Household with two
1,42
4
Segment 1 Household with
one component
1,09
or three
components, head
with occupational
staus different
from white collar
Rank
5
Number
Description
Response
rate
Segment 6 Household with
1,o6
Segment 5 Household with
missing number
of components,
head male
0,81
missing number of
components, head
female
Fly UP