DATA ANALYSIS Module Code: CA660 Lecture Block 7:Non-parametrics
by user
Comments
Transcript
DATA ANALYSIS Module Code: CA660 Lecture Block 7:Non-parametrics
DATA ANALYSIS Module Code: CA660 Lecture Block 7:Non-parametrics WHAT ABOUT NON-PARAMETRICS? How Useful are ‘Small Data’? • General points -No clear theoretical probability distribution, so empirical distributions needed -So, less knowledge of form of data* e.g. ranks instead of values - Quick and dirty - Need not focus on parameter estimation or testing; when do frequently based on “less-good” parameters/ estimators, e.g. Medians; otherwise test “properties”, e.g. randomness, symmetry, quality etc. - weaker assumptions, implicit in * - smaller sample sizes ‘typical’ - different data - implicit from other points. Levels of Measurement Nominal, Ordinal typical for non-parametric/ distribution-free 2 ADVANTAGES/DISADVANTAGES Advantages - Power may be better using N-P, if assumptions weaker - Smaller samples and less work etc. – as stated Disadvantages - also implicit from earlier points, specifically: - loss of information /power etc. when do know more on data /when assumptions do apply - Separate tables each test General bases/principles: Binomial - cumulative tables, Ordinal data, Normal - large samples, Kolmogorov-Smirnov for Empirical Distributions - shift in Median/Shape, Confidence Intervals- more work to establish. Use Confidence Regions and Tolerance Intervals Errors – Type I, Type II . Power as usual. Relative Efficiency – asymptotic, e.g. look at ratio of sample sizes needed to achieve same power 3 STARTING SIMPLY: - THE ‘SIGN TEST’ • Example. Suppose want to test if weights of a certain item likely to be more or less than 220 g. From 12 measurements, selected at random, count how many above, how many below. Obtain 9(+), 3(-) • Null Hypothesis : H0: Median = 220. “Test” on basis of counts of signs. • Binomial situation, n=12, p=0.5. For this distribution P{3 X 9} = 0.962 while P{X 2 or X 10} = 1-0.962 = 0.038 Result not strongly significant. • Notes: Need not be Median as “Location of test” (Describe distributions by Location, dispersion, shape). Location = median, “quartile” or other percentile. Many variants of Sign Test - including e.g. runs of + and - signs for “randomness” 4 PERMUTATION/RANDOMIZATION TESTS • Example: Suppose have 8 subjects, 4 to be selected at random for new training. All 8 ranked in order of level of ability after a given period, ranking from 1 (best) to 8 (worse). P{subjects ranked 1,2,3,4 took new training} = ?? n 70 • Clearly any 4 subjects could be chosen. Select r=4 units from n = 8, r • If new scheme ineffective, sets of ranks equally likely: P{1,2,3,4} = 1/70 • More formally, Sum ranks in each grouping. Low sums indicate that the training is effective, High sums that it is not. Sums 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 No. 1 1 2 3 5 5 7 7 8 7 7 5 5 4 2 1 1 • Critical Region size 2/70 given by rank sums 10 and 11 while size 4/70 from rank sums 10, 11, 12 (both “Nominal” 5%) • Testing H0: new training scheme no improvement vs H1: some improvement 5 MORE INFORMATION WILCOXON ‘SIGNED RANK’ • Direction and Magnitude : H0: = 220 ?Symmetry • Arrange all sample deviations from median in order of magnitude and replace by ranks (1 = smallest deviation, n largest). High sum for positive (or negative) ranks, relative to the other H0 unlikely. Weights 126 142 156 228 245 246 370 419 433 454 478 503 Diffs. -94 -78 -64 8 25 26 150 199 213 234 258 283 Rearrange 8 25 26 -64 -78 -94 150 199 213 234 258 383 Signed ranks 1 2 3 -4 -5 -6 7 8 9 10 11 12 Clearly Snegative = 15 and < Spositive Tables of form: Reject H0 if lower of Snegative , Spositive tabled value e.g. here, n=12 at = 5 % level, tabled value =13, so do not reject H0 6 LARGE SAMPLES andC.I. • Normal Approximation for S the smaller in magnitude of rank sums 1 1 n(n 1) Obs Exp. 2 4 Z (or U S .N .D.) ~ N (0,1) SE n(n 1)(2n 1) / 24 S so C.I. as usual • General for C.I. Basic idea is to take pairs of observations, calculate mean and omit largest / smallest of (1/2)(n)(n+1) pairs. Usually, computer-based - re-sampling or graphical techniques. • Alternative Forms -common for non-parametrics e.g. for Wilcoxon Signed Ranks. Use W S p S n = magnitude of differences between positive /negative rank sums. Different Table Ties - complicate distributions and significance. Assign mid-ranks 7 KOLMOGOROV-SMIRNOV and EMPIRICAL DISTRIBUTIONS • Purpose - to compare set of measurements (two groups with each other) or one group with expected - to analyse differences. • Can not assume Normality of underlying distribution, (usual shape), so need enough sample values to base comparison on (e.g. 4, 2 groups) • Major features - sensitivity to differences in both shape and location of Medians: (does not distinguish which is different) • Empirical c.d.f. not p.d.f. - looks for consistency by comparing popn. curve (expected case) with empirical curve (sample values) Step fn. No. sample values x i.e. value at each step from data S ( x) E.D.F . n S(x) should never be too far from F(x) = “expected” form • Test Basis is Max. Diff . F ( xi ) S ( xi ) 8 Criticisms/Comparison K-S with other ( 2) Goodness of Fit Tests for distributions Main Criticism of Kolmogorov-Smirnov: - wastes information in using only differences of greatest magnitude; (in cumulative form) General Advantages/Disadvantages K-S - easy to apply - relatively easy to obtain C.I. - generally deals well with continuous data. Discrete data also possible, but test criteria not exact, so can be inefficient. - For two groups, need same number of observations - distinction between location/shape differences not established Note: 2 applies to both discrete and continuous data , and to grouped, but “arbitrary” grouping can be a problem. Affects sensitivity of H0 rejection. 9 COMPARISON 2 INDEPENDENT SAMPLES: WILCOXON-MANN-WHITNEY • Parallel with parametric (classical) again. H0 : Samples from same population (Medians same) vs H1 : Medians not the same • For two samples, size m, n, calculate joint ranking and Sum for each sample, giving Sm and Sn . Should be similar if populations sampled are also similar. 1 • Sm + Sn = sum of all ranks = (m n)( m n 1) and result tabulated for 2 1 1 U m S m m(m 1), U n S n n(n 1) 2 2 • Clearly, U m mn U n so need only calculate one from 1st principles • Tables typically give, for various m, n, the value to exceed for smallest U in order to reject H0 . 1-tailed/2-tailed. • Easier : use the sum of smaller ranks or fewer values. • Example in brief: If sum of ranks =12 say, probability based on no. possible ways of obtaining a 12 out of Total no. of possible sums 10 Example - W-M-W • For example on weights earlier. Assume now have 2nd sample set also: 29 39 60 78 82 112 125 170 192 224 263 275 276 286 369 756 • Combined ranks for the two samples are: Value 29 39 60 78 82 112 125 126 142 156 170 192 224 228 245 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Value 246 263 275 276 286 369 370 419 433 454 478 503 756 Rank 16 17 18 19 20 21 22 23 24 25 26 27 28 Here m = 16, n=12 and Sm= 1+ 2+ 3 + ….+21+ 28 = 187 So Um=51, and Un=141. (Clearly, can check by calculating Un directly also) For a 2-tailed test at 5% level, Um= 53 from tables and our value is less, i.e. more extreme, so reject H0 . Medians are different here 11 MANY SAMPLES - Kruskal-Wallis • Direct extension of W-M-W. Tests: H0: Medians are the same. • Rank total number of observations for all samples from smallest (rank 1) to highest (rank N) for N values. Ties given mid-rank. • rij is rank of observation xij and si = sum of ranks in ith sample (group) • Compute treatment and total SSQ ranks - uncorrected given as S 2 t si2 i ni S r2 rij2 , i, j • For no ties, this simplifies Sr N ( N 1)(2 N 1) / 6 1 2 • Subtract off correction for average for each, given by C N ( N 1) 2 • Test Statistic T ( N 1)[ S C ] 2 t 1 S r2 C 2 t 4 i.e. approx. 2 for moderate/large N. Simplifies further if no ties. 12 PAIRING/RANDOMIZED BLOCKS - Friedman • Blocks of units, so e.g. two treatments allocated at random within block = matched pairs; can use a variant of sign test (on differences) • Many samples or units = Friedman (simplest case of R.B. design) • Recall comparisons within pairs/blocks more precise than between, so including Blocks term, “removes” block effect as source of variation. • Friedman’s test- replaces observations by ranks (within blocks) to achieve this. (Thus, ranked data can also be used directly). Have xij = response. Treatment i, (i=1,2..t) in each block j, (j=1,2...b) Ranked within blocks Sum of ranks obtained each treatment si, i=1,…t rij2 For rank rij (or mid-rank if tied), raw (uncorrected) rank SSQ S r2 i, j 13 Friedman contd. • With no ties, the analysis simplifies • Need also SSQ(All treatments –appear in blocks) si2 2 St b i • Again, the correction factor analogous to that for Kruskal-Wallis 1 C bt (t 1) 2 4 • and common form of Friedman Test Statistic b(t 1)( St2 C ) 2 T1 t 1 ( S r2 C ) t, b not very small, otherwise need exact tables. 14 Other Parallels with Parametric cases • Correlation - Spearman’s Rho ( Pearson’s P-M calculated using ranks or mid-ranks) r s C i i where C i i ri C 2 i s C 2 i 1 n(n 1) 2 4 used to compare e.g. ranks on two assessments/tests. • Regression – LSE robust in general. Some use of “median methods”, such as Theil’s (not dealt with here, so assume usual least squares form). 15 NON-PARAMETRIC C.I. in Informatics: BOOTSTRAP • Bootstrapping = re-sampling technique used to obtain Empirical distribution for estimator in construction of non-parametric C.I. - Effective when distribution unknown or complex - More computation than parametric approaches and may fail when sample size of original experiment is small - Re-sampling implies sampling from a sample - usually to estimate empirical properties, (such as variance, distribution, C.I. of an estimator) and to obtain EDF of a test statistic- common methods are Bootstrap, Jacknife, shuffling - Aim = approximate numerical solutions (like confidence regions). Can handle bias in this way - e.g. to find MLE of variance 2, mean unknown - both Bootstrap and Jacknife used, Bootstrap more often for C.I. 16 Bootstrap/Non-parametric C.I. contd. Basis - both Bootstrap and others rely on fact that sample cumulative distn fn. (CDF or just DF) = MLE of a population Distribution Fn. F(x) Define Bootstrap sample as a random sample, size n, drawn with replacement from a sample of n objects For S the original sample, S ( x1 , x2 ,.....xn ) P{drawing each item, object or group} = 1/n Bootstrap sample SB obtained from original, s.t. sampling n times with replacement gives B B B S B ( x1 , x2 ,.....xn ) Power relies on the fact that large number of resampling samples can be obtained from a single original sample, so if repeat process b times, obtain SjB, j=1,2,….b, with each of these being a bootstrap replication 17 Contd. • Estimator - obtained from each sample. If ˆ jB F ( S Bj ) is the estimate for the jth replication, then bootstrap mean and variance 1 b b ˆ jB , i 1 1 Vˆ B b 1 b (ˆ jB B ) 2 j 1 while BiasB = B • CDF of Estimator = P{ˆb x} for b replications so C.I. with confidence coefficient for some percentile is then B {CDF 1[0.5(1 )], CDF 1[0.5(1 )]} • Normal Approx. for mean: Large b B U (1 ) / 2 Vˆ B U ~ N (0,1) (tb-1 - distribution if No. bootstrap replications small). Standardised Normal Deviate 18 Example • Recall (gene and marker) or (sex and purchasing) example • MLE, 1000 bootstrapping replications might give results: ˆ Parametric Variance 0.0001357 95% C.I. (0, 0.0455) 95% Interval (Likelihood) (0.06, 0.056) Bootstrap Variance 0.0001666 Bias 0.0000800 95% C.I. Normal (0, 0.048) 95% C.I. (Percentile) (0, 0.054) ̂ 0.00099 (0.162, 0.286) (0.17, 0.288) 0.0009025 0.0020600 (0.1675, 0.2853) (0.1815, 0.2826) 19 SUMMARY: Non-Parametric Use Simple property : Sign Tests – large number of variants. Simple basis Paired data Wilcoxon Signed Rank -Compare medians Conditions/Assumptions - No. pairs 6; Distributions - Same shape Independent data Mann-Whitney U - Compare medians - 2 independent groups Conditions/Assumptions -(N 4); Distributions same shape Correlation –as before. Parallels parametric case. Distributions: Kolmogorov-Smirnov - Compare either location, shape Conditions/Assumptions - (N 4), but the two are not separately distinguished. If 2 groups compared) need equal no. observations Many Group/Sample Comparisons: Friedman – compare Medians. Conditions : Data in Randomised Block design. Distributions same shape. Kruskal-Wallis- Independent groups. Conditions : Completely randomised. Groups can be unequal nos. Distributions same shape. Regression – robust, as noted, so use parametric form. 20