...

Document 1359286

by user

on
Category: Documents
38

views

Report

Comments

Transcript

Document 1359286
9/8/14 2. Sta's'cal Inference ENEE 757 | CMSC 818V Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park http://ter.ps/757 https://www.facebook.com/SDSAtUMD Today’s Lecture •  Where we’ve been –  IntroducFon to network and distributed systems security •  Where we’re going today –  StaFsFcal inference –  Machine learning •  Where we’re going next –  Adversary models and cryptography review 2 1 9/8/14 Random Variables and Random Numbers •  A random variable X is a rule for assigning a number to each experimental outcome –  Example: X is the number of heads aMer n coin tosses –  We’re oMen trying to determine the probability distribu'on of X •  Assign probabiliFes p1, p2, … pn to values x1, x2, … xn of X •  Computers produce pseudo-­‐random numbers –  Sequence of numbers (orbit) starFng from a seed value –  The numbers in the sequence follow certain mathemaFcal properFes, e.g. uniform distribuFon 3 Empirical Distribu'ons What does the data look like? •  Probability density funcFon (PDF) of the values you measure –  PDF(x) is the probability that the metric takes the value x b
–  ∫ a PDF(x)dx = Pr[a ≤ metric ≤ b]
Mode Median Mean –  EsFmaFon from empirical data (Matlab: ksdensity R: density) •  CumulaFve density funcFon (CDF) 75th percenFle median 25th percenFle –  CDF(x) is the probability that the metric takes a value less than x x
CDF ( x ) =
∫
PDF (u) du = Pr[metric ≤ x]
−∞
4 –  EsFmaFon (R: ecdf) 2 9/8/14 Pseudo-­‐Random Number Genera'on •  Example: linear congruenFal (LC) generator Xi+1 = (A * Xi + B) mod M –  SensiFve to choice of parameters A, B and M –  With A = 214013, B = 2531011, M = 232, orbit is a complete permutaFon: every 32-­‐bit integer is generated exactly once •  When operaFons done on 32-­‐bit integers, no need for modulus operaFon –  Not a cryptographically secure random number generator •  More on this later 5 Sta's'cal Inference •  You must understand how to interpret data correctly •  StaFsFcal inference: Methods for drawing conclusions about a populaFon from sample data •  Two key methods –  Confidence intervals –  Hypothesis tests (significance tests) 6 3 9/8/14 Confidence Intervals What is the range of likely values? •  95% confidence interval for the sample mean –  If we repeated the experiment 100 Fmes, we expect that this interval would include the mean 95/100 Fmes μ: mean σ
–  CI = µ ±1.96
σ: standard deviaFon n
n: number of elements •  Why 95%? –  No good reason, but widely used •  You can compute confidence intervals for many staFsFcal measures –  Variance, slope of regression line, effect size, etc. 7 Hypothesis Tests Is a result sta's'cally significant? •  Compare an experimental group and a control group –  H0: Null Hypothesis = No difference between the groups –  H1: AlternaFve Hypothesis = Significant difference between the groups •  Hypothesis tests –  t-­‐test: are the means significantly different? (R: t.test) •  One-­‐tailed (μ1>μ2), two-­‐tailed (μ1≠μ2) •  Paired (difference between pairs of measurements) –  χ2 goodness-­‐of-­‐fit test: does the empirical data match a probability distribuFon (or some other hypothesis about the data)? (R: chisq.test) –  Analysis of Variance (ANOVA): is there a difference among a number of treatments? Which factors contribute most to the observed variability? (R: anova) 8 4 9/8/14 Hypothesis Tests – How Different is Different? Is a result sta's'cally significant? •  How do we know the difference in two treatments is not just due to chance? –  We don’t. But we can calculate the odds that it is. •  The p-­‐value = likelihood that H0 is true –  In repeated experiments at this sample size, how oMen would you see a result at least this extreme assuming the null hypothesis? –  p < 0.05: the difference observed is sta's'cally significant –  p > 0.05: the result is inconclusive –  Why 5%? Again, no good reason but widely used. !  A non-­‐significant difference is not the same as no difference !  A significant difference is not always an interes'ng difference 9 Sampling What can you tell about a popula'on by observing a sub-­‐sample? •  SomeFmes you may choose your sample size (or sampling rate) –  Rule of thumb: 10% is usually OK for large data –  Strategies: •  Uniform sampling: randomly keep 1 out of 10 data points (R: sample) •  StraFfied sampling: for each city, keep equal number of rows –  Useful trick: sample based on output of crypto hash (e.g. MD5) •  Output bits of hash are uniformly distributed regardless of the input •  Bootstrapping: how to extrapolate property Q –  Want Q(sample) è Q(whole populaFon) –  Key idea: observe the distribuFon of Q on several sub-­‐samples •  How well can you extrapolate Q(sub-­‐sample) è Q(sample)? –  Useful when the sample size is insufficient for inference 10 5 9/8/14 Correla'on and Regression Are two factors related? •  CorrelaFon coefficient R (R: cor) ~  1: posiFve correlaFon (when X grows, Y grows too) ~  -­‐1: negaFve correlaFon (when X grows, Y goes down) ~  0: no correlaFon –  p-­‐value: Pr[R ≠ 0], dependent on sample size (R: cor.test) !  Compute the correla'on coefficient only you think that the rela'onship between X and Y is linear !  Correla'on is not causa'on Corr. coeff. R = -­‐0.87 Slope
a = -­‐0.005 Intercept b = 37.29 •  Regression (R: lm) –  Fit linear model y = ax + b •  Typically using least squares method •  Some methods are robust to outliers (R package: minpack.lm) The “Curse” of Big Data When you search for paherns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to iden'fy coincidences that have no predic've power. Vincent Granville •  Marginal cost of collecFng more data is essenFally zero! –  But while this decreases variance, it amplifies bias –  Example: You log all clicks to your website to model user behavior, but this only samples current users, not the users you want to awract. –  Vincent Granville’s example: http://www.analyticbridge.com/profiles/blogs/the-­‐curse-­‐of-­‐big-­‐
data •  Taleb’s “Black Swan” events –  The turkey’s model of human behavior 12 6 9/8/14 Threats to Validity in Security Experiments •  Construct validity –  Use metrics and measures that actually model the hypotheses –  Example: only cyber awackers know the ground truth about awacks •  Internal validity –  Establish causal connecFon between independent and dependent variables –  Example: recording behavior of malware on clean testbed or goodware on infected testbed? •  Content validity –  Exclude irrelevant data points and including all the relevant ones –  Example: malware corpus reflects most common awacks; what about zero-­‐
day malware? •  External validity –  Results must generalize to systems beyond the scope of the study –  Example: my detecFon technique works well today; will it work tomorrow? 13 Machine Learning Systems that automa'cally learn programs from data P. Domingos, CACM 2012 •  Supervised learning: have inputs and associated outputs –  Learn relaFonships between them using available training data (also called “labeled data”, “ground truth”) –  Predict future values –  Classifica'on: The output (learned awribute) is categorical –  Regression: The output (learned awribute) is numeric •  Unsupervised learning: have only inputs –  Learn “latent” labels –  Clustering: IdenFfy natural groups in the data 14 7 9/8/14 Rules Weather and golf outlook overcast overcast overcast overcast rainy rainy rainy rainy rainy sunny sunny sunny sunny sunny temp cool hot hot mild cool mild cool mild mild hot hot mild cool mild humidity windy normal TRUE high FALSE normal FALSE high TRUE normal TRUE high TRUE normal FALSE high FALSE normal FALSE high FALSE high TRUE high FALSE normal FALSE normal TRUE play yes yes yes yes no no yes yes yes no no no yes yes •  Want to decide when to play –  Create rules based on awributes •  Example: 1 awribute if (outlook == “rainy”) then play = “no” else play = “yes” –  Errors: 6/14 •  Can refine rule by adding condiFons on other awributes –  Create a decision tree 15 Resul'ng Decision Tree •  Pu}ng the decision tree together –  Several awribute selecFon strategies (e.g. maximize informaFon gain) –  Create branches for each value of awribute –  DiscreFze conFnuous awributes (choose parFFon with highest gain) –  R package: rpart •  Not a perfect classificaFon (sFll makes some incorrect decisions) 16 8 9/8/14 Confusion Matrix How to determine if the classifier does a good job? •  You need a training set (ground truth) and a tesFng set –  Or you can split your ground truth into two data sets –  Even bewer: K-­‐fold cross-­‐validaFon •  Select K samples without replacement and train classifier mulFple Fmes •  You can make a mistake in two different ways Predicted -­‐ Predicted + True -­‐ True NegaFve (TN) Correct decision False Posi've (FP) Type 1 error True + False Nega've (FN) Type 2 error True PosiFve (TP) Correct decision 17 Evalua'ng Results Is it beher to have low FPs or low FNs? •  There is usually a trade-­‐off between FPs and FNs 5.14. k -means
Nearest Neighbor (Mahalanobis)
–  Ability to rule out true negaFves The final step was to evaluate each of the 14 detectors
–  Also cdata.
alled true negaFve rate using the password-timing
Each
detector
was trained
and tested using the same procedure, and the anomaly
scores output by each detector were converted into standard
measures of error.
0.8
0.6
0.4
0.2
EvaluaFng keystroke dynamics [Killourhy & Maxion, DSN’09] 0.0
•  Specificity = TN / (FP + TN) 6. Evaluation methodology
Rate
TP rate (Hit
SensiFvity) 1.0
This detector was
described by
Kang1et
[11].causes It uses more type 2 errors, and Subject
19
–  Reducing type eal.
rrors vice-­‐versa the k-means clustering algorithm to identify clusters in the
training vectors, and then calculates whether the test vector
is close to any of the clusters. In the training phase, the
zero−miss rate
•  runs
Sensi'vity TP / (onTP+FN) detector simply
the k-means =
algorithm
the training
data (with k = 3). The algorithm produces three centroids
equal−error rate
–  Ability o idenFfy true osiFves such that each training
vectortshould
be close
to at pleast
one
of the three centroids. In the test phase, the anomaly score is
–  Also called true posiFve rate calculated as the Euclidean distance between the test vector
and the nearest of these centroids.
0.0
0.2
0.4
0.6
0.8
1.0
FP rate (1 Alarm
– Specificity) False
Rate
Figure 1. An example
ROC
curve depicts the
•  Can plot a Receiver OperaFng CharacterisFc (ROC) curve 6.1. Training and testing the detectors
–  R package: OCR long-time passConsider a scenario
in which aRuser’s
word has been compromised by an impostor. The user is
assumed to be practiced in typing their password, while the
impostor is unfamiliar with it (e.g., typing it for the first
time). We measure how well each of our detectors is able to
discriminate between the impostor’s typing and the genuine
user’s typing in this scenario.
We start by designating one of our 51 subjects as the genuine user, and the rest as impostors. We train an anomaly
detector and test its ability to recognize the genuine user and
impostors as follows:
1. We run the training phase of the detector on the timing
vectors from the first 200 password repetitions typed by
performance of the Nearest Neighbor (Maha18 lanobis) detector with subject 19 as the genuine user. The curve shows the trade-off between the hit rate and the false-alarm rate.
Proximity to the top-left corner of the graph
is a visual measure of performance.
they knew timing mattered. We believe that our choices are
fair for a first evaluation; experiments involving fewer training repetitions and more practiced impostors are planned.
6.2. Calculating detector performance
To measure detector performance, we used the scores
to generate a graphical summary of performance called an
9 9/8/14 Unsupervised Learning •  AgglomeraFve hierarchical clustering (R: hclust) –  No ground truth; goal is to idenFfy pawerns that describe the data –  Start from individual points and progressively merge nearby clusters –  Distance metric (e.g. Euclidian, rank correlaFon, Gower) –  Linkage: how to aggregate pairwise point distances into cluster distances •  Average? Minimum (single)? Maximum (complete)? Variance decrease (Ward)? Dendrogram of 1970 cars (features: MPG, weight, drive raFo) !  Choose classifica'on or clustering features carefully Other Machine Learning Algorithms •  Supervised learning –  We saw: decision trees –  Other classifiers: naïve Bayes, Support Vector Machines (SVM) –  Natural language processing •  Text mining (R package: tm) •  Natural Language Toolkit (NLTK) •  Unsupervised learning –  We saw: hierarchical clustering –  Other clustering techniques: k-­‐means, k-­‐medoids, Fme series clustering –  Dimensionality reducFon: principal component analysis (PCA) 20 10 9/8/14 Addi'onal References •  NIST Engineering StaFsFcs Handbook http://www.itl.nist.gov/div898/handbook/index.htm •  Andrew Ng’s Coursera class on machine learning: hwps://class.coursera.org/ml-­‐005/lecture/preview •  ENEE 759G: Data Mining and Knowledge Discovery •  Machine learning tools –  For R: hwp://cran.r-­‐project.org/web/views/MachineLearning.html –  For Hadoop: Mahout (hwp://mahout.apache.org/) 21 Inference Story – Wihy Worm •  Vulnerability ICQ filtering module of ISS BlackICE intrusion detectors –  Bug: sprin‚(buf, "%s", pkt);
Why is this bad? •  Should specify size of string, e.g. sprin‚(buf, "%9s", pkt); –  Debugging code accidentally leM in released product •  Wiwy worm –  Exploit = single UDP packet from source port 4000 (ICQ) –  Payload contains “(^.^ insert wiwy message here ^.^)”, deletes random disk sectors –  AMer infecFon, scans random permutaFon of IP address space for vulnerable hosts •  Chronology of Wiwy –  Mar 8, 2004: vulnerability discovered by eEye –  Mar 18, 2004: high-­‐level descripFon published –  36 hours later: worm released –  75 mins later: all 12,000 vulnerable machines infected! 11 9/8/14 Detec'on with Network Telescopes •  Network telescopes monitor packets sent to unused blocks of Internet address space –  Useful for detecFng scanning worms, DDoS •  CAIDA/UCSD Network Telescope –  Monitors 23.0.0.0/8 block of IPv4 address space •  All addresses with a parFcular first byte •  Saw ~4 out of 1000 packets sent by each Wiwy infectee (why?) •  Telescopes allow inferring basic properFes of worm propagaFon –  Scanning rate (aggregate and per-­‐host average), host effecFve bandwidth –  DistorFons: worm propagaFon may overwhelm the monitor What Else Can You Infer … … if you understand how the malware works? 24 12 9/8/14 Pseudocode of Wihy (1) [Kumar, Paxson, Weaver] 1.  srand(get_tick_count()) Seed pseudo-random generator
2.  for(i=0; i<20,000; i++) 3.  destIP ← rand()[0..15] | rand()[0..15] 4.  destPort ← rand()[0..15] 5.  packetSize ← 768 + rand()[0..8] Each Witty packet contains
bits from 4 consecutive
pseudo-random numbers
6.  packetContents ← top of stack 7.  send packet to destIP/destPort 8.  if(open(physicaldisk,rand()[13..15])) write(rand()[0..14] || 0x4E20); goto 1; 9. else goto 2 slide 25
Wihy’s PRNG [Kumar, Paxson, Weaver] •  Recall linear congruenFal PRNG: Xi+1 = A * Xi + B mod M •  Can reconstruct the enFre state of the generator from a single packet, predict future & past values Given top 16 bits of X …
destIP ← (Xi)[0..15] | (Xi+1)[0..15] destPort ← (Xi+2)[0..15] i
… try all possible lower 16 bits and check if they yield Xi+1 and Xi+2 consistent with the observaFons •  For each packet recorded by the telescope, try 216 possibiliFes to recover PRNG state 13 9/8/14 Es'ma'ng Infectee’s Bandwidth [Kumar, Paxson, Weaver] •  Suppose two consecuFvely received packets from a parFcular infectee have states Xi and Xj •  Compute j-­‐i –  Count the number of PRNG “turns” between Xi and Xj –  Infer the number of packets sent by infectee between two observaFons: (j-­‐i)/4 (why?) •  sendto() in Windows is blocking (means what?) •  Access bandwidth of infectee = (j-­‐i)/4 * packet size / ΔT –  Does this work in the presence of packet loss? •  Can also compute effec've bandwidth –  Take observed packet rate from a source IP –  Correct by the sample size of IPv4 space represented by the telescope Effec've vs. Access Bandwidth [Kumar, Paxson, Weaver] 12000
30% distorFon of telescope observaFons due to congesFon on the inbound link 10000
1e+08
1e+07
7 0.
=
y
Packets per second
Effective bandwidth (bits per sec.)
1e+09
x 1e+06
100000
10000
10000
100000
1e+06
1e+07
1e+08
6000
4000
2000
100 Mbps 10 Mbps 8000
0
1e+09
0
500 1000 1500 20
Access bandwidth (bits per sec.)
28 spectively. As expected, all points lie below the diagonal,
indicating that the effective bandwidth never exceeds the
access bandwidth, and is often lower by a signifi cant factor.
Figure 8: Aggregate worm t
logged at the telescope.
1e+09
sec.)
Figure 7: Scatter-plot of estimated bandwidth using the two
techniques.
1e+08
14 9/8/14 What Else Can We Infer? [Kumar, Paxson, Weaver] •  Compute seeds used for reseeding –  srand(get_Fck_count()) – determine host upFme •  Compute exact random number used for each subsequent disk-­‐wipe test –  Can determine whether it succeeded or failed, and thus the number of drives awached to each infectee •  Compute who-­‐infected-­‐whom tree –  Compare when packets were sent to a given address and when this address started sending packets slide 29 Bug in Wihy’s PRNG [Kumar, Paxson, Weaver] •  Wiwy uses a permutaFon PRNG, but only uses 16 highest bits of each number –  Knuth’s advice: higher-­‐order bits of LC PRNG provide bewer uniform randomness –  Result: orbit is not a compete permutaFon, misses approximately 10% of IP address space and visits 10% twice •  … but telescope data indicates that some hosts in the missed space sFll got infected –  How? slide 30 15 9/8/14 Wihy’s Hitlist and Pa'ent Zero [Kumar, Paxson, Weaver] •  Some hosts in the unscanned space got infected very early in the outbreak –  PotenFally on a hitlist (pre-­‐computed list of IP addresses to be targeted first) •  Prevalent /16 = U.S. military base (Fort Huachuca) –  Authors’ conclusion: Wihy targeted the US military! •  A peculiar “infectee” shows up in the telescope observaFon data early in the Wiwy oubreak –  Sending packets with desFnaFon IP addresses that could not have been generated by Wiwy’s PRNG (runs different code) –  Probably pa'ent zero: the source of the Wiwy outbreak •  IP address belongs to a European retail ISP; informaFon passed to law enforcement Was There a Hitlist? [Robert Graham] Gowa be a hitlist, right? Typical worm propagaFon curve (more on this later) •  AlternaFve explanaFon: the iniFally infected BlackIce copies were running as network intrusion detectors in promiscuous mode monitoring a huge fracFon of DoD address space –  20% of all Internet => likely to “receive” packets faster than individual hosts –  Had IP addresses in the Fort Huachuca IP range –  ExplanaFon supported by analysis of padding in Wiwy packets hwp://blog.erratasec.com/2014/03/wiwy-­‐worm-­‐no-­‐seed-­‐populaFon-­‐involved.html 16 9/8/14 Things to Take Away •  You can make interesFng inferences if you combine –  Empirical measurements –  Knowledge of the underlying process generaFng the data •  Think about the threats to the validity of your interpretaFon –  Low p-­‐values not enough; must avoid spurious correlaFons •  Ahack / target ahribu'on is generally challenging –  Analysis may be affected by spurious correlaFons –  Awacker may leave traces that point fingers to someone else 33 Sources •  Kumar et al., ‘ExploiFng underlying structure for detailed reconstrucFon of an internet-­‐scale event,’ IMC 2005 •  Various slides from Bill Howe and Vitaly ShmaFkov 34 17 9/8/14 Review of Lecture •  What did we learn? –  Confidence intervals –  Hypothesis tests –  CorrelaFon and regression –  ClassificaFon –  Clustering •  What’s next? –  Adversary models and cryptography review •  Deadline reminder –  Post pilot project proposal on Piazza by end of today –  If you want to propose a topic that is not on the list from the course web page, come see me during office hours aMer class 35 18 
Fly UP