...

Security Data Science (SDS) Prof. Tudor Dumitraș Assistant Professor, ECE

by user

on
Category: Documents
51

views

Report

Comments

Transcript

Security Data Science (SDS) Prof. Tudor Dumitraș Assistant Professor, ECE
Security Data Science (SDS)
ENEE 759D | ENEE 459D | CMSC 858Z
Prof. Tudor Dumitraș
Assistant Professor, ECE
University of Maryland, College Park
http://ter.ps/759d
https://www.facebook.com/SDSAtUMD
Introducing Your Instructor
Tudor Dumitraș
Office: AVW 3425
Email: [email protected]
Course Website: http://ter.ps/759d
Office Hours: Mon 2-3 pm
2
My Background
• Ph.D. at Carnegie Mellon University
– Research in distributed systems and fault-tolerant middleware
• Worked at Symantec Research Labs
– Built WINE platform for Big Data experiments in security
– WINE currently used by academic researchers and
Symantec engineers
• Joined UMD faculty
WINE
• Research and teaching on applied security and systems
– Focus on solving security problems with data analysis techniques
3
SDS In A Nutshell
• Course objectives
– Ability to understand and interpret scholarly publications, to explain their
key ideas, and to provide constructive feedback
– Ability to apply some of these ideas in practice
• Topics
Vulnerabilities and exploits
Spam infrastructures
Failures of cryptosystems
Pay per install
Internet worms
Attacks against physical infrastructure
Denial of service
Targeted attacks
Botnets
Economic implications of cybercrime
• Grading
– 50% paper reviews and class participation
– 50% projects
4
We Are Swimming in Data
• Data created/reproduced in 2010: 1,200 exabytes
• Data collected to find the Higgs boson: 1 gigabyte / s
• Yahoo: 200 petabytes across 20 clusters
• Security:
– Global spam in 2011: 62 billion / day
– Malware variants created in 2011: 403 million
5
Why So Much Data?
• We can store it
– 6¢ / GB
– 29¢ / GB (SAS HDD)
• We can generate it
– Most data is machine-generated
– Most malware samples are variants of other malware, generated
automatically (repacking, obfuscation)
What to do with all this data?
6
Three Stories about Data
7
WHAT QUESTIONS TO ASK ON A FIRST DATE?
The Power of Big Data
8
If You Want to Know …
Do my date and I have long-term potential?
9
If You Want to Know …
Do my date and I have long-term potential?
… ask:
275,000 user submitted questions
34,260 real world couples
Q
Do you like horror movies?
Q
Have you ever traveled
around another country alone?
Q
Wouldn't it be fun to chuck it
all and go live on a sailboat?
3.7×
Top 3 user rated
questions, about:
• God
• Sex
• Smoking
Psychology
Likelihood of
coincidence
Data
10
Online Dating and Big Data
• eHarmony
– Analyzes hundreds of behavioral variables, most collected automatically
– CTO: former search
engineer at Yahoo!
• OkCupid
We do math to get you dates
– Founded by Harvard
math & CS majors
• PlentyOfFish
Building this matching system
was harder than [being] cited in
the paper that won the Fields Medal
Source: CNN Money
11
Early 1900s: Most Factories Had Private Generators
Source: Nicholas Carr
Electricity was critical for business, but not widely available
12
Is he an
engineer?
Data analytics provide
remarkable insight
Does she date
engineers?
Applications in many
disciplines
Source: OkCupid
13
What Is Data Science?
• Also known as …
… Big Data analytics
… Machine intelligence
… Data-intensive computing
… Data wrangling
… Data munging
… Data jujitsu
Source: Drew Conway
14
IMPROVING MACHINE TRANSLATION
The Unreasonable Effectiveness of Data
15
2005 NIST Machine Translation Competition
English-Arabic competition
• Google’s first entry
– None of the engineers spoke Arabic
• Simple statistical approach
• Trained using United Nations
documents
– 200 million translated words
– 1 trillion monolingual words
16
For many hard problems
there appears to be a
threshold of sufficient data
A. Halevy, et al., CACM 2009.
17
What is Security Data Science?
• Also known as …
… Security analytics
… Surveillance analytics
• Applying data science methods to security problems
18
Security Principles in 60 Seconds
[J. Saltzer & M. Schroeder, SOSP 1973]
• Economy of mechanism: Keep the protection mechanism as
simple and small as possible
• Fail-safe defaults: Base access decisions on permission rather
than exclusion
• Complete mediation: Check every access to every object
• Open design: Do not keep the design secret
• Separation of privilege: Require two keys to unlock, not one
• Least privilege: Grant every program/user the least set of
privileges necessary to complete the job
• Least common mechanism: Minimize the amount of mechanism
common to more than one user and depended on by all users
• Psychological acceptability: Design interfaces for ease of use
19
Security in Practice
(Source: C. Nachenberg, Symantec)
• 1986: Simple computer viruses
– Defense: anti-virus
• 1990: Polymorphic viruses (decryption logic + encrypted malicious code)
– Defense: “universal” decoder, emulation
• 1995: Macro viruses
– Defense: AV vendor cooperation, digital signatures for macros
• 1999: Worms
– Defense: Vulnerability-specific signatures
• 2004: Web-based malware
– Defense: behavior blocking
• 2006: Auto-generated malware
– Defense: reputation based security
• 2010 (but probably earlier): Targeted attacks (physical infrastructure, 0-day, etc.)
– Defense: ??
20
UNDERSTANDING ZERO-DAY ATTACKS
The Need for Security Data Science
21
Zero-Day Attacks: Recent Examples
Zero-day attack = cyber attack exploiting a software vulnerability
before the public disclosure of the vulnerability
2011: Attack against RSA
2010: Stuxnet
2009: Operation Aurora
against Google
22
Price of Zero-Day Exploits on the Black Market
The Economist, March 2013
23
Hydraq Trojan also displayed this obfuscation.
Additional links joining the various exploits together included a shared command-and-control infrastructure.
Trojans dropped by different exploits were connecting to the same servers to retrieve commands from the
attackers. Some compromised websites used in the watering hole attacks had two different exploits injected into
them one after the other. Yet another connection is the use of similar encryption in documents and malicious
executables. A technique used to pass data to a SWF file was re-used in multiple attacks. Finally, the same family
of Trojan was dropped from multiple different exploits.
The Elderwood Project
Group with “seemingly unlimited” supply of zero-day exploits
Figure 7 illustrates the connections between the various exploits.
(Source: Symantec)
Figure 7
Links bet ween different exploit s
24
Zero-Day Attacks: Open Questions
Decade-long open questions
• How common are zero-day attacks?
• How long can they remain undiscovered?
• What happens after disclosure?
Zero-day attack
Prior work
[Arbaugh 2000, Frei 2008,
McQueen 2009, Shahzad 2012]
Vulnerability
timeline
Creation
Exploit
used in attacks
Vulnerability disclosed
(“day zero”)
Security
patch released
All hosts
patched
25
Zero-Day Attacks: Open Questions (cont’d)
Decade-long questions: Why still open?
• Rare events, hard to observe in small data sets
• Need data analysis at scale
Malware variants
100000
10000
CVE-2009-4324
CVE-2009-0658
CVE-2009-0084
Rare events
CVE-2010-1241
CVE-2010-2862
CVE-2010-0480
Before disclosure:
Targeted attacks
CVE-2009-0561
1000
CVE-2010-2883
After disclosure:
Large-scale attacks
CVE-2009-3126
CVE-2008-2249
CVE-2009-2501
CVE-2008-0015
100
CVE-2010-0028
10
CVE-2011-1331
CVE-2009-1134
1
-100
-50
t0
50
100
150 [weeks]
Time [weeks]
Creation
Exploit
used in attacks
Vulnerability disclosed
(“day zero”)
Security
patch released
All hosts
patched
26
Research in Security Data Science
Challenge 1: Find the needle in the haystack
– Example: Identify and measure zero-day attacks
Targeted attacks
before disclosure
Variants
105
Rare events
403 million new
malware variants
created in 2011
103
10
-100
-50
T0
50
100
150 (weeks)
Challenge 2: Ensure generally applicable and repeatable results
– The threat landscape changes frequently
Challenge 3: Deal with new and advanced threats
– Skilled and persistent hackers can bypass firewalls, anti-virus, passwordprotected systems, two-factor authentication, physical isolation
[…]
Your thesis topic goes here
27
What is Security Data Science? (re-visited)
• Systems knowledge: develop technologies needed to store and
process massive data sets
• Statistics & machine learning knowledge: analyze the data and
extract information
• Security knowledge: ask the right questions about cyber attacks
• Data scientists are in high demand in the cybersecurity industry
Booz Allen may be recruiting more
[data scientists] than Google or Facebook
The Economist, June 2013
28
Course Content
• Introduction to Security Data Science
• Hands-on emphasis – this is largely an unexplored research area
– Team-based projects
– Reviews of scholarly publications
– No textbook
• Specific things you can expect to learn
– Selected topics in security
– System skills: Experiment design, data analysis, scalability
– Team skills: Cooperating to achieve your team goals
– Speaking/writing skills: Presenting paper/project findings, providing
constructive feedback
29
This is an Advanced Course
• You are responsible for holding up your end of the educational
bargain
– I expect you to attend classes and to complete reading assignments
– I expect you to learn how to analyze data and to try things out for yourself
– I expect you to know how to find research literature on security topics
• The required readings provide starting points
– I expect you to manage your time
• In general there will be one written assignment due before each lecture
• Learning material in this course requires participation
– This is not a sit-back-and-listen kind of course; class participation is required
for understanding the material and makes up a part of your grade!
• Different grading criteria for graduate and undergraduate students
Reading Assignments
• Readings: 1-2 papers before each lecture
– Not light reading – some papers require several readings to understand
– For next time: C. Kanich et al., 'Spamalytics: An Empirical Analysis of Spam
Marketing Conversion,'ACM CCS, 2008.
– Check course web page (still in flux) for next readings and links to papers
• Homeworks: review the papers you read using a defined template
– Submit homework by email to [email protected]
• We might switch to a Web based submission system in the future
– Due at 6 pm the evening before class
– BibTeX template: Summary, Contributions, Weaknesses, Opinion (optional)
– I will provide feedback on some of your written critiques; no email means your
writeup is satisfactory
• In-class discussion: stand up and talk about the papers
– Volunteers are preferred
– Students randomly selected if no volunteers
31
Discuss …
Do my date and I have long-term potential?
… ask:
275,000 user submitted questions
34,260 real world couples
Q
Do you like horror movies?
Q
Have you ever traveled
around another country alone?
Q
Wouldn't it be fun to chuck it
all and go live on a sailboat?
3.7×
Top 3 user rated
questions, about:
• God
• Sex
• Smoking
Psychology
Likelihood of
coincidence
Data
32
Course Projects
• Pilot project: two-week individual projects
– Propose a security problem and a data set that you could analyze to solve it
• Some ideas are available on the web page
– Conduct preliminary data analysis and write a report
– Propose projects by September 9th (soft deadline)
– Submit report by September 18th
• Group project: ten-week group project
– Deeper investigation of promising approaches
– Submit written report and present findings during last week of class
• 2 checkpoints along the way (schedule on the course web page)
– Form teams and propose projects by September 30th
• Peer reviews: review at least 2 project reports from other students
– Use skills learned from paper reviews
– Post project proposals, reports and reviews on Piazza
33
Pre-Requisite Knowledge
• Good programming skills
– Knowledge of languages commonly used in data analysis, like Matlab or R,
is a plus
– To brush up: ‘Data Analysis and Visualization with MATLAB for Beginners’
seminar, on September 12 at 5pm, Room 1110 Kim Engineering Building
• Ability to come up to speed on advanced security topics
– Covered in the paper readings
– Basic knowledge of security (CMSC 414, ENEE 459C or equivalent) is a plus
• Ability to come up to speed on data analytics
– Lectures provide light-duty tutorials, but you will need to pick up the
details as you go along
34
Policies
• “Showing up is 80% of life” – Woody Allen
– Participation in in-class discussions is required for full credit
– You can get an “A” with a few missed assignments, but reserve these for
emergencies (conference trips, waking up sick, etc.)
– Notify the instructor if you need to miss a class, and submit your
homework on time
• UMD’s Code of Academic Integrity applies, modified as follows:
– Complete your homework entirely on your own. After you hand in your
homework, you are welcome (and encouraged) to discuss it with others
– Discuss the problems and concepts involved in the project, but produce
your own project implementation, report and presentation
• Group projects are the result of team work
• See class web site for the official version
35
Classroom Protocol
• Please arrive on time; lecture begins promptly
– I also promise to end on time
– Handouts, readings and homework templates posted class web page
• Questions are encouraged
– If you don’t understand, ask; probably other students are struggling too
– Explain the content of your reading assignment, and the underlying
reasoning, to the rest of the class
– Your reasons don't have to be "right” – you just have to be able to explain
them
• There is no way to cover everything
– If there is an interesting aspect that we do not cover in class, feel free to
incorporate that in your projects
36
Grading Criteria
• Straight scale: A≥90; B≥80; C≥70; D<70
– 50% Written paper critique and class discussion
• 24 assignments x 2 points each + 2 points for this lecture
– 50% Projects
• 30 points for group project, 10 points for pilot project, 10 points for project reviews
– 10% Subjective evaluation
• Expectations
– Graduate students: you can explain the contributions and weaknesses of the
papers you read
– Undergraduates: you demonstrate a general understanding of the papers
• Unsatisfactory participation means:
– You did not read the papers
– You did not produce a working implementation for your project, or you do not
37
understand how the implementation works
Review of Lecture
• What did we learn?
–
–
–
–
Data analytics provide real benefits
Analyzing large data sets allows tackling long-standing hard problems
Difference between security principles and security in practice
Examples of security problems that require insights from large data sets
• I want to emphasize
– This is systems course, not a not a pen-and-paper course
– You will be expected to build a real, working, data analysis tool
• What’s next?
– Basic statistics and experimental design
– Pilot project: proposal, approach, expectations
• Deadline reminder
– Post pilot project proposal on Piazza by Monday (soft deadline)
– First homework due on Sunday at 6 pm
38
Dive In
http://ter.ps/759d
39
Fly UP