...

Computing and the life sciences David Huen University of Wolverhampton

by user

on
Category: Documents
15

views

Report

Comments

Transcript

Computing and the life sciences David Huen University of Wolverhampton
Computing and the life sciences
David Huen
University of Wolverhampton
Structure
• Some background biology
• How computing is used in some areas of
biology
– I can only describe a small subset briefly
Memorial university
Conversations with computers
The cell as a machine
http://multiple-sclerosis-research.blogspot.co.uk
DNA as code
• Heritable information is encoded on DNA
• Almost like code
• DNA is code that, when executed on an
appropriate machine, builds a very similar
machine.
Central dogma
wikipedia
Cell as a machine
Regulatory
Effector
State repertoire
External
stimulus
New
functionality
Regulatory
Effector
State change
External
stimulus
Regulatory
Effector
Persistence: memory
Changed state repertoire
External
stimulus
Different
functionality
Genomics
External
stimulus
Regulatory
Effector
Transcriptomics
Proteomics
Metabolomics
Metrics
• What different transcripts/proteins/molecule
types are out there?
• How much of each is there?
Challenges in genomics
Genomics: cheap sequencing
Drinking at the waterfall…
Nature Biotechnology 30, 627–630 (2012)
Genetic risk of disease
• Our DNA is (usually) unique.
– ~1 in 1000 bases => ~6 million differences.
– Some positions in the genome vary frequently:
common variants.
– Most variants are rare (<3%)
• These differences underlie genetic risk.
– Twin studies
High profile case of genetic risk
• Mother died from breast cancer
at 56 (common disease: 1 in 8)
• Estimated genetic risk: 1.8x
• Chose BRCA1 (and most likely
BRCA2) screening.
• Most likely by sequencing
BRCA1/BRCA2 genes
• 1-3% of breast cancers
• 87% lifetime risk of breast cancer
• 50% lifetime risk of ovarian
cancer
• Double mastectomy
Most genetic risk is not like that!
• Very few genes have large risks!
– BRCA1/BRCA2 are atypical
• Genetic risk can be significant through small
contributions from many genes
– Risk from almost all known “risky” genes is small
(<2.5x)
– Common variation (>3% of population) accounts
for only a small part of genetic risk: rare variants
matter!
Why might we want population-wide
whole genome sequencing?
• Cheapest genetic testing methods suitable only
for common variants
• Cheaper gene-specific sequencing gets expensive
rapidly as the number of genes to be screened
increases.
– It will be much cheaper to sequence everything within
a decade.
• Combining the sequence data with
morbidity/mortality statistics will identify new
risk factors.
Why might this not be a good idea…
• The quality of data exceeds anything the coppers
have with their DNA database.
– Just ~20% of the population typed/sequenced will
allow identification of almost anyone through
relatedness. Sequencing your relations leads to
considerable information about you!
• Could you live with the knowledge of your
genetic risk?
• Could your insurance company?
• Criminal risk profiling?
• Paternity?
The basics of DNA
• DNA is a oriented polymer assembled of four different
subunits (denoted A,C,G,T)
• DNA can be represented as a string with symbols
drawn from the alphabet {A,C,G,T}
Figure from http://ccrhawaii.org/index.php/nucleic-acid-techniques/23-nucleic-acid-hybridization-a-expression-analysis/23c-in-situhybridization-a-dna-microarrays/23c-content-tutorial
The basics of DNA
• Most natural DNA is in the form of a double helix in which
two molecules of DNA are bound together in an
antiparallel arrangement.
• Double-stranded DNA
The basics of DNA
• The strands are held together by base complementarity.
• A-T
• C-G
• The strands hold redundant information
DNA – to a programmer
A – T
C - G
TATTGACG
ATAACTGC
TATTGACG
CGTCAATA
One double-stranded fragment has TWO representations!
DNA sequencing
• All sequencing technologies read SINGLE-STRANDED
DNA
– Read length: how far you can read
– Error rate: <<1%-15%
• Paired reads: technique that allows you to read
sequences at both ends of the same double-stranded
DNA helix
AAGTC…
AAGTC-------------------->
<--------------------AGGTC
CTGGA…
Sequence
Alignment
What is alignment?
• Approximate string search problem
• Alignment is probably be the most used
procedure in genomics
Alignment
CACCTGACTCCTGTGGAGAA
CACCTGACTCCTGTGGAGAA
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
Alignment
CACCTGACTCCTGTGGAGAA
CACCTGACTCCTGTGGAGAA
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
Alignment
TTCTCCACAGGAGTCAGGTG
?
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
Reverse strand alignment
TTCTCCACAGGAGTCAGGTG
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
GTGGACTGAGGACACCTCTT
Reverse strand alignment
Why alignment?
• Approximate string search problem
• Alignment is probably be the most used
procedure in genomics
• How do I obtain the genome sequence of an
individual (e.g. for medical risk purposes)?
Contemporary sequencing
Sequence fragment ends
Contemporary resequencing
Sequence fragment ends
Align (approximate
substring search)
Reference sequence
Resequencing via alignment
GTGCACCTGACTCCTGTGGA
CACCTGACTCCTGTGGAGAA
TGACTCCTGTGGAGAAGT
TCCTGTGGAGAAGTCTG
TGTGGAGAAGTCTGCCGTT
GAGAAGTCTGCCGTTACT
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
Resequencing via alignment
Sample
ATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACT
GTGCACCTGACTCCTGTGGA
CACCTGACTCCTGTGGAGAA
TGACTCCTGTGGAGAAGT
TCCTGTGGAGAAGTCTG
TGTGGAGAAGTCTGCCGTT
GAGAAGTCTGCCGTTACT
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
Alignment allows me to reestablish the order and strand
orientation of reads easily
Why alignment?
• Alignment will probably be the most used
procedure in genomics
• How do I obtain the genome sequence of an
individual (e.g. for medical risk purposes)?
• All human genomes are now obtained by
resequencing via alignment (approximate
substring search)
The problem of ploidy
ChrA
ChrB
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
ATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACT
GTGCACCTGACTCCTGAGGA
CACCTGACTCCTGTGGAGAA
TGACTCCTGTGGAGAAGT
TCCTGAGGAGAAGTCTG
TGTGGAGAAGTCTGCCGTT
GAGAAGTCTGCCGTTACT
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
The problem of ploidy
ChrA
ChrB
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
ATGGTGCACCTGACTCCTG?GGAGAAGTCTGCCGTTACT
GTGCACCTGACTCCTGAGGA
CACCTGACTCCTGAGGAGAA
TGACTCCTGAGGAGAAGT
TCCTGAGGAGAAGTCTG
TGAGGAGAAGTCTGCCGTT
GAGAAGTCTGCCGTTACT
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
When sequencing 3x109 bases, “perfect sequence requires
sampling error rate << 1 in a billion (~x30 depth).
Wheat is hexaploid!
The problem of error
Chr1
Chr2
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
ATGGTGCACCTGACTCCTG?GGAGAAGTCTGCCGTTACT
GTGCACCTGACTCCTGTGGA
CACCTGACTCCTGAGGAGAA
TGACTCCTGAGGAGAAGT
TCCTGAGGAGAAGTCTG
TGAGGAGAAGTCTGCCGTT
GAGAAGTCTGCCGTTACT
Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT
The error rate of next-generation sequencing is higher than
traditional sequencing.
The requirement for depth
• Research grade ~30x
• Clinical grade ~80x
– Random variation of depth requires high average
depth to achieve specific minimum local depth.
• Your variant calling software needs to assess
whether a variant exists at every position.
Implications of depth
• Human genome: 3x109 bases
• Required sequencing: 240x109 bases (80x)
• Fastest aligner: 78 secs for 105 100 bp reads
(bowtie2 on 1 Xeon thread)
– Bowtie2 is not the most sophisticated alignment
you can do.
• Expected processing: 667 CPU-hr
• Multicore compute servers.
What if we wanted population-wide
genome sequencing?
• UK birth-rate: 688000 births/yr (2011)
• Clinical rate sequencing
• What are the hardware requirements
Extensive full genome sequencing?
• Computation cost not constraining: 667 CPUhours for alignment (@$3.50/hr = $74)
• 1 32-thread server: ~400 alignments/year
– 1720 servers should do it.
• Downstream risk analysis – even 10x increase
still cuts it.
– Research big data trawling: horrendous
computational load.
Extensive full genome sequencing?
• Storage
– raw images: TBs!
– unprocessed reads: 200Gb
– 688000 births/yr => 137.6 Pb
– diffs: ?
– Legal liability, security and privacy?
• Network transfer
– 137.6 Pb (bases)
– 0.115 PB moved during Olympics 2012
Typical datacenter
Sequencers
1 TB/day
100 TB
2500 TB
3000 CPU
cores
Genome
Assembly
Software development
• Older aligners were good for fewer, longer
(200-1000 base) reads.
• Current aligners are good for many, short (18250 base) reads.
• New sequencing technologies will eventually
have many, long (>>1000 base) reads. Need
to develop efficient aligners for this!
Contemporary “shotgun” sequencing
Shear and separate
strands (~00’s bp)
Read (erroneously)
Assemble
Genome assembly
• All-vs-all partial string match problem
• Almost all software uses one or more of
– de Bruijn graph
– String graph
– Burroughs-Wheeler transform-based indices
• Very computationally demanding
• Only used for research – not directly used
clinically.
Traditional genome sequencing
Coarse fragmenting (35-100 kb)
Order and orient
Sequence each fragment
So why does shotgun sequencing
work?
• Paired end sequencing
• Libraries with different insert sizes
Bridging via big paired reads
gap
contig
gap
contig
contig
SCAFFOLD
Typical hardware requirements
• Human genome scale (preliminary assembly)
– ABySS: 21 8-core/16 GB RAM nodes.
– SOAPdenovo: 1 32-core/512 GB RAM node.
– SGA: 1427 CPU-hours @54 GB RAM (vs 479 CPUhours @118 GB RAM for ABySS)
Metagenome
Assembly
Metagenomics
• Microbial flora has a strong impact on
health
• Gut flora is the largest population of
microbes associated with us.
• Extract DNA
• Sequence
• ASSEMBLE (>200 genomes mixed together)
Metagenomic sequencing
Shear and sequence
Assemble
Example
• Human Microbiome Project
– Aims to sequence up to 3000 microbial genomes
• Culturable: “easy”
• Unculturable: metagenomic problem
– To date: 1.2 Tb of sequence obtained
Software development
• We have working software
• Performance is not good
– Slow or high RAM requirements
– Quality of assemblies can be poor (shrapnel or
nonsense)
– Very sensitive to parameter settings
– Lots of human intervention to wrangle them into
shape!
Transcriptomics
Transcriptomics
Why transcriptomics?
• Cells are often regulated by modulating the
abundance of specific transcripts.
– Transcripts => proteins
• There are thousands of distinct transcripts in a
cell.
• Different cell types have very different
abundances of transcripts.
Splicing
Genome
exon
intron
exon
Primary
transcript
Spliced
transcript
Transcriptomic techniques
• Microarrays (now cheap)
– Provide an analogue output related to abundance
of a small subsequence (that is part of a known
transcript)
• RNA-sequencing (becoming cheaper)
– Sequence the transcripts
– Abundance inferred from number of fragments
attributable to a specific transcript
– Can discover transcripts de novo
Microarrays
Transcripts
“Probes”
Signal
• ‘000s of sequencespecific probes
• Fluorescence
readout (non-linear)
RNA-seq
RNA
Fragment and sequence
Align to genome
Transcript
Count to quantify
Why transcriptomics?
• Addresses the question “What is the cell
doing?”
• Way of classifying disease types, esp. cancer.
– Is gene expressed? Has it got mutations?
– Stratified medicine
• Gradually moving from the lab into the clinic.
Data integration
Textual information
Humans vs computers
Humans
• Can operate with fuzzy
semantics, even semantic
switching.
• Have mental models of
phenomena that are not
explicitly articulated.
• Deal well with uncertainty
and/or error
Computer algorithms need
• Explicit semantics
• Consistent models
• Consistent APIs
However, computers can work on
much larger datasets than humans
can possibly deal with!
The integration problem
• Huge amounts of data are being generated in
biomedical fields each day
– Storage
• Will cost of storage exceed cost of (re)generation?
– Metadata
• Ontologies
–
–
–
–
–
Textual data – e.g. research publications
Programmatic access
Interchange standards
Semantic web. Really?
Privacy and security
The clinical data integration problem
• If large scale sequencing is used in a clinical
context…
– What do we need to store?
• Legal liability.
• Non-modifiable filestore?
– What standards do we need for diagnostic purposes?
– How will privacy be protected?
– Where genetic data has implications for related
individuals, what are the clinical obligations?
– Research access? (Non-)commercial?
If you are interested in this area…
• Many universities offer M.Sc degrees in this
field.
• Most accept CS/IT first degrees.
– Some specialise in CS/IT conversion, e.g.
Edinburgh
• Many practitioners have Ph.D.s in the field.
Fly UP