Computing and the life sciences David Huen University of Wolverhampton
by user
Comments
Transcript
Computing and the life sciences David Huen University of Wolverhampton
Computing and the life sciences David Huen University of Wolverhampton Structure • Some background biology • How computing is used in some areas of biology – I can only describe a small subset briefly Memorial university Conversations with computers The cell as a machine http://multiple-sclerosis-research.blogspot.co.uk DNA as code • Heritable information is encoded on DNA • Almost like code • DNA is code that, when executed on an appropriate machine, builds a very similar machine. Central dogma wikipedia Cell as a machine Regulatory Effector State repertoire External stimulus New functionality Regulatory Effector State change External stimulus Regulatory Effector Persistence: memory Changed state repertoire External stimulus Different functionality Genomics External stimulus Regulatory Effector Transcriptomics Proteomics Metabolomics Metrics • What different transcripts/proteins/molecule types are out there? • How much of each is there? Challenges in genomics Genomics: cheap sequencing Drinking at the waterfall… Nature Biotechnology 30, 627–630 (2012) Genetic risk of disease • Our DNA is (usually) unique. – ~1 in 1000 bases => ~6 million differences. – Some positions in the genome vary frequently: common variants. – Most variants are rare (<3%) • These differences underlie genetic risk. – Twin studies High profile case of genetic risk • Mother died from breast cancer at 56 (common disease: 1 in 8) • Estimated genetic risk: 1.8x • Chose BRCA1 (and most likely BRCA2) screening. • Most likely by sequencing BRCA1/BRCA2 genes • 1-3% of breast cancers • 87% lifetime risk of breast cancer • 50% lifetime risk of ovarian cancer • Double mastectomy Most genetic risk is not like that! • Very few genes have large risks! – BRCA1/BRCA2 are atypical • Genetic risk can be significant through small contributions from many genes – Risk from almost all known “risky” genes is small (<2.5x) – Common variation (>3% of population) accounts for only a small part of genetic risk: rare variants matter! Why might we want population-wide whole genome sequencing? • Cheapest genetic testing methods suitable only for common variants • Cheaper gene-specific sequencing gets expensive rapidly as the number of genes to be screened increases. – It will be much cheaper to sequence everything within a decade. • Combining the sequence data with morbidity/mortality statistics will identify new risk factors. Why might this not be a good idea… • The quality of data exceeds anything the coppers have with their DNA database. – Just ~20% of the population typed/sequenced will allow identification of almost anyone through relatedness. Sequencing your relations leads to considerable information about you! • Could you live with the knowledge of your genetic risk? • Could your insurance company? • Criminal risk profiling? • Paternity? The basics of DNA • DNA is a oriented polymer assembled of four different subunits (denoted A,C,G,T) • DNA can be represented as a string with symbols drawn from the alphabet {A,C,G,T} Figure from http://ccrhawaii.org/index.php/nucleic-acid-techniques/23-nucleic-acid-hybridization-a-expression-analysis/23c-in-situhybridization-a-dna-microarrays/23c-content-tutorial The basics of DNA • Most natural DNA is in the form of a double helix in which two molecules of DNA are bound together in an antiparallel arrangement. • Double-stranded DNA The basics of DNA • The strands are held together by base complementarity. • A-T • C-G • The strands hold redundant information DNA – to a programmer A – T C - G TATTGACG ATAACTGC TATTGACG CGTCAATA One double-stranded fragment has TWO representations! DNA sequencing • All sequencing technologies read SINGLE-STRANDED DNA – Read length: how far you can read – Error rate: <<1%-15% • Paired reads: technique that allows you to read sequences at both ends of the same double-stranded DNA helix AAGTC… AAGTC--------------------> <--------------------AGGTC CTGGA… Sequence Alignment What is alignment? • Approximate string search problem • Alignment is probably be the most used procedure in genomics Alignment CACCTGACTCCTGTGGAGAA CACCTGACTCCTGTGGAGAA Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT Alignment CACCTGACTCCTGTGGAGAA CACCTGACTCCTGTGGAGAA Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT Alignment TTCTCCACAGGAGTCAGGTG ? Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT Reverse strand alignment TTCTCCACAGGAGTCAGGTG Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT GTGGACTGAGGACACCTCTT Reverse strand alignment Why alignment? • Approximate string search problem • Alignment is probably be the most used procedure in genomics • How do I obtain the genome sequence of an individual (e.g. for medical risk purposes)? Contemporary sequencing Sequence fragment ends Contemporary resequencing Sequence fragment ends Align (approximate substring search) Reference sequence Resequencing via alignment GTGCACCTGACTCCTGTGGA CACCTGACTCCTGTGGAGAA TGACTCCTGTGGAGAAGT TCCTGTGGAGAAGTCTG TGTGGAGAAGTCTGCCGTT GAGAAGTCTGCCGTTACT Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT Resequencing via alignment Sample ATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACT GTGCACCTGACTCCTGTGGA CACCTGACTCCTGTGGAGAA TGACTCCTGTGGAGAAGT TCCTGTGGAGAAGTCTG TGTGGAGAAGTCTGCCGTT GAGAAGTCTGCCGTTACT Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT Alignment allows me to reestablish the order and strand orientation of reads easily Why alignment? • Alignment will probably be the most used procedure in genomics • How do I obtain the genome sequence of an individual (e.g. for medical risk purposes)? • All human genomes are now obtained by resequencing via alignment (approximate substring search) The problem of ploidy ChrA ChrB ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT ATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACT GTGCACCTGACTCCTGAGGA CACCTGACTCCTGTGGAGAA TGACTCCTGTGGAGAAGT TCCTGAGGAGAAGTCTG TGTGGAGAAGTCTGCCGTT GAGAAGTCTGCCGTTACT Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT The problem of ploidy ChrA ChrB ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT ATGGTGCACCTGACTCCTG?GGAGAAGTCTGCCGTTACT GTGCACCTGACTCCTGAGGA CACCTGACTCCTGAGGAGAA TGACTCCTGAGGAGAAGT TCCTGAGGAGAAGTCTG TGAGGAGAAGTCTGCCGTT GAGAAGTCTGCCGTTACT Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT When sequencing 3x109 bases, “perfect sequence requires sampling error rate << 1 in a billion (~x30 depth). Wheat is hexaploid! The problem of error Chr1 Chr2 ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT ATGGTGCACCTGACTCCTG?GGAGAAGTCTGCCGTTACT GTGCACCTGACTCCTGTGGA CACCTGACTCCTGAGGAGAA TGACTCCTGAGGAGAAGT TCCTGAGGAGAAGTCTG TGAGGAGAAGTCTGCCGTT GAGAAGTCTGCCGTTACT Reference ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACT The error rate of next-generation sequencing is higher than traditional sequencing. The requirement for depth • Research grade ~30x • Clinical grade ~80x – Random variation of depth requires high average depth to achieve specific minimum local depth. • Your variant calling software needs to assess whether a variant exists at every position. Implications of depth • Human genome: 3x109 bases • Required sequencing: 240x109 bases (80x) • Fastest aligner: 78 secs for 105 100 bp reads (bowtie2 on 1 Xeon thread) – Bowtie2 is not the most sophisticated alignment you can do. • Expected processing: 667 CPU-hr • Multicore compute servers. What if we wanted population-wide genome sequencing? • UK birth-rate: 688000 births/yr (2011) • Clinical rate sequencing • What are the hardware requirements Extensive full genome sequencing? • Computation cost not constraining: 667 CPUhours for alignment (@$3.50/hr = $74) • 1 32-thread server: ~400 alignments/year – 1720 servers should do it. • Downstream risk analysis – even 10x increase still cuts it. – Research big data trawling: horrendous computational load. Extensive full genome sequencing? • Storage – raw images: TBs! – unprocessed reads: 200Gb – 688000 births/yr => 137.6 Pb – diffs: ? – Legal liability, security and privacy? • Network transfer – 137.6 Pb (bases) – 0.115 PB moved during Olympics 2012 Typical datacenter Sequencers 1 TB/day 100 TB 2500 TB 3000 CPU cores Genome Assembly Software development • Older aligners were good for fewer, longer (200-1000 base) reads. • Current aligners are good for many, short (18250 base) reads. • New sequencing technologies will eventually have many, long (>>1000 base) reads. Need to develop efficient aligners for this! Contemporary “shotgun” sequencing Shear and separate strands (~00’s bp) Read (erroneously) Assemble Genome assembly • All-vs-all partial string match problem • Almost all software uses one or more of – de Bruijn graph – String graph – Burroughs-Wheeler transform-based indices • Very computationally demanding • Only used for research – not directly used clinically. Traditional genome sequencing Coarse fragmenting (35-100 kb) Order and orient Sequence each fragment So why does shotgun sequencing work? • Paired end sequencing • Libraries with different insert sizes Bridging via big paired reads gap contig gap contig contig SCAFFOLD Typical hardware requirements • Human genome scale (preliminary assembly) – ABySS: 21 8-core/16 GB RAM nodes. – SOAPdenovo: 1 32-core/512 GB RAM node. – SGA: 1427 CPU-hours @54 GB RAM (vs 479 CPUhours @118 GB RAM for ABySS) Metagenome Assembly Metagenomics • Microbial flora has a strong impact on health • Gut flora is the largest population of microbes associated with us. • Extract DNA • Sequence • ASSEMBLE (>200 genomes mixed together) Metagenomic sequencing Shear and sequence Assemble Example • Human Microbiome Project – Aims to sequence up to 3000 microbial genomes • Culturable: “easy” • Unculturable: metagenomic problem – To date: 1.2 Tb of sequence obtained Software development • We have working software • Performance is not good – Slow or high RAM requirements – Quality of assemblies can be poor (shrapnel or nonsense) – Very sensitive to parameter settings – Lots of human intervention to wrangle them into shape! Transcriptomics Transcriptomics Why transcriptomics? • Cells are often regulated by modulating the abundance of specific transcripts. – Transcripts => proteins • There are thousands of distinct transcripts in a cell. • Different cell types have very different abundances of transcripts. Splicing Genome exon intron exon Primary transcript Spliced transcript Transcriptomic techniques • Microarrays (now cheap) – Provide an analogue output related to abundance of a small subsequence (that is part of a known transcript) • RNA-sequencing (becoming cheaper) – Sequence the transcripts – Abundance inferred from number of fragments attributable to a specific transcript – Can discover transcripts de novo Microarrays Transcripts “Probes” Signal • ‘000s of sequencespecific probes • Fluorescence readout (non-linear) RNA-seq RNA Fragment and sequence Align to genome Transcript Count to quantify Why transcriptomics? • Addresses the question “What is the cell doing?” • Way of classifying disease types, esp. cancer. – Is gene expressed? Has it got mutations? – Stratified medicine • Gradually moving from the lab into the clinic. Data integration Textual information Humans vs computers Humans • Can operate with fuzzy semantics, even semantic switching. • Have mental models of phenomena that are not explicitly articulated. • Deal well with uncertainty and/or error Computer algorithms need • Explicit semantics • Consistent models • Consistent APIs However, computers can work on much larger datasets than humans can possibly deal with! The integration problem • Huge amounts of data are being generated in biomedical fields each day – Storage • Will cost of storage exceed cost of (re)generation? – Metadata • Ontologies – – – – – Textual data – e.g. research publications Programmatic access Interchange standards Semantic web. Really? Privacy and security The clinical data integration problem • If large scale sequencing is used in a clinical context… – What do we need to store? • Legal liability. • Non-modifiable filestore? – What standards do we need for diagnostic purposes? – How will privacy be protected? – Where genetic data has implications for related individuals, what are the clinical obligations? – Research access? (Non-)commercial? If you are interested in this area… • Many universities offer M.Sc degrees in this field. • Most accept CS/IT first degrees. – Some specialise in CS/IT conversion, e.g. Edinburgh • Many practitioners have Ph.D.s in the field.