Home > Events > PhD Defense - Olivia Choudhury

PhD Defense - Olivia Choudhury

Start: 6/15/2017 at 1:00PM
End: 6/15/2017 at 4:30PM
Location: 258 Fitzpatrick
Attendees: Faculty and students are welcome to attend the presentation portion of the defense. Light refreshments will be served.
Add to calendar:
iCal vCal

Olivia Choudhury
Dissertation Defense
June 15, 2017        1:00 pm        258 Fitzpatrick
Adviser: Dr. Scott Emrich
Committee:
Dr. Kevin Bowyer     Dr. Jeanne Romero-Severson   Dr. Douglas Thain


"Expediting Analysis and Improving Fidelity of Big Data Genomics"

 Abstract

Genomics, or the study of genome-derived data, has had widespread impact in applications including medicine, forensic science, human evolution, environmental science, and social science. The plummeting cost of genome sequencing in the last decade has spurred an exponential growth of genomic data. The rate of data generation from these sequencing techniques has outpaced computing throughput, as predicted by Moore’s Law, causing a major bottleneck in the rate of data processing and analysis. Emerging genome data is also characterized by missing and erroneous values, that reduce data fidelity and limit its applicability for downstream analysis. This forms the basis of the following research questions: (i) Can we design frameworks that can expedite data analysis and enable efficient utilization of computational resources? (ii) Can we develop accurate and efficient algorithms to improve data fidelity in genomic applications?

We address the first problem by developing a parallel data analysis framework that accelerates large-scale comparative genomics applications. We identify that optimal data partitioning and caching significantly improve the performance of such framework. We further construct a predictive model to estimate runtime configurations that facilitate optimal utilization of cloud and cluster-based resources while executing data-intensive applications.

The fidelity of genomic data derived from next-generation sequencing techniques impacts downstream applications like genome-wide association study (GWAS) and genome assembly. For imputation of missing genotype data, we design an accurate, fast, and lightweight algorithm for both model (with a reference genotype panel) and non-model (without a reference genotype panel) organisms. To correct erroneous long reads generated by emerging sequencing techniques, we formulate a hybrid correction algorithm that determines a correction policy based on an optimal combination of base quality and similarity of aligned short reads. We extend the core algorithm by proposing an iterative learning paradigm that further improves its performance.

Our proposed data analysis framework is accessible to the scientific community and has been used to study the genomes of ecologically important plant species and malaria vector mosquitoes. The predictive models exhibit high accuracy in determining optimal parameters of operation on commercial cloud services like Amazon EC2 and Microsoft Azure. Finally, the imputation and error correction algorithms outperform state-of-the-art alternatives when tested on real data sets of plants, malarial mosquitoes, and humans. Hence, in this thesis, we present novel solutions to expedite data-parallel genomic applications while optimizing cloud and cluster-based resource utilization. We also design novel, accurate, and efficient algorithms to impute missing data and correct erroneous data in emerging genomic applications.