Our group is interested in advancing precision medicine and health using big data analytics. We develop new algorithms and models for the big data in biobanks and eletronic health records. In particular, we are making new insights from big data that are often not possible with smaller data.
Population genetics informatics
Modern biobanks include genotypes up to 0.1%-1% of an entire large population. At this scale, genetic relatedness among samples is unavoidably ubiquitous. However, current methods are not efficient for uncovering genetic relatedness at such a scale. We developed ultra-efficient methods for detecting Identical-by-Descent (IBD) segments, a primary embodiment of genetic relatedness. Our RaPID method detected all IBD segments over a certain length orders of magnitude faster than existing methods, while offering higher power, accuracy, and sharper IBD segment boundaries.
We believe identifying IBD segments in population scale cohorts are the first step towards construction population scale genealogy which will be a fundamental infrastructure for future human society.
RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. A Naseri, X Liu, K Tang, S Zhang, D Zhi. Genome biology 20 (1), 143
Efficient haplotype matching between a query and a panel for genealogical search. A Naseri, E Holzhauser, D Zhi, S Zhang. Bioinformatics 35 (14), i233-i241
Modeling of electronic health record (EHR) using deep learning
Patients’ health records and other health information are being collected and becoming available. This allows developing representation models that describe the inherent health status and treatment history of a patient. With access to multiple EHR databases with over 50 Million patients, We develop deep learning methods for uncovering the logic of medical practice and to help improve efficiency of clincial care.
Laila Rasmy, Yonghui Wu, Ningtao Wang, Xin Geng, W. Jim Zheng, Fei Wang, Hulin Wu, Hua Xu, and Degui Zhi. 2018. “A Study of Generalizability of Recurrent Neural Network-Based Predictive Models for Heart Failure Onset Risk Using a Large and Heterogeneous EHR Data Set.” Journal of Biomedical Informatics 84 (August): 11–16.
Modern bioinformatics using deep learning
Deep learning is a powerful paradigm for modeling complex multi-modality data that is faced by modern biomedical research. We explore a variety of bioinformatics problems using deep learning approaches.
Gene2vec: distributed representation of genes based on co-expression. J Du, P Jia, Y Dai, C Tao, Z Zhao, D Zhi*. BMC genomics 20 (1), 82
RaPID. Random Projection-based IBD Detection (RaPID).
pytorch_ehr. Open source codes for modeling EHR based on PyTorch
HapSeq2. Our method for genotype calling and phasing for WGS data.
msBayes. Statistical Quantification of Methylation Levels by Next-generation Sequencing.