Our group is interested in advancing precision medicine and health using big data analytics. We develop new algorithms and models for the big data in biobanks and eletronic health records. In particular, we are making new insights from big data that are often not possible with smaller data.


Population genetics informatics

Modern biobanks include genotypes up to 0.1%-1% of an entire large population. At this scale, genetic relatedness among samples is unavoidably ubiquitous. However, current methods are not efficient for uncovering genetic relatedness at such a scale. We developed ultra-efficient methods for detecting Identical-by-Descent (IBD) segments, a primary embodiment of genetic relatedness. Our RaPID method detected all IBD segments over a certain length orders of magnitude faster than existing methods, while offering higher power, accuracy, and sharper IBD segment boundaries.

We believe identifying IBD segments in population scale cohorts are the first step towards construction population scale genealogy which will be a fundamental infrastructure for future human society.

Representative publications

RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. A Naseri, X Liu, K Tang, S Zhang, D Zhi. Genome biology 20 (1), 143

Efficient haplotype matching between a query and a panel for genealogical search. A Naseri, E Holzhauser, D Zhi, S Zhang. Bioinformatics 35 (14), i233-i241

Modeling of electronic health record (EHR) using deep learning

Patients’ health records and other health information are being collected and becoming available. This allows developing representation models that describe the inherent health status and treatment history of a patient. With access to multiple EHR databases with over 50 Million patients, We develop deep learning methods for uncovering the logic of medical practice and to help improve efficiency of clincial care.

Representative publications

Laila Rasmy, Yonghui Wu, Ningtao Wang, Xin Geng, W. Jim Zheng, Fei Wang, Hulin Wu, Hua Xu, and Degui Zhi. 2018. “A Study of Generalizability of Recurrent Neural Network-Based Predictive Models for Heart Failure Onset Risk Using a Large and Heterogeneous EHR Data Set.” Journal of Biomedical Informatics 84 (August): 11–16.

Modern bioinformatics using deep learning

Deep learning is a powerful paradigm for modeling complex multi-modality data that is faced by modern biomedical research. We explore a variety of bioinformatics problems using deep learning approaches.

Representative publications

Gene2vec: distributed representation of genes based on co-expression. J Du, P Jia, Y Dai, C Tao, Z Zhao, D Zhi*. BMC genomics 20 (1), 82


RaPID. Random Projection-based IBD Detection (RaPID).

pytorch_ehr. Open source codes for modeling EHR based on PyTorch

HapSeq2. Our method for genotype calling and phasing for WGS data.

msBayes. Statistical Quantification of Methylation Levels by Next-generation Sequencing.



Ardalan Naseri

Ardalan Naseri Research Scientist

  • Areas of Interest: Bioinformatics
  • CV
Degui Zhi

Degui Zhi Associate Professor

  • Areas of Interest: Genome Informatics, Statistical Genetics, Machine Learning
  • Links: Google Scholar
  • Software: HapSeq2
Jing Zhang

Jing Zhang Graduate Student Researcher

  • Areas of Interest: big data visualization
Laila Rasmy Gindy Bekhet

Laila Rasmy Gindy Bekhet PhD student

Mia Tran

Mia Tran Masters' student

  • Areas of Interest: Deep learning model for EHR continuous variables
Ryan Lewis

Ryan Lewis Graduate Student Researcher

  • Areas of Interest: Population Genetics Informatics
  • CV:
Ziqian Xie

Ziqian Xie Postdoc (Jointly with Rui Chen)

  • Areas of Interest: Deep learning

Past Members

Ginny (Jie) Zhu

Ginny (Jie) Zhu Ph.D student

  • Areas of Interest: Machine learning, Health data science
  • CV
  • Links: Linkedin
Soyeon Kim

Soyeon Kim Postdoc

  • Areas of Interest: Statistics
  • CV
Swati Goyal

Swati Goyal Graduate Research Assistant

  • Areas of Interest: Analyzing twitter data related to infectious diseases.
  • CV
Bijie Bie

Bijie Bie Postdoc

  • Areas of Interest: Communication studies, data science for social media
Guodong Wu

Guodong Wu PhD student

  • Areas of Interest: Statistical Genetics, Penalized Regression
Samad Jahandideh

Samad Jahandideh Postdoc

  • Areas of Interest: Structural Bioinformatics, Machine Learning
  • Links:Google scholar
  • CV
Xin Geng

Xin Geng Postdoc

Xueyan (Snow) Zhao

Xueyan (Snow) Zhao Postdoc

  • Areas of Interest: Statistical Genetics, PheWAS
  • CV