ACM-BCB 2021

Eight-segment panorama of Chicago, Illinois, as viewed from North Avenue Beach

BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Full Citation in the ACM Digital Library

SESSION: Sequence analysis

Session details: Sequence analysis

Cuncong Zhong

A k-mer query tool for assessing population diversity in pangenomes

Hang Su
Ziwei Chen
Maya L Najarian
Martin T. Ferris
Fernando Pardo-Manuel de Villena
Leonard McMillan

Inexpensive and fast genome sequencing has yielded multiple genome assemblies that, taken together, can be considered as a single pangenome model. However, applying conventional alignment-based sequence analysis to the assemblies of a pangenome is computationally expensive and largely redundant. Here, we present an alignment-free method that analyzes the relationship of any new sample relative to a given pangenome model using selected k-mer queries. We select a representative set of k-mers from the pangenome as probes and determine their frequencies in the raw short-read sequence data. The selection of probes is designed to cover every base of the pangenome, maximize sharing, and identify informative probes that discriminate between haplotypes. The k-mer frequencies are determined using an FM-index built over the raw sequence data of the new sample. Prior to the k-mer search, the probes are reordered to maximize the shared suffixes between succesive k-mers, thus reducing the overall run time compared to executing each search independently. We aggregate the forward and reverse k-mer probe counts, save them in the appropriate rows of a count matrix and remap them back to their locations in the pangenome. The resulting probe database serves as a valuable resource for representing population-scale sequence variations based on the pangenome model.

PriSeT: efficient de novo primer discovery

Marie Hoffmann
Michael T. Monaghan
Knut Reinert

Motivation: DNA metabarcoding is commonly used to infer the species composition of environmental samples, whereby a short, homologous DNA sequence is amplified and sequenced from all members of the community. Samples can comprise hundreds of organisms that can be closely or very distantly related. DNA metabarcoding combines polymerase chain reaction (PCR) and next-generation sequencing (NGS), and sequences are taxonomically identified based on their match to a reference database. Ideally, each species of interest would have a unique DNA barcode. This short, variable sequence needs to be flanked by conserved regions that can be used as primer-binding sites. PCR primer pairs would amplify a variable barcode in a broad evolutionary range of taxa. To date, no tools exist that computationally search and analyze the effectiveness of new primer pairs for large unaligned sequence data sets. More specifically we solve the following problem: Given a set of reference sequences R = {R₁, R₂, ..., R_m}, find a primer set P that allows for a high taxonomic coverage. This goal can be achieved by filtering for frequent primers and ranking by coverage or variation, i.e. the number of unique barcodes for further analysis. Here we present the software PriSeT, an offline primer-discovery tool that is capable of processing large libraries and is robust against mislabeled or low-quality references. It avoids the construction of a multisequence alignment of R. Instead, PriSeT uses encodings of frequent k-mers that allow bit-parallel processing and other optimizations.

Results: We first evaluated PriSeT on references (mostly 18S rRNA genes) from 19 clades covering eukaryotic organisms that are typical for freshwater plankton samples. PriSeT recovered several published primer sets as well as additional, more chemically suitable primer sets. For these new sets, we compared frequency, taxonomic coverage, and amplicon variation with published primer sets. For 11 clades we found de novo primer pairs that cover more taxa than the published ones, and for six clades de novo primers resulted in greater sequence (i.e., DNA barcode) variation. We also applied PriSeT to SARS-CoV-2 genomes and computed 114 new primer pairs with the additional constraint that the sequences have no co-occurrences in closely related taxa. These primer sets would be suitable for empirical testing.

Availability: https://github.com/mariehoffmann/PriSeT

Contact: marie.hoffmann@fu-berlin.de

ppIacerDC: a new scalable phylogenetic placement method

Elizabeth Koning
Malachi Phillips
Tandy Warnow

Motivation: Phylogenetic placement (i.e., the insertion of a sequence into a phylogenetic tree) is a basic step in several bioinformatics pipelines, including taxon identification in metagenomic analysis and large scale phylogeny estimation. The most accurate current method is pplacer, which attempts to optimize the placement using maximum likelihood, but it frequently fails on datasets where the phylogenetic tree has 5000 leaves. APPLES is the current most scalable method, and EPA-ng, although more scalable than pplacer and more accurate than APPLES, also fails on many 50,000-taxon trees. Here we describe pplacerDC, a divide-and-conquer approach that enables pplacer to be used when the phylogenetic tree is very large.

Results: Our study shows that pplacerDC has excellent accuracy and scalability, matching pplacer where pplacer can run, improving accuracy compared to APPLES and EPA-ng, and is able to run on datasets with up to 100,000 sequences.

Availability: The pplacerDC code is available on GitHub at https://github.com/kodingkoning/pplacerDC.

Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

Yael Ben-Ari
Dan Flomin
Lianrong Pu
Yaron Orenstein
Ron Shamir

High-throughput sequencing techniques generate large volumes of DNA sequencing data at ultra-fast speed and extremely low cost. As a consequence, sequencing techniques have become ubiquitous in biomedical research and are used in hundreds of genomic applications. Efficient data structures and algorithms have been developed to handle the large datasets produced by these techniques. The prevailing method to index DNA sequences in those data structures and algorithms is by using k-mers (k-long substrings) known as minimizers. Minimizers are the smallest k-mers selected in every consecutive window of a fixed length in a sequence, where the smallest is determined according to a predefined order, e.g., lexicographic. Recently, a new k-mer order based on a universal hitting set (UHS) was suggested. While several studies have shown that orders based on a small UHS have improved properties, the utility of using them in high-throughput sequencing analysis tasks has been demonstrated to date in only one application of k-mer counting. Here, we demonstrate the practical benefit of UHSs in the genome assembly task. Reconstructing a genome from billions of short reads is a fundamental task in high-throughput sequencing analyses. De Bruijn graph construction is a key step in genome assembly, which often requires very large amounts of memory and long computation time. A critical bottleneck lies in the partitioning of DNA sequences into bins. The sequences in each bin are assembled separately, and the final de Bruijn graph is constructed by merging the bin-specific subgraphs. We incorporated a UHS-based order in the bin partition step of the Minimum Substring Partitioning algorithm. Using a UHS-based order instead of lexicographic- or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning, which led to a reduction in both runtime and memory usage.

SESSION: Electronic health records

Session details: Electronic health records

Gaurav Pandey

COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks

Aishwarya Mandyam
Elizabeth C. Yoo
Jeff Soules
Krzysztof Laudanski
Barbara E. Engelhardt

In order to ensure that analyses of complex electronic healthcare record (EHR) data are reproducible and generalizable, it is crucial for researchers to use comparable preprocessing, filtering, and imputation strategies. We introduce COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks, an open-source processing and analysis software for MIMIC-IV, a ubiquitous benchmark EHR dataset. COP-E-CAT allows users to select filtering characteristics and preprocess covariates to generate data structures for use in downstream analysis tasks. This user-friendly approach shows promise in facilitating reproducibility and comparability among studies that leverage the MIMIC-IV data, and enhances EHR accessibility to a wider spectrum of researchers than current data processing methods. We demonstrate the versatility of our workflow by describing three use cases: ensemble prediction, reinforcement learning, and dimension reduction. The software is available at: https://github.com/eyeshoe/cop-e-cat.

Supervised multi-specialist topic model with applications on large-scale electronic health record data

Ziyang Song
Xavier Sumba Toral
Yixin Xu
Aihua Liu
Liming Guo
Guido Powell
Aman Verma
David Buckeridge
Ariane Marelli
Yue Li

Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled.

Materials and Methods: We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S.

Results: We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services.

Availability and implementation: MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS

Concurrent imputation and prediction on EHR data using bi-directional GANs: Bi-GANs for EHR imputation and prediction

Mehak Gupta
Thao-Ly T. Phan
H. Timothy Bunnell
Rahmatollah Beheshti

Working with electronic health records (EHRs) is known to be challenging due to several reasons. These reasons include not having: 1) similar lengths (per visit), 2) the same number of observations (per patient), and 3) complete entries in the available records. These issues hinder the performance of the predictive models created using EHRs. In this paper, we approach these issues by presenting a model for the combined task of imputing and predicting values for the irregularly observed and varying length EHR data with missing entries. Our proposed model (dubbed as Bi-GAN) uses a bidirectional recurrent network in a generative adversarial setting. In this architecture, the generator is a bidirectional recurrent network that receives the EHR data and imputes the existing missing values. The discriminator attempts to discriminate between the actual and the imputed values generated by the generator. Using the input data in its entirety, Bi-GAN learns how to impute missing elements in-between (imputation) or outside of the input time steps (prediction). Our method has three advantages to the state-of-the-art methods in the field: (a) one single model performs both the imputation and prediction tasks; (b) the model can perform predictions using time-series of varying length with missing data; (c) it does not require to know the observation and prediction time window during training and can be used for the predictions with different observation and prediction window lengths, for short- and long-term predictions. We evaluate our model on two large EHR datasets to impute and predict body mass index (BMI) values and show its superior performance in both settings.

Privacy preserving neural networks for electronic health records de-identification

Tanbir Ahmed
Md Momin Al Aziz
Noman Mohammed
Xiaoqian Jiang

Over the last decade, significant improvements and efforts in digitizing healthcare provided us with a sizeable collection of electronic medical records. These Electronic Health Records (EHRs), especially the clinical narratives, brimming with hidden knowledge to be discovered, contain sensitive information about the patient. For this reason, medical institutions are legally prohibited from publishing these data in the raw format to the public. Hence, the recent surge towards finding a solution for de-identification detects and removes sensitive information from clinical narratives. Since 2016, we have seen several deep learning-based approaches for de-identification, which achieved over 98% accuracy. However, these models are trained with sensitive information and can unwittingly memorize some of its training data, and a careful analysis of these models can reveal patients' data. In this work, we propose a differentially private ensemble framework for de-identification, allowing medical researchers to collaborate through publicly publishing the de-identification models. To the best of our knowledge, this is the first privacy-preserving machine learning approach for the de-identification of EHRs. Experiments in three different datasets showed competitive results compared to the state-of-the-art methods with guaranteed differential privacy.

DBNet: a novel deep learning framework for mechanical ventilation prediction using electronic health records

Kai Zhang
Xiaoqian Jiang
Mahboubeh Madadi
Luyao Chen
Sean Savitz
Shayan Shams

The outbreak of the Coronavirus disease (COVID-19) pandemic has caused millions of deaths and put immense pressure on the health care system, especially the supply of mechanical ventilators. It is critical for clinicians to identify the patients in a timely manner whose status may deteriorate in the near future and therefore need mechanical ventilators. We propose a prediction model to estimate the probability of requiring mechanical ventilation for in-hospital patients at least 24 hours after their admission. Our model is a multi-modal encoder-decoder attention model that takes full usages of the electronic health record (EHR) database. The EHR database consists of heterogeneous data tables of different formats (diagnosis, drug administrations, medicine prescriptions, lab tests, vital sign observations, clinical procedures, and demographics). We leverage the attention mechanism to increase model performance and promote result interpretability. The attention mechanism also serves the role of the missing data imputation technique, which is often used on irregularly sampled temporal data. We name the model as DBNet as the model takes the database as input. DBNet is evaluated on a large cohort of COVID-19 patients and the result shows it outperforms the state-of-the-art baseline deep learning models in predicting the future requirement of mechanical ventilation. It also outperforms several machine learning models even with sophisticated feature engineering. Due to its ability to handle multiple tables and also longitudinal data, the DBNet can also is not limited to this single application and can be generalized to other healthcare prediction tasks.

SESSION: System biology

Session details: System biology

Leonid Chindelevitch

Gazelle: transcript abundance query against large-scale RNA-seq experiments

Xiaofei Zhang
Ye Yu
Chan Hee Mok
James N. MacLeod
Jinze Liu

The exponential growth of high throughput sequencing data has been witnessed in almost every sequencing data repository. To date, most of the exploratory analysis on these large datasets requires heavy lifting data processing pipelines that are both resource and labor intensive.

Very recently, various algorithms have been developed to enable arbitrary sequence query over large collections of sequencing data. These algorithms were designed to support presence/absence query, i.e., screening for RNA-seq samples containing a given transcript sequence. Their utility is rather limited as they cannot retrieve abundance information of query sequence. Such abundance information is indeed critical in real applications in order to understand how the variation of transcript expression associates with different biological conditions or disease subtypes.

In this paper, we present Gazelle, a sequence query engine that enables fast and quantified query against large-scale RNA-seq experiments. Gazelle exploits the advantages of two different types of hashing algorithms and seamlessly combines them into one integrated structure to support highly efficient and accurate sequence queries with abundance. We evaluated the performance of Gazelle on three datasets to benchmark its efficiency, accuracy as well as its utility in real-life applications. Our result shows that Gazelle achieves near-perfect k-mer query, supports on-demand sequence query against moderately large sequence database, and renders highly consistent abundance estimation with RT-qPCR as well as traditional transcript quantification method such as Kallisto.

MultiRBP: multi-task neural network for protein-RNA binding prediction

Jonathan Karin
Hagai Michel
Yaron Orenstein

Protein-RNA binding plays vital roles in post-transcriptional gene regulation. High-throughput in vitro binding measurements were generated for more than 200 RNA-binding proteins, enabling the development of computational methods to predict binding to any RNA transcript of interest. In recent years, deep learning-based methods have been developed to predict RNA binding in vitro achieving state-of-the-art results. However, all methods train a single model per protein, under-utilizing the similarities in binding preferences shared by multiple RNA-binding proteins. In this work, we developed MultiRBP, a deep learning-based method to predict RNA binding of hundreds of proteins to a given RNA sequence. The innovation of MultiRBP is in its multi-task nature, i.e., predicting binding for hundreds of proteins at the same time. We trained MultiRBP on the RNAcompete dataset, the most comprehensive dataset of in vitro binding measurements. Our method outperformed extant methods in both in vitro and in vivo RNA-binding prediction. Our method achieved an average Pearson correlation of 0.692±0.17 for in vitro binding prediction, and a median AUROC of 0.668±0.09 for in vivo binding prediction. Moreover, by visualizing the learned binding preferences, MultiRBP provided more interpretable visualization than a single-task model. The code is publicly available at github.com/OrensteinLab/MultiRBP.

A spatiotemporal model of polarity and spatial gradient establishment in caulobacter crescentus

Chunrui Xu
Yang Cao

Bacterial cells have sophisticated intracellular organization of proteins in space and time, which allows for stress response, signal transduction, cell differentiation and morphogenesis. The mechanisms of spatial localization and their contributions to cell development and adaptability are not fully understood. In this work, we use the bacterial model organism, Caulobacter cescentus, to investigate the establishment of polarity and asymmetry. We apply a reaction-diffusion model to simulating the spatiotemporal dynamics of scaffolding proteins PodJ and PopZ, which account for the formation of distinct poles in C. crescentus. Additionally, we use this mathematical model to investigate the nonuniform distribution of key kinase DivJ and phosphatase PleC and figure out their contributions to the spatial gradient of response regulators DivK and CtrA.

Predicting aneurysmal degeneration of type B aortic dissection with computational fluid dynamics

Bradley Feiger
Erick Lorenzana
David Ranney
Muath Bishawi
Julie Doberne
Andrew Vekstein
Soraya Voigt
Chad Hughes
Amanda Randles

Stanford Type B aortic dissection (TBAD) is a deadly cardiovascular disease with mortality rates as high as 50% in complicated cases. Patients with TBAD are often medically managed, but in ~20--40% of cases, patients experience aneurysmal degeneration in the dissected aorta, and surgical intervention is required. In this work, we simulated blood flow using computational fluid dynamics (CFD) to determine relationships between hemodynamics and aneurysmal degeneration, providing an important step towards predicting the need for intervention prior to significant aneurysm occurrence. Currently, surgeons intervene in TBAD cases based on the aneurysms growth rate and overall size, as well as a variety of other factors such as malperfusion, thrombosis, and pain, but predicting future risk of aneurysmal degeneration would allow earlier intervention leading to improved patient outcomes. Here, we hypothesized that hemodynamic metrics play an important role in the formation of aneurysms and that these metrics could be used to predict future aneurysmal degeneration in this patient population. Our retrospective dataset consisted of 16 patients with TBAD where eight required intervention due to aneurysmal degeneration and eight were medically managed. The patients with surgical intervention were examined in our study prior to the formation of an aneurysm. For each patient, we segmented and reconstructed the aortic geometry and simulated blood flow using the lattice Boltzmann method. We then compared hemodynamic metrics between to the two groups of patients, including time-averaged wall shear stress, oscillatory shear index, relative residence time, and flow fractions to the true and false lumen. We found significant differences in each metric between the true and false lumen. We also showed that flow fractions to the false lumen was higher in patients with aneurysmal degeneration (p = 0.02). These results are an important step towards developing more precise methods to predict future aneurysmal degeneration and the need for intervention in TBAD patients.

SESSION: Genomic variation

Session details: Genomic variation

Mario Cannataro

Frontier: finding the boundaries of novel transposable element insertions in genomes

Anwica Kashfeen
Leonard McMillan

Transposable Elements (TEs) are DNA subsequences that have historically copied themselves throughout a genome. Apart from constituting a large fraction of all eukaryotic genomes, TEs are a significant source of genetic variation and are directly responsible for many diseases. TEs are also one of the most difficult genomic regions to analyze. A typical approach for identifying TE insertions (TEi) involves the detection of split-reads, which requires checking if each read can be split into TE and non-TE parts. Identification of the TE part depends on a model for each distinct TE class, and these classes vary significantly both within and between species. Previous methods for detecting segregating TEis depend on template libraries and their computational cost increases with the number of templates. Here we propose a novel template-free method for identifying the split-reads containing TEi boundaries called Frontier. We leverage the pervasiveness of TE sequences to identify candidate reads that might include the boundary of an insertion. We then apply machine learning methods to further classify whether the read includes actual TE-like sequence. For each predicted TEi boundary we apply a second classifier to infer the corresponding TE type (LINE, SINE, ALU, ERV/LTR). Both classifiers achieve high precision (> .9), recall (> .8) and F1 score (> .8) when applied to real data. The resulting trained model, can detect and classify about 50 million frontier reads in less than an hour. Frontier codes are available at github https://github.com/Anwica/Frontier.

Statistical analysis of GC-biased gene conversion and recombination hotspots in eukaryotic genomes: a phylogenetic hidden Markov model-based approach

Meijun Gao
Kevin J. Liu

Genetic recombination in eukaryotes can occur with or without crossover, where the latter event is referred to as gene conversion. New discoveries in the genomic and post-genomic era have shed new light into the complex interplay between recombination and other evolutionary processes such as point mutations. In particular, G/C content of genomic regions can increase over evolutionary time due to recombination in the form of gene conversion - a phenomenon known as GC-biased gene conversion (gBGC) - and gBGC is increasingly appreciated as serving an important role in genome evolution throughout the eukaryotic Tree of life. These findings have largely relied on computational advances for analyzing recombinant sequences for indirect signatures of gBGC. However, deeper insights into the functional and evolutionary significance of gBGC require a unified framework that accounts for variable-across-sites recombination and point mutation processes.

In this study, we introduce PHYNCH (or "PHYlogeNetiC-HMM for analyzing gBGC and recombination hotspots"). PHYNCH utilizes a statistical model that combines a hidden Markov model to capture local genealogical variation due to recombination and gene conversion with a finite-sites model of sequence evolution along a local genealogy. Inference and learning under the new model is used to detect and analyze local patterns of gBGC and recombination hotspots within genomic sequences. We validate the performance of PHYNCH using simulated benchmarking data. Furthermore, we use PHYNCH to create a new genomic map of gBGC and recombination in rice.

Novel genomic duplication models through integer linear programming

Jarosław Paszek
Oliver Eulenstein
Paweł Górecki

Unveiling ancient whole-genome duplications, or WGDs, in the evolutionary history of species is elementary to understand how gene families have formed over time and genomes evolved. A classic framework of WGD models for deciphering ancient species in which genome duplications occurred is based on reconciling multiple gene trees with a species tree. Reconciling gene trees with a species tree reveals evolutionary scenarios describing how genes have evolved along species tree branches through speciation and single duplication events. Clustering single duplication events from different gene trees occurring in the same species can reveal duplication episodes indicative of remnants of ancient WGDs. WGD models can be categorized into restricted and unrestricted models. Restricted models only consider scenarios where single duplications are limited by the timing of their ancestor speciation, while unrestricted models consider all possible evolutionary scenarios. Representing two extremes of the overall spectrum of possible scenarios, unconstrained models are biased towards locating duplication episodes close to the root of the species tree, while the constrained models tend to locate episodes close to the most recent species that theoretically could have contained them.

Adding flexibility for improved biological realism, in this work, we develop and analyze a novel framework of WGD models encompassing the whole range of intermediate locations by defining, implementing, and testing models under multiple constraint strategies. We achieve this by formulating the first ILP model for the NP-hard problem of computing duplication episodes under the classic unrestricted WGD model from Fellows et al. and then incorporating constraints into this formulation reflecting WGD models for intermediate locations. Finally, we demonstrate the exemplary performance of our models and that our ILP formulations allow computing typical problem instances occurring in practice.

SESSION: Health monitoring & phenotyping

Session details: Health monitoring & phenotyping

Yonghui Wu

Transformer-based unsupervised patient representation learning based on medical claims for risk stratification and analysis

Xianlong Zeng
Simon Lin
Chang Liu

The claims data, containing medical codes, services information, and incurred expenditure, can be a good resource for estimating an individual's health condition and medical risk level. In this study, we developed Transformer-based Multimodal AutoEncoder (TMAE), an unsupervised learning framework that can learn efficient patient representation by encoding meaningful information from the claims data. TMAE is motivated by the practical needs in healthcare to stratify patients into different risk levels for improving care delivery and management. Compared to previous approaches, TMAE is able to 1) model inpatient, outpatient, and medication claims collectively, 2) handle irregular time intervals between medical events, 3) alleviate the sparsity issue of the rare medical codes, and 4) incorporate medical expenditure information. We trained TMAE using a real-world pediatric claims dataset containing more than 600,000 patients and compared its performance with various approaches in two clustering tasks. Experimental results demonstrate that TMAE has superior performance compared to all baselines. Multiple downstream applications are also conducted to illustrate the effectiveness of our framework. The promising results confirm that the TMAE framework is scalable to large claims data and is able to generate efficient patient embeddings for risk stratification and analysis.

Signal quality detection towards practical non-touch vital sign monitoring

Zongxing Xie
Bing Zhou
Fan Ye

Non-touch vital sign sensing is gaining popularity because it does not require users' cooperative efforts (e.g., charging, wearing) thus convenient for longitudinal monitoring. In recent radio-based heart and respiration rate (HR and RR) sensing using Wi-Fi, millimeter wave (mmWave), or ultra-wideband (UWB), inevitable user movements or background moving objects cause large disturbances to the much weaker respiratory and heart signals. Such "corrupted" signals must be detected and excluded to avoid making erroneous measurements. Despite several attempts, reliable signal quality detection (SQD) remains unresolved. In this paper, we spent over 80 hours to manually examine 50268 data samples collected from 8 participants. We find that heart and respiration signals are not always simultaneously available, which breaks an important assumption in prior work. We propose a 2-bit SQD to classify their "availability" separately. We further quantify the contributions of and correlation among a comprehensive set of features in both time and frequency domains, and use a forward selection strategy to identify an optimal and much smaller feature set for multiple common classification algorithms. Extensive experiments show that our 2-bit SQD achieves 91/95% precision, 88/91% recall in detecting available RR/HR signals, as compared to a flat spectrum detector (FSD) [3] and a spectrum-averaged harmonic path detector (SHAPA) [24] in prior work, and reduces the 80-percentile RR/HR errors from 10/18 bpm to 3.5/4.0 bpm, 3~4 fold reductions.

DeepNote-GNN: predicting hospital readmission using clinical notes and patient network

Sara Nouri Golmaei
Xiao Luo

With the increasing availability of Electronic Health Records (EHRs) and advances in deep learning techniques, developing deep predictive models that use EHR data to solve healthcare problems has gained momentum in recent years. The majority of clinical predictive models benefit from structured data in EHR (e.g., lab measurements and medications). Still, learning clinical outcomes from all possible information sources is one of the main challenges when building predictive models. This work focuses on two sources of information that have been underused by researchers; unstructured data (e.g., clinical notes) and a patient network. We propose a novel hybrid deep learning model, DeepNote-GNN, that integrates clinical notes information and patient network topological structure to improve 30-day hospital readmission prediction. DeepNote-GNN is a robust deep learning framework consisting of two modules: DeepNote and patient network. DeepNote extracts deep representations of clinical notes using a feature aggregation unit on top of a state-of-the-art Natural Language Processing (NLP) technique - BERT. By exploiting these deep representations, a patient network is built, and Graph Neural Network (GNN) is used to train the network for hospital readmission predictions. Performance evaluation on the MIMIC-III dataset demonstrates that DeepNote-GNN achieves superior results compared to the state-of-the-art baselines on the 30-day hospital readmission task. We extensively analyze the DeepNote-GNN model to illustrate the effectiveness and contribution of each component of it. The model analysis shows that patient network has a significant contribution to the overall performance, and DeepNote-GNN is robust and can consistently perform well on the 30-day readmission prediction task.

Pheno-mapper: an interactive toolbox for the visual exploration of phenomics data

Youjia Zhou
Methun Kamruzzaman
Patrick Schnable
Bala Krishnamoorthy
Ananth Kalyanaraman
Bei Wang

High-throughput technologies to collect field data have made observations possible at scale in several branches of life sciences. The data collected can range from the molecular level (genotypes) to physiological (phenotypic traits) and environmental observations (e.g., weather, soil conditions). These vast swathes of data, collectively referred to as phenomics data, represent a treasure trove of key scientific knowledge on the dynamics of the underlying biological system. However, extracting information and insights from these complex datasets remains a significant challenge owing to their multidimensionality and lack of prior knowledge about their complex structure. In this paper, we present Pheno-Mapper, an interactive toolbox for the exploratory analysis and visualization of large-scale phenomics data. Our approach uses the mapper framework to perform a topological analysis of the data, and subsequently render visual representations with built-in data analysis and machine learning capabilities. We demonstrate the utility of this new tool on real-world plant (e.g., maize) phenomics datasets. In comparison to existing approaches, the main advantage of Pheno-Mapper is that it provides rich, interactive capabilities in the exploratory analysis of phenomics data, and it integrates visual analytics with data analysis and machine learning in an easily extensible way. In particular, Pheno-Mapper allows the interactive selection of subpopulations guided by a topological summary of the data and applies data mining and machine learning to these selected subpopulations for in-depth exploration.

SESSION: Structural bioinformatics

Session details: Structural bioinformatics

Lingling An

Modeling protein structures from predicted contacts with modern molecular dynamics potentials: accuracy, sensitivity, and refinement

Russell B. Davidson
Mathialakan Thavappiragasam
T. Chad Effler
Jess Woods
Dwayne A. Elias
Jerry M. Parks
Ada Sedova

Protein structure prediction has become increasingly popular and successful in recent years. An essential step for fragment-free, template-free methods is the generation of a final three-dimensional protein model from a set of predicted amino acid contacts that are often described by interresidue pairwise atomic distances. Here we explore the use of modern, open-source molecular dynamics (MD) engines, which have been continually developed over the last three decades with all-atom Hamiltonians to model biomolecular structure and dynamics, to generate accurate protein structures starting from a set of inferred pairwise distances. Additionally, the ability of MD empirical physical potentials to correct inaccuracies in the predicted geometries is tested. We rigorously characterize the effect of modeling parameters on results, the effect of different amounts of error in the predicted distances on the final structures, and test the ability of post-processing analysis to sort the best models out of a set of statistical replicas. We find that with exact distances and with noisy distances, the method can produce excellent structural models, and that the molecular dynamics force field seems to help correct errors in distance predictions, resisting the effects of applied noise.

Computational modeling of SARS-CoV-2 Nsp1 binding to human ribosomal 40S complex

Linkel Boateng
Anita Nag
Homayoun Valafar

Molecular docking and Targeted Molecular Dynamics Simulations were conducted to elucidate binding properties of SARS-CoV-2 nonstructural protein 1 (nspl) onto the human ribosomal 40S complex. Nspl serves as a host shutoff factor by blocking ribosome assembly on host mRNAs, thereby suppressing host gene expression. Recently, cryo-electron microscope structure of both 40S and 80S ribosome purified in the presence of SARS-CoV-2 nspl revealed the presence of the C-terminal region of nspl in the mRNA binding site of the 40S ribosome. This structure gives the first insight into the molecular mechanism of nspl-mediated suppression of host protein translation. In this study we have utilized the most recent emerging partial structures of nspl bound to 40S ribosome as the reference point of implementing a Targeted Molecular Dynamics Simulation of the entire nspl bound to the 40S complex. Our final bound structure of nspl exhibits the previously reported helix-turn-helix conformation of the C-terminal region of nspl and satisfies all the previously reported proximity restraints. Finally, we have established the interaction and stability of this final bound state of the full nspl and 40S. The observation that C-terminal region of nspl folds into a helix-turn-helix structure to occupy the mRNA binding site in the 40S ribosome enables further inquiry into the understanding of the entire nspl structure bound to the 40S ribosome to reveal relative positioning of the two termini when nspl is bound to ribosome.

SESSION: Single cell omics

Session details: Single cell omics

Zhongming Zhao

FastCount: a fast gene count software for single-cell RNA-seq data

Jinpeng Liu
Xinan Liu
Ye Yu
Chi Wang
Jinze Liu

Motivation: The advent of single cell RNA-seq (scRNA-seq) enables scientists to characterize the transcriptomic response of cells under different conditions and understand expression heterogeneity at single cell level. One of the fundamental steps in scRNA-seq analysis is to summarize raw sequencing reads into a list of gene counts for each individual cell. However, this step remains to be most time-consuming and resource intensive in the analysis workflow due to the large amount of data produced in a scRNA-seq experiment. It is further complicated by the special handling of cell barcodes and unique molecular identifiers (UMIs) information in the read sequences. For example, the gene count summarization of 10X Chromium sequencing by standard Cell Ranger count often takes many hours to finish when running on a computing cluster. Although several alignment-free algorithms have been developed to improve efficiency, their derived gene count suffer from poor concordance with Cell Ranger count and algorithm-specific bias[1].

Results: In this work, we present a light-weight k-mer based gene count algorithm, FastCount, to support efficient UMI counts from single cell RNA-seq data. We demonstrate that FastCount is over an order of magnitude faster than Cell Ranger count while achieving competitive accuracy on 10X Genomics single cell RNA-seq data. FastCount is a stand-alone program implemented in C++. The source code is located at https://bitbucket.org/merckey/fastcount/.

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Daniel N. Baker
Nathan Dyjack
Vladimir Braverman
Stephanie C. Hicks
Ben Langmead

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.

Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.

Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

A hybrid deep neural network for robust single-cell genome-wide DNA methylation detection

Russell A. Li
Zhandong Liu

DNA methylation is an epigenetic mechanism that occurs when methyl groups are added to the 5th carbons of DNA cytosine residues. The process primarily takes place at CpG sites within the genome for the purpose of gene expression. Most cancerous cells result from aberrant DNA methylation, and the process is also linked to neurological disorders such as Alzheimer's and Parkinson's diseases. To discern the link between DNA methylation patterns and diseases, the methylation status of CpG sites throughout the genome must be known. Existing practical sequencing techniques can only map out methylation statuses for 10% to 40% of CpG sites. To address this deficiency, we have developed a hybrid deep neural network to estimate missing methylation statuses across the entire genome. The network was built with convolutional neural network layers and bidirectional LSTM neural network layers. The network extracts features from raw DNA sequences and creatively utilizes information contained in neighboring CpG sites. Our network achieved accuracy rates of 91% to 93% on the task of DNA methylation status identification, which is a statistically significant improvement over existing leading computational methods.

Copy number variation detection using single cell sequencing data

Fatima Zare
Jacob Stark
Sheida Nabavi

Single-cell sequencing (SCS) has emerged as a critical means of discovering important biological knowledge. Data analysis plays an essential role in extracting accurate and meaningful information from SCS data. However, compared to bulk sequencing, SCS introduces new challenges in data analysis. In this paper, we present a novel CNV detection algorithm for SCS data. The proposed method, first, finds the optimal window size for generating read count signal using the AIC approach and removes outliers from the read count signal. Then, using a novel segmentation method based on the Total Variation approach, the method identifies significant change points and detects CNV segments. Finally, it uses the hierarchical clustering of cells based on their CNV patterns and employs Z-score to improve CNV detection across the cells. We used real and simulated data to evaluate the performance of the proposed method and compared its performance with those of other commonly used CNV detection methods. We show that the proposed method outperforms the existing CNV detection methods in terms of sensitivity and false discovery rate.

SESSION: Machine learning & drug design

Session details: Machine learning & drug design

Ariful Azad

SPEAR: self-supervised post-training enhancer for molecule optimization

Tianfan Fu
Cao Xiao
Kexin Huang
Lucas M. Glass
Jimeng Sun

The molecular optimization task is to generate molecules that are similar to a target molecule but with better chemical properties. Deep Generative Models (DGMs) have shown initial success in automatic molecule optimization. However, the training of DGMs often suffers from limited labeled molecule pairs due to the ad-hoc and restricted molecule pair construction. To solve this challenge and leverage the entire unpaired molecule database, we propose Self-Supervised Post-training EnhAnceR method (SPEAR) to enhance any graph-based DGMs for molecule optimization. SPEAR mines molecular structure knowledge and learns the molecule generation procedure in a purely self-supervised fashion. Unlike most self-supervised deep learning models that rely on pre-training for better molecule representation, the SPEAR method is applied as post-processing step to enhance molecule optimization during inference time for DGMs without additional training. Our SPEAR model can be efficiently incorporated into any DGM model as part of the inference procedure. We evaluated SPEAR against several state-of-the-art DGMs, SPEAR successfully improved the performance of all DGMs and obtained 5--21% relative improvement over its corresponding DGM models in terms of success rate.

A value-based approach for training of classifiers with high-throughput small molecule screening data

Natalia Khuri
Sarah Parsons

In many practical applications of machine learning, models are built using experimental data that are noisy, biased and of low quality. Binary classifiers trained with such data have low performance in independent and prospective tests. This work builds upon techniques for the estimation of the value of training data and evaluates a batch-based data valuation. Comparative experiments conducted in this work with seven challenging benchmarks, demonstrate that classification performance can be improved by 10% to 25% in independent tests, using value-based training of classifiers. Additionally, between 97% to 100% of class labels can be detected among low-valued training samples. Finally, results show that simpler and faster learning methods, such as generalized linear models, perform as well as complex gradient boosting trees when training data comprises only the high-valued samples extracted from high-throughput small molecule screens.

Predicting drug resistance in M. tuberculosis using a long-term recurrent convolutional network

Amir Hosein Safari
Nafiseh Sedaghat
Hooman Zabeti
Alpha Forna
Leonid Chindelevitch
Maxwell Libbrecht

Motivation: Drug resistance in Mycobacterium tuberculosis (MTB) is a growing threat to human health worldwide. One way to mitigate the risk of drug resistance is to enable clinicians to prescribe the right antibiotic drugs to each patient through methods that predict drug resistance in MTB using whole-genome sequencing (WGS) data. Existing machine learning methods for this task typically convert the WGS data from a given bacterial isolate into features corresponding to single-nucleotide polymorphisms (SNPs) or short sequence segments of a fixed length K (K-mers). Here, we introduce a gene burden-based method for predicting drug resistance in TB. We define one numerical feature per gene corresponding to the number of mutations in that gene in a given isolate. This representation greatly reduces the number of model parameters. We further propose a model architecture that considers both gene order and locality structure through a Long-term Recurrent Convolutional Network (LRCN) architecture, which combines convolutional and recurrent layers.

Results: We find that using these strategies yields a substantial, statistically significant improvement over state-of-the-art methods on a large dataset of M. tuberculosis isolates, and suggest that this improvement is driven by our method's ability to account for the order of the genes in the genome and their organization into operons.

Availability: The implementations of our feature preprocessing pipeline¹ and our LRCN model² are publicly available, as is our complete dataset³.

Supplementary information: Additional data are available in the Supplementary Materials document⁴.

LSHvec: a vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings

Lizhen Shi
Bo Chen

Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of distinct k-mers is far more than the number of different words in our vocabulary, making the model too huge to be stored in memory. Second, sequencing errors create lots of novel k-mers (noise), which significantly degrade model performance. In this work, we introduce LSHvec, a model that leverages Locality Sensitive Hashing (LSH) for k-mer encoding to overcome these challenges. After k-mers are LSH encoded, we adopt the skip-gram with negative sampling to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrate that k-mer encoding using LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than using alternative encoding methods. We validate that LSHvec is robust on reads with high sequencing error rates and works well with any sequencing technologies. In addition, the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and taxonomic classification. Finally, We demonstrate the unprecedented capability of LSHvec by participating in the second round of CAMI challenges and show that LSHvec is able to handle metagenome datasets that exceed Terabytes in size through distributed training across multiple nodes.

SESSION: Medical imaging

Session details: Medical imaging

Sanjay Purushotham

AW-Net: automatic muscle structure analysis on B-mode ultrasound images for injury prevention

Hugo Michard
Bertrand Luvison
Quoc-Cuong Pham
Antonio J. Morales-Artacho
Gaël Guilhem

Muscle injuries have deep physical and mental impacts on athletic individuals. Not only are injuries a heavy setback for sportspersons as it reduces their practice time, but they also have significant after-math on their personal life [7, 17]. Knowing the individual's muscle state to adapt his practice to his needs and to prevent potential injuries is nowadays a high stake issue. Muscle geometry, referred to as muscle architecture is a crucial determinant of performance (i.e. muscle strength, power, shortening velocity). For a decade, it is proposed that hamstring muscle architecture (e.g. muscle fascicle length) may influence an individual's exposure to muscle injury [2]. B-mode ultrasound imaging is a non-invasive technique widely used to extract muscle properties by identifying the pennation angle and fascicle length in the images but requires time-consuming and prone to approximations manual post-processing of the images. This paper presents a fully automated method to analyze the muscle structure and properties from B-mode ultrasound images, more specifically by estimating the aponeurosis and fascicle architecture. In this study, we focus on segmenting aponeuroses and estimating a vector field modeling of the fascicle structure with a novel multi-task model, Attention W-Net (AW-Net), based on the U-Net architecture with attention gate. We enriched two public ultrasound datasets on lower leg muscles with aponeurosis annotation and made these modifications publicly available. Trained on these datasets, our model outperformed state-of-the-art methods and demonstrates precise estimation of the muscle structure and properties in a fully automated fashion. Additionally, by augmenting the training set with 288 proprietary ultrasound images of the semimembranous and biceps femoris muscles collected on athletic individuals and annotated with medical and sport science experts, our method generalizes to other muscles with further improved performance.

Assigning ICD-O-3 codes to pathology reports using neural multi-task training with hierarchical regularization

Anthony Rios
Eric B. Durbin
Isaac Hands
Ramakanth Kavuluru

Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.

Segmenting thoracic cavities with neoplastic lesions: a head-to-head benchmark with fully convolutional neural networks

Zhao Li
Rongbin Li
Kendall J. Kiser
Luca Giancardo
W. Jim Zheng

Automatic segmentation of thoracic cavity structures in computer tomography (CT) is a key step for applications ranging from radiotherapy planning to imaging biomarker discovery with radiomics approaches. State-of-the-art segmentation can be provided by fully convolutional neural networks such as the U-Net or V-Net. However, there is a very limited body of work on a comparative analysis of the performance of these architectures for chest CTs with significant neoplastic disease. In this work, we compared four different types of fully convolutional architectures using the same pre-processing and post-processing pipelines. These methods were evaluated using a dataset of CT images and thoracic cavity segmentations from 402 cancer patients. We found that these methods achieved very high segmentation performance by benchmarks of three evaluation criteria, i.e. Dice coefficient, average symmetric surface distance and 95% Hausdorff distance. Overall, the two-stage 3D U-Net model performed slightly better than other models, with Dice coefficients for left and right lung reaching 0.947 and 0.952, respectively. However, 3D U-Net model achieved the best performance under the evaluation of HD95 for right lung and ASSD for both left and right lung. These results demonstrate that the current state-of-art deep learning models can work very well for segmenting not only healthy lungs but also the lung containing different stages of cancerous lesions. The comprehensive types of lung masks from these evaluated methods enabled the creation of imaging-based biomarkers representing both healthy lung parenchyma and neoplastic lesions, allowing us to utilize these segmented areas for the downstream analysis, e.g. treatment planning, prognosis and survival prediction.

COVID-19 classification using thermal images: thermal images capability for identifying COVID-19 using traditional machine learning classifiers

Martha Rebeca Canales-Fiscal
Rocío Ortiz López
Regina Barzilay
Víctor Treviño
Servando Cardona-Huerta
Luis Javier Ramírez-Treviño
Adam Yala
José Tamez-Peña

Medical images have been proposed as a diagnostic tool for SARS-COV-2. The image modality more investigated on this subject is computed tomography (CT), however it has some disadvantages: it uses ionizing radiation, requires unique installations along with a complicated process limiting the number of possible tests per equipment, and the economic costs can be prohibitively high for screening a large population. For these reasons, the aim of this study is to investigate thermal images as an alternative modality for diagnosis of COVID-19. The methodology used in this study consisted of using radiomics and moment features extracted from six images obtained from thermal video clips in which optical flow and super resolution were used, these features were classified using traditional machine learning methods. Accuracies were in the range of 0.433 - 0.524. These first results conducted on thermal images suggest that the use of this type of image modality is unlikely to be favorable for COVID-19 detection.

A CNN-based cell tracking method for multi-slice intravital imaging data

Kenji Fujimoto
Tsubasa Mizugaki
Utkrisht Rajkumar
Hironori Shigeta
Shigeto Seno
Yutaka Uchida
Masaru Ishii
Vineet Bafna
Hideo Matsuda

Cell migration is one of the important criteria for determining effects on cells by inflammatory and/or chemical stimulation. Accurate detection of cells' movement through traditional methods, such as optical flow, is difficult because those cells' fluorescence intensity and shapes are similar to each other. Therefore, we adopt a tracking approach based on a convolutional neural network (CNN) using time-lapse multi-slice images observed with 2-photon excitation microscopy. Existing CNN-based cell tracking methods are often focusing on tracking targets in the 2-dimensional (2D) space since the costs for computation and annotation drastically increase in 3-dimensional (3D) settings. Those methods usually convert the 3D microscopic images to the 2D ones via maximum intensity projection (MIP). However, as MIP does not keep the depth information, it is difficult to track depth-directionally overlapping cells in MIP images accurately. To cope with the problem, we propose a novel CNN-based cell tracking method for multi-slice 3D images. Our method trains a CNN using MIP images annotated with the cell locations, similarly to the existing methods. Meanwhile, in the tracking phase, our method estimates not only each cell's location but its depth using multiple slices at different depths. Using our method, we track leukocyte migration in multi-slice time-lapse images observed with 2-photon excitation microscopy. As a result, we show that our method outperforms existing cell tracking methods including multi-domain network (MDNet) with MIP images.

SESSION: Graphs & networks

Session details: Graphs & networks

Byung-Jun Yoon

Transfer learning for predicting virus-host protein interactions for novel virus sequences

Jack Lanchantin
Tom Weingarten
Arshdeep Sekhon
Clint Miller
Yanjun Qi

Viruses such as SARS-CoV-2 infect the human body by forming interactions between virus proteins and human proteins. However, experimental methods to find protein interactions are inadequate: large scale experiments are noisy, and small scale experiments are slow and expensive. Inspired by the recent successes of deep neural networks, we hypothesize that deep learning methods are well-positioned to aid and augment biological experiments, hoping to help identify more accurate virus-host protein interaction maps. Moreover, computational methods can quickly adapt to predict how virus mutations change protein interactions with the host proteins.

We propose DeepVHPPI, a novel deep learning framework combining a self-attention-based transformer architecture and a transfer learning training strategy to predict interactions between human proteins and virus proteins that have novel sequence patterns. We show that our approach outperforms the state-of-the-art methods significantly in predicting Virus-Human protein interactions for SARS-CoV-2, H1N1, and Ebola. In addition, we demonstrate how our framework can be used to predict and interpret the interactions of mutated SARS-CoV-2 Spike protein sequences.

Availability: We make all of our data and code available on GitHub https://github.com/QData/DeepVHPPI.

GNNfam: utilizing sparsity in protein family predictions using graph neural networks

Anuj Godase
Md. Khaledur Rahman
Ariful Azad

We present GNNfam, a pipeline for predicting protein families from protein sequences. GNNfam aligns proteins using pairwise sequence aligner LAST, creates a sparse graph based on the alignment scores, and employs graph neural networks (GNNs) to predict protein families. Unlike alignment-free deep learning methods such as DeepFam, GNNfam can control the sparsity of the protein similarity graph to prune uninformative edges. We develop three pruning strategies to improve the prediction accuracy, convergence, and running time of the downstream graph neural networks. We also demonstrate that semi-supervised GNNs outperform traditional graph clustering-based methods by a large margin. When trained with three labeled sequence datasets from the SCOPe and COG databases, GNNfam achieves more than 90% test accuracy when predicting protein families and performs significantly better than clustering, embedding and other deep learning methods. GNNfam is available at https://github.com/HipGraph/GNNfam.

A multi-resolution graph convolution network for contiguous epitope prediction

Lisa Oh
Bowen Dai
Chris Bailey-Kellogg

Computational methods for predicting binding interfaces between antigens and antibodies (epitopes and paratopes) are faster and cheaper than traditional experimental structure determination methods. A sufficiently reliable computational predictor that could scale to large sets of available antibody sequence data could thus inform and expedite many biomedical pursuits, such as better understanding immune responses to vaccination and natural infection and developing better drugs and vaccines. However, current state-of-the-art predictors produce discontiguous predictions, e.g., predicting the epitope in many different spots on an antigen, even though in reality they typically comprise a single localized region. We seek to produce contiguous predicted epitopes, accounting for long-range spatial relationships between residues. We therefore build a novel Graph Convolution Network (GCN) that performs graph convolutions at multiple resolutions so as to represent and constrain long-range spatial dependencies. In evaluation on a standard epitope prediction benchmark, we see a significant boost with the multi-resolution approach compared to a previous state-of-the-art GCN predictor, with half of the test cases increasing in AUC-PR by an average of 0.15 and the other half decreasing by only 0.05. We further introduce a clustering algorithm that takes advantage of the contiguity yielded by our model, grouping the raw predictions into a small set of discrete potential epitopes. We show that within the top 3 clusters, 73% of test cases contain a cluster covering most of the actual epitope, demonstrating the utility of contiguous predictions for guiding experimental methods by yielding a small set of reasonable hypotheses for further investigation.

ShareTrace: an iterative message passing algorithm for efficient and effective disease risk assessment on an interaction graph

Erman Ayday
Youngjin Yoo
Anisa Halimi

We propose a novel privacy-preserving COVID-19 risk assessment algorithm that can make a fundamental contribution to the development of the next generation resilient public health and health care systems. The proposed algorithm, ShareTrace, uses a hyperlocal interaction graph to capture direct and indirect physical interactions among users. Combining user-reported symptoms that are propagated through the hyperlocal interaction graph via a novel message passing algorithm, ShareTrace is able to pick up early warning signals based on the combination of interactions with others and symptoms. The proposed algorithm is inspired by the belief propagation algorithm and iterative decoding of low-density parity-check codes over factor graphs. Our evaluation on synthetic data shows the efficiency and efficacy of the proposed solution.

Investigating statistical analysis for network motifs

Zican Li
Wooyoung Kim

Network motifs are frequent and statistically significant subgraph patterns in a network. Its statistical uniqueness is generally determined by an explicit generation of many random graphs followed by subgraph sampling, and computation of P-value or Z-score, which is called EXPLICIT. It absorbs most computational time in detection of network motifs as typically 1,000 number of random graphs are generated and analyzed. Here, we investigated a DIRECT method which was introduced as an alternative to EXPLICIT, to speed up the process by removing the need of the explicit generation of random graphs. Although DIRECT's efficiency was described in theory, it was never adapted to detection of network motifs in practice. Therefore, we investigated, implemented, and applied DIRECT with a different statistical measurement to determine network motifs. Experimental results demonstrate that DIRECT is a good alternative to EXPLICIT, because it is much faster than EXPLICIT in detection of small size of network motifs, and the results are generally consistent with those by EXPLICIT.

SESSION: COVID-19

Session details: COVID-19

Mohd Anwar

Temporal analysis of social determinants associated with COVID-19 mortality

Shayom Debopadhaya
John S. Erickson
Kristin P. Bennett

This study examines how social determinants associated with COVID-19 mortality change over time. Using US county-level data from July 5 and December 28, 2020, the effect of 19 high-risk factors on COVID-19 mortality rate was quantified at each time point with negative binomial mixed models. Then, these high-risk factors were used as controls in two association studies between 40 social determinants and COVID-19 mortality rates using data from the same time points. The results indicate that counties with certain ethnic minorities and age groups, immigrants, prevalence of diseases like pediatric asthma and diabetes and cardiovascular disease, socioeconomic inequalities, and higher social association are associated with increased COVID-19 mortality rates. Meanwhile, more mental health providers, access to exercise, higher income, chronic lung disease in adults, suicide, and excessive drinking are associated with decreased mortality. Our temporal analysis also reveals a possible decreasing impact of socioeconomic disadvantage and air quality, and an increasing effect of factors like age, which suggests that public health policies may have been effective in protecting disadvantaged populations over time or that analysis utilizing earlier data may have exaggerated certain effects. Overall, we continue to recognize that social inequality still places disadvantaged groups at risk, and we identify possible relationships between lung disease, mental health, and COVID-19 that need to be explored on a clinical level.

COVID-19 diagnosis using model agnostic meta-learning on limited chest X-ray images

Tarun Naren
Yuanda Zhu
May Dongmei Wang

In the past year, detection of Coronavirus infection has demonstrated itself to be a challenging task. The gold standard for detection, real-time reverse transcription polymerase chain reaction (RT-PCR) testing, has a few shortcomings, including high false negative rates, long turn-around times, and limited availability. Applying machine learning for automatic analysis on chest X-rays can overcome these issues, but the limited amount of data with which to train inhibits development of robust deep neural networks. In this paper, we demonstrate the feasibility of performing few-shot learning to classify COVID-19 chest X-rays by utilizing a Model-Agnostic Meta-Learning (MAML) algorithm. We compare the improved variant of MAML, named MAML++, to other state-of-the-art machine learning strategies and demonstrate the robust and superior performance in classification accuracy. In addition, we explore the effect of the number of images made available to the sub-learners used for training MAML++ and show that increasing the number of images leads to diminishing returns in performance. Lastly, we compare MAML++ to the original MAML algorithm and discuss the shortcomings of MAML-based algorithms in classification problems.

Surveillance of COVID-19 pandemic using social media: a reddit study in North Carolina

Christopher Whitfield
Yang Liu
Mohd Anwar

Coronavirus disease (COVID-19) pandemic has changed various aspects of people's lives and behaviors. At this stage, there are no other ways to control the natural progression of the disease than adopting mitigation strategies such as wearing masks, watching distance, and washing hands. Moreover, at this time of social distancing, social media plays a key role in connecting people and providing a platform for expressing their feelings. In this study, we tap into social media to surveil the uptake of mitigation and detection strategies, and capture issues and concerns about the pandemic. In particular, we explore the research question, "how much can be learned regarding the public uptake of mitigation strategies and concerns about COVID-19 pandemic by using natural language processing on Reddit posts?" After extracting COVID-related posts from the four largest subreddit communities of North Carolina over six months, we performed NLP-based preprocessing to clean the noisy data. We employed a custom Named-entity Recognition (NER) system and a Latent Dirichlet Allocation (LDA) method for topic modeling on a Reddit corpus. We observed that mask, flu, and testing are the most prevalent named-entities for "Personal Protective Equipment", "symptoms", and "testing" categories, respectively. We also observed that the most discussed topics are related to testing, masks, and employment. The mitigation measures are the most prevalent theme of discussion across all subreddits.

A multi-instance support vector machine with incomplete data for clinical outcome prediction of COVID-19

Lodewijk Brand
Lauren Zoe Baker
Hua Wang

In order to manage the public health crisis associated with COVID-19, it is critically important that healthcare workers can quickly identify high-risk patients in order to provide effective treatment with limited resources. Statistical learning tools have the potential to help predict serious infection early-on in the progression of the disease. However, many of these techniques are unable to take full advantage of temporal data on a per-patient basis as they handle the problem as a single-instance classification. Furthermore, these algorithms rely on complete data to make their predictions. In this work, we present a novel approach to handle the temporal and missing data problems, simultaneously; our proposed Simultaneous Imputation-Multi Instance Support Vector Machine method illustrates how multiple instance learning techniques and low-rank data imputation can be utilized to accurately predict clinical outcomes of COVID-19 patients. We compare our approach against recent methods used to predict outcomes on a public dataset with a cohort of 361 COVID-19 positive patients. In addition to improved prediction performance early on in the progression of the disease, our method identifies a collection of biomarkers associated with the liver, immune system, and blood, that deserve additional study and may provide additional insight into causes of patient mortality due to COVID-19. We publish the source code for our method online.¹

SESSION: Clinical trials & outcome prediction

Session details: Clinical trials & outcome prediction

Kaiman Zeng

Synthesized difference in differences

Eric V. Strobl
Thomas A. Lasko

We consider estimating the conditional average treatment effect for everyone by eliminating confounding and selection bias. Unfortunately, randomized clinical trials (RCTs) eliminate confounding but impose strict exclusion criteria that prevent sampling of the entire clinical population. Observational datasets are more inclusive but suffer from confounding. We therefore analyze RCT and observational data simultaneously in order to extract the strengths of each. Our solution builds upon Difference in Differences (DD), an algorithm that eliminates confounding from observational data by comparing outcomes before and after treatment administration. DD requires a parallel slopes assumption that may not apply in practice when confounding shifts across time. We instead propose Synthesized Difference in Differences (SDD) that infers the correct (possibly non-parallel) slopes by linearly adjusting a conditional version of DD using additional RCT data. The algorithm achieves state of the art performance across multiple synthetic and real datasets even when the RCT excludes the majority of patients.

Match2: hybrid self-organizing map and deep learning strategies for treatment effect estimation

Xiao Shou
Tian Gao
Dharmashankar Subramanian
Kristin P. Bennett

Estimating treatment effects from observational data through covariate matching remains an active research area in causal inference. Although existing methods may provide accurate results on simulated datasets, knowing how to tune the parameters to accurately estimate treatments in practice can be a challenge, since the ground truth is not known. We provide an explainable hybrid neural network and self-organizing map (SOM) approach, Match2. Using a supervised learning paradigm, our method simultaneously learns a meaningful latent representation with respect to treatment assignment and a nonlinear neighborhood preserving mapping via SOM in the latent space. To select the appropriate latent dimension, we propose a data-driven strategy based on the minimum validation loss for the treatment classification subproblem. Unlike other matching methods, the hybrid SOM-neural network can be used as the basis for visualizing and quantifying the quality of the matches. The user can understand the quality of the matches to provide confidence in the results and detect any potential problems. We design a novel metric to examine the overall quality of matching along with the visualization. We demonstrate strong performance on four benchmark datasets compared to non-neural-network baselines. Integrating a SOM component may potentially benefit other state-of-the-art neural network models for causal effect estimation by gaining interpretability while retaining prediction/estimation accuracy.

CytoSet: predicting clinical outcomes via set-modeling of cytometry data

Haidong Yi
Natalie Stanley

Single-cell flow and mass cytometry technologies are being increasingly applied in clinical settings, as they enable the simultaneous measurement of multiple proteins across millions of cells within a multi-patient cohort. In this work, we introduce CytoSet, a deep learning model that can directly predict a patient's clinical outcome from a collection of cells obtained through a blood or tissue sample. Unlike previous work, CytoSet explicitly models the cells profiled in each patient sample as a set, allowing for the use of recently developed permutation invariant architectures. We show that CytoSet achieves state-of-the-art classification performance across a variety of flow and mass cytometry benchmark datasets. The strong classification performance is further complemented by demonstrated robustness to the number of sub-sampled cells per patient and the depth of model, enabling CytoSet to scale adequately to hundreds of patient samples. The strong performance achieved by the set-based architectures used in CytoSet suggests that clinical cytometry data can be appropriately interpreted and studied as sets. The code is publicly available at https://github.com/CompCy-lab/cytoset.

Towards an extensible ontology for streaming sensor data for clinical trials

Robert Lyons
Geoffrey Ross Low
Clare Bates Congdon
Melissa Ceruolo
Marissa Ballesteros
Steven Cambria
Paolo DePetrillo

The use of wearable sensors for clinical trials can lead to better data collection and a better patient experience during trials, and can further allow more patients to participate in trials by allowing more remote monitoring and fewer site visits. However, extracting maximum value from the data collected via streaming sensors presents some specific technical challenges, including processing the data in real time, and storing the sensor data in a representation that facilitates the use of biomarker algorithms that can be used and reused with different similar sensors, at different scales, and across different clinical trials. Here we present our initial work on SORBET, a Sensor Ontology for Reusable Biometric Expressions and Transformations. Our design strategy is presented, along with the initial design and examples. While this ontology has been created for the Medidata Sensor Cloud product, it is our hope that others working in this space will join us in extending and hardening this ontology, as we expand it to incorporate more sensors and more needs for clinical trials research.

Transformer-based named entity recognition for parsing clinical trial eligibility criteria

Shubo Tian
Arslan Erdengasileng
Xi Yang
Yi Guo
Yonghui Wu
Jinfeng Zhang
Jiang Bian
Zhe He

The rapid adoption of electronic health records (EHRs) systems has made clinical data available in electronic format for research and for many downstream applications. Electronic screening of potentially eligible patients using these clinical databases for clinical trials is a critical need to improve trial recruitment efficiency. Nevertheless, manually translating free-text eligibility criteria into database queries is labor intensive and inefficient. To facilitate automated screening, free-text eligibility criteria must be structured and coded into a computable format using controlled vocabularies. Named entity recognition (NER) is thus an important first step. In this study, we evaluate 4 state-of-the-art transformer-based NER models on two publicly available annotated corpora of eligibility criteria released by Columbia University (i.e., the Chia data) and Facebook Research (i.e.the FRD data). Four transformer-based models (i.e., BERT, ALBERT, RoBERTa, and ELECTRA) pretrained with general English domain corpora vs. those pretrained with PubMed citations, clinical notes from the MIMIC-III dataset and eligibility criteria extracted from all the clinical trials on ClinicalTrials.gov were compared. Experimental results show that RoBERTa pretrained with MIMIC-III clinical notes and eligibility criteria yielded the highest strict and relaxed F-scores in both the Chia data (i.e., 0.658/0.798) and the FRD data (i.e., 0.785/0.916). With promising NER results, further investigations on building a reliable natural language processing (NLP)-assisted pipeline for automated electronic screening are needed.

SESSION: Cancer

Session details: Cancer

Oznur Tastan

Cancer molecular subtype classification by graph convolutional networks on multi-omics data

Bingjun Li
Tianyu Wang
Sheida Nabavi

Cancer has been a second leading cause of death in the United States for decades and an accurate classifier of cancers' molecular profiles is a key predictor for patients' survival. Recently The Cancer Genome Atlas research networks have identified a new cancer taxonomy based on molecular tumor subtypes over 33 types of cancer. Several studies have reported classification models for traditional tissue-of-origin cancer type classification or classification of subtypes of a cancer type. In this study, we propose a novel end-to-end deep learning model that incorporates prior biological knowledge into the model and integrates multi-omics data to classify pan-cancer molecular subtypes. Our proposed model consists of three sections: i) a graph convolutional network that takes a genet interaction network, representing prior knowledge, as its input graph where genes are nodes and multi-omics data are the node features, to extract localized features; ii) a fully connected neural network to extract global features from the data; and iii) a classification layer that takes the combination of localized features and global features as input. We examined building the input graph using gene-gene interaction networks, protein-protein interaction networks, and gene co-expression networks. We also investigated the effect of input graph size (number of genes/nodes) on the performance of the model. We evaluated the performance of the proposed model in terms of prediction accuracy, precision, recall, and F1 score; and compared the performance of our model with those of three state-of-the-art deep learning models and two conventional machine learning models. The results show that the proposed model outperforms the baseline models at each level of the number of genes. Our model achieves not only a better prediction accuracy but also a lower false-negative rate, which is important for cancer patients treatments. Our model also shows the benefit of employing multi-omics data compared with employing only single-omic data.

Deep neural network models to automate incident triage in the radiation oncology incident learning system

Priyankar Bose
William C. Sleeman
Khajamoinuddin Syed
Michael Hagan
Jatinder Palta
Rishabh Kapoor
Preetam Ghosh

Radiotherapy treatment for cancer patients involves a complex workflow involving radiation physicists, therapists, dosimetrists, physicians and nurses. Multiple hand-offs between these care team members often lead to errors varying in severity levels. Such errors are logged in incident reports stored in the Radiation Oncology Incident Learning System. Here, we present an automated incident triage and severity determination pipeline that can predict high and low severity incidents. Incident reports are collected from the US Veterans Health Affairs (VHA) and Virginia Commonwealth University (VCU) radiation oncology centers. Natural language processing (NLP) and deep learning (DL) methods, like CNN and BiLSTM, are used to predict severity using the 'Incident Description' information. Other features like 'Incident Type', 'Action taken by reporter' and 'Incident discovered at' are used to infer the best performing model. Random oversampling and minority class oversampling are employed to address large class imbalance ratios in the data.

We observed that CNN performs best on both VHA data (0.83 F1-score) and the combined VCU+VHA data (0.83 F1-score) while CNN with minority sampling performs better on VCU data (0.60 F1-score) using the 'Incident Description' feature. Different feature combinations suggest that the two feature model using 'Incident Description' and 'Action taken by reporter' performs better with CNN on both the VHA (0.84 F1-score) and combined VCU+VHA data (0.81 F1-score). Multiple features were considered for the first time where the two feature model using CNNs emerge as the best suited for automating the radiotherapy incident triage and prioritization process.

Extracapsular extension identification for head and neck cancer using multi-scale 3D deep neural network

Yibin Wang
W. Neil. Duggar
Toms V. Thomas
P. Russell Roberts
Linkan Bian
Haifeng Wang

Extracapsular extension (ECE) is a strong predictor of patients' survival outcomes with head and neck squamous cell carcinoma (HNSCC). ECE occurs when metastatic tumor cells within the lymph node break through the nodal capsule into surrounding tissues. It is crucial to identify the occurrence of ECE as it changes staging and management for the patients. Current clinical ECE detection relying on radiologists' visual identification is extremely labor-intensive, time-consuming, and error-prone, and consequently, pathologic confirmation is required. Therefore, we aim to perform ECE identification automatically introducing a novel 3D deep neural network (DNN) with multi-scale input to analyze the presence or absence of ECE and correlate that with gold standard histopathological findings. Both local and global features are extracted. The experimental tests show that our proposed model is capable for ECE classification. The test results are enhanced with performance visualization.

SESSION: Ontologies & databases

Session details: Ontologies & databases

Fereydoun Hormozdiari

KGDAL: knowledge graph guided double attention LSTM for rolling mortality prediction for AKI-D patients

Lucas Jing Liu
Victor Ortiz-Soriano
Javier A. Neyra
Jin Chen

With the rapid accumulation of electronic health record (EHR) data, deep learning (DL) models have exhibited promising performance on patient risk prediction. Recent advances have also demonstrated the effectiveness of knowledge graphs (KG) in providing valuable prior knowledge for further improving DL model performance. However, it is still unclear how KG can be utilized to encode highorder relations among clinical concepts and how DL models can make full use of the encoded concept relations to solve real-world healthcare problems and to interpret the outcomes. We propose a novel knowledge graph guided double attention LSTM model named KGDAL for rolling mortality prediction for critically ill patients with acute kidney injury requiring dialysis (AKI-D). KGDAL constructs a KG-based two-dimension attention in both time and feature spaces. In the experiment with two large healthcare datasets, we compared KGDAL with a variety of rolling mortality prediction models and conducted an ablation study to test the effectiveness, efficacy, and contribution of different attention mechanisms. The results showed that KGDAL clearly outperformed all the compared models. Also, KGDAL-derived patient risk trajectories may assist healthcare providers to make timely decisions and actions. The source code, sample data, and manual of KGDAL are available at https://github.com/lucasliu0928/KGDAL.

Low resource recognition and linking of biomedical concepts from a large ontology

Sunil Mohan
Rico Angell
Nicholas Monath
Andrew McCallum

Tools to explore scientific literature are essential for scientists, especially in biomedicine, where about a million new papers are published every year. Many such tools provide users the ability to search for specific entities (e.g. proteins, diseases) by tracking their mentions in papers. PubMed, the most well known database of biomedical papers, relies on human curators to add these annotations. This can take several weeks for new papers, and not all papers get tagged. Machine learning models have been developed to facilitate the semantic indexing of scientific papers. However their performance on the more comprehensive ontologies of biomedical concepts does not reach the levels of typical entity recognition problems studied in NLP. In large part this is due to their low resources, where the ontologies are large, there is a lack of descriptive text defining most entities, and labeled data can only cover a small portion of the ontology. In this paper, we develop a new model that overcomes these challenges by (1) generalizing to entities unseen at training time, and (2) incorporating linking predictions into the mention segmentation decisions. Our approach achieves new state-of-the-art results for the UMLS ontology in both traditional recognition/linking (+8 F1 pts) as well as semantic indexing-based evaluation (+10 F1 pts).

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation

Jiho Noh
Ramakanth Kavuluru

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.

HYPON: embedding biomedical ontology with entity sets

Zhuoyan Li
Sheng Wang

Constructing high-quality biomedical ontologies is one of the first steps to study new concepts, such as emerging infectious diseases. Manually curated ontologies are often noisy, especially for new knowledge that requires domain expertise. In this paper, we proposed a novel ontology embedding approach HYPON to automate this process. In contrast to conventional approaches, we propose to embed biomedical ontology in the hyperbolic space to better model the hierarchical structure. Importantly, our method is able to consider both graph structure and the varied-size set of entities, which are largely overlooked by existing methods. We demonstrated substantial improvement in comparison to thirteen comparison approaches on eleven biomedical ontologies, including two recently curated COVID-19 ontologies.

BioNumQA-BERT: answering biomedical questions using numerical facts with a deep language representation model

Ye Wu
Hing-Fung Ting
Tak-Wah Lam
Ruibang Luo

Biomedical question answering (QA) is playing an increasingly significant role in medical knowledge translation. However, current biomedical QA datasets and methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. In this paper, we constructed BioNumQA, a novel biomedical QA dataset that answers research questions using relevant numerical facts for biomedical QA model training and testing. To leverage the new dataset, we designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT and BioBERT (39.0% vs 29.5%, 31.3% and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. BioNumQA and BioNumQA-BERT establish a new baseline for biomedical QA. The dataset, source codes and pretrained model of BioNumQA-BERT are available at https://github.com/LeaveYeah/BioNumQA-BERT.

POSTER SESSION: BCB conference poster presentations

LDEncoder: reference deep learning-based feature detector for transfer learning in the field of epigenomics

Gun Woo (Warren) Park
Kevin Bryson

We propose a reference feature extractor that can be used for methylation data and potentially other epigenomic data sources. In doing so, it can be used in a trans-omics manner to bridge between epigenomics and transcriptomics. By having an internal latent space, it can solve classification/regression problems in a trans-omics manner. DNA methylation data is part of epigenomics data that is altered by external factors including the change in environment. It has multiple roles including the regulation of gene expression. The goal of the reference feature extractor is to extract important features from the DNA methylation data while encoding the features in a low dimensional feature space. To achieve this, a pan-cancer dataset was used to train the model with a wide variety of data. Due to the low dimensional encoding, downstream tasks can be solved while utilising significantly fewer parameters. The current state-of-the-art can work with a trans-omics setting, but it was not able to generalise the model so that it could work in other settings [1--3]. For example, TDImpute [4] needed an extra decision-making model to complete the classification task, while not utilising the latent feature representation inferred inside the model. Furthermore, a multi-layer perceptron, called LDEncoder, used in this approach has a low encoding dimension (512), which is used to represent the high dimensional DNA methylation data in a significantly lower-dimensional feature space. So, if the new classification/regression problem needs to be solved, the input dimension of 512 can be used for the transfer learning of the model. This significantly reduces the amount of time and computational resources needed for solving problems. In effect, transforming the DNA methylation data to gene expression data (RNA-seq) while having a bottleneck enables the lower dimensional encoding of the data. Also, in a similar scenario, we evaluated the performance of various models and techniques inspired by successful ones in computer vision. These included incorporating the model parameter savers based on the best validation loss and CpG site sorting¹. We found some promising results as shown in Table 1. Also, we further evaluate the generalisability of the model through cancer/non-cancer prediction and breast cancer molecular subtype prediction results.

Sequence model evaluation framework for STARR-seq peak calling

Christopher R. Beal
John G. Peters
Ronald J. Nowling

Enhancers are short regions of non-coding DNA that increase transcription rates of genes despite being located distantly from the genes themselves [5]. Enhancers are identified through experimental techniques such as ChIP-Seq or CUT&RUN with H3K4me1 and H3K27ac histone modifications, self-transcribing active regulatory region sequencing (STARR-Seq), and massively parallel reporter assays (MPRA). Machine learning models have been used in conjunction with experimental data to identify enhancer activity from sequences [3], predict enhancer-transcription factor interactions [4], and decode the enhancer regulatory language [2].

We describe a framework that connects peak calling errors to the prediction accuracy of sequence models. The key assumptions of our framework are that (1) enhancers have consistent sequence patterns that can be used to separate enhancers from control sequences, (2) errors in the training data impact prediction accuracies in predictable ways, and (3) prediction accuracy is a useful proxy for evaluating peak calling accuracy. In the framework, data sets are constructed from peak (positive) and randomly sampled (control) sequences. Machine learning models are trained and evaluated on the sequences in a cross-chromosome (cross-fold) setup. Lastly, precision of the originating peaks are evaluated by calculating true and false positive rates.

We applied our framework to evaluate peaks for D. melanogaster STARR-Seq data [1] called with the MACS software [6]. Although designed for ChIP-Seq data, MACS can be used to process other types of data, but users must be careful about parameter choices. We evaluated different parameter combinations with our framework and visual comparisons of called peaks. True and false positive rates ranged from a high of 88.0% to a low of 74.7% and from a low of 18.6% to a high of 49.4%, respectively. The default MACS parameters produced the highest true and lowest false positive rates, suggesting that the default parameters are also suitable for STARR-Seq data. Our results demonstrate the utility of our framework through a practical application and provide a base for future development.

Developing a modified version of generative adversarial network to predict the potential anti-viral drug of COVID-19

Md. Sadek Hossain Asif

Coronavirus (COVID19), a new strain of the SARS-CoV family discovered in 2019, transmits through droplets generated by an infected person. Its production rate on the endoplasmic reticulum will lead to the apoptosis of the human cell. Several attempts to create an anti-viral drug to combat the virus, such as chloroquine and remdesivir [1], are ongoing. A potential cure for the virus is to have an inhibitor (Antibody) that will attach itself to the virus and prevent it from attaching to the human cell's receptors, preventing it from spreading. In this work, the author will first develop a set of sample candidate drugs using deep reinforcement learning and then, using the coronavirus as a target, bind all the sample candidate drugs to it and determine the one with the best binding affinity. The drug with the best binding affinity will be a potential cure for the virus.

Using electronic health records to accurately predict COVID-19 health outcomes through a novel machine learning pipeline

Alice Feng

Current COVID-19 predictive models primarily focus on predicting the risk of mortality, and rely on COVID-19 specific medical data such as chest imaging after COVID-19 diagnosis. In this project, we developed an innovative supervised machine learning pipeline using longitudinal Electronic Health Records (EHR) to accurately predict COVID-19 related health outcomes including mortality, ventilation, days in hospital or ICU. In particular, we developed unique and effective data processing algorithms, including data cleaning, initial feature screening, vector representation, and feature normalization. Then we trained models using state-of-the-art machine learning strategies combined with different parameter settings and feature selection. Based on routinely collected EHR, our machine learning pipeline not only consistently outperformed those developed by other research groups using the same data set, but also achieved similar mortality prediction accuracy as those trained on medical data available only after COVID-19 diagnosis. In addition, we identified top COVID-19 risk factors, which are consistent with epidemiologic findings.

Hybridized distance- and contact- based hierarchical protein structure modeling using DConStruct

Rahmatullah Roche
Sutanu Bhattacharya
Debswapna Bhattacharya

Crystallography and NMR system (CNS) is a widely used method for predicting 3D structures of protein from inter-residue distance or contact maps. However, the decade-old CNS is an experimental structure determination method that was originally developed for solving macromolecular geometry from experimental restraints, as opposed to predictive structure modeling. Thus, relying on CNS for structure modeling may undermine the ab initio folding performance. Here we propose a CNS-free protein structure modeling method called DConStruct [1], which performs 3-stage hierarchical predictive modeling with iterative self-correction driven purely by the geometric restraints induced by inter-residue interactions and secondary structures. Starting from a residue-residue interaction map and secondary structure, DConStruct can hierarchically estimate the correct overall fold of a target protein in coarse-grained mode to progressively optimize local and non-local interactions while enhancing the secondary structure topology in a self-correcting manner. Multiple large-scale benchmarking experiments show that our proposed method can substantially improve the folding accuracy for both soluble and membrane proteins compared to state-of-the-art approaches. The open-source DConStruct software package, licensed under the GNU General Public License v3, is freely available at https://github.com/Bhattacharya-Lab/DConStruct.

Do microscopy imaging frequency and experiment duration impact the analysis of T cell movement?

Viktor Zenkov
James O'Connor
Hayley McNamara
Ian Cockburn
Vitaly Ganusov

For years we have investigated whether vaccine-induced CD8 T cells in the liver hunt Plasmodium sporozoites (malaria) or if the T cells find infected hepatocytes randomly. Using previous T cell position data collected with intravital microscopy, we performed numerous analyses, giving two main conclusions. Firstly, using a new metric for detecting attraction using the Von Mises-Fisher distribution in 3D and statistical analyses, we concluded that cells move randomly until a cluster of cells forms around a parasite; then some cells begin to move with bias. Secondly, using simulations where we control the amount of attraction and otherwise replicated the movement of real cells, we concluded that cells are able to move with enough attraction to find parasites, but not enough to have statistically detectable attraction in experimental data. We have thus constructed methods to test how cells move and moreover determined the strength and the limitations of our metrics.

New opportunities arose in the past year as we received new position data, which has some critical differences from the older data. The new data has more frequent imaging of every 12--20 seconds (old data: 90--120 seconds) and shorter experiments of 30 minutes in total length (old data: 120 minutes). The different experiment parameters also affect some properties of the recorded data, including the calculated speeds (whose calculations in the old data we showed to be artificially low due to mathematical properties).

We repeated our work using the new data. Repeating the statistical analyses, we found that cells still move randomly until a cluster of cells forms around a parasite, then begin to move with bias. Repeating the simulations, now using parameters replicating the movement of new cells, we found that cells still are able to move with enough attraction to find parasites, but not enough to have statistically detectable attraction using current methodology.

In summary, despite differing parameters such as imaging frequencies and experiment durations, we make the same conclusions about T cell movement. This serves as both a biological result and a commentary on the potential effects of differing experimental parameters.

Explaining large-for-gestational-age births: a random forest classifier with a novel local interpretation method

Yuhan Du
Anthony R Rafferty
Fionnuala M McAuliffe
Catherine Mooney

We proposed a novel local interpretation method for a random forest classifier based on feature occurrence frequency in trees that give the same prediction as the random forest classifier. The method shows promising results when applied to our random forest classifier for large-for-gestational-age births. Further validation of the method is required.

Browsing weighted interactome models using GeneDive

Mike Wong
Nayana Laxmeshwar
Rachit Joshi
Anagha Kulkarni

Interactome models are key to understanding metabolic functions and mechanisms of medical interventions. There are two problems that need to be solved to deliver and use high-quality interactome models: (1) network curation at scale; and (2) bringing large-scale interactomes to human-scale understanding. To address these problems we present GeneDive, a computational systems biology interactome browser that makes large gene, disease, and drug interactions networks easily accessible and usable. GeneDive can be accessed at: https://www.genedive.net/

Expert curation of interaction networks is considered the gold standard for quality, but is limited in scope due to prohibitive costs. Automated molecular interaction network labeling is relatively inexpensive and can perform at scale, but lacks the quality of manual curation. Hybrid approaches that leverage the scalability of automated methods and the quality of expert curation processes have the potential to substantially expand our understanding of interactome models. GeneDive web-application is designed to facilitate such hybrid curation efforts.

Machine learning (ML) based approaches have been shown to assign high confidence scores to interactions with strong evidence in scientific literature, while assigning lower scores to interactions with less evidence. Subnetworks containing only high confidence scores have been shown to match strongly with expert curated networks. This suggests that interactions with mid-ranged scores may be yet undiscovered or not as well-researched. Edges in interaction networks can be weighted by these confidence scores to represent interactions found in literature and the abundance of relevant support from said literature. A weighted interactome browser, like GeneDive, can recall such understudied subnetworks and their surrounding context to prioritize curation efforts. Once these subnetworks are curated, the following process with a positive feedback loop can be applied: (1) on the ML algorithm (which can be retrained with the latest curations, improving accuracy); and (2) on the curation effort to identify the next "low-hanging fruit" discoveries within the browser. An iterative application of this process can rapidly grow the amount of high-quality interactome models available to the biomedical community.

In addition to improving the underlying data quality and scale for weighted interactome models, GeneDive also helps biomedical researchers use said models for biomedical discoveries. In previously published work, we outlined eight use cases where GeneDive could be used to help clinicians and researchers gain a better understanding through search and visualization of the interactome model.

GeneDive provides the following key features to facilitate discoveries in weighted interactome browsing: multiple search modalities, Cytoscape network visualization, a tabular view of interactions with supporting evidence and confidence scores, highlighting, free-text filtering for all fields, and filtering by confidence scores. Latest improvements to the GeneDive web-application include the ability to import user-provided data using an extensible semi-structured schema (previously, only fixed-schema user-provided data could be imported), and the ability to add the extensible fields to the tabular view via a plugin architecture. The ability to consolidate multiple datasets is supported without compromising data ownership and privacy. GeneDive promotes longitudinal and exploratory investigations that can be reproduced to enable branched hypothesis testing.

Scalable non-invasive pediatric cerebral visual impairment screening with the higher visual function question inventory (HVFQI)

Mike Wong
Saeideh Ghahghaei
Arvind Chandna
Anagha Kulkarni

Cerebral Visual Impairment (CVI), vision loss due to brain injury in early childhood, is the leading cause --- approximately 40% --- of bilateral visual impairment in children from industrialized countries and is the most rapidly growing cause among children in developing countries. Typical causes of CVI include abnormal brain development or brain damage, often consequential to birth-related complications such as hypoxic ischemic encephalopathy, meningitis, hydrocephalus, and head injury. The current gold standard for clinical diagnosis require a trained clinician to administer visual-motor integration tests in conjunction with a thorough review of a patient's clinical history. This approach does not scale to meet the needs of undiagnosed children with CVI. Recently, non-invasive screening methods such as administering a higher visual function question inventory (HVFQI), have been shown to accurately capture the observations of teachers and guardians of children. Analysis of those observations has been shown to have very high correlation with visual-motor integration tests (p values < 0.01). In this poster, we present a clinical database and information system, the HVFQI web app, which delivers a scalable, non-invasive pediatric CVI screening that is currently administered by a clinician, but has the potential to be self-administered in the near future, with follow-up clinical consultation for participants that screen positive or near-positive.

The HVFQI web app is an online clinical diagnostic tool and database system designed to gather participant responses to over 50 questions and follow-up questions. No personally identifying information is stored in the database. The web app provides three critical functions: (1) scalable and accurate administration of the HVFQI; (2) global coordination of screening efforts; and (3) a consolidated database for efficacy analysis studies and rapid iteration of the HFVQI to maximize impact, accuracy, and accessibility.

The HVFQI consists of a question inventory, a scoring rubric, and a conditional intervention strategy list. The question inventory is carefully curated and designed to capture observations from teachers and guardians that may indicate specific pediatric higher visual function deficits (HVFDs). In many instances HVFDs are accompanied by normal visual acuity, making diagnosis difficult, and requiring multiple questions to elicit HVFDs. These questions are presented at random to avoid leading the participant. The responses are scored according to the HVFQI scoring rubric, which indicates which HVFDs may be present, and relevant intervention strategies. The rubric responds to varying degrees of affirmative responses in a 5-category Likert scale, which includes 3 non-applicable responses with different causes. The rubric also responds to categorical responses for multiple-choice and multiple-answer questions. The web app prepares a report of relevant intervention strategies for the participant, along with the questions and responses as context.

The web app is also designed to coordinate international clinical efforts. The web application administrators, led by a superadministrator, can create, manage, and remove centers and staff corresponding to physical (or virtual) locations as demand grows. Each center has a local administrator that can authorize staff to administer the HVFQI, interpret and discuss the results and relevant strategies with participants.

The data is stored in a central secure database, accessible only to authorized researchers, and the site administrator. All data are anonymous and non-personally identifying. Authorized researchers can analyze the results for their local centers to estimate efficacy and submit suggestions to update the HVFQI to meet the needs of local participants. Researchers authorized by the super-administrator have full access to the entire database and can perform global analysis for rapid improvement.

The HVFQI web app is the result of a collaboration between the SeeLab of the Smith-Kettlewell Eye Research Institute, and the Kulkarni Group of the Computer Science Department at San Francisco State University.

Application of natural language processing and machine learning to radiology reports

Seoungdeok Jeon
Zachary Colburn
Joshua Sakai
Ling-Hong Hung
Ka Yee Yeung

After radiologists perform a set of chest-x-rays (CXRs) they write a short report describing their observations and interpretations. Because these reports are free-text documents, there is the risk of miscommunication, which can result in reduced patient outcomes. We applied text mining methods to radiology reports in the MIMIC Chest X-ray (MIMIC-CXR) database [5], consisting of 227,835 de-identified free-text radiology reports. We selected relevant terms (features) and developed predictive models that take a radiology report as input and return the probability the report describes a positive diagnosis for pneumonia, a common respiratory condition characterized by the accumulation of fluid in the lungs. Subsequently, we evaluated the performance of different predictive models using the area under the curve (AUC) and the Brier Score.

Due to the large number of reports in the MIMIC-CXR database, we generated and evaluated predictive models by randomly selecting 500, 1000, 2000, and 3000 reports. Specifically, we randomly selected reports and assigned 70% to the training set and 30% to the test set, created a term-document matrix giving the frequencies of different sets of 1 or 2 consecutive words (1-gram or 2-gram) using the R package tm [2], performed feature selection to identify terms that differentiate between cases, and trained the models using different classification methods, including the k nearest neighbor (KNN), random forest [4], gradient boosting machine [3], xgboost [1], and adaboost [6]. We repeated the process six times and computed the average assessment statistics. Our results indicate that all the models perform similarly on the test set except for KNN. KNN had the worst performance with an average Brier Score (ABS) of 0.313 and average AUC of 0.645. The other algorithms had high performance: random forest (ABS =0.174, AUC=0.836), gradient boosting (ABS=0.175, AUC=0.820), xgboost (ABS=0.177, AUC=0.814), and adaboost (ABS=0.163, 0.815). The high performance suggests machine learning models have the potential to impact patient care in radiology.

Formulating a gene signature for diagnosis of autoimmune and infectious diseases

Riya Gupta
Aditya M Rao
Lara Murphy Jones
Purvesh Khatri

When patients with an underlying autoimmune condition such as juvenile idiopathic arthritis or lupus report life-threatening symptoms, physicians need to quickly determine whether these symptoms are caused by an acute infection or a complication of their autoimmune condition. As immunosuppressive drugs are harmful to someone undergoing an infection, accurate and timely diagnosis is critical. In recent years, host-response-based diagnostics have shown promise in accurately and non-invasively diagnosing a number of infectious and autoimmune diseases.

Here, we collected and curated blood transcriptome profiles of 14,587 patients from 42 countries across 122 independent datasets and grouped them into infectious, autoimmune, and healthy control categories. Using a novel statistical framework, we created two gene signatures from this data: one to differentiate patients with autoimmune or infectious diseases from healthy individuals and another to differentiate between patients with autoimmune or infectious diseases. Both signatures achieve an area under the receiver operating characteristics curve (AUROC) of >0.87 on completely independent datasets. Because our training and testing data included heterogeneity across many factors, these gene signatures can be utilized in diverse clinical populations. Furthermore, these signatures can aid physicians across a broad range of clinical scenarios, where existing diagnostics are invasive, expensive, or non-specific.

SPAN and JBR: analysis and visualization toolkit for peak calling

Oleg Shpynov
Roman Chernyatchik
Petr Tsurinov
Maxim Artyomov

The widespread application of ChIP-seq led to a growing need for analysis or comparison of multiple epigenetics profiles, for instance, in human studies where multiple replicates are a common element of design. Peak calling is one of the fundamental steps of ChIP-seq analysis followed by motif analysis, gene set enrichment analysis etc. The most widely used tools for peak calling like MACS2 [1] or SICER [2] perform analysis independently for each sample, small differences in signal quality lead to a very different number of peaks for individual samples, making group-level analysis and comparison between samples difficult. On the other side, when samples are pooled together, or processed with a single joint statistical data model, individual-level statistical differences are ignored.

MACS2 produced from 5,000 to 18,000 peaks for promoter-associated mark H3K4me3 and up to 3x fold change on other marks for the same cell types taken from different donors. In contrast, SPAN results were more consistent for all histone marks (e.g. 16,000 - 19,000 H3K4me3 peaks) across samples with substantially different signal-to-noise ratios. Also, this approach was successfully applied for the bulk peak calling step in the pipeline for integrative analysis of multiple single-cell ATAC-seq datasets [4].

The semi-supervised peak calling approach is implemented as a part of JBR genome browser, a stand-alone application that allows for accessible and streamlined annotation, analysis, and data visualization and is fully integrated with SPAN. Markup annotation can be created directly within the JBR genome browser and used to calibrate sample-specific peak calling parameters of SPAN by leveraging annotation information. Moreover, JBR supports standard genome browser capabilities like IGV [5], and was designed for efficient data processing, resulting in fast viewing and analysis of multiple replicates, up to thousands of tracks.

SPAN can be applied to a broad range of ChIP-seq datasets of different quality and chromatin accessibility ATAC-seq, which was demonstrated on both real and simulated datasets. Simulation experiment [6] confirmed that SPAN produces good results even without semi-supervised annotation steps and can be used as a unified general-purpose peak calling method. Accelerated execution and integrated peak calling make SPAN and JBR a next-generation visualization and analysis toolkit for epigenetic data.

FARM: hierarchical association rule mining and visualization method

Petr Tsurinov
Oleg Shpynov
Nina Lukashina
Daria Likholetova
Maxim Artyomov

Associations search is one of the methods of data analysis. Association Rule Mining (ARM) approach can construct association rules from observational data, but the most widely used algorithm Apriori typically produces large number of unstructured results without any ranking or statistical significance. We propose a novel method for association rules mining FARM (Fishbone Association Rule Mining) to address these challenges.

First of all, it is necessary to solve the problem with huge number of unstructured rules. It is important because large number of rules results in time costs for their investigation and absence of rule structure gives no information which features are more important. FARM uses hierarchical structure for rules producing which is helpful because priority of features became clearly visible. At each step FARM is trying to increase hierarchical rule complexity by adding additional features in such way that optimization metric (e.g., conviction) would grow. During this procedure it's also being checked that information growth is achieving. Further significance filtering is used to focus on statistically significant results. FARM involves check for statistical significance using hold-out approach which begins with splitting dataset into two parts - first for rules construction and second one for validation. Constructed rules are firstly filtered by chi-squared test, then validated and finally checked using statistical testing with multiple comparisons correction. At this point FARM obtains statistically significant hierarchical rules and they need to be shown in human readable way for which Ishikawa diagram is used. This diagram is based on the idea of causal-like hierarchy structure visualization with the fishbone head target and ordered predicates in the ribs, so it perfectly corresponds to our needs. Final rules are included in result diagrams and interactive filters provide FARM users with ability to set filters to show most significant rules, or rules with minimal required characteristics. Analysis can be run using dedicated web service what improves convenience for everyone who wants to try FARM.

We applied FARM to previously published public datasets achieving rules which included original papers results. After that we used FARM in our recent paper [1] where we found associations between changes in methylome and regulatory regions in the genome.

FARM has shown itself convenient in use and promising due to abilities in detecting significant rules and their apparent visualization. We believe that FARM will accelerate discoveries by producing complete solution for analysis and visualization of data patterns.

Mechanistic model demonstrates importance of autocrine IL-8 secretion by neutrophils

Wangui Mbuguiro
Feilim Mac Gabhann

IL-8 (CXCL8) is a potent chemoattractant and pro-angiogenic factor that is involved in maintaining homeostasis and is implicated in a wide range of inflammatory disorders. IL-8 was initially understood to be produced by monocytes to induce neutrophil migration through binding surface receptors IL-8RA (CXCR1) and IL-8RB (CXCR2). Although neutrophil secretion of IL-8 has been reported as ranging from 0--10 molecules/cell/second [1,2], it is unclear how this may affect neutrophil activation. In this study, we create and parameterize a mechanistic model of IL-8 signaling using data from in vitro cell culture experiments. We use this model to estimate receptor internalization rates which had not been previously reported. Through sensitivity analyses and additional simulations, we find that neutrophil secretion regulates the level of IL-8RB available, especially in pM-range environments.

PubTrends: a scientific literature explorer

Oleg Shpynov
Kapralov Nikolai

With an ever-increasing number of scientific papers published each year, it becomes more difficult for researchers to explore unfamiliar or fast-growing research areas. This greatly inhibits the potential for cross-disciplinary research. When approaching a new subject, a researcher often starts with a search of relevant papers in a dedicated search engine - Google Scholar, Scopus, etc. Besides the search by title, author names, or keywords, these services also provide the user with basic statistics such as number of citations of the paper. However, it is hard to organize the information from these papers, especially with the current rise in publication numbers. For instance, regarding the latest COVID-19 epidemic, thousands of papers were published in the first months.

Bibliometrics methods [1] allow for structuring the information in papers by highlighting the most frequent keywords in the selected area. Also, the number of citations is widely used to rank papers according to their estimated scientific impact, citation and co-citation networks for groups of papers may reflect the underlying structure of the research field [2]. Papers are more likely to be co-cited if they belong to closely related research topics, and network analysis allows extracting meaningful clusters representing different scientific directions.

Another approach for structuring the information is based on the methods for natural language processing. In particular, we have shown that citation context can be used for summarization of papers and, subsequently, research topics [3].

To our knowledge, no freely available tools exist that combine different approaches for structuring information in research papers in an easy-to-use web service. Most bibliometrics tools require manual preparation of papers dataset, and most of the summarization methods are lacking ready-to-use implementation.

We present PubTrends - a scientific publication exploratory tool capable of analyzing the intellectual structure of a research field and similar papers analysis. The service is available at https://pubtrends.net and works with papers from the PubMed [4] database of biomedical texts. Within search results for a given query or a list of paper ids, it shows the most cited papers, frequent keywords, most relevant authors, and journals. We combine bibliometrics methods for citation information analysis and natural language processing algorithms to compute similarity between papers, followed by topics extraction with clustering.

Integrated viewer for citation graph and paper similarity network with rich capabilities of visualization and filtering allows for quick navigation through the different aspects of the research field. Finally, we apply deep learning methods to interactively generate automated literature reviews.

In addition to implementation details and examples, we demonstrate that topic extraction algorithms produce relevant results by comparison with topics extracted from selected review papers from Nature Reviews journals.

Probing automated treatment of urinary tract infections for bias: a case-study where machine learning perpetuates structural differences and racial disparities

Garrett Yoon
Vincent J. Major

Urinary tract infections (UTI) are a common indication for antibiotic treatment worldwide. Fluoroquinolone antibiotics are widely prescribed for these infections, despite being considered 'second-line' treatments. The use of these therapies contributes to increased levels of antibiotic resistance. A recent paper [1] described a machine learning system to recommend the narrowest antibiotic predicted to be appropriate for an individual's UTI. Such data-driven techniques integrated with clinical decision support may play a role in antibiotic stewardship and slow the onset of resistance.

Decision making algorithms may inadvertently contain bias that should be vetted before implementation. Prior work has found unintended discriminatory practices in widely used healthcare algorithms [2]. The UTI treatment system (and the data used to develop it) was investigated for potential bias.

MVAR: a mouse variation registry

Bahá El Kassaby
Govindarajan Kunde-Ramamoorthy
Francisco Castellanos
Carol Bult

Model organisms are essential to understanding the biological and disease consequences of human genome variation. Bioinformatics resources that support meaningful comparisons of mouse and human genotype-to-phenotype data and knowledge are needed to support the translation from bench to bedside and back again [1].

There is no genome variation resource for mouse comparable to resources available for human genome variation data such as EXAC [2], ClinVar [3], or ClinGen [4]. NCBI resources such as dbSNP and ClinVar no longer accept data from model organisms. While the European Variation Archive (EVA) serves a repository of SNP data for mouse, however, the resource does not accept imputed variation data or curated phenotype annotations associated with variation data that are central to data interpretation and analysis. Although the Mouse Genome Informatics database (MGI) [5] serves as a comprehensive mouse allele registry and curates information about the association of mouse variants with phenotypes and disease, the variation data in MGI are not currently available in format consistent with the Human Genome Variation Society (HGVS) standards [6]. The Mouse Variation Registry (MVAR) will represent the integration of all mouse genome variation data and includes processes to automatically canonicalize variants so that they are uniquely represented in the database with comprehensive annotation and their distribution across strains.

The starting dataset used as input into MVAR was downloaded in VCF format [7] (as a 42GB gzipped file) from the Mouse Genomes Project [8] and contains about 81M Single-Nucleotide Variants (SNV), ~9M Deletions and ~8M Insertions. Other data will be obtained from MGI, the Mouse Mutant Repository Database (MMRDB), the Diversity Outbred Database (DODB), and from computationally imputed SNP data.

The MVAR data ingest workflow has been developed to normalize, prepare and annotate input variation data. With the help of the GATK framework [9], the first step of the pipeline consists of normalizing i.e., left aligning each variant, and decomposing the multi-allelic variants (where there is more than one variation in a row of data). The next step in the pipeline is made with the use of the Ensembl Variant Effect Predictor (VEP) [10], which annotates the variation data with its corresponding HGVS nomenclature and existing external Id. The final step uses the Jannovar library [11] to enrich the data with Functional Consequence annotations. After the data has been pre-processed through the pipeline, they are inserted into a MySQL database with the help of custom tools developed to create the canonical variants representations.

MVAR supports programmatic data access to the registry through an API for interoperability. This API is used by a user-friendly web-application with rich user interfaces to query the database and display results. The API is also available to be a resource for other services or applications over HTTP with JSON data payloads. Wide-used industry frameworks like Angular and Groovy Grails were leveraged to build the MVAR web application.

To conclude, the lack of a comprehensive, annotated genome variation resource for mouse is a significant barrier to comparing variation and its biological consequences between mouse and human and limits the impact of many research and resource development programs. The MVAR project seeks to address this resource gap by bringing together investigators that have active projects in the area of genome variation in either mouse or human or both. Many of the investigators on this project have developed independent resources to curate or manage genome variation. This project aims to unify these efforts and build a common data resource. Future work will include the incorporation of structural variants into the MVAR registry.

Inferring interaction networks from microbial time series data: it's not just finding a statistic

Caroline Cannistra
Alex Yuan
Wenying Shou

When a scientist performs a statistical hypothesis test, there are three steps to consider. The first is computing a statistic from data, the second is finding or estimating the probability distribution of that statistic under the null hypothesis, and the third is comparing the computed statistic to its null distribution, resulting in a p-value. In the field of microbial ecology, various statistics are used in inferring networks of interactions between microbes from time series data, such as correlation between time series, estimated coefficients when the data is fitted to a generalized Lotka-Volterra model, and local similarity. However, there has been little discussion of different methods of finding a null distribution of these statistics, and often these null distributions are found in an ad-hoc and undefended way. We assess different commonly used statistics in combination with different methods of estimating null distributions to infer causal relationships between simulated microbes in time series data.

We estimate null distributions in two ways. The first is by generating surrogate time series data, which preserves most statistical properties of our data while adhering to the null model. Since there are many methods of generating surrogate data, and they all come with different assumptions about the original data, we have chosen to test multiple surrogate data types with different statistics. The second null distribution estimation method we benchmark is a parametric test that is paired with the Pearson correlation statistic and produces a null distribution with the data's estimated "effective degrees of freedom", thus accounting for autocorrelation. We show that our choice of null distribution estimation method can be used to minimize false positives, while choice of statistic affects the test's statistical power. As such, both "pieces" of the statistical test ought to be considered.

Gene-disease-drug link prediction using tripartite graphs

Cheng Chen
Stephen K. Grady
Sally R. Ellingson
Michael A. Langston

The development of new ethical drugs is expensive in terms of both time and resources. A single drug can take up to a decade to bring to market, with costs soaring to over a billion dollars [1]. Drug repositioning has thus become an attractive alternative to the development of new compounds, with growing interest in the use of in silico repositioning predictions. Bipartite graphs and efficient biclique enumeration algorithms [2] can be used to study protein-drug or other crucial interactions. Extensions of this approach to larger dimensions has been hobbled, however, by a lack of effective analytics. In the present work, we take advantage of highly innovative and efficient tripartite graph algorithms [3]. We employ one partite set for genes, proteins or other gene products, another partite set for diseases, and a third partite set for drugs of interest, with inter-partite edges denoting known or inferred interactions.

As a proof of principle, we constructed a tripartite interaction graph containing 1999 proteins, 5522 diseases, and 1359 drugs. Data for this graph was taken from three previously-studied bipartite interaction graphs found in the Stanford Biomedical Network Dataset Collection [4]. With the aid of the aforementioned novel algorithms, we were able to extract 173 maximal tricliques from this graph, with each triclique containing at least five vertices in every partite set. Proteins within these tricliques tended to share known biological function, diseases tended to affect the same organs and tissues, and drugs tended to share fingerprint similarity.

Prediction requires edge imputation, of course, and so we then applied to this graph extensions of the well-known paraclique algorithm [5]. The aim of this technique is to find dense subgraphs amidst real and noisy biological data. In particular, paraclique is able to pinpoint edges that may be missing, which is exactly what we seek here. Using this approach, we found 115 putative protein-disease interactions, 144 putative protein-drug interactions, and 3,120 putative disease-drug interactions. We then used Autodock Vina for the putative protein-drug interactions to simulate molecular docking and compute binding scores as in [6]. We found these scores to be statistically lower than those computed over all missing protein-drug edges using a t-test with a p-value threshold of 0.05, which demonstrates an increased likelihood of real interactions. In future work, we will compare and determine the feasibility of incorporating this methodology with alternate tools such as machine learning and graph neural networks.

Exploring target specificity of antimicrobial peptides through deep learning embeddings

Lauren Losin
Daniel Veltri

In the face of increasing bacterial resistance to antibiotics, antimicrobial peptides (AMPs) have stood out as an encouraging target for the development of new drugs. Machine learning approaches can be applied to this area to characterize large sets of AMPs based on their bacterial targets, activity measures, and other sequence features. Such methods enable wet-laboratory researchers to optimize the speed and accuracy of their work by focusing on prioritized candidates [5].

Prior work on computational AMP recognition has largely focused on binary sequence classification (predicting AMP vs non-AMP) but is beginning to venture into de novo peptide design [5]. This work takes steps to further understand AMP function and specificity by learning sequence embeddings based on both molecular sequence and activity measures against different bacteria targets. The model uses a Siamese network architecture [1] to learn from pairs of AMPs to predict how their activity differs against 10 different genera of bacteria. Unlike many other approaches, we also consider N- and C-termini modifications to sequences.

Training and testing data originates from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP) [4] and was parsed to consider monomer AMPs with activity measurements recorded as minimum inhibitory concentration (MIC). Due to the large heterogeneity of bacteria at the species-level, responses were grouped by genera and MIC values averaged. Based on the percentage of all AMPs with a mean MIC response available, the top 10 genera were considered. That data set was split into training (4, 170 AMPs), validation (1, 142 AMPs), and testing (535 AMPs) partitions. To reduce the chance of data leakage between testing and training data, the CD-HIT server [2] was used (after removing termini modifications) to ensure all testing sequences share < 90% identity with all training/validation sequences. Each partition was further arranged into pairs of sequences sharing the same target, with responses calculated as the difference in mean MIC values.

The Siamese network consists of an embedding and long short-term memory layer [3] that are trained in a supervised setting. It compares AMP sequence pairs to train a shared set of weights. All input sequences are padded to be the same length and a tokenizer is used to encode both amino acids and termini modifications. The model outputs sequence embeddings based on the difference in MIC for each AMP pair.

To obtain insight into AMP activity and specificity, separate models are trained for gram-positive and gram-negative genera. Trained embeddings for each model are then plotted and compared to visualize how bacterial membrane structure can influence AMP sequence composition. These results present another step towards making AMP deep learning models more informative and understandable to the research community.

Designing novel antimicrobial peptides against multi-drug resistant bacteria

Shravani Bobde
Fahad Alsaab
Guangshun Wang
Monique L. van Hoek

Antimicrobial peptides (AMPs) are ubiquitous amongst living organisms and are part of the innate immune system with the ability to kill pathogens directly or indirectly by modulating the immune system. AMPs have potential as a novel therapeutic against bacteria due to their quick-acting mechanism of action that prevents bacteria from developing resistance. Additionally, there is a dire need for therapeutics with activity specifically against gram-negative bacterial infections that are dangerous and difficult to treat. Development of new antibiotics has slowed in recent years and novel therapeutics like AMPs with a focus against gram-negative bacteria are needed. We designed 8 novel AMPs, termed PHNX peptides, using ab initio computational design (database filtering technology on APD3 dataset of natural AMPs against gram negative bacteria as described by Wang et al.) and assessed their theoretical function using published machine learning algorithms as well as measured their activity in our laboratory. These AMPs were tested and demonstrated greater activity against gramnegative MDR Escherichia coli than MRSA (Methicillin Resistant Staphylococcus aureus) bacteria and showed low hemolytic activity against human red-blood cells.

The genetics of human aging: predicting age and age-related diseases by deep mining high dimensional biomarker data

Hannah Guan

Recent research efforts have shown compelling evidence of DNA methylation alterations in aging and age-related disease. The traditional formula of DNA methylation aging suffers from multiple hypothesis testing due to the interacting, high dimensional, and non-linear nature of the data. Neural network analyses have shown effectiveness on biological age prediction for its ability in learning interacting and nonlinear relationships. However, the high dimensionality of DNA methylation data often results in overfitting and poor generalization in neural networks. To solve this problem, we developed a neural network model that selects input features based on their correlations with biological age. We compared it with the traditional statistical regressions and other dimension reduction models in neural networks, such as neural networks with LASSO and elastic net regularizations and the dropout neural networks. The results showed that our model decreased the age prediction error to 2.7 years, outperforming all other models. In addition, we studied age acceleration in two age-related diseases (Down Syndrome and Schizophrenia). Our model is able to confirm age acceleration in Down Syndrome with a much smaller variance comparing to existing studies and find extrinsic epigenetic age acceleration (EEAA) in Schizophrenia, which is a weak pattern that other models couldn't detect. Our research is one of the first to adapt neural network algorithms in biological aging prediction. It can be applied to a wide range of high-dimensional biomarker data, and ultimately improve understanding of the aging process and benefit public health.

WORKSHOP SESSION: ParBio workshop paper presentations

CNN models for eye state classification using EEG with temporal ordering

Femi William
Feng Zhu

Most studies on eye states (open or closed) detection apply machine learning techniques on subject dependent eye state datasets, but subject independent data with large physiological variation between individuals has not been well explored. Temporal ordering information is important to predict eye state because EEG is a time sequence dataset. In this research, we keep the temporal ordering of the data in place. We create multiple CNN network models and select optimal filters and depth. Our CNN feature models are effective on both subject dependent and subject independent eye state EEG classifications. We got the best subject dependent result with 4 layers of CNNs with an accuracy rate of 96.51% on dataset I and 100% on dataset II. For subject's independent studies, we got the best classification accuracy of 80.47% on dataset I and we got 90.15% on dataset II.

Data mining for electroencephalogram signal processing and analysis

Rossana Mancuso
Marzia Settimo
Mario Cannataro

Electroencephalography (EEG) is a complex signal that requires advanced signal processing and feature extraction methodologies to be interpreted correctly. EEG, is usually utilized to estimate the trace and the electrical brain activity. It is employed in the discovery and forecast of epileptic and non-epileptic seizures and neurodegenerative pathologies. In this article, we give an overview of the various computational techniques used in the past, in the present and the future to preprocess and analyze EEG signals.

In particular, this work aims to briefly review the state of research in this field, trying to understand the needs of EEG analysis in the medical field, with special focus on neurodegenerative pathologies, and epileptic and not-epileptic diseases. After presenting the main pre-processing, feature selection and extraction phases, we focus on classification processes and on Data Mining techniques applied to classify EEGs. Then, through the EEG analysis a discussion of the implementation is provided to investigate, predict and diagnose some cognitive diseases and epilepsy.

Towards dynamic simulation of a whole cell model

Jae-Seung Yeom
Konstantia Georgouli
Robert Blake
Ali Navid

Whole-cell models (WCMs) aim to integrate the sum of our knowledge about the mechanistic processes of an organism that are inherently multi-scale and dynamic. Such comprehensive models would enable us to address many challenging questions, such as understanding interactions and coupling between pathways, examining system properties, and identifying gaps in our biological knowledge. WCMs integrate a diverse array of intracellular pathways through an equally diverse assortment of computational methods. Among these methods, stochastic simulation helps implement the most detailed model. However, it is also the most time consuming to execute. Furthermore, WCMs involve some of the largest known biochemical reaction networks. To speed up the simulation and accelerate the development of such a model, we present a parallel implementation of stochastic simulation algorithm (SSA) and its application to a whole cell reaction network.

WORKSHOP SESSION: HPC-BOD workshop paper presentations

SUPREME: a cancer subtype prediction methodology integrating multiple biological datatypes using graph convolutional neural networks

Ziynet Nesibe Kesimoglu
Serdar Bozdag

Cancer is the second leading cause of death in the world, and cancer-specific networks are still not fully demystified. With increasing technology, various biological datasets from cancer tissues have been generated to better characterize the cancer biology. Using these datasets, subtypes of various cancers have been discovered and cancer subtype prediction tools have been developed. Several of the cancer subtype prediction studies rely only on one type of biological dataset such as gene expression, mutation, microRNA expression, and DNA methylation. However, each of these data types explain a unique aspect of the underlying biology and thus developing integrative computational methods have been an important problem in bioinformatics. Deep learning has the ability to find the significant relations in high-dimensional data. Getting benefit of deep learning, Graph Convolutional Neural Network (GCNN) was developed recently to model data by performing convolution in non-Euclidean domains like graphs and have been applied in many studies. In this study, we aim to develop a cancer subtype prediction methodology called SUPREME that will integrate multiple types of biological data to predict subtypes of cancer patients. Our objective is to deduce the signals from different biological datasets considering the patient similarities with GCNN and integrate the datatype-specific signals to predict the cancer subtypes of patients. We did preliminary analysis using 1022 breast cancer patients along with gene expression, copy number aberration, mutation, and DNA methylation from The Cancer Genome Atlas project. We generated individual networks and trained GCNN models for each network separately. Our preliminary results showed that individual networks had good prediction power as compared to multilayer perceptron even before integration step. This motivated us to analyze these networks more deeply and integrate them getting benefit of convolution and all the datatypes at once, potentially leading to more accurate models.

GWAS analysis to compute genetic markers of progression to Alzheimer's disease

Yashu Vashishath
Serdar Bozdag

Prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is a challenging task due to involvement of many genetic and environmental factors leading to neurodegeneration. Several Genome Wide Association Studies (GWAS) have been conducted to understand genetic risk factors of AD, however, there are not many GWAS studies that find the genetic factors of conversion from MCI to AD. In this study, we aim to find potential genetic markers causing conversion from MCI to AD and build a machine learning model to predict the conversion. To this end, we used genetic variation data of 809 patients in Alzheimer's Disease Neuroimaging Initiative (ADNI). We processed the genetic data by merging SNPs information of all patients into a single set of PLINK files, which served as input for the GWAS analysis. SNPs were filtered based on call rate, minor allele frequency and Hardy-Weinberg equilibrium cut-off. Samples were also filtered to remove any linkage using sample call rate, inbreeding coefficient, kinship coefficient and linkage disequilibrium cut-off. After the filtration, we calculated negative log p-value for the remaining SNPs in association with the phenotype (i.e., converter vs. non-converter). We repeated this step for other traits such as ventricle size, hippocampus size, and Tau protein concentration to record different loci on chromosomes influencing each trait. The presence or absence of significant SNPs will be used to create a deep learning model to predict the progression of AD in a patient using only genetic information.

Search feasibility in distributed MS-proteomics big data

Umair Mohammad
Fahad Saeed

Making large-scale Mass Spectrometry (MS) data FAIR (Findable, Accessible, Interoperable, Reusable) and democratizing access for the omics research community requires advance access and reuse mechanisms. In this work, we proposed a novel distributed data access infrastructure and developed a simulation test-bed to show the feasibility of this solution. In contrast to existing centralized approaches, participating nodes are relied upon to execute the search algorithm and search based on the comparison of raw spectra is supported as opposed to simple meta-data based searches. Simulation results using networking, stochastic modelling, and queuing theory, illustrated that search times were reduced by up-to 600 times for up-to a total of fifty billion spectra. Proteomics is vital because of the importance proteins to life and their role in state-of-the-art medicine such as custom drug delivery and cancer treatment. MS-based proteomics involves the fragmentation of proteins into peptide ions to generate raw MS spectra. Traditionally, scientists have relied on meta-data based searches of centralized repositories followed by complex database searches and protein sequencing. Though useful, this technique may result in missed datasets because of poor meta-data or sheer amount of effort and computational time needed. Recently, direct raw spectra search has been proposed with the development of centralized tools such as PeptideAtlas. However, PeptideAtlas hosts 13,000 spectra whereas systems supporting billions of spectra are needed. Let us assume users can submit one or more query spectra for search to a central controller. In the proposed novel distributed paradigm, the controller will forward the queries to several nodes hosting a total of multiple MS/MS datasets, where each of the nodes will run the search algorithm against against each spectrum in their local MS/MS dataset, and send the results as URLs/pointers and associated scores back to the controller. The controller will then collate the results and transmit them back to the users. To simulate the system performance, we focused on the distributed process between the controller and the participating nodes. We modeled the the nodes using computational devices present in typical research labs, communication links as the average achievable by combined fiber/Ethernet links, and data loads based on typical storage sizes of spectra and URLs. By running Monte Carlo simulations, we were able to obtain the response time to a single query for various scenarios and assuming an M/M/1 queue, we simulated the time degradation due to multiple requests by compounding over the number of requests with a load degradation factor. Testing results for fifty billion spectra indicated that using 500 distributed nodes can provide search results in 10s and 2000 nodes in 5s, a reduction by 100 and 200 times, respectively, compared to a centralized approach which requires 1000s. Considering typical capabilities of modern day servers and computers, a load factor of 0.001% was tested and indicated that the system provided constant time performance up-to 10k concurrent queries. Lastly, accounting for communication link degradation demonstrated that a trade-off can be achieved between performance and number of nodes. Therefore, it is worth investigating the implementation of a distributed big-data access infrastructure for proteomics.

Real-time peptide identification from high-throughput mass-spectrometry data

Sumesh Kumar
Fahad Saeed

Peptide deduction remains one of the most challenging research problems in the large-scale study of proteomes using high-throughput Mass Spectrometers. The identification of large number of proteins from complex biological samples can be carried out in two steps: 1) tryptic digestion of protein sample to isolate constituent peptides, and then generating MS/MS data using high-throughput mass spectrometers; 2) Once the data is generated various methods such as database-search tools are used to compare mass-spectrometry data against a repository of known peptides. Advances in the MS instrumentation now allow generation of high-resolution data in massive volume and velocity making traditional MS based algorithms a bottleneck in the overall workflow. New generation of state-of-the-art database search tools are now capable of producing high-quality matches with impressively low FDR; however, the search time usually takes somewhere between a few weeks to a few months depending on the size of database and search parameters. To accelerate the overall search times, several studies have been proposed which target this computational bottleneck by exploiting specialized hardware architectures including HPC compute clusters and GPUs. Even with these accelerated pipelines the dream of realizing a true real-time processing and deduction of peptides from MS data is a far from realization. One bottleneck preventing the design of true real-time processing of MS based data is the cost of communication of the data required for the existing workflows i.e. moving the data from storage to computational nodes and across hierarchies of system memory, dominates the overall search process in MS data analysis. Therefore, techniques which can minimize the communication cost by enabling the computational search process to execute near the source of data-generation are highly desirable. In particular, specialized computer architecture designed by utilizing FPGAs to process high-resolution MS data as soon as it is generated by a mass-spectrometer can alleviate the latency involved in data storage and movement. FPGA based designs can exploit the inherent data-parallelism and minimize communication overhead by using a custom pipeline design aimed at reducing the number of main memory accesses. In this paper, we propose to design, and develop an FPGA based hardware accelerator. Our design consists of asynchronous parallel processing elements which implement efficient dataflow operations by using configurable data-caching, contention aware bus-arbiter, and double buffering. Our results have shown that we are able to achieve 600x reduction in average number of DRAM accesses and an average of 24x speed-up in the overall computation compared with a CPU. These results were obtained by processing publicly available MS data, whereas real-time performance can be achieved if the search operations are moved close to the source of data generation. In this regard, a streaming network-based hardware accelerator can greatly enhance the scale of proteomics which reads raw data directly from the mass-spectrometer to process the MS data in real-time in a streaming fashion and produce peptides deductions.

Important Dates

Call for	Submission Deadline	Notification of Acceptance
Papers	April 30	May 29
Workshops	April 7	April 14
Tutorials	April 30	May 7
Highlights	May 10	June 14
Posters	May 14	May 27
Late-break Posters	May 20	June 15

News

Website launched with CFP

December 22, 2020