Identifying the target genes of transcription factors (TFs) is one of the key factors to understand transcriptional regulation. However, our understanding of genome-wide TF targeting profile is limited due to the cost of large scale experiments and intrinsic complexity. Thus, computational prediction methods are useful to predict the unobserved associations. Here, we developed a new one-class collaborative filtering algorithm tREMAP that is based on regularized, weighted nonnegative matrix tri-factorization. The algorithm predicts unobserved target genes for TFs using known gene-TF associations and protein-protein interaction network. Our benchmark study shows that tREMAP significantly outperforms its counterpart REMAP, a bi-factorization-based algorithm, for transcription factor target gene prediction in all four performance metrics AUC, MAP, MPR, and HLU. When evaluated by independent data sets, the prediction accuracy is 37.8% on the top 495 predicted associations, an enrichment factor of 4.19 compared with the random guess. Furthermore, many of the predicted novel associations by tREMAP are supported by evidence from literature. Although we only use canonical TF-target gene interaction data in this study, tREMAP can be directly applied to tissue-specific data sets. tREMAP provides a framework to integrate multiple omics data for the further improvement of TF target gene prediction. Thus, tREMAP is a potentially useful tool in studying gene regulatory networks. The benchmark data set and the source code of tREMAP are freely available at https://github.com/hansaimlim/REMAP/tree/master/TriFacREMAP.
We study the problem of computing a minimal subset of nodes of a given asynchronous Boolean network that need to be controlled to drive its dynamics from an initial steady state (or attractor) to a target steady state. Due to the phenomenon of state-space explosion, a simple global approach that performs computations on the entire network, may not scale well for large networks. We believe that efficient algorithms for such networks must exploit the structure of the networks together with their dynamics. Taking such an approach, we derive a decomposition-based solution to the minimal control problem which can be significantly faster than the existing approaches on large networks. We apply our solution to both real-life biological networks and randomly generated networks, demonstrating promising results.
Dynamic biological networks model changes in the network topology over time. However, often the topologies of these networks are not available at specific time points. Existing algorithms for studying dynamic networks often ignore this problem and focus only on the time points at which experimental data is available. In this paper, we develop a novel alignment based network construction algorithm, ANCA, that constructs the dynamic networks at the missing time points by exploiting the information from a reference dynamic network. Our experiments on synthetic and real networks demonstrate that ANCA predicts the missing target networks accurately, and scales to large biological networks in practical time. Our analysis of an E. coli protein-protein interaction network shows that ANCA successfully identifies key temporal changes in the biological networks. Our analysis also suggests that by focusing on the topological differences in the network, our method can be used to find important genes and temporal functional changes in the biological networks.
We introduce an algorithm for selectively aligning high-throughput sequencing reads to a transcriptome, with the goal of improving transcript-level quantification in difficult or adversarial scenarios. This algorithm attempts to bridge the gap between fast \nab algorithms and more traditional alignment procedures. We adopt a hybrid approach that is able to produce accurate alignments while still retaining much of the efficiency of non-alignment-based algorithms. To achieve this, we combine edit-distance-based verification with a highly-sensitive read mapping procedure. Additionally, unlike the strategies adopted in most aligners which first align the ends of paired-end reads independently, we introduce a notion of co-mapping. This procedure exploits relevant information between the "hits" from the left and right ends of paired-end reads before full mappings for each are generated, improving the efficiency of filtering likely-spurious alignments. Finally, we demonstrate the utility of selective alignment in improving the accuracy of efficient transcript-level quantification from RNA-seq reads. Specifically, we show that selective-alignment is able to resolve certain complex mapping scenarios that can confound existing non-alignment-based procedures, while simultaneously eliminating spurious alignments that fast mapping approaches can produce. Selective-alignment is implemented in C++11 as a part of Salmon, and is available as open source software, under GPL v3, at: \hrefhttps://github.com/COMBINE-lab/salmon/tree/selective-alignment https://github.com/COMBINE-lab/salmon/tree/selective-alignment
In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool (ULTRA) are competitive with the most heavily used tool for repeat masking (TRF). ULTRA is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.
The genetic cross is a fundamental, flexible, and widely-used experimental technique to create new mutant strains from existing ones. Surprisingly, the problem of how to efficiently compute a sequence of crosses that can make a desired target mutant from a set of source mutants has received scarce attention. In this paper, we make three contributions to this question. First, we formulate several natural problems related to efficient synthesis of a target mutant from source mutants. Our formulations capture experimentally-useful notions of verifiability (e.g., the need to confirm that a mutant contains mutations in the desired genes) and permissibility (e.g., the requirement that no intermediate mutants in the synthesis be inviable). Second, we develop combinatorial techniques to solve these problems. We prove that checking the existence of a verifiable, permissible synthesis is \cnp-complete in general. We complement this result with three polynomial time or fixed-parameter tractable algorithms for optimal synthesis of a target mutant for special cases of the problem that arise in practice. Third, we apply these algorithms to simulated data and to synthetic data. We use results from simulations of a mathematical model of the cell cycle to replicate realistic experimental scenarios where a biologist may be interested in creating several mutants in order to verify model predictions. Our results show that the consideration of permissible mutants can affect the existence of a synthesis or the number of crosses in an optimal one. Our algorithms gracefully handle the restrictions that permissible mutants impose. Results on synthetic data show that our algorithms scale well with increases in the size of the input and the fixed parameters.
Inspired by recent efforts to model cancer evolution with phylogenetic trees, we consider the problem of finding a consensus tumor evolution tree from a set of conflicting input trees. In contrast to traditional phylogenetic trees, the tumor trees we consider contain features such as mutation labels on internal vertices (in addition to the leaves) and allow multiple mutations to label a single vertex. We describe several distance measures between these tumor trees and present an algorithm to solve the consensus problem called GraPhyC. Our approach uses a weighted directed graph where vertices are sets of mutations and edges are weighted using a function that depends on the number of times a parental relationship is observed between their constituent mutations in the set of input trees. We find a minimum weight spanning arborescence in this graph and prove that the resulting tree minimizes the total distance to all input trees for one of our presented distance measures. We evaluate our GraPhyC method using both simulated and real data. On simulated data we show that our method outperforms a baseline method at finding an appropriate representative tree. Using a set of tumor trees derived from both whole-genome and deep sequencing data from a Chronic Lymphocytic Leukemia patient we find that our approach identifies a tree not included in the set of input trees, but that contains characteristics supported by other reported evolutionary reconstructions of this tumor.
We present an automated pipeline capable of distinguishing the phenotypes of myeloid-derived suppressor cells (MDSC) in healthy and tumor-bearing tissues in mice using flow cytometry data. In contrast to earlier work where samples are analyzed individually, we analyze all samples from each tissue collectively using a representative template for it. We demonstrate with 43 flow cytometry samples collected from three tissues, naive bone-marrow, spleens of tumor-bearing mice, and intra-peritoneal tumor, that a set of templates serves as a better classifier than popular machine learning approaches including support vector machines and neural networks. Our "interpretable machine learning" approach goes beyond classification and identifies distinctive phenotypes associated with each tissue, information that is clinically useful. Hence the pipeline presented here leads to better understanding of the maturation and differentiation of MDSCs using high-throughput data.
While many transcriptional profiling experiments measure dynamic processes that change over time, few include enough time points to adequately capture temporal changes in expression. This is especially true for data from human subjects, for which relevant samples may be hard to obtain, and for developmental processes where dynamics are critically important. Although most expression data sets sample at a single time point, it is possible to use accompanying temporal information to create a virtual time series by combining data from different individuals. We introduce TEMPO, a pathway-based outlier detection approach for finding pathways showing significant temporal changes in expression patterns from such combined data. We present findings from applications to existing microarray and RNA-seq data sets. TEMPO identifies temporal dysregulation of biologically relevant pathways in patients with autism spectrum disorders, Huntington's disease, Alzheimer's disease, and COPD. Its findings are distinct from those of standard temporal or gene set analysis methodologies. Overall, our experiments demonstrate that there is enough signal to overcome the noise inherent in such virtual time series, and that a temporal pathway approach can identify new functional, temporal, or developmental processes associated with specific phenotypes. Availability: An R package implementing this method and full results tables are available at bcb.cs.tufts.edu/tempo/.
The differential analysis is the most significant part of RNA-Seq analysis. Conventional methods of the differential analysis usually match the tumor samples to the normal samples, which are both from the same tumor type. Such method would fail in differentiating tumor types because it lacks the knowledge from other tumor types. The Pan-Cancer Atlas provides us with abundant information on 33 prevalent tumor types which could be used as prior knowledge to generate tumor-specific biomarkers. In this paper, we embedded the high dimensional RNA-Seq data into 2-D images and used a convolutional neural network to make classification of the 33 tumor types. The final accuracy we got was 95.59%. Furthermore, based on the idea of Guided Grad Cam, as to each class, we generated significance heat-map for all the genes. By doing functional analysis on the genes with high intensities in the heat-maps, we validated that these top genes are related to tumor-specific pathways, and some of them have already been used as biomarkers, which proved the effectiveness of our method. As far as we know, we are the first to apply a convolutional neural network on Pan-Cancer Atlas for the classification of tumor types, and we are also the first to use gene's contribution in classification to the importance of genes to identify candidate biomarkers. Our experiment results show that our method has a good performance and could also apply to other genomics data.
We present a new distributed computing algorithm, Parallel Pattern Discovery (PPD), for constrained Non-negative Matrix Factorization (NMF). Our implementation offers the ability to constrain a specific pattern for optimization of the data while minimizing reconstruction error. Parallel Pattern Discovery operates within a distributed environment using a message passing interface. Distribution of the PPD algorithm provides better scalability and allows operation in single- or multiple-system environments. The algorithm was tested on a set of time-series, dose-dependent mRNA gene expression data. Parallel Pattern Discovery was found to accurately identify patterns within the data and reconstruct the original matrices. Our NMF algorithm found a smaller reconstruction error when compared against standard NMF algorithms. Development focused on running PPD as part of a system which identifies significantly contributing genes. Parallel Pattern Discovery is first run to find patterns from biological data. It is followed by Gene Set Enrichment (GSE) which takes the pattern data and relates it back to genetic pathways.
Chest X-rays is one of the most commonly available and affordable radiological examinations in clinical practice. While detecting thoracic diseases on chest X-rays is still a challenging task for machine intelligence, due to 1) the highly varied appearance of lesion areas on X-rays from patients of different thoracic disease and 2) the shortage of accurate pixel-level annotations by radiologists for model training. Existing machine learning methods are unable to deal with the challenge that thoracic diseases usually happen in localized disease-specific areas. In this article, we propose a weakly supervised deep learning framework equipped with squeeze-and-excitation blocks, multi-map transfer and max-min pooling for classifying common thoracic diseases as well as localizing suspicious lesion regions on chest X-rays. The comprehensive experiments and discussions are performed on the ChestX-ray14 dataset. Both numerical and visual results have demonstrated the effectiveness of proposed model and its better performance against the state-of-the-art pipelines.
The socioeconomic losses caused by extreme daytime drowsiness are enormous in these days. Hence, building a virtuous cycle system is necessary to improve work efficiency and safety by monitoring instantaneous drowsiness that can be used in any environment. In this paper, we propose a novel framework to detect extreme drowsiness using a short time segment (~ 2 s) of EEG which well represents immediate activity changes depending on a person's arousal, drowsiness, and sleep state. To develop the framework, we use multitaper power spectral density (MPSD) for feature extraction along with extreme gradient boosting (XGBoost) as a machine learning classifier. In addition, we suggest a novel drowsiness labeling method by combining the advantages of the psychomotor vigilance task and the electrooculography technique. By experimental evaluation, we show that the adopted MPSD and XGB techniques outperform other techniques used in previous studies. Finally, we identify that spectral components (theta, alpha, and gamma) and channels (Fp1, Fp2, T3, T4, O1, and O2) play an important role in our drowsiness detection framework, which could be extended to mobile devices.
Cyclic mononucleotides, in particular 3',5'-cyclic guanosine monophosphate (cGMP) and 3',5'-cyclic adenosine monophosphate (cAMP), are molecular signals that mediate a myriad of biological responses in organisms across the tree of life. In plants, they transduce signals such as hormones and peptides perceived at receptors on the cell surface into the cytoplasm to orchestrate a cascade of biochemical reactions that enable them to grow and develop, and adapt to light, hormones, salt and drought stresses as well as pathogens. However, their generating enzymes (guanylyl cyclases, GCs and adenylyl cyclases, ACs) have just been recently discovered and are still poorly understood. Here, we employed a computational approach to probe the physicochemical properties of the catalytic centers of these enzymes and the knowledge of which, was used to create a web-based tool, ACPred (http://gcpred.com/acpred) for the prediction of AC functional centers from amino acid sequence. Understanding the nature of such catalytic centers have enabled the creation of predictive tools such as ACPred which will in turn, facilitate the discovery of novel cellular components across different systems.
Identifying cell types is one of the significant applications of single cell RNA sequencing (scRNAseq) technology, which provides insights into cellular level mechanisms and variations. Most existing methods for identifying cell types only utilize the expression matrix for clustering the cells; however, a few studies show the benefits of considering relationship between genes into the cell clustering procedure. In this study, we proposed a new method, Gene Mover's Distance (GMD) that is based on a nonparametric Earth Mover's Distance (EMD) and leveraging a novel word embedding approach to cluster cells. In this method both intrinsic distances between genes and their expression values are used to compute a novel distance metric for clustering. We employed the word embedding word2vec model which was trained on biological corpus to capture the relationship between genes and employed EMD to compute the distance between cells by considering a cell as a group of weighted points (genes). We used three single cell datasets to validate the proposed method and to evaluate its performance in comparison with three state-of-the-art clustering methods. Results indicate that GMD outperformed the methods in clustering single cells in terms of Adjusted Random Index and Fowlkes Mallows Index.
Bacteria with resistance genes are becoming ever more common, and new methods of discovering antibiotics are being developed. One of these new methods involves researchers creating random peptides and testing their antimicrobial activity. Developing antibiotics from these peptides requires understanding which sequence motifs will be toxic to bacteria. To determine if the toxic peptides of a randomly-generated peptide library can be uniquely classified based solely on sequence motifs, we created the PepSeq Pipeline: a new software that utilizes a Random Forest algorithm to extract motifs from a peptide library. We found that this pipeline can accurately classify 56% of the toxic peptides in the peptide library using motifs extracted from the model. Testing on simulated data with less noise, we could classify up to 94% of the toxic peptides. The pipeline extracted significant toxic motifs in every library that was tested, but its ability to classify all toxic peptides depended on the number of motifs in the library. Once extracted, these motifs can be used both to understand the biology behind why certain peptides are toxic and to create novel antibiotics. The code and data used in this analysis can be found at https://github.com/tjense25/pep-seq-pipeline.
Advances in DNA sequencing technologies have paved the way for Metagenomics; defined as the collective sequencing of co-existing microbial communities in a host environment. Several researchers and clinicians have embarked on studying the role of these microbes with respect to human health and diseases, whereas others are using metagenomics to monitor the impact of external factors (e.g., Gulf oil spill) on ecosystems. Lack of accurate and efficient analytical methods is an impediment for the identification of the function and presence of microbial organism within different clinical samples, reducing our ability to elucidate the microbial-host interactions and discover novel therapeutics. In this paper, we present a Multiple Instance Learning (MIL) based computational pipeline to predict clinical phenotype from metagenomic data. We use data from well-known metagenomic studies of Liver Cirrhosis and Inflammatory Bowel Disease (IBD) to evaluate our approach. We show that our proposed pipeline outperforms comparative MIL as well as non-MIL methods.
One of the grand challenges of modern biology is to understand how genotypes (G) and environments (E) interact to affect phenotypes (P), i.e., G × E - P . Phenomics is the emerging field that aims to study large and complex data sets encompassing combinations of genotypes, environments, phenotypes readings. A phenomenon of crucial interest in this context is that of divergent subpopulations, i.e., how certain subgroups of the population show differential behavior under different types of environmental conditions. We consider the fundamental task of identifying such "interesting" subpopulation-level behavior by analyzing high-dimensional phenomics data sets from a large and diverse population. However, delineation of such subpopulations is a challenging task due to the large size, high dimensionality, and complexity of phenomics data. We present a new framework to extract such subpopulation-level information from phenomics data. Our approach is based on principles from algebraic topology, a branch of mathematics that studies shapes and structure of data in a robust manner. In particular, our framework identifies and quantifies "flares", which are structural branching features in data that characterize divergent behavior of subpopulations, in an unsupervised manner. We present algorithms to detect and rank flares, and demonstrate the utility of the proposed framework on two real-world plant phenomics data sets.
Connectomics alterations associated with subtle forms of cerebrovascular neuropathology-such as cerebral microbleeds (CMBs)-can result in substantial neurological and/or cognitive deficits in victims of traumatic brain injury (TBI). Quantifying CMB-related connectome changes in mild TBI (mTBI) patients requires ingenious neuroinformatics to integrate structural magnetic resonance imaging (sMRI) with diffusion-weighted imaging (DWI) for patient-tailored profiling while preserving the data scientist's ability to implement population studies. Such solutions, however, can assist the refinement of rehabilitation protocols and streamline large-scale analysis while accommodating the heterogeneity of mTBI. This study describes a pipeline for the multimodal integration of sMRI/DWI/DTI to quantify white matter (WM) neural network circuitry alterations associated with mTBI-related CMBs. The approach incorporates WM streamline matching, topology-compliant streamline prototyping and along-tract analysis within a unified framework. When applied to the analysis of neuroimaging data acquired from both mTBI and healthy control volunteers, the approach facilitates the identification of patient-specific CMB-related connectomic changes while incorporating the ability to perform group analyses. This pipeline for the identification and profiling of connectopathies can assist the adaptation of clinical rehabilitation protocols to patients' individual needs.
Discovery of disease biomarkers is a key step in translating advances in genomics into clinical practice. There is growing evidence that changes in gut microbial composition are associated with the onset and progression of Type 2 Diabetes (T2D), Obesity, and Inflammatory Bowel Disease (IBD). Reliable identification of the most informative features (i.e., microbes) for discriminating metagenomics samples from two or more groups (i.e., phenotypes) is a major challenge in computational metagenomics. We propose a Network-Based Biomarker Discovery (NBBD) framework for detecting disease biomarkers from metagenomics data. NBBD has two major customizable modules: i) A network inference module for inferring ecological networks from the abundances of microbial operational taxonomic units (OTUs); ii) A node importance scoring module for comparing the constructed networks for the chosen phenotypes and assigning a score to each node based on the degree to which the topological properties of the node differ across two networks. We empirically evaluated the proposed NBBD framework, using five network inference methods for inferring gut microbial networks combined with six node topological properties, on the identification of IBD biomarkers using a large dataset from a cohort of 657 and 316 IBD and healthy controls metagenomic biopsy samples, respectively. Our results show that NBBD is very competitive with some of the state-of-the-art feature selection methods including the widely used method based on random forest variable importance scores.
Accurate reporting of causes of death on death certificates is essential to formulate appropriate disease control, prevention and emergency response by national health-protection institutions such as Center for disease prevention and control (CDC). In this study, we utilize knowledge from publicly available expert-formulated rules for the cause of death to determine the extent of discordance in the death certificates in national mortality data with the expert knowledge base. We also report the most commonly occurring invalid causal pairs which physicians put in the death certificates. We use sequence rule mining to find patterns that are most frequent on death certificates and compare them with the rules from the expert knowledge based. Based on our results, 20.1% of the common patterns derived from entries into death certificates were discordant. The most probable causes of these discordance or invalid rules are missing steps and non-specific ICD-10 codes on the death certificates.
Deriving pseudo causal relations from medical text data lies at the heart of medical literature mining. Existing studies have utilized extraction models to find pseudo causal relation from single sentences, while the knowledge created by causation transitivity - often spanning multiple sentences - has not been considered. Furthermore, we observe that many pseudo causal relations follow the rule of causation transitivity, which makes it possible to discover unseen casual relations and generate new causal relation hypotheses. In this paper, we address these two issues by proposing a factor graph model to incorporate three clues to discover causation expressions in the text data. We propose four types of triad structures to represent the rules of causation transitivity among causal relations. Our proposed model, called CausalTriad, uses textual and structural knowledge to infer pseudo causal relations from the triad structures. Experimental results on two datasets demonstrate that (a) CausalTriad is effective for pseudo causal relation discovery within and across sentences; (b) CausalTriad is highly capable at recognizing implicit pseudo causal relations; (c) CausalTriad can infer missing/new pseudo causal relations from text data.
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. In this study, we explore data from dermatology outpatients to build a risk model for the disease. Using millions of patient records with thousands of data points in each record, we show that we can build a melanoma risk model from real-world Electronic Health Record (EHR) data without any expert knowledge or manually engineered features. While other risk models for melanoma have been developed, this is the first to use routinely collected EHR data rather than expert features targeted specifically for melanoma. The random forest model achieves similar or better performance than these previous models (AUC 0.79, sensitivity 0.71, specificity 0.72), which allows larger populations of patients to get screened for melanoma risk without having to perform specialized and time-consuming data collection. Important features from the model can be extracted and studied, and features influencing a specific prediction can be explained to providers and patients. The process for building this model can be further refined to improve performance, as well as used for risk prediction of other diseases.
Multiple sequence alignment (MSA) is a classic problem in computational genomics. In typical use, MSA software is expected to align a collection of homologous genes, such as orthologs from multiple species or duplication-induced paralogs within a species. Recent focus on the importance of alternatively-spliced isoforms in disease and cell biology has highlighted the need to create MSAs that more effectively accommodate isoforms. MSAs are traditionally constructed using scoring criteria that prefer alignments with occasional mismatches over alignments with long gaps. Alternatively spliced protein isoforms effectively contain exon-length insertions or deletions (indels) relative to each other, and demand an alternative approach. Some improvements can be achieved by making indel penalties much smaller, but this is merely a patchwork solution. In this work we present Mirage, a novel MSA software package for the alignment of alternatively spliced protein isoforms. Mirage aligns isoforms to each other by first mapping each protein sequence to its encoding genomic sequence, and then aligning isoforms to one another based on the relative genomic coordinates of their constitutive codons. Mirage is highly effective at mapping proteins back to their encoding exons, and these protein-genome mappings lead to extremely accurate intra-species alignments; splice site information in these alignments is used to improve the accuracy of inter-species alignments of isoforms. Mirage alignments have also revealed the ubiquity of dual-coding exons, in which an exon conditionally encodes multiple open reading frames as overlapping spliced segments of frame-shifted genomic sequence.
Understanding how a mutation affects a protein's structural stability can guide pharmaceutical drug design initiatives that aim to engineer medicines for combating a variety of diseases. Conducting wet-lab mutagenesis experiments in physical proteins can provide precise insights about the role of a residue in maintaining a protein's stability, but such experiments are time and cost intensive. Computational methods for modeling and predicting the effects of mutations are available, with several Machine Learning approaches achieving good predictions. However, most such methods, including ensemble based approaches that are based on multiple classifier models instead of a single-expert system, are dependent on large datasets for training the model. In this work, we motivate and demonstrate the utility of several voting-based models that rely on the predictions of a Support Vector Regression (SVR), Random Forest (RF), and Deep Neural Network (DNN) models for inferring the effects of single amino acid substitutions. The three models rely on rigidity analysis results for a dataset of proteins, for which we use wet lab experimental data to show prediction accuracies with Pearson Correlation values of 0.76. We show that our voting approaches achieve a higher Pearson Correlation, as well as a lower RMSE score, than any of the SVR, RF, and DNN models alone.
Computational techniques for binding-affinity prediction and molecular docking have long been considered in terms of their utility for drug discovery. With the advent of deep learning, new supervised learning techniques have emerged which can utilize the wealth of experimental binding data already available. Here we demonstrate the ability of a fully convolutional neural network to classify molecules from their Simplified Molecular-Input Line-Entry System (SMILES) strings for binding affinity to HIV proteins. The network is evaluated on two tasks to distinguish a set of molecules which are experimentally verified to bind and inhibit HIV-1 Protease and HIV-1 Reverse Transcriptase from a random sample of drug-like molecules. We report 98% and 93% classification accuracy on the respective tasks using a computationally efficient model which outperforms traditional machine learning baselines. Our model is suitable for virtually screening a large set of drug-like molecules for binding to HIV or other protein targets.
The analysis of electroencephalogram (EEG) signal plays a crucial role in epileptic seizure detection. Researchers have proposed many machine learning and deep learning based automatic epileptic seizure detection methods. However, these schemes, especially the deep learning based ones, suffer from labeling huge amounts of training data. Moreover, in epileptic seizure detection, physicians pay more attention to abnormal signals than normal signals, and thus the misclassification cost for them should be different. To address these issues, we propose a cost-sensitive deep active learning scheme to detect the epileptic seizure. In particular, we develop a new generic double-deep neural network (double-DNN) to obtain the cost-sensitive utility for the samples selection strategy in the labeling process. We further employ three types of fundamental neural networks, i.e., one-dimensional convolutional neural networks (1D CNNs), recurrent neural networks with long short-term memory (LSTM) units, and recurrent neural networks with gated recurrent units (GRU), in the double-DNN and evaluate their performances. Experiment results show that the proposed scheme can reduce the amount of labeled samples by up to 33% and 80% compared with uncertainty sampling and random sampling, respectively.
The analysis of animal cross section images, such as cross sections of laboratory mice, is critical in assessing the effect of experimental drugs such as the biodistribution of candidate compounds in preclinical drug development stage. Tissue distribution of radiolabeled candidate therapeutic compounds can be quantified using techniques like Quantitative Whole-Body Autoradiography (QWBA). QWBA relies, among other aspects, on the accurate segmentation or identification of key organs of interest in the animal cross section image - such as the brain, spine, heart, liver and others. Currently, organs are identified manually in such mouse cross section images - a process that is labor intensive, time consuming and not robust, which leads to requiring a larger number of laboratory animals used. We present a deep learning based organ segmentation solution to this problem, using which we can achieve automated organ segmentation with high precision (dice coefficient in the 0.83-0.95 range depending on organ) for the key organs of interest.
A deep learning approach for analyzing DNase-seq datasets is presented, which has promising potentials for unraveling biological underpinnings on transcription regulation mechanisms. Further understanding of these mechanisms can lead to important advances in life sciences in general and drug, biomarker discovery, and cancer research in particular. Motivated by recent remarkable advances in the field of deep learning, we developed a platform, Deep Semi-Supervised DNase-seq Analytics (DSSDA). Primarily empowered by deep generative Convolutional Networks (ConvNets), the most notable aspect is the capability of semi-supervised learning, which is highly beneficial for common biological settings often plagued with a less sufficient number of labeled data. In addition, we investigated a k-mer based continuous vector space representation, attempting further improvement on learning power with the consideration of the nature of biological sequences for features associated with locality-based relationships between neighboring nucleotides. DSSDA employs a modified Ladder Network for underlying generative model architecture, and its performance is demonstrated on the cell type classification task using sequences from large-scale DNase-seq experiments. We report the performance of DSSDA in both fully-supervised setting, in which DSSDA outperforms widely-known ConvNet models (94.6% classification accuracy), and semi-supervised setting for which, even with less than 10% of labeled data, DSSDA performs relatively comparable to other ConvNets using the full data set. Our results underscore, in order to deal with challenging genomic sequence datasets, the need of a better deep learning method to learn latent features and representation.
Genome annotation is the process of labeling DNA sequences of an organism with its biological features, and is one of the fundamental problems in Bioinformatics. Public annotation pipelines such as NCBI integrate a variety of algorithms and homology searches on public and private databases. However, they build on the information of varying consistency and quality, produced over the last two decades. We identified 12,415 errors in NCBI RNA gene annotations, demonstrating the need for improved annotation programs. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) to demonstrate the potential of deep learning networks to annotate genome sequences, and evaluate different approaches on prokaryotic sequences from NCBI database. Particularly, we evaluate DNA $K-$mer embeddings and the application of RNNs for genome annotation. We show how to improve the performance of our deep networks by incorporating intermediate objectives and downstream algorithms to achieve better accuracy. Our method, called DeepAnnotator, achieves an F-score of ~94%, and establishes a generalized computational approach for genome annotation using deep learning. Our results are very encouraging as our method eliminates the requirement of hand crafted features and motivates further research in application of deep learning to full genome annotation. DeepAnnotator algorithms and models can be accessed in Github: \urlhttps://github.com/ruhulsbu/DeepAnnotator.
In cancer genomics, due to the fast somatic mutations (mainly random segment duplications and deletions), copy number profiles (CNPs), (i.e., a file containing each of the gene numbers) are used more often than genome themselves. On the other hand, algorithms with performance analysis for processing CNPs are lacking. In a recent CPM'16 paper, Shamir et al. studied the copy number transformation problem, which is to use the minimum number of duplications and deletions (on the CNPs) to convert one CNP to another, and gave a linear time algorithm. In this paper, we consider a slightly different problem which is called Minimum Copy Number Generation (MCNG), namely, given a genome G and a specific CNP C, use the minimum number of duplications and deletions on G to obtain some genome H which has a CNP C. We show that the problem is NP-hard if G is generic (i.e., contains duplicated genes) and when the duplications are tandem. On the other hand, when only tandem duplications are allowed, if G is exemplar (or is a permutation) and all components in C are power of two's, then the problem can be solved in time linear in the length of the input (or $|C|$) plus $O(|G|log |G]|)$ (the cost for sorting $|G|$ elements). That naturally extends to a practical heuristic algorithm for the problem (when G is exemplar and the components in C are arbitrary). We also show that two variations of the MCNG problem are at least as hard as Set Cover in terms of approximability and FPT tractability. For the general Minimum Copy Number Generation problem, i.e., when both (arbitrary) segment duplications and deletions are allowed, we also design a practical greedy algorithm, present some non-trivial cases and discuss the directions for future research.
Principal Component Analysis (PCA) of dense single nucleotide polymorphism (SNP) data has wide-ranging applications in populations genetics, including detection of chromosomal inversions. SNPs associated with each PC can be identified through single-SNP association tests performed between SNP genotypes and PC coordinates; this approach has several advantages over thresholding loading factors or sparse PCA methods.
Insect vector SNP data often have a high proportion of unknown (uncalled) genotypes, however, that cannot be reliably imputed and prevent the direct usage of association tests. Building on our previous work, we propose a novel method for adjusting the association tests to handle these unknown genotypes.
We demonstrate the utility of the method through two applications: detecting chromosomal inversions and characterizing differentiation processed captured by PCA. When applied to SNP data from the 2L and 2R chromosome arms of 34 karyotyped Anopheles gambiae and Anopheles coluzzii mosquitoes, our method clearly identifies the 2La, 2Rb, 2Rc, 2Rj, and 2Ru inversions. Using our method to identify SNP associated with 2L-PC3, we observed one of the two insecticide-resistance variants in the Rdl gene; our results suggests that the PC is capturing differentiation driven by insecticide usage.
With increased availability of tumor genomics data, it is possible to discern mutation profiles and differentiate cancer patient subtypes with the ultimate purpose of personalizing treatment strategies. Patient subtyping has been implemented using molecular profiles of gene expression data sets that suffer from systematic error. Somatic mutation data, available through tumor sample comparisons, offer more accurate mutation information than those at the molecule level but suffer from the data sparsity. To address this challenge, we developed a novel post-randomization based perturbation clustering technique to identify the optimal number of subtypes using different somatic mutation cancer datasets. We compared the clustering results of our approach with those of the standard clustering approaches in terms of the identified subtypes' predictive accuracy on survival time. Results from different cancer mutation datasets consistently demonstrate increased subtyping performance and signals future opportunities for using somatic mutation to identify cancer subtype-specific biomarkers.
Despite all efforts made in the last few decades in disease prevention and treatment, lung cancer remains the most common cancer after skin cancer. Lung cancer is also the leading cause in cancer death. One of the criteria used to assess both the severity of the disease and the efficacy of treatments in cancer is survival rate. Survival rate is impacted by several parameters including disease stage and patients profile. The aim of this research is to build a predictive model to assess survivability of advanced stage non-small cell lung cancer (NSCLC) patients. The main feature of the proposed methodology is the leveraging of patients treatment sequences in addition to clinical, demographic and genomic patient information. A new algorithm that generates frequent patterns derived from patient treatment sequences is proposed. The algorithm aims at capturing the complex relationship that exists between treatments sequences and patient outcomes in order to produce better prediction models. The results of the experiments show that random forest models achieved the best prediction performance. The features that have the most influence in the prediction performance are features derived from treatment sequences; indicating their potential to encapsulate the information expressed in the other features such as the genomic features.
Biomedical open information extraction (BioOpenIE) is a novel paradigm to automatically extract structured information from unstructured text with no or little supervision. It does not require any pre-specified relation types but aims to extract all the relation tuples from the corpus. A major challenge for open information extraction (OpenIE) is that it produces massive surface-name formed relation tuples that cannot be directly used for downstream applications. We propose a novel framework CPIE (Clause+Pattern-guided Information Extraction) that incorporates clause extraction and meta-pattern discovery to extract structured relation tuples with little supervision. Compared with previous OpenIE methods, CPIE produces massive but more structured output that can be directly used for downstream applications. We first detect short clauses from input sentences. Then we extract quality textual patterns and perform synonymous pattern grouping to identify relation types. Last, we obtain the corresponding relation tuples by matching each quality pattern in the text. Experiments show that CPIE achieves the highest precision in comparison with state-of-the-art OpenIE baselines, and also keeps the distinctiveness and simplicity of the extracted relation tuples. CPIE shows great potential in effectively dealing with real-world biomedical literature with complicated sentence structures and rich information.
Clinical trial protocols are complex documents that must be translated manually for trial execution and management. We have developed a system to automatically transform a schedule of activity (SOA) table from a PDF document into a machine interpretable form. Our system combines semantic, structural, and NLP approaches with a "human in the loop" for verification to determine which cells contain activity or temporal information, and then to understand details of what these cells represent. Using a training and test set of 20 protocols, we assess the accuracy of identifying specific types of SOA elements. This work is the first stage of a larger effort to use artificial intelligence techniques to extract procedural logic in clinical trial documents and to create a knowledge base of protocols for insights and comparison across studies.
Structure of a protein largely determines its functional properties. Hence, the knowledge of the protein's 3D structure is an important aspect in determining solutions to fundamental biological problems. Structure prediction algorithms generally employ clustering algorithm to select the optimal model for a target from a large number of predicted confirmations (a.k.a. decoy). Despite significant advancement in clustering-based optimal decoy selection methods, these approaches often cannot deliver high performance in terms of the time taken to cluster large number of protein structures owing to the computational cost associated with pairwise structural superpositions. Here, we propose a superposition-free approach to protein decoy clustering, called clustQ, based on weighted internal distance comparisons. Experimental results suggest that the novel weighing scheme is helpful in both reproducing the decoy-native similarity score and estimating pairwise clustering based predicted quality score in a computationally efficient manner. clustQ attains performance comparable to the state-of-the-art multi-model decoy quality estimation methods participating in the latest Critical Assessment of protein Structure Prediction (CASP) experiments irrespective of target difficulty. Moreover, clustQ predicted score offers a unique way to reliably estimate target difficulty without the knowledge of the experimental structure. clustQ is freely available at http://watson.cse.eng.auburn.edu/clustQ/.
The function of a protein depends on its three-dimensional structure. Current approaches based on homology for predicting a given protein's function do not work well at scale. In this work, we propose a representation of proteins that explicitly encodes secondary and tertiary structure into fix-sized images. In addition, we present a neural network architecture that exploits our data representation to perform protein function prediction. We validate the effectiveness of our encoding method and the strength of our neural network architecture through a 5-fold cross validation over roughly 63 thousand images, achieving an accuracy of 80% across 8 distinct classes. Our novel approach of encoding and classifying proteins is suitable for real-time processing, leading to high-throughput analysis.
Data clustering approaches are widely used in many domains including molecular dynamics (MD) simulation. Modern applications of clustering for MD simulation data must be capable of assessing both natively folded and disordered proteins. We compare the performance of the spectral clustering with a more recent subspace clustering approach, and a newly proposed 'hybrid' clustering algorithm which seeks to combine the useful characteristics of both methods on MD data from both protein classes. Results are analysed in terms of accuracy, stability, data density, and other properties. We conclude with what combinations of algorithms/improvements/data density will provide results that are either more accurate or more stable. We find that subspace clustering produces better results than standard spectral clustering, especially for disordered proteins and regardless of input data density or choice of affinity scaling. Additionally, our hybrid approach improves subspace results in most cases and entropic affinity scaling leads to a better performance of both spectral clustering and our hybrid approach.
Biomarker discovery aims to find a shortlist of high-profile biomarkers that can be further verified and utilized in downstream analysis. Many biomarkers exhibit structured multiclass behavior, where groups of interest may be clustered into a small number of patterns such that groups assigned the same pattern share a common governing distribution. While several algorithms are proposed for multiclass problems, to the best of our knowledge, none can take such constraints on the group-pattern assignment, or structure, as input, and output high-profile potential biomarkers as well as the structure they satisfy. While post analyses may be used to infer the structure, ignoring such information impedes feature selection to fully take advantage of experimental data. Recent work proposes a Bayesian framework for feature selection that places priors on feature-label distribution and label-conditioned feature distribution. Here we extend this framework for structured multiclass problems, solve the proposed model for the case of independent features, evaluate it in several synthetic simulations, apply it to two cancer datasets, and perform enrichment analysis. Many of the highly ranked genes and pathways are suggested to be affected in the cancer under study. We also find potentially new biomarkers. Not only do we detect biomarkers, but also make inferences about the underlying distributional connections across classes, which provide additional insight on cancer biology.
The state of the art in bio-medical technologies has produced many genomic, epigenetic, transcriptomic, and proteomic data of varied types across different biological conditions. Historically, it has always been a challenge to produce new ways to integrate data of different types. Here, we leverage the node-conditional uni-variate exponential family distribution to capture the dependencies and interactions between different data types. The graph underlying our mixed graphical model contains both un-directed and directed edges. In addition, it is widely believed that incorporating data across different experimental conditions can lead us to a more holistic view of the biological system and help to unravel the regulatory mechanism behind complex diseases. We then integrate the data across related biological conditions through multiple graphical models. The performance of our approach is demonstrated through simulations and its application to cancer genomics.
Metabolic reprogramming is a hallmark of cancer. In cancer cells, transcription factors (TFs) govern metabolic reprogramming through abnormally increasing or decreasing the transcription rate of metabolic enzymes, which provides cancer cells growth advantages and concurrently leads to the altered metabolic phenotypes observed in many cancers. Consequently, targeting TFs that govern metabolic reprogramming can be highly effective for novel cancer therapeutics. In this work, we present TFmeta, a machine learning approach to uncover TFs that govern reprogramming of cancer metabolism. Our approach achieves state-of-the-art performance in reconstructing interactions between TFs and their target genes on public benchmark data sets. Leveraging TF binding profiles inferred from genome-wide ChIP-Seq experiments and 150 RNA-Seq samples from 75 paired cancerous (CA) and non-cancerous (NC) human lung tissues, our approach predicted 19 key TFs that may be the major regulators of the gene expression changes of metabolic enzymes of the central metabolic pathway glycolysis, which may underlie the dysregulation of glycolysis in non-small-cell lung cancer patients.
Advanced high-throughput technologies have produced vast amounts of biological data. Data integration is the key to obtain the power needed to pinpoint the biological mechanisms and biomarkers of the underlying disease. Two critical drawbacks of computational approaches for data integration is that they do not account for study bias, as well as the noisy nature of molecular data. This leads to unreliable and inconsistent results, i.e., the results change drastically when the input is slightly perturbed or when additional datasets are added to the analysis. Here we propose a multi-cohort integrated approach, named MIA, for biomarker identification that is robust to noise and study bias. We deploy a leave-one-out strategy to avoid the disproportionate influence of a single cohort. We also utilize techniques from both p-value-based and effect-size-based meta-analyses to ensure that the identified genes are significantly impacted. We compare MIA versus classical approaches (Fisher's, Stouffer's, maxP, minP, and the additive method) using 7 microarray and 4 RNASeq datasets. For each approach, we construct a disease signature using 3 datasets and then classify patients from 8 remaining datasets. MIA outperforms all existing approaches in terms of both the highest sensitivity and specificity by accurately distinguishing symptomatic patients from healthy controls.
Solving median tree problems is a classic approach for inferring species trees from a collection of discordant gene trees. Such problems are typically NP-hard and dealt with by local search heuristics. Unfortunately, such heuristics generally lack any provable correctness and precision. Algorithmic advances addressing this uncertainty, have led to exact dynamic programming formulations suitable to solve a well-studied group of median tree problems for smaller phylogenetic analyzes. However, these formulations allow to compute only very few optimal species trees out of possibly many such trees, and phylogenetic studies often require the analysis of all optimal solutions through their consensus tree. Here, we describe a significant algorithmic modification of the dynamic programming formulations that compute the cluster counts of all optimal species trees from which various types of consensus trees can be efficiently computed. Through experimental studies, we demonstrate that our parallel implementation of the modified programming formulation is more efficient than a previous implementation of the original formulation, and can greatly benefit phylogenetic analyses.
Gene duplication and loss are two evolutionary processes that occur across all three domains of life. These two processes result in different loci, across a set of related genomes, having different gene trees. Inferring the phylogeny of the genomes from data sets of such gene trees is a central task in phylogenomics. Furthermore, when the evolutionary history of the genomes includes short branches, deep coalescence, or incomplete lineage sorting (ILS), could be at play, in addition to duplication and loss, further adding to the complexity of gene/genome relationships. Recently, researchers have developed methods to infer these evolutionary processes by simultaneously modeling gene duplication, loss, and incomplete lineage sorting with respect to a given (fixed) species tree. In this work, we focused on the task of inferring species trees, as well as locus and gene trees, from sequence data in the presence of all three processes. We developed a search heuristic for estimating the maximum a posteriori species/locus/gene tree triad, as well as their associated parameters, from the sequence data of independent gene families. We demonstrate the performance of our method on simulated data and a data set of 200 gene families from six yeast genomes. Our work enables new statistical phylogenomic analyses, particularly when hidden paralogy and incomplete lineage sorting could be simultaneously at play.
It is well-understood that most eukaryotic genes contain one or more protein domains and that the domain content of a gene can change over time. This change in domain content, through domain duplications, transfers, or losses, has important evolutionary and functional consequences. Recently, a powerful new reconciliation framework, called Domain-Gene-Species (DGS) reconciliation, was introduced to simultaneously model the evolution of a domain family inside one or more gene families and the evolution of those gene families inside a species tree. The underlying computational problem in DGS reconciliation is NP-hard and a heuristic algorithm is currently used to estimate optimal DGS reconciliations. However, this heuristic has several undesirable limitations. First, it offers no guarantee of optimality or near-optimality. Second, it can result in biologically unrealistic evolutionary scenarios. And third, it only computes a single DGS reconciliation even though there can be multiple optimal DGS reconciliations. In this work, we introduce the first exact algorithm for computing optimal DGS reconciliations that addresses all three limitations. Our algorithm is based on an integer linear programming formulation of the problem, which we solve iteratively by solving a series of linear programming relaxations. Our experimental results on over $3,400$ domain trees and over 7,000 gene trees from 12 fly species shows that our new algorithm is highly scalable and that it leads to significant improvement in DGS reconciliation inference. An implementation of our exact algorithm is available freely from http://compbio.engr.uconn.edu/software/seadog/.
Perhaps the most important organizing principle in biology for bac- teria is the tree of phyla. It represents the evolution of bacteria now living in virtually every environment. The availability of whole genome sequences has provided the opportunity to reconstruct a comprehensive view of the tree and to trace the shared ancestry among all bacteria that have been sequenced. However, most exist- ing research has presented the tree of phyla without considering the ancestral phylum. The objective of this study is to fi nd the ancestral phylum using a network science approach and exploiting the availability of a rich dataset of genomes. For the analysis, a network representing 210 organisms is created by clustering more than 700,000 protein sequences for 28 recognized phyla. A network of phyla is then extracted from the results which is examined using a breadth-fi rst search algorithm and centrality measures to create a rooted tree from which the likely ancestral phylum is identified.
Observing the recent progress in Deep Learning, the employment of AI is surging to accelerate drug discovery and cut R&D costs in the last few years. However, the success of deep learning is attributed to large-scale clean high-quality labeled data, which is generally unavailable in drug discovery practices. In this paper, we address this issue by proposing an end-to-end deep learning framework in a semi-supervised learning fashion. That is said, the proposed deep learning approach can utilize both labeled and unlabeled data. While labeled data is of very limited availability, the amount of available unlabeled data is generally huge. The proposed framework, named as seq3seq fingerprint, automatically learns a strong representation of each molecule in an unsupervised way from a huge training data pool containing a mixture of both unlabeled and labeled molecules. In the meantime, the representation is also adjusted to further help predictive tasks, e.g., acidity, alkalinity or solubility classification. The entire framework is trained end-to-end and simultaneously learn the representation and inference results. Extensive experiments support the superiority of the proposed framework.
Drug-drug interactions (DDIs) may cause significant adverse effects. As prescribing multiple drugs becomes increasingly common, it is necessary to verify potential interactions among drugs that are used at the same time. Likely DDIs can be identified with higher confidence if supporting experimental evidence is provided. Such information is usually published in biomedical literature. While current retrieval and classification methods can identify publications related to DDIs, not all articles that discuss DDIs contain experimental evidence. A publication that does present evidence typically contains sentences conveying information about specific experimental methods and their results. A classifier that can readily identify such sentences can be useful for obtaining explicit and reliable information concerning DDIs. In this work, we develop two text classifiers to distinguish scientific sentence-fragments bearing experimental evidence from fragments that do not present such evidence. We focus on a corpus of text containing biomedical abstracts related to drug interactions. The classifiers are trained and tested on a manually curated set of sentence-fragments in these abstracts. Our experiments demonstrate a high level of performance (at least 89% precision and recall) suggesting the applicability of these classifiers toward improving retrieval of reliable information pertaining to drug interactions.
Biological aging process is the main cause to many age-related diseases. Therefore, exploring cellular level changes due to aging, chemical impacts and anti-aging compounds are of high interest in drug discovery and personalized drugs research. In this paper, we propose a model to predict the effect of chemical compounds on lifespan of Caenorhabditis elegans. We analyze the data from DrugAge database, which includes chemical compounds that affect lifespan of model organisms and use chemical descriptors and gene ontology as features. We propose a new feature selection scheme based on particle swarm optimization and correlation-based feature selection to select the most relevant features for classification task. The experimental results indicate our approach achieves higher performance over the existing methods. We discuss the benefits of our proposed feature selection schema over other methodologies and compare our results conducted by random forest with base-line support vector machine and artificial neural network classifiers.
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.
The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.
Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence a basic question about profile HMMs is whether they are \em statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.
Longitudinal data are widely used in medicine, demography, sociology and other areas. Incomplete observations in such data often confound the results of analysis. A plethora of data imputation methods have already been proposed to alleviate this problem. The Stochastic Process Model (SPM) represents a general framework for modeling joint evolution of repeatedly measured variables and time-to-event outcome typically observed in longitudinal studies of aging, health and longevity. It is perfectly suitable for imputing missing observations in censored longitudinal data. We applied SPM to the problem of imputation of censored missing longitudinal data. This model was applied both to the Framingham Heart Study and Cardiovascular Health Study data as well as to simulated datasets. We also present an R package stpm designed for this purpose.
Autism Spectrum Disorder (ASD) is a pervasive and lifelong neuro-developmental disability where early treatment has been shown to improve a person's symptoms and ability to function. One of the most significant obstacles to effective treatment of ASD is the challenge of early detection, but unfortunately, due to the limited availability of screening and diagnostic instruments in some regions, many affected children remain undiagnosed or are diagnosed late. Recent studies have shown that characteristics in vocalizations could be used to build new ASD screening tools, but most prior efforts are based on recordings made in controlled settings and processed manually, affecting the practical value of such solutions. On the other hand, we are increasingly surrounded by smart devices that can capture an individual's vocalizations, including devices specifically targeted at child populations (e.g., Amazon Echo Kids Edition). In this paper, we propose a practical and fully automatic ASD screening solution that can be implemented on such devices, which captures and analyzes a child's everyday vocalizations at home, without the need for professional help. A 17-month experiment on 35 children is used to verify the effectiveness of the proposed approach, showing that we can obtain an unweighted F1-score of 0.87 for the classification of typically developing and ASD children.
Spinal muscular atrophy (SMA) is a common muscle disease that can lead to high rate of infant mortality. It is important to be able to quickly and accurately diagnose SMAs as well as track disease progression throughout the treatment process. This study introduced a framework for deriving movement features from motion tracking data, and applied a regularized regression method to predict the gold standard clinical measures for SMA, the CHOP INTEND Extremities Scores (CIES). Our results showed the CIES could be predicted with good accuracy using derived motion features and Elastic Net regression. An RMSE of 8.5 points on CIES was achieved in both cross-validation and prediction on the held-out set. A high ROC-AUC of 0.91 was achieved for discriminating SMA infants from Controls on both session and subject levels. It was concluded that motion tracking devices could potentially be used as a low-cost yet effective method to assess and monitor infants with SMA.
In the wake of the vast population of smart device users worldwide, mobile health (mHealth) technologies are hopeful to generate positive and wide influence on people's health. They are able to provide flexible, affordable and portable health guides to devise users. Current online decision-making methods for mHealth assume that the users are completely heterogeneous. They share no information among users and learn a separate policy for each user. However, data for each user is very limited in size to support the separate online learning, leading to unstable policies that contain lots of variances. Besides, we find the truth that a user may be similar with some, but not all, users, and connected users tend to have similar behaviors. In this paper, we propose a network cohesion constrained (actor-critic) Reinforcement Learning (RL) method for mHealth. The goal is to explore how to share information among similar users to better convert the limited user information into sharper learned policies. To the best of our knowledge, this is the first online actor-critic RL for mHealth and first network cohesion constrained (actor-critic) RL method in all applications. The network cohesion is important to derive effective policies. We come up with a novel method to learn the network by using the warm start trajectory, which directly reflects the users' property. The optimization of our model is difficult and very different from the general supervised learning due to the indirect observation of values. As a contribution, we propose two algorithms for the proposed online RLs. Apart from mHealth, the proposed methods can be easily applied or adapted to other health-related tasks. Extensive experiment results on the HeartSteps dataset demonstrates that in a variety of parameter settings, the proposed two methods obtain obvious improvements over the state-of-the-art methods.
We consider the actor-critic contextual bandit for the mobile health (mHealth) intervention. State-of-the-art decision-making algorithms generally ignore the outliers in the data-set. In this paper, we propose a novel robust contextual bandit method for the mHealth. It can achieve the conflicting goal of reducing the influence of outliers, while seeking for a similar solution compared with the state-of-the-art contextual bandit methods on the datasets without outliers. Such performance relies on two technologies: (1) the capped-L2 norm; (2) a reliable method to set the threshold hyper-parameter, which is inspired by one of the most fundamental techniques in the statistics. Although the model is non-convex and non-differentiable, we propose an effective reweighted algorithm and provide solid theoretical analyses. We prove that the proposed algorithm can sufficiently decrease the objective function value at each iteration and will converge after a finite number of iterations. Extensive experiment results on two datasets demonstrate that our method can achieve almost identical results compared with state-of-the-art contextual bandit methods on the dataset without outliers, and significantly outperform those state-of-the-art methods on the badly noised dataset with outliers in a variety of parameter settings.
Polycystic ovary syndrome (PCOS) is a common endocrine disorder that affects up to 20% of women, however diagnosis is commonly unreliable and un-quantitative. Here we use supervised machine learning and measurements of 51 cytokines from a large cohort of patients to identify a low-dimensional set of potential biomarkers for diagnosis of PCOS. Both whole blood and individual follicular fluid (FF) aspirates were collected women during pre- intracytoplasmic sperm injection with in vitro fertilization (ICSI/IVF) oocyte retrieval and linked with patients' PCOS status as diagnosed by the Rotterdam criteria (n = 69 PCOS, n = 222 non-PCOS). We trained a binary support vector machine (SVM) using a random subset of patient data to determine cytokine profile associated with PCOS. Our resultant model includes 3 variables and is 76% accurate. This provides insight into the immunological basis of PCOS and may define a potential non-invasive quantitative strategy for diagnosis.
Over the last decade, joint advances in next-generation sequencing technology and bioinformatics pipelines have dramatically improved our understanding of host-associated and environmental microbiota. Standard microbiome community analysis typically involves amplicon sequencing of the prokaryotic 16S rRNA gene. These sequences are then clustered into operational taxonomic units (OTUs) for downstream diversity analyses, but also to reduce computational burden and allow for rapid analysis of datasets. Taxonomy is then assigned to all reads of an OTU, based on the assignment of a representative read. Although straightforward in principle, present methods often rely on heuristics while constructing (or "picking") OTUs to avoid computationally expensive algorithms, and ignore the prior knowledge of microbial phylogeny to further reduce the computational complexity. Here, we present HmmUFOtu, a novel tool for processing 16S rRNA sequences that addresses major limitations of current OTU picking and taxonomy assignment methods. HmmUFOtu relies on rapid per-read phylogenetic placement, followed by OTU picking and taxonomic assignment based on the phylogeny of known taxa. By benchmarking on simulated, mock community, and real datasets, we show that HmmUFOtu achieves high assignment accuracy, sensitivity, specificity and precision, even at species-level resolution. Compared to standard pipelines, HmmUFOtu more accurately recapitulates community diversity and composition. HmmUFOtu can perform taxonomic assignment in a species-resolution reference tree with ~ 200,000 nodes for 1 million 16S sequencing reads within 6 hours on a modest Linux workstation with 16 processors and 32 GB RAM. HmmUFOtu is written in C++98 and freely available at https://github.com/Grice-Lab/HmmUFOtu/.
Cancer cells contain thousands of mutated genes, differential copy numbers and differential expressions of genes. The progression of cancer differs from patient to patient. Identification of key proteins and pathways of individual patient's molecular profile has become important for personalized medicine. At the first step of our proposed pipeline, gene mutations, gene expression profile, copy number variations and clinical data of lung cancer patients (LUAD) are downloaded from TCGA. Significant genomic variations are determined by using R MADGIC and GAIA packages. Using R Deseq2 package, most active differentially expressed genes are determined for the patients (number of patients=55) for whom the adjacent normal tissue RNA-seq expression levels are available. Most active pathways are determined by Cytoscape jactivemodules program based on expression levels. For significant genomic variations and gene expression levels, MDS plot and Kaplan-Meier survival analysis of the patients is performed. The most mutated genes in 565 LUAD samples were identified by TCGA-Biolinks package. We found that TP53, a known tumor suppressor gene, has a mutation in 48% of the patients. Survival analysis for the 55 LUAD patients clustered using K-means clustering (k=2) was performed. Results show that survival probability of two clusters doesn't vary significantly. The goals of this study are to 1) computationally identify the most significant genes whose mutation and expression profile correlate with the patient survival time 2) verify the significance of results against the results of an earlier study conducted on TCGA LUAD dataset  and 3) provide an open-source automated pipeline.
Over the past two decades, the field of nanotechnology - aimed to design, characterize and produce materials on a nanometer scale - has been fast growing and has revolutionized many aspects of our lives. The incorporation of engineered nanoparticles (NPs) in various industries (e.g., electronics, manufacturing, construction), consumer products (e.g., cosmetics, food packaging) and biomedicine poses an increased risk of exposure in humans. Human exposure to carbonaceous nanomaterials (CNMs) can occur via treatment of various diseases, as well as through their presence in manufacturing, occupational, and environment setting. As technology advances to incorporate CNMs in industry, the potential exposures associated with these particles also increase. CNMs have been found to be associated with substantial pulmonary toxicity, including inflammation, fibrosis, and/or granuloma formation. This study attempts to categorize the toxicity profiles of various carbon allotropes, in particular, carbon black, different multi-walled carbon nanotubes, graphene-based materials and their derivatives. Statistical and machine learning based approaches were used to identify groups of CNMs with similar pulmonary toxicity responses from a panel of proteins measured in bronchoalveolar lavage (BAL) fluid samples and with similar pathological outcomes in the lungs. Thus, grouped particles based on their pulmonary toxicity profiles, were used to select a small set of proteins that could potentially identify and discriminate between the biological responses associated within each group. Specifically, MDC/CCL22 and MIP-3/CCL19 were identified as common protein markers associated with both toxicologically distinct groups of CNMs. In addition, the persistent expression of other selected protein markers in BAL fluid from each group suggested their ability to predict toxicity in the lungs, i.e., fibrosis and microgranuloma formation. The advantages of such approaches can have positive implications for further research in toxicity profiling.
Precision Medicine has attracted increasing attention from biomedical research. Extracting information from biomedical literature about protein-protein interactions affected by mutations is a vital step towards PM because it uncovers mechanisms leading to diseases. We investigate a feature-rich supervised method to accomplish this relation extraction challenge. Our approach leverages a novel combination of features, as well as two auxiliary corpora, to achieve up 44% improvement in F1-score over baseline method.
Modeling and simulation software now provide us with a view of the structure space navigated by peptides and proteins under physiological conditions. Such software, such as Molecular Dynamics, yields trajectories of consecutive structures accessed by a dynamic molecule, but does not readily expose the underlying organization in the structure state so as to summarize the equilibrium dynamics over the present structural states. In this paper we investigate Markov State Models on their ability to do so. While we make use of an established software to do so, we analyze within it different design decisions and measure their impact on the obtained results. We present our findings on optimal design decisions, revealing in the process the dynamics of the Met-enkephaline peptide.
Understanding the molecular mechanisms underlying early cancer development is still a challenge. To address this, we developed an interpretable, data-driven machine learning approach to identify the gene biomarkers that predict the clinical outcomes of early cancer patients. As a demonstration, we applied this approach into large-scale pan-cancer datasets including TCGA to find out how effective it would be at identifying the developmental gene expression biomarkers across tumor stages for various cancer types. Results confirmed that artificial neural network prediction embedding nonlinear feature selection outperforms other classifiers. Moreover, and more relevant to the goal of machine learning interpretable classifiers, we found that early cancer patient groups clustered by the biomarkers selected have significantly more survival differences than ones by early TNM stages, suggesting that this method identified novel early cancer molecular biomarkers. Furthermore, using lung cancer as a study case, we leveraged the hierarchical architectures of neural network to identify the developmental regulatory networks controlling the expression of early cancer biomarkers, providing mechanistic insights of functional genomics driving the onset of cancer development. Finally, we reported the drugs targeting early cancer biomarkers, revealing potential genomic medicine affecting the early cancer development.
Evaluation of intelligent search with support from advanced natural language processing (NLP) technologies is labor-intense and related tasks are very trivial. This study introduced user relevance feedback procedures and relevance measures to evaluate our SPRIT-NLP semantic search system. The historical protocol archives of our organization were annotated using UMLS (Unified Medical Language Systems) concepts and indexed by Solr to test these evaluation settings. The outcome demonstrated concept-based semantic search is very effective to retrieve many categories of clinical queries.
For many women, online health communities, such as BabyCenter.com, provide mediums to quell doubts and receive answers amidst the pressing uncertainties of pregnancy . Women contributing to such community forums often suffer from complications such as postpartum depression, and likely want their posts addressed in a timely and adequate manner. This work examined quantitative and qualitative factors that contribute to various levels of responsiveness to posts in BabyCenter.com post-partum depression online health communities. Our aim was to identify post characteristics conducive to higher levels of engagement from online health forum contributors. In this study, we analyzed characteristics of posts (length of the main text, time of day, existence of exclamation points or question marks in the title) to see if there was a relationship between the number of community comments (as a measure of engagement) and varying levels of the characteristics. The number of comments was used as a measure of engagement because it is an estimate of the extent to which community members were drawn to and felt compelled to interact with the post. For each of 100 randomly-selected posts (from 3 groups related to postpartum depression and anxiety), we generated summary statistics and performed two-sample t-tests. For the length of main post variable, a regression analysis was performed as well. In the end, we found no significant differences in engagement resulting from the three variables. For time of day, the average comments was 14.25 for AM posts, whereas the average number of comments for PM posts was 7.87. (p-value = 0.054, 95% CI: -0.11, 12.9, Figure 1). Length of the main post did not appear to predict level of engagement by online health community members (R2=0.0006, p-value=0.814, Figure 1). The difference in number of comments for posts with more than 148 words (median length) compared to posts with fewer than 148 words was also non-significant (p-value=0.58, 95% CI: -5.1, 2.8). Differences in engagement for posts with punctuation (exclamation point or question mark) in title (N=33) compared to those without (N=67) were non-significant (p-value=0.81, 95% CI: -4.0, 5.1) as well. (Figure 1) The strengths of this pilot study are in revealing characteristics that may appeal to users responding on online health community forums, it also sets the stage for future work investigating user behavior trends on social question and answering sites. The limitations include small sample size given our use of 100 randomly selected forum posts. Future work will assess larger data sets and examine more in-depth characteristics such as content and previous user behavior. For those participating in online health forum discussions, this research provides insight into factors that may foster a more reciprocal, communal environment for posting questions and comments. This process may lead to health benefits through providing better social support to posters managing postpartum depression .
The coordination of functional genomics is a critical and complex process in biological systems, especially across different phenotypes or organism states (e.g., time, disease, organism). Understanding how the interactions of various genomic functions relate to these states remains a challenge. To address this, we have developed a machine learning method, ManiNetCluster, which integrates and simultaneously clusters multiple gene networks to identify cross-phenotype functional gene modules, revealing the genomic functional linkages. Particularly, this method extended the manifold learning to match local and nonlinear structures among networks for maximizing the functional connectivities. For example, we showed that ManiNetCluster significantly better aligns orthologous genes from cross-species gene expression datasets than the linear state-of-art methods. As a demonstration, we have applied our method to temporal gene co-expression networks of an algal day/night cycling transcriptome. This demonstration confirmed i) the validity of our clustering method, and ii) revealed the day-night linkages of photosynthetic functions, providing novel insights of temporal genomic functional coordination in bioproduction.
Understanding drug-target interactions at both protein and pathway signaling network level in healthy vs. disease states is critical for the success of drug discovery and development. In the post-genomic era, quantitative chemical proteomics is emerging as a powerful tool to identify and validate novel druggable targets by means of (i) deconvolution of the molecular mechanism of action (MMoA); (ii) proteome selectivity assessment of bioactive molecules and (iii) druggability assessment for proteins of therapeutic interest with unclear MMoA identified through diverse approaches. Internally, we utilized in biochemical assays, transcriptomics and chemoproteomics experiments to elucidate drug-target interaction from various angles. At its core, chemoproteomics has the unprecedented power to unbiasedly discover, and unambiguously quantify, hundreds to thousands of protein interactions in a disease-relevant biological system perturbed with a controlled chemical insult, such as a drug or investigational molecule. Yet, the effective means to systematically mine, integrate and derive relevant information out of such complex big data sets remains a scientific frontier today. To maximize the value of various chemical biology data, to fuel in the hypothesis generation-testing cycle in Target Identification and Validation (TIDVal), we invested in a Chemical Biology Data Management System (CBDMS) as the infrastructure foundation to capture systems-biology perspectives of drug-proteome and transcriptome dynamics, on-target engagement, off-target effects and polypharmacology. Herein we report current progress of this endeavor, particularly in chemical proteomics data handling, analysis and visualization in the context of several exemplary chemical proteomics experiments to identifying novel targets.
Chemoproteomics is a powerful mass spectrometry?based affinity chromatography approach for identifying proteome-wide small molecule-protein interactions.1 It aims for unbiased determination of drug targets in a complex cellular environment. Chemoproteomics has been one of the central methods of choice for small molecule mechanism of action (MOA) deconvolution of phenotypic screen hits, as well as for understanding the selectivity and off-target biological activities. In order to understand the modulation of the human proteome with small molecules in a comprehensive and systematic manner, a chemically diverse probe set with drug-like characteristics has been selected and profiled against 8 relevant biosamples including cells and human organ tissues to delineate protein target binding spectra in an unbiased manner at a global scale. In this work, we will update progress-to-date on experimental design, optimization, and current findings from this unprecedented rich system-chemical biology dataset. We will use examples from this study to highlight the cheminformatics and bioinformatics solutions that we developed to address the unique challenge of chemical biology/chemical proteomics data. Insights from this chemoproteomics profiling effort will be discussed from the perspectives of: 1) compound selectivity in the context of diverse biological samples beyond industry standard practice of using in vitro recombinant protein profiling panel or in one or two model cell lines, 2) frequent targets and chemo type hitters as well as 3) novel potential target examples. These efforts to develop a unique human chemo-proteomic database, together with chemo-genomic and transcriptomic approaches, provide chemical biologists the means to prosecute novel target identification and subsequent validation studies in support of relevant disease areas.
The analysis of the relations among diseases and genetic aspects of individuals is based on the analysis of data produced by high-throughput experimental technologies, such as Single Nucleotide Polymorphism (SNP) genotyping data. We present a novel data analysis pipeline for SNP data, named Services4SNPs (S4S), that includes two previously developed data analysis tools, DMET-Miner and OSAnalyzer, that have been engineered and modified to be deployed as RESTful web services, named GenotypeAnalytics (GA) and OSAnalytics (OSA) respectively. S4S tries to overcome the limits of desktop bioinformatics software by moving complexity on the server-side, allowing users to easily extract multiple associations between SNPs in DMET datasets and correlate the presence-absence of SNPs with the overall survival of the subjects in DMET datasets annotated with clinical information.
The design of synthetic vaccine peptides and other constructs (e.g., for developing immunodiagnostics) is informed by B-cell epitope prediction for antipeptide paratopes, which crucially depends on physicochemically and biologically meaningful interpretation of pertinent experimental data as regards paratope-epitope binding, with negative data being particularly problematic as they may be due to artefacts of immunization and immunoassays. Yet, the problem posed by negative data remains to be comprehensively addressed in a manner that clearly defines their role in the further development of B-cell epitope prediction. Hence, published negative data were surveyed and analyzed herein to identify key issues impacting on B-cell epitope prediction. Data were retrieved via searches using the Immune Epitope Database (IEDB) and review of underlying primary sources in literature to identify said issues, which include (1) inherent tendency toward false-negative data with use of solid-phase immunoassays and/or monoclonal paratopes, (2) equivocal data (i.e., both positive and negative data obtained from similar experiments), and (3) failure of antipeptide paratopes to cross-react with antigens of covalent structure and/or conformation different from that of the peptide immunogens despite apparent identity between curated epitope sequences. Analysis of experimental details thus focused on negative data from fluid-phase (e.g., immunoprecipitation) assays for detection of polyclonal paratope-epitope binding. Underlying literature references were reviewed to confirm the identification of negative data included for analysis. Furthermore, data from assays to detect cross-reaction of antipeptide antibody with protein antigen were included only if supported by positive data on either the corresponding reaction of the same antibody with peptide antigen or cross-reaction of said antibody with denatured protein antigen, to exclude the possibility that negative data on cross-reaction were due to absence of antipeptide paratopes in the first place (e.g., because of failed immunization due to insufficient immunogenicity and/or immune tolerance). Among currently available negative binding data on antipeptide antibodies, very few are on polyclonal responses yet also clearly attributable to conformational differences between peptide immunogens and native cognate proteins thereof. This dearth of negative data suitable for benchmarking B-cell epitope prediction conceivably could be addressed by generating positive data on binding of polyclonal antipeptide antibodies to cognate-protein sequences (e.g., in solid-phase immunoassays using unfolded protein antigen) to complement negative data on failure of the same antibodies to cross-react with native protein (e.g., in fluid-phase immunoassays, without artefactual covalent modification of antigens that tends to produce false-negative results). As regards cross-reactive binding of native cognate proteins by antipeptide antibodies (e.g., as mechanistic basis for novel vaccines and immunotherapeutics), negative data are most informative where attributable to conformational differences between peptide immunogens and target proteins. This is favored by careful peptide-immunogen design (e.g., avoiding covalent backbone and sidechain differences vis-a-vis target protein sequence) and positive data on antibody binding of the target protein sequence (e.g., in unfolded protein) paired with negative data on the same antibody using native protein antigen (e.g., from fluid- rather than solid-phase assays).
The amount of data available in public bioinformatics resources and the complexity of user interfaces they are served through often challenges appreciation and effective utilization of these valuable resources. While education, documentation and training activities mitigate this problem, there is still a need to develop user interfaces to serve simple day-to-day needs of scientists. To this end, we developed ProSetComp; a simple web-based platform to create and compare protein sets, following a traditional software development process; from requirement analysis to implementation. First, we interviewed and collected user scenarios from wet lab scientists with seniority, research interests and backgrounds. Reviewing the user scenarios, we identified one high impact need that drove the development of ProSetComp; ability to 1) create protein sets by searching databases, 2) compare these protein sets in different dimensions such as functional domains, pathways, molecular functions and biological processes, and 3) visualize results graphically. Next, we collected and integrated necessary data from several bioinformatics resources including UniProt, Reactome, Gene Ontology and PFAM in a local relational database. Finally, we designed user interfaces that facilitate the creation of protein sets by using form-based query generators and exploring the relationship between created protein sets using tabular and graphical representations. The current internal release of the platform contains ~120 million protein entries. The user interface supports >50 search criteria to create up-to four protein sets and comparison of these sets in four dimensions; protein domains, molecular functions, biological processes, and pathways. The commonality and differences between protein sets, along with tables, can be explored using novel user interface components such as Venn and UpSet diagrams. The first public release of ProSetComp (http://ceng.mu.edu.tr/labs/bioinfo/prosetcomp) is targeted for mid-August, 2018 and planned to be updated monthly thereafter. Upon public release, source code ProSetComp will become available through GitHub. The database content and user interface will be expanded as per community needs. The ProSetComp project is supported by The Scientific and Technological Research Council of Turkey (TUBITAK, Grant number: 216Z111).
Preterm birth affected about 10% infants born in the U.S. in 2016. This project was a secondary analysis of data drawn from a preterm birth prediction study to assess whether psychological symptom clusters exist among pregnant women. A symptom cluster exists when two or more symptoms co-occur, are related to each other, and are stable. We found one psychological symptom pair that satisfied these conditions: anxiety & self-esteem. This finding has potential to help guide symptom assessment among pregnant women.
In this paper, we propose new methods to make the rankings of candidates for p53 inhibitors, considering both radioprotective function and cytotoxicity. We use features about compound's structure including fingerprints, machine learning methods and ranking methods. As a result, we presented the regression models of cytotoxicity and radioprotective function of them to determine the rankings.
Lung cancer is the leading cause of cancer death in many countries. Interstitial lung disease (ILD), it is not cancer though, affects people more severely than many kinds of cancer. While ILD and lung cancer often occur concomitantly, the cause is still unclear. We intend to find the key factor that make patients suffer from both ILD and lung cancer instead of only lung cancer.
The widely deployed and easy-to-use Linguistic Inquiry and Word Count (LIWC) tool is the gold standard for many computerized text analysis tasks for many medical applications such as patient sentiment analysis, depression detection, and ADHD detection. Compared to most other natural language processing (NLP) tasks, in the medical field it is often very difficult to obtain large-scale data sets, making effective automatic representation learning from complex text patterns (e.g., using a deep auto-encoder) challenging. LIWC can solve this problem by using a human-designed dictionary as a substitution of a machine learning model to convert text into a concise and effective vector representation. However, while LIWC's dictionary is large, some potentially informative words might still be neglected due to the knowledge constraint of the dictionary editors. This problem is particularly conspicuous when the analyzed text is not a formal language (e.g., dialect, slang, or cyber words). To address this problem, we propose a new matching scheme that does not require an exact word match, but instead counts all words that are similar to a key in the LIWC dictionary. This scheme is implemented using WordNet, a large lexical database, and Word2Vec, a machine learning based word embedding technology. The output of the proposed method is in the exact same format as LIWC's output, thereby maintaining the usability. Similar to previous work, the proposed method can be viewed as a combination of human domain knowledge and machine learning for text representation encoding.
Kinase domain mutations in the Epidermal growth factor receptor (EGFR) are common drivers of lung adenocarcinoma. 1st generation EGFR tyrosine kinase inhibitors (TKIs), gefitinib and erlotinib, 2nd generation EGFR TKI, afatinib and 3rd generation EGFR TKIs osimertinib and rociletinib inhibit mutant EGFRs. While all the EGFR TKIs are active against TKI-sensitizing EGFR mutants, L858R and Del EGFR, only the 3rd generation TKIs are effective against EGFR T790M, the most common acquired resistance mechanism to 1st and 2nd generation TKIs. Patients often have a good initial response to these drugs, but resistance inevitably develops, due to either additional EGFR mutations or to activation of parallel signaling pathways. To understand the mechanisms of resistance to the 3rd generation EGFR TKIs, we conducted a mass spectrometry-based phosphoproteomic analysis comparing rociletinib-resistant and rociletinib-sensitive lung cancer cells. Using iPTMnet, a PTM resource that integrates data from text mining of the scientific literature and other PTM databases, we found that AKT and PKA kinases targeted many of the sites whose phosphorylation was up-regulated in resistant cells; these kinases may be part of signaling pathways that are aberrantly activated in these cells. Next, we used kinase-inhibitor target data (KinomeScan) and phosphoproteomic data (P100) from the NIH Library of Integrated Network-Based Cellular Signatures Program (LINCS; http://www.lincsproject.org/) to identify drugs that might overcome drug resistance. Our study demonstrated that PTM knowledge networks can be used in conjunction with phosphoproteomic data to identify aberrantly regulated kinase signaling pathways in drug resistant cells, and that LINCS data (KinomeScan and P100) can be used to identify candidate drugs to be used in combination therapy to overcome resistance. In our ongoing work, we are testing drugs identified by LINCS analysis in cell culture assays, extending the analysis to other TKIs, and automating our workflow for overlay of PTM knowledge maps, LINCS data, and cancer omics data.
Interactive information extraction (IE) systems supported by biomedical ontologies are intelligent natural language processing (NLP) tools to understand literature and clinical narratives and discover meaningful domain knowledge from unstructured text. This study developed integrated IE systems to detect treatment complications of blood cancer patients from Electrical Medical Records (EMR) in the Long-Term Follow-Up (LTFU) protocol following Hematopoietic Cell Transplantation (HCT). The performance of the proposed approach was very encouraging compared to the gold-standard datasets manually reviewed by domain experts. In addition, the NLP system identified significant amount of cases not caught by experts.
While many factors influence the fatigue experienced by patients undergoing radiation therapy (RT), we hypothesize that expression of genes related to oxidative stress can be predictive of RT-related fatigue. In this work, we present a two-phase scheme which first selects a limited subset of genes deemed most predictive by a regularized elastic net, followed by a widely used classifier, the regularized random forest, to discriminate patients having high fatigue from low fatigue during RT. The model predicted 80% accuracy (0.80 AUC) in cross-validation. Initial results suggest that several genes are consistently selected in the proposed scheme, such as PRDX5, FHL2 and GPX4, showing promise as potential predictors for RT-related fatigue, and may provide information of its biologic underpinnings.
Chronic inflammation associated with inflammatory bowel disease (IBD) results in increased oxidative stress that damages the colonic microenvironment. A low level of serum bilirubin, an endogenous antioxidant, has been associated with increased risk for Crohn's disease (CD), but no study has tested another common IBD ulcerative colitis (UC). Bilirubin is metabolized in the liver by uridine glucuronosyltransferase 1A1 (UGT1A1) exclusively. Genetic variants cause functional changes in UGT1A1 which result in hyperbilirubinemia, which can be toxic to tissues if untreated and results in a characteristic jaundiced appearance. Approximately 10% of the Caucasian population is homozygous for the microsatellite polymorphism UGT1A1*28, which results in increased total serum bilirubin levels due to reduced transcriptional efficiency of UGT1A1 and an overall 70% reduction in UGT1A1 enzymatic activity. The aim of this study was to examine whether bilirubin levels are associated with the risk for ulcerative colitis (UC). Using the Informatics for Integrating Biology and the Bedside (i2b2), a large case-control population was identified from a single tertiary care center, Penn State Hershey Medical Center (PSU). Similarly, a validation cohort was identified at Virginia Commonwealth University Medical Center. Logistic regression analysis was performed to determine the risk of developing UC with lower concentrations of serum bilirubin. From the PSU cohort, a subset of terminal ileum tissue was obtained at the time of surgical resection to analyze UGT1A1 gene expression (which encodes the enzyme responsible for bilirubin metabolism). Similar to CD patients, UC patients also demonstrated reduced levels of total serum bilirubin. Upon segregating serum bilirubin levels into quartiles, risk of UC increased with reduced concentrations of serum bilirubin. These results were confirmed in our validation cohort. UGT1A1 gene expression was up-regulated in the terminal ileum of a subset of UC patients. Lower levels of the antioxidant bilirubin may reduce the capability of UC patients to remove reactive oxygen species leading to an increase in intestinal injury. One potential explanation for these lower bilirubin levels may be up-regulation of UGT1A1 gene expression, which encodes the only enzyme involved in conjugating bilirubin. Therapeutics that reduce oxidative stress may be beneficial for these patients.
Understanding anther development is crucial in determining traits that are important for crop breeding. Maize is one of the most studied grass species with well-defined anther developmental stages. In this work, we use a network approach to build a maize anther interactome using anther specific RNAseq and small RNA (sRNA) data.
While various measures are available for computing sentence similarity, few studies have examined their performance in the biomedical domain. Motivated by BIOSSES, an earlier study for biomedical sentence similarity, we here explore the effectiveness of multiple similarity measures via sentence ranking in PubMed abstracts. Ranking sentences is a crucial component for text summarization and biocuration evidence attribution. Applied to the "natural language processing" and "computational biology" datasets, our experimental results show that the off-the-shelf measures for sentence similarity may not be effective for ranking sentences. Neither lexical nor semantic measures provided more than 0.60 NDCG scores at the top 1 ranked document. It necessitates the development of a large-scale benchmark set and more effective measures.
We recently developed the Genomics Research Integration System (GRIS) to help NIAID investigators at the NIH leverage both phenotypic and genotypic patient data to identify causal variants for rare diseases. The project is a bioinformatics compliment to an initiative to sequence exomes for all NIAID patients visiting the NIH Clinical Center. The system is designed to serve as a valuable resource for clinical genomic data annotated with standardized phenotypic terms using the Human Phenotype Ontology \citeKohler2013. GRIS uses PhenoTips® \citeGirdea2013 to capture clinical records and family pedigrees which are linked to genomic records stored in a genetic analysis tool,seqr, developed at the Broad Institute (\urlseqr.broadinstitute.org ) to enable causal variant identification. We have customized both programs in novel ways to meet NIH encryption requirements, to link patient records across programs in a controlled manner, and to provide "tiers" of access so that individual research groups can customize users' ability to edit their patient records and view personally identifiable information (PII). A challenge faced by shared clinical data repositories is to facilitate maximal collective research value of data through open sharing, while respecting the needs of researchers to adjust access to patient data in accordance with research goals and subject to clinical sharing guidelines. We devised a technical approach to meet the needs of sharing policies, formulated collectively by researchers and clinicians, to promote wider acceptance and usage of the system. Accordingly, we implemented a patient identifier mapping system in conjunction with automated notifications to enable transparent sharing. Our approach may prove helpful to other hospital or clinical support systems seeking to respect the confidentiality of patient PII and early findings of individual researchers, while recognizing that data repositories are most primed for discovery (and can significantly increase return on investment) if they are open and accessible to a larger research community.
Among the various risk adjustment models for Medicare and Medicaid programs, the CMS-HCC model is a prospective model for the Medicare Advantage (MA) plans which ensures a plan is paid per the expected risk of the population that it is responsible for. The risk score computed by the CMS-HCC model is called Risk Adjustment Factor (RAF). RAF scores are prospective in nature - data from the previous year of service is used to predict the expected risk and hence the prospective payment in the current year . In this paper we discuss how early prediction of RAF can help in realizing two revenue opportunities, as detailed in the following. The first revenue opportunity is the Accelerated Revenue Opportunity. As an example, consider the service year 2016 and the payment year 2017. Based on the RAF timeline, if some of the services rendered in the second half of 2016 were instead rendered in the first half of 2016, they could have counted towards the subsequent payments starting from Jan 2017 rather than a late (lump-sum) payment in August 2017. The second revenue opportunity, named incremental additional revenue opportunity, relates to payments that never get realized even with some delay. Our contribution in this work is two-fold: (1) We present a method to identify a candidate list of individuals for early inspection in the beginning of any given year. (2) We propose methods to evaluate a given candidate list in terms of its ability to realize the two revenue opportunities mentioned above. The proposed evaluation methods can be used more generally to evaluate other predicted candidate lists. The core of our solution is the identification of a candidate list of individuals to be inspected early in the beginning of any given year. Our proposed method uses RAF scores from the previous two years, and the monetary value of claims under certain condition categories in the past year, and applies machine learning to predict the top 20% of high RAF scores in the current year. The resulting list is considered the candidate list for early RAF inspection in the beginning of the current year. Evaluation of Revenue Opportunities: We demonstrate our methodologies for evaluating revenue opportunities by considering a scenario in which the 2014-2015 data from a large healthcare organization was used to predict the 2016 RAF scores early in the beginning of 2016, and study opportunities that could have been realized using such early prediction. To evaluate the accelerated revenue opportunity, we looked at the lump-sum adjustments made in Aug 2017 for the individuals in the candidate list, and summed over the lump adjustments when they are positive. These late payments could have been received earlier in time if the diagnosis codes identified in the second half of 2016 were instead identified in the first half. Our method for evaluating the incremental additional revenue opportunity is based on the assumption that many conditions in Medicare populations are chronic and the 2017 RAF scores could have been realized earlier in 2016 using early inspection. Specifically, we compared the 2017 and 2016 RAF scores for the individuals in the identified candidate list and determined cases for whom the 2017 RAF score is sufficiently larger than the 2016 RAF score. We consider the delta in RAF scores sufficiently large if the ratio of 2017 RAF over 2016 RAF for an individual is larger than the ratio of c = average(2017 RAF)/average(2016 RAF) where averages here are taken across the population. The delta between RAF 2017 and c×(RAF 2016) marks the missing revenue opportunity for an individual. In the case study mentioned above, our analysis indicated full RAF potential was not captured on 41% (2,048) of the members. The unrealized 2017 payments for these members (based on the unrealized RAF potentials in 2016) is estimated to be $12.69 million. In addition to the missed revenue opportunity observed in the analysis, a significant amount of revenue, approximately $5 million, could have been accelerated to January in 2017. This approach can be used in conjunction with existing analytics to introduce new sources of revenue. The case studies on RAF analytics describes the work in more detail . REFERENCES CMS Risk Adjustment, https://www.cms.gov/Medicare/Health-Plans/MedicareAdvtgSpecRateStats/Risk-Adjustors.html  Predictive Risk Adjustment Factor (RAF), http://www.basehealth.com/raf.html
Recent advances in DNA sequencing technologies has transformed the study of DNA sequence variation. Over the last decade, the development of a number of functional impact predictors and annotation tools have been implemented to aid in this DNA variant analysis. While many annotation tools and pipelines have been built to annotate nuclear genome variants, only a few software predictors address the thousands of variants found in human mitochondrial DNA. Many prediction tools built for nuclear DNA have been retrofitted to annotate mitochondrial DNA, but because of the vast differences between the two, nuclear annotators fail to produce accurate predictions for mitochondrial mutations. Conventional annotation tools and predictors such as SIFT and PolyPhen2 are a few of the tools that produce less than accurate pathogenicity scores for mitochondrial variants. More recently, tools such as APOGEE have addressed the need for specialized tools to annotate mtDNA exonic variants with high-confidence. In addition, most of the annotation tools only annotate exonic mutations, but variants in mitochondrial tRNA and rRNA are important and are a common cause of mitochondrial disease. A few papers have addressed the need to accurately predict the pathogenicity of tRNA variants, such as MitoTIP, while no known tools exist for annotating rRNA variant pathogenicity for mtDNA variants. We have constructed a comparative analysis of both standard and non-standard annotation tools and their ability to accurately predict the pathogenicity of mitochondrial mutations. We carefully curated a complete list of all potential non-synonymous exonic, tRNA and rRNA mitochondrial mutations and ran selected tools for each dataset. We have analyzed the accuracy and precision of each tool compared to the consensus among the tools combined with pathogenicity predictions from MITOMAP disease associations. Over the course of our testing, we confirmed that many of the prediction tools typically used for nuclear DNA were subpar when tested on mitochondrial DNA. Newer annotation tools built specifically for mtDNA such as APOGEE had higher overall assessment scores. Based on our analysis, we are creating an online annotation tool specifically for mtDNA variants that integrates pathogencity scores from our top-rated prediction tools.
Participants (n=44, age 4-15 yrs) with double-blind, placebo-controlled food challenge proven food allergy to multiple foods, were administered omalizumab (anti-IgE, n=40) or placebo (n=4) for 16 weeks with oral immunotherapy (OIT) for 2-5 foods, starting 8 weeks after the beginning of omalizumab or placebo (clinical outcomes of this trial in \citeANDORF2018 ). To better understand the immunophenotypical changes leading to successful desensitization, we interrogated changes in immune cell subtypes in PBMCs before and after successful OIT using mass cytometry (CyTOF) on unstimulated as well as PMA/Ionomycin stimulated samples. The first step in this analysis was an unsupervised clustering across the markers within the CyTOF panel used for cell type identification (lineage markers) of a pooled dataset of all cells of the samples of the two time points. This was done through FlowSOM \citeVanGassen2015, using self-organizing maps followed by hierarchical consensus meta-clustering. The immune cell subtype of each cluster was determined based on the expression level of the lineage markers of the cells within that cluster. The median level of various functional markers within each cluster were individually determined for each sample. Subsequently we tested whether the median level for each functional marker in each cell type (cluster) was significantly different between baseline and post-OIT.
Further mechanistic experiments included epigenetics (pyrosequencing of bisulfite treated genomic DNA purified from participant's PBMCs) and component resolved diagnostics (ThermoFisher).
Our preliminary results indicated a significant decrease (FDR-adjusted P < 0.01) of CD28 and GPR15 levels in effector memory CD4+ T cells after successful OIT compared to baseline. A significant increase (FDR-adjusted P < 0.01) in IL-10 was detected in the Treg and gamma-delta T cell populations. Epigenetic data demonstrated hypermethylation of the -48 CpG site in the IL-4 promoter region post-OIT (FDR-adjusted P < 0.01). The IgG4/IgE ratio of antibodies to most of the whole foods in the participant's OIT and to the corresponding storage proteins showed a significant increase (FDR-adjusted P < 0.01) between baseline and post-OIT. Our data thus imply that T cell anergy induced through OIT might contribute to successful desensitization.
\beginthebibliography 1 \bibitemalzamel2017faster M. Alzamel, P. Charalampopoulos, C. S. Iliopoulos, S. P. Pissis, J. Radoszewski, and W.-K. Sung. \newblock Faster algorithms for 1-mappability of a sequence. \newblock In \em International Conference on Combinatorial Optimization and Applications, pages 109--121. Springer, 2017. \bibitemderrien2012fast T. Derrien, J. Estellé, S. M. Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca. \newblock Fast computation and applications of genome mappability. \newblock \em PloS one, 7(1):e30377, 2012. \bibitemThankachanACA18 S. V. Thankachan, C. Aluru, S. P. Chockalingam, and S. Aluru. \newblock Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. \newblock In \em Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, pages 211--224, 2018. \endthebibliography
Most of today's genome sequencing technology requires that genomes be sequenced in fragments. Typically, these fragments are then aligned using a variety of different alignment programs. All alignment tools query against a reference database to determine the most accurate reassembly of the original DNA strand's nucleotide sequence. Although these programs can align in both nucleotide and protein space, each method comes with its own disadvantages. Protein aligners such as PALADIN consistently align a greater percent of reads faster and provide greater insight into the functional capabilities of the aligned sequence. On the other hand, this method reduces the sensitivity of taxonomic classification due to the degeneracy of the genetic codes. Our program, Renuc, is a PALADIN plugin that addresses this issue by taking protein alignment results using the UniProt database and identifying the most likely taxonomic origin for each nucleotide sequence associated with each detected protein. We have validated our approach and its implementation in Renuc by successfully retrieving the nucleotide sequence and corresponding taxonomic IDs for all of the aligned proteins in our test dataset consisting of a whole Escherichia coli genome. Our program aligns over 99 percent of the nucleotide reads with 97 percent of them remaining in the same protein cluster as the original protein alignment. However, this dataset is incredibly well studied and documented in UniProt. Future work should be considered with a dataset containing less annotations in the database. Renuc quickly identifies and visualizes the alignment's taxonomic data in a user friendly way. The integration of SQLite into the program significantly reduces the time required to retrieve information from the UniProt database. Currently, we seek to improve the retrieval of nucleotide sequences by creating a local cache of the NCBI RefSeq database, and visualizing taxonomy with greater resolution using RaxML.
Studies using similarity metrics have been used to help quantify the relationship between patients; however, these studies do not leverage either the patients' prior treatments or the ordering of these treatments. Our proposal seeks to recommend the next treatment for a given patient by comparing the overall survival of similar patients who share a common treatment stem. Data was aggregated from the FlatIron® Advanced non-small-cell lung (NSCLC) proprietary dataset  comprised of 1312 patients from 2008-2016. Our methodology pipeline was comprised of three main components (non-treatment-based similarity (NTS), treatment-based similarity (TS), recommendation). For NTS, we divided all non-treatment features into 2 main categories (i.e. genetic category and clinical/demographic category), computed a patient similarity using Gower Similarity Metric  for each category, and lastly, followed a similar approach as Gottlieb et al  and created a single similarity measure using the geometric mean. A similarity threshold is used to select for a reference patient p a set of similar patients Simp with a similarity value (determined from the genetic and clinical/demographic features) above the threshold value. TS is used during the next step to filter from Simp the patients that do not share the same treatment class-level stem (prior treatments) as the reference patient. The objective is to consider only patients who share similar previous class treatments in order to determine the next treatment class for the reference patient. From this final subset of patients, we determine which patient has survived the longest number of days following the treatment stem and we recommend this patient's next treatment class to the reference patient. To evaluate this approach, we repeated this methodology across 10 random subsets of patients where each subset was 10% of the entire data set. Each patient in each subset was viewed as a reference patient and compared against the patients outside the random subset. We varied the length of the initial treatment stem and varied the NTS similarity threshold value. We found that for stems lower than two treatments and with similarity thresholds above 0.6, only approximately 30% of patients took the same treatment as the longest surviving patient in the subpopulation. Further work is needed to refine the proposed approach, including assigning different weights to the features used in the similarity computation, and considering other outcome variables to recommend next treatment (e.g. quality of life using ECOG performance score).
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS_k, have been shown to produce results as effective as multiple-sequence alignment based methods in phylogeny. In this work, we present a novel linear-time heuristic to approximate ACS_k, which is faster than computing the exact ACS_k while being closer to the exact ACS_k values compared to previously published linear-time greedy heuristics.
iPTMnet is an integrated bioinformatics resource for post translational modification network discovery and analysis that integrates text mining results from eFIP and RLIMS-P, curated databases and ontologies. In its present form, the data in iPTMnet is accessible only through the website. To facilitate the integration of iPTMnet into existing bioinformatics pipelines, we have built a RESTful API to query and retrieve the data from iPTMnet programmatically. In addition, we have also built Python and R packages that make the API accessible for biologists.
We have developed a cloud-based (AWS and IBM SoftLayer) knowledge environment for scalable semantic mining of scientific literature and PTM integrative knowledge discovery in precision medicine, building upon our novel natural language processing (NLP) technologies and bioinformatics infrastructure. We provided semantic integration of full-scale PubMed mining results from disparate text mining tools, along with kinase-substrate data from iPTMnet, and PTM proteoforms and their relations from Protein Ontology (PRO). We shared the digital objects of those applications in multiple interoperable formats and have registered them in bioCADDIE using CEDAR. We experimented with multiple system setups using operating system, programming language, web server, or database server that best fits each application. We evaluated the cost effectiveness of cloud computing by only paying for what we use and readily experimenting with additional services. A web portal is available for accessing our cloud-based knowledge environment at https://proteininformationresource.org/cloud/.
Drug repurposing is the use of currently-approved drugs to treat diseases separate and distinct from their originally-approved indications. With the rise of precision medicine and quantitative systems pharmacology, there is heightened emphasis on the application of genomic data to guide drug development. Herein, we propose an algorithmic approach to the selection of drug repurposing candidates using relevant drug profiles from DrugBank, a publicly available database of pharmacological agents, and genomic data from BioVU, a largescale DNA repository linked to de-identified longitudinal electronic health record information based at Vanderbilt University Medical Center. Specifically, we propose a method of repurposing candidate prioritization through the integration of structured data from DrugBank, such as marketing start date, number of targets with known mechanism of action, target names, and drug class, with quality control thresholds for the genomic data derived from the DNA samples housed within BioVU. Through the synergy of delineated "target-action pairs," along with target genetics, pharmacodynamics, and pharmacokinetics, we identify a new method of repurposability screening and candidate prioritization, which generates a select, manageable subset of approximately 250 drugs with unique mechanisms of action, target selectivity, and real-world repurposing potential from a total of nearly eleven thousand (11,000) agents.
Boolean implication networks (Genet) have been utilized to model gene co-expression networks in our previous research. In this study, they are constructed to model the co-occurrence of amplification/deletion events in DNA copy number variations (CNVs) at a genome-wide scale. The Boolean implication scheme extends the dichotomous nature of the variable under scrutiny such that it can have numerous discrete values corresponding to DNA CNVs, and pairwise co-occurrence of CNVs is computed. The implication network was implemented in a software package (Genet-CNV) and run on 271 patient samples afflicted with non-small cell lung cancer (NSCLC )[GSE31800].
Recognition of viral epitopes by the host immune system plays a pivotal role in controlling viral infection, while at the same time exerting selective pressure for escape mutations, which in turn leads to many epitopes evolving faster than the adjacent regions. Some epitope regions appear to be highly conserved despite the strong immune pressure due to functional and/or structural constraints acting on them. Yet, we still have relatively little understanding of the nature of protein structural and functional constraints operating on epitope regions. Here, we identify coevolving epitope regions that have high functional and/or structural constraints. We examined patterns of coevolution of protein segment pairs between Integrase-Reverse Transcriptase, Integrase-Vpr, and Reverse Transcriptase-Vpr protein interactions pairs in HIV-1 pre-integration complex. Our coevolutionary and structural analysis shows that protein regions with strong multiple coevolutionary constraints with few regions are located in structurally conserved regions (i.e., those with well-defined secondary structures). Meanwhile, protein regions with strong multiple coevolutionary constraints with many regions are clustered in the structurally flexible regions (regions with high B-factor) of the proteins and regions with high conformational diversity (relative mobility in the collective dynamics). On the other hand, protein regions with weak coevolutionary signals are clustered in structurally disordered regions. Identifying viral epitope regions that harbor strong constraints against escape mutations is important in development of vaccines that can induce immune responses against multiple conserved epitopes. This analysis also offers important insights into molecular evolution of protein interactions.
Clustering biological samples allows us to define populations within groups (for example of species or cells), which permits us to answer questions about the processes occuring in those groups. Distance calculations between DNA sequences have been used to build clusters of samples. However, distance calculations for genome-scale data are limited to a small number of samples due to the size of genomic data. For example, a human genome sequenced at 10X coverage is approximately 30Gb in size. Thus, to understand biological samples it is necessary to develop efficient, accurate methods to calculate distances among many genomes. This will allow us to see similarities and differences between DNA sequences, examine their mutational patterns, and better understand evolution. In this project, we calculated cosine distances among human genome samples based on k-mer frequencies. We used publicly available Illumina reads from human genome samples from five populations. We calculated k-mer frequencies for multiple values of k in each genome sample using Jellyfish, a tool for fast, memory-efficient counting of k-mers in DNA. We calculated cosine distances between human genome k-mer profiles based on the frequency of each k-mer. We used these distances to build dendrograms of samples and infer clustering. For k-mers where k<=12, distance calculations were fast, but these distances did not capture expected population structure (i.e. known ancestry of samples). In contrast, population structure should be captured accurately for large k, but distance calculations are computationally intractable. Thus, we need an efficient way to compute genomic distance using large k. We hypothesized that the majority of k-mers are infrequent and would not contribute to the dot product in the cosine distance calculation. Thus, removing these k-mers from the calculation will possibly reduce computation time without impacting the cosine distance. In order to better understand the distribution of k-mer frequencies we built histograms. Frequencies were normalized by the level of genome coverage. We filtered out k-mers with frequencies below 10^0, 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, and 10^7. We recalculated cosine distance for each set of filtered k-mers. We examined the closest neighboring sample to each sample. Each sample's nearest neighbor remained the same after filtering out frequencies below 105. We rebuilt the dendrograms and confirmed that clustering of samples was not affected by filtering out frequencies below 105, which we determined to be the optimal filter value. Calculating cosine distances on filtered frequencies was 25 times faster than calculating on unfiltered frequencies. Calculating cosine distance between a pair of 12-mer human genome profiles takes 48 seconds using TensorFlow with a GPU-based framework; thus, distance calculations for 25 samples would take four hours. However, calculating distance on a pair of 12-mer profiles that does not include frequencies below 105 takes only two seconds. In the future we will validate that the filtered data provides similar speed-ups for large k, while allowing accurate sample distances and clusters. We plan to make this approach available for projects that involve clustering of genome data samples.
RNAi therapeutics can be designed to silence almost any gene of interest and have demonstrated high levels of efficacy and acceptable safety profiles in pre-clinical and clinical development for cardio-metabolic, hepatic infectious, central nervous system, and rare diseases. Minimizing microRNA-like off-target activity while maintaining on-target silencing is a means to maximize the safety profile. One strategy to mitigate off-target activity is to incorporate thermally destabilizing residues such as glycol nucleic acid in the seed region of the antisense strand of a double-stranded RNA. Here we demonstrate the benefit of this strategy using Alnylam's ESC+ conjugate platform by performing RNA-Seq in dose response to measure both on-target and off-target effects. Diverse measures and visualizations of transcriptomic noise will be presented, as well as estimates of relative on-target to off-target effects as a function of dose. These results show that ESC+ conjugates are capable of simultaneously achieving high levels of on-target silencing while maintaining low levels of transcriptomic noise.
Antimicrobial peptides (AMPs) are being considered as a promising replacement for antibiotics. They take action in the bodies' adaptive immune system. While its effect inside the body is primarily known, a problem of correctly identifying AMPs based on their sequence features remains a subject of active investigations. Here we optimize the use of the reduced alphabet, simplify 20-letter amino acid alphabet to 2-4 letters, and the use of N-grams, short strings of amino acids, to find a correlation between a profile of N-gram frequencies. The calculations were carried out using java programs written for this study and WEKA machine learning software. Classification using machine learning methods was then conducted for AMP subclasses, including antibacterial, antifungal, and antiviral peptides. The results show that reduced alphabets with N-gram frequency analysis are a promising alternative in the area of AMP classification and prediction. All AMP sequences were retrieved from different sources. AMP set consists of 7984 sequences, not necessarily of any specific class. We also used class-specific AMP sets (antibacterial, antiviral, and antifungal). A raw negative set consisting of 20258 non-AMPs using sequence fragments from annotated protein sequence databases. The classification of AMPs against non-AMPs was successful. Models achieved maximum accuracy of 87.71% using frequency N-gram analysis, alphabet reduction option 47, and the RF model with 10 trees cross-validation. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABPs AMPs achieved maximum accuracy of 86.83% using frequency N-gram analysis, alphabet reduction option 47, and RF model, while with bagging algorithm 84.35%. Second, classification of AVPs against non-AVP AMPs achieved an accuracy of 92.75% and 92.30% using frequency N-gram analysis, alphabet reduction option 47 and 29 respectively, and with RF model. This experiment also consisted of many other successful trials. RF significantly outperforms each of the other six learning algorithms. Alphabet reduction 47 most often yielded the highest classification accuracies. This finding implies that 4-cluster alphabet is optimal for N-gram frequency analysis and machine learning. Our results suggest that the classifiers produced possess great predictive power and can be of significant use in various biological and medical applications, potentially saving tens or hundreds of thousands of lives.
This paper presents a new method for diagnosing schizophrenia using deep learning. This experiment used a secondary dataset supplied by the National Institute of Health. The experiment analyzes the dataset and identifies schizophrenia using traditional machine learning methods such as logistic regression, support vector machines, and random forest. Finally, a deep neural network with three hidden layers is applied to the dataset. The results show that the neural network model yielded the highest accuracy, suggesting that deep learning may be a feasible method for diagnosing schizophrenia.
\subsubsection*Background Genome rearrangements are large-scale evolutionary events that shuffle genomic architectures. Since genome rearrangements are rare, the number of events between two genomes is used in phylogenomic studies to measure the evolutionary distance between them. Such measurement is often based on the maximum parsimony assumption, implying that the evolutionary distance can be estimated as the minimum number of rearrangements between genomes. The maximum parsimony assumption enables addressing the ancestral genome reconstruction problem, which asks for reconstructing of ancestral genomes from the given extant genomes, by minimizing the total distance between genomes along the branches of the phylogenetic tree. The basic case of this problem with just three given genomes is known as the genome median problem (GMP), which asks for a single ancestral genome (\emphmedian genome ) at the minimum total distance from the given genomes. \emphWhole genome duplication (WGD) represents yet another type of dramatic evolutionary events, which simultaneously duplicate each chromosome of a genome. WGDs are known to have happened in the evolution of plants~\citeguyot2004ancestral. An analog of the GMP in presence of a WGD is known as the guided genome halving problem (GGHP). This problem is posed for input genomes A and B, where all genes in B are present in a single copy (\emphordinary genome ), while all genes in A are present in two copies (\emphduplicated genome ). The GGHP asks for an ordinary ancestral genome R that minimizes the total evolutionary distance between genomes A and $2R$ (genome resulted from the WGD of R) and between B and R. \vspace-0.5em \subsubsection*Methods A major tool for analysis of genome rearrangements is the breakpoint graph, which encodes gene adjacencies in different genomes by edges of different colors. A median genome corresponds to a certain optimal perfect matching in the breakpoint graph of the given genomes. While the GMP is NP-hard \citetannier2009multichromosomal, one of the prominent exact and practical solutions to the GMP is based on decomposition of the breakpoint graph intoadequate subgraphs ~\citexu2009fast, i.e., induced subgraphs where any optimal matching can be extended to an optimal matching in the whole graph. To handle genomes with duplicated genes, one has to generalize the notion of the breakpoint graph to the contracted breakpoint graph of the given genomes A and B. In the present study, we extend the adequate subgraph approach to the GGHP. \vspace-0.5em \subsubsection*Results We extended the notion of adequate subgraphs to contracted breakpoint graphs and identified all simple adequate subgraphs of order $2$ and $4$ (shown in Fig. \reffig:adequate_gghp ). This enables us to design an efficient divide-and-conquer algorithm for the GGHP. Our algorithm searches for adequate subgraphs in the given contracted breakpoint graph and combines optimal matchings in these subgraphs into an optimal matching (representing a solution to the GGHP) in the whole graph. \vspace-0.5em \subsubsection*Conclusion Our present study provides an exact fast algorithm for the GGHP. In future research, we plan to extend the notion of adequate subgraphs to other ancestral reconstruction problems with duplicated genomes, such as the guided genome aliquoting problem. \vspace-1em
Transitioning to value-based care makes new demands on understanding and managing patient risk for a variety of adverse outcomes in multiple conditions. Optimizing use of finite healthcare resources then proves challenging, and would benefit from a data-driven approach. Modelling the "risk triangle" paradigm of disease management as a state diagram within the electronic health record helps bring clinical situational awareness and tailored decision support interventions to individual patients at the point-of-care, while automatically capturing new types of state duration and transition sequence data across the whole population. Such data can iteratively inform improving risk prediction models.
With biomolecular structure recognized as central to understanding mechanisms in the cell, dry laboratories have spent significant efforts on modeling and analyzing structure and dynamics. While significant advances have been made, particularly in the design of sophisticated energetic models and molecular representations, such efforts are experiencing diminishing returns. One of the culprits is the low exploration capability of Molecular Dynamics- and Monte Carlo-based exploration algorithms. The impasse has attracted AI researchers bringing complementary tools, such as randomized search and stochastic optimization. The tutorial introduces students and researchers to stochastic optimization treatments and methodologies for understanding and elucidating the role of biomolecular structure and dynamics in function. In addition, the tutorial allows attendees to connect between structures, motions, and function via analysis tools that take an energy landscape view of the relationship between biomolecular structure, dynamics, and function. The presentation is enhanced via open-source software that permit hands-on exercises, which benefits both students and senior researchers keen to make their own contributions.
Designing effective Clinical Decision Support (CDS) tools in an Electronic Health Record (EHR) can prove challenging, due to complex real-world scenarios and newly-discovered requirements. Deploying new CDS tools shares much in common with new product development, where "agile" principles and practices consistently prove effective. Agile methods can thus prove helpful on CDS projects, including time-boxed "sprints" and lightweight requirements gathering with User Stories. Modeling CDS behavior promotes unambiguous shared understanding of desired behavior, but risks analysis paralysis: an Agile Modeling approach can foster effective rapid-cycle CDS design and optimization. The agile practice of automated testing for test-driven design and regression testing can be applied to CDS development using open-source tools. Ongoing monitoring of CDS behavior once released to production can identify anomalies and prompt rapid-cycle redesign to further enhance CDS effectiveness. The tutorial participant will learn about these topics in interactive didactic sessions, with time for practicing the techniques taught.
The past decade has seen a revolution in genomic technologies that enable a flood of genome-wide profiling of many molecular elements on human genomes. This massive-scale "-omics" data provides researchers with an unprecedented opportunity to understand gene regulation that can enable new insights into principles of life, the study of diseases, and the development of treatments and drugs. Computational challenges are the major bottlenecks for comprehensive genome-wide data analysis of gene regulation. Such data sets are complex, structured and at an unprecedented scale of data growth. Problems of this nature may be particularly well suited to deep learning techniques that recently show impressive results across a variety of domains. This tutorial aims to provide an extensive literature review about the state-of-the-art techniques in deep Learning, to examine how deep learning is enabling changes at analyzing datasets about gene regulations, and to foresee the potential of deep learning to transform several areas of biology and medicine.
Reproducibility is essential for the verification and advancement of scientific research. It is often necessary, not just to recreate the code, but also the software and hardware environment to reproduce results of computational analyses. Software containers like Docker, that distribute the entire computing environment are rapidly gaining popularity in bioinformatics. Docker not only allows for the reproducible deployment of bioinformatics workflows, but also facilitates mix-and-match of components from different workflows that have complex and possibly conflicting software requirements. However, configuration and deployment of Docker, a command-line tool, can be exceedingly challenging for biomedical researchers with limited training in programming and technical skills. We developed a drag and drop GUI called the Biodepot-Workflow-Builder (Bwb) to allow users to assemble, replicate, modify and execute Docker workflows. Bwb represents individual software modules as widgets which are dragged onto a canvas and connected together to form a graphical representation of an analytical pipeline. These widgets allow the user interface to interact with software containers such that software tools written in other languages are compatible and can be used to build modular bioinformatics workflows. We will present a case study using the Bwb to create and execute a RNA sequencing data workflow.
Accurately identifying disease-associated alleles from large sequencing experiments remains challenging. During this tutorial, participants will learn how to use a new variant annotation and filtering web app called Bystro (https://bystro.io/) to analyze sequencing experiments. Bystro is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for even the largest, terabyte-sized whole-genome experiments containing thousands of samples. Using its general-purpose, natural-language filtering engine, attendees will be shown how to perform quality control measures and identify alleles of interest. They will then be guided in exporting those variants, and using them in both a regression context by performing rare-variant association tests in R, as well as classification context by training new machine learning models in Python's scikit-learn library.
This tutorial extensively covers the definitions, nuances, challenges, and requirements for the design of interpretable and explainable machine learning models and systems in healthcare. We discuss many uses in which interpretable machine learning models are needed in healthcare and how they should be deployed. Additionally, we explore the landscape of recent advances to address the challenges model interpretability in healthcare and also describe how one would go about choosing the right interpretable machine learnig algorithm for a given problem in healthcare.
A major aim of cancer genomics is to pinpoint which somatically mutated genes are involved in tumor initiation and progression. This is a difficult task, as numerous somatic mutations are typically observed in each cancer genome, only a subset of which are cancer-relevant, and very few genes are found to be somatically mutated across large numbers of individuals. In this talk, I will overview three methods my group has introduced for identifying cancer genes. First, I will present a framework for uncovering cancer genes, differential mutation analysis, that compares the mutational profiles of genes across cancer genomes with their natural germline variation across healthy individuals. Next, I will show how to leverage per-individual mutational profiles within the context of protein-protein interaction networks in order to identify small connected subnetworks of genes that, while not individually frequently mutated, comprise pathways that are altered across (i.e., "cover") a large fraction of individuals. Finally, I will demonstrate that cancer genes can be discovered by identifying genes whose interaction interfaces are enriched in somatic mutations. Overall, these methods recapitulate known cancer driver genes, and discover novel, and sometimes rarely-mutated, genes with likely roles in cancer.
Precision medicine offers the promise of improved diagnosis and for more effective, patient-specific therapies. Typically, such studies have been pursued using research cohorts. At Vanderbilt, we have linked de-identified electronic health records (EHRs), to a DNA repository, called BioVU, which has nearly 250,000 samples. Through BioVU and a NHGRI-funded network using EHRs for discovery, the Electronic Medical Records and Genomics (eMERGE) network, we have used clinical data of genomic basis of disease and drug response using real-world clinical data. The EHR also enables the inverse experiment - starting with a genotype and discovering all the phenotypes with which it is associated - a phenome-wide association study. By looking for clusters of diseases and symptoms through phenotype risk scores, we find unrecognized genetic variants associated with common disease. The era of huge international cohorts such as the UK Biobank, Million Veteran Program, and the newly started All of Us Research Program will make millions of individuals available with dense molecular and phenotypic data. All of Us launched May 6, 2018 and will engage one million diverse individuals across the US who will contribute data and also receive results back.
Motivation: The powerful idea of the Connectivity Mapping proposes the creation of a library of drug induced gene expression signatures. Such a resource can facilitate finding small molecules to mimic or reverse disease signatures, identifying drug targets, discovering the mechanisms of action for novel small molecules, elucidating off-target effect mechanisms, and directing cellular differentiation and reprogramming. A related concept is Gene Set Enrichment Analysis. Problem statement: In my presentation I will discuss how these two transformative ideas can be expanded in various creative ways to unify knowledge representation in system biology. Approach: I will demonstrate how expanded Connectivity Mapping and Gene Set Enrichment Analyses combined with Machine Learning can enable imputing and illuminating new biological and pharmacological knowledge.
It is our great pleasure to have eleven highlights talks in the program of the 2018 ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB'18. These invited talks are based on articles that have been published in the last 12 months and represent some of the most interesting and exciting works in our field.
Predicting the fold of a protein from its sequence is a challenging and important problem in bioinformatics. In this work, we developed the first deep learning method that can directly classify any protein sequence of arbitrary length into 1,195 known protein folds in the Structural Classification of Proteins (SCOP) database. The method uses one-dimensional convolutional neural networks (1D-CNN) to automatically extract features from raw protein sequence information and compose them to high-level features to predict protein folds, achieving a classification accuracy of 73% on an independent test dataset. This novel machine learning approach is different from traditional alignment methods that rely on the known folds of homologous proteins to recognize the fold of a target protein. It is also more accurate than HHSearch - a top protein profile-profile alignment method in recognizing folds of proteins that have little sequence similarity with proteins with known folds. Moreover, the deep learning method overcomes the shortcoming of the previous machine learning methods that can only classify proteins into dozens of pre-selected folds due to methodological limitations. The hidden features extracted by the deep learning method provide a new semantic representation of proteins that can be used in other protein analysis tasks such as protein clustering and comparison.
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. A representative subset is a subset of sequences from the original data set that (1) minimizes the redundancy in the representative sequences, and (2) maximizes the representativeness of the subset; that is, every sequence in the full data set has at least one representative that is similar to it. The selected representative subset is then used in downstream analysis in place of the full data set. Previous methods for this task, such as CD-HIT, PISCES and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. These sequence selection methods are very widely used---for example, the CD-HIT papers have been cited a total of >3,000 times (Google Scholar)---and are a standard preprocessing step applied to data sets of protein sequences, cDNA sequences and microbial DNA. In this work, we propose a principled framework, Repset, for representative protein sequence subset selection using submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. Our approach involves defining a submodular objective function that quantifies the desirable properties of a given subset of sequences, and then applying a submodular optimization algorithm to choose a representative subset that maximizes this function. Framing this task as an optimization problem has two benefits. First, it allows us to leverage a large existing literature on submodular optimization. This led to the development of a method that is computationally efficient, empirically outperforms other methods, and, in contrast to all existing solutions to this problem, is backed by theoretical guarantees of its performance. In particular, Repset outperforms threshold-based methods on two measures: (1) representative subsets produced by Repset have lower redundancy, as measured by the pairwise similarity of sequences in the set, and (2) these subsets have greater structural diversity, as measured using the SCOPe library of protein domain structures. Second, the optimization-based framework gives the method great flexibility. The user can select one of a variety of objective functions to optimize according to their needs. For example, the user can minimize the redundancy of sequences in the subset, maximize the representativeness of the subset of the full set, or some combination of the two. The user can also choose to prefer some sequences over others, such as preferring long sequences over shorter ones. More broadly, this paper demonstrates the utility of submodular optimization for computational biology. Applying submodular optimization to a new problem has two simple steps: (1) devise a submodular objective function, and (2) apply a standard optimization algorithm to this objective. Therefore, we believe that the strategy we employ here will have analogous applications to hundreds of other problems in computational biology.
Mathematical models of cellular processes can systematically predict the phenotypes of novel combinations of multi-gene mutations. Searching for informative predictions and prioritizing them for experimental validation is challenging since the number of possible combinations grows exponentially in the number of mutations. Moreover, keeping track of the crosses needed to make new mutants and planning sequences of experiments is unmanageable when the experimenter is deluged by hundreds of potentially informative predictions to test. We present CrossPlan, a novel methodology for systematically planning genetic crosses to make a set of target mutants from a set of source mutants. We base our approach on a generic experimental workflow used in performing genetic crosses in budding yeast. We prove that the CrossPlan problem is NP-complete. We develop an integer-linear-program (ILP) to maximize the number of target mutants that we can make under certain experimental constraints. We apply our method to a comprehensive mathematical model of the protein regulatory network controlling cell division in budding yeast. We also extend our solution to incorporate other experimental conditions such as a delay factor that decides the availability of a mutant and genetic markers to con rm gene deletions. The experimental flow that underlies our work is quite generic and our ILP-based algorithm is easy to modify. Hence our framework should be relevant in plant and animal systems as well. This paper opens up a new area of research: how to automatically synthesize efficient experimental plans for making large numbers of mutants carrying perturbations in multiple genes. Moreover, the principles used in CrossPlan can be directly extended to other organisms where siRNA or CRISPR-based screens are effective. Thus the growing community of biomedical scientists who are beginning to use CRISPR-based approaches to plan multiple, combinatorial gene perturbations will find our approach to be very relevant to their research.
The class I major histocompatibility complex (MHC) is capable of binding peptides derived from intracellular proteins and displaying them at the cell surface. The recognition of these peptide-MHC (pMHC) complexes by T-cells is the cornerstone of cellular immunity, enabling the elimination of infected or tumoral cells. T-cell-based immunotherapies against cancer, which leverage this mechanism, can greatly benefit from structural analyses of pMHC complexes. Several attempts have been made to use molecular docking for such analyses, but pMHC structure remains too challenging for even state-of-the-art docking tools. To overcome these limitations, we describe the use of an incremental meta-docking approach for structural prediction of pMHC complexes. Previous methods applied in this context used specific constraints to reduce the complexity of this prediction problem, at the expense of generality. Our strategy makes no assumption and can potentially be used to predict binding modes for any pMHC complex. Our method has been tested in a re-docking experiment, reproducing the binding modes of 25 pMHC complexes whose crystal structures are available. This study is a proof of concept that incremental docking strategies can lead to general geometry prediction of pMHC complexes, with potential applications for immunotherapy against cancer or infectious diseases.
Network alignment (NA) aims to find similar (conserved) regions between networks. Until recently, existing methods were limited to aligning static networks. However, real-world systems, including biological ones, are dynamic. Hence, we had introduced the first ever dynamic NA method, DynaMAGNA++, which improved upon the traditional static NA. However, DynaMAGNA++does not necessarily scale well to larger networks in terms of alignment quality or runtime. So, more recently, we introduced another dynamic NA approach, DynaWAVE. DynaWAVE complements DynaMAGNA++: while DynaMAGNA++is superior to DynaWAVE on smaller networks, DynaWAVE is superior to DynaMAGNA++on larger networks. This justifies the need for both approaches.
Objective: The gold standard for diagnosing sleep disorders is polysomnography, which generates extensive data about biophysical changes occurring during sleep. We developed the National Sleep Research Resource (NSRR), a comprehensive system for sharing sleep data. The NSRR embodies elements of a data commons aimed at accelerating research to address critical questions about the impact of sleep disorders on important health outcomes. Approach: We used a metadata-guided approach, with a set of common sleep-specific terms enforcing uniform semantic interpretation of data elements across three main components: (1) annotated datasets; (2) user interfaces for accessing data; and (3) computational tools for the analysis of polysomnography recordings. We incorporated the process for managing dataset-specific data use agreements, evidence of Institutional Review Board review, and the corresponding access control in the NSRR web portal. The metadata-guided approach facilitates structural and semantic interoperability, ultimately leading to enhanced data reusability and scientific rigor. Results: The authors curated and deposited retrospective data from 10 large, NIH-funded sleep cohort studies, including several from the Trans-Omics for Precision Medicine (TOPMed) program, into the NSRR. The NSRR currently contains data on 26,808 subjects and 31,166 signal files in European Data Format. Launched in April 2014, over 3000 registered users have downloaded over 130 terabytes of data. Conclusions: The NSRR offers a use case and an example for creating a full-fledged data commons. It provides a single point of access to analysis-ready physiological signals from polysomnography obtained from multiple sources, and a wide variety of clinical data to facilitate sleep research. The NIH Data Commons (or Commons) is an ambitious vision for a shared virtual space to allow digital objects to be stored and computed upon by the scientific community. The Commons would allow investigators to find, manage, share, use and reuse data, software, metadata and workflows. It imagines an ecosystem that makes digital objects Findable, Accessible, Interoperable and Reusable (FAIR). Four components are considered integral parts of the Commons: a computing resource for accessing and processing of digital objects; a "digital object compliance model" that describes the properties of digital objects that enable them to be FAIR; datasets that adhere to the digital object compliance model; and software and services to facilitate access to and use of data. This paper describes the contributions of NSRR along several aspects of the Commons vision: metadata for sleep research digital objects; a collection of annotated sleep data sets; and interfaces and tools for accessing and analyzing such data. More importantly, the NSRR provides the design of a functional architecture for implementing a Sleep Data Commons. The NSRR also reveals complexities and challenges involved in making clinical sleep data conform to the FAIR principles. Future directions: Shared resources offered by emerging resources such as cloud instances provide promising platforms for the Data Commons. However, simply expanding storage or adding compute power may not allow us to cope with the rapidly expanding volume and increasing complexity of biomedical data. Concurrent efforts must be spent to address digital object organization challenges. To make our approach future-proof, we need to continue advancing research in data representation and interfaces for human-data interaction. A possible next phase of NSRR is the creation of a universal self-descriptive sequential data format. The idea is to break large, unstructured, sequential data files into minimal, semantically meaningful, fragments. Such fragments can be indexed, assembled, retrieved, rendered, or repackaged on-the-fly, for multitudes of application scenarios. Data points in such a fragment will be locally embedded with relevant metadata labels, governed by terminology and ontology. Potential benefits of such an approach may include precise levels of data access, increased analysis readiness with on-the-fly data conversion, multi-level data discovery and support for effective web-based visualization of contents in large sequential files.
High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.
The application of Natural Language Processing (NLP) methods and resources to clinical and biomedical text has received growing attention over the past years, but progress has been limited by difficulties to access shared tools and resources, partially caused by patient privacy and data confidentiality constraints. Efforts to increase sharing and interoperability of the few existing resources are needed to facilitate the progress observed in the general NLP domain. Leveraging our research in corpus analysis and de-identification research, we have created multiple synthetic data sets for a couple of NLP tasks based on real clinical sentences. We are organizing a challenge workshop to promote community efforts towards the advancement in clinical NLP. The challenge workshop will have two tasks: 1) Family History Information Extraction; and 2) Clinical Semantic Textual Similarity.
Plants are sessile organisms and are unable to relocate to favorable locations under extreme environmental conditions, and hence they have no choice but to acclimate and eventually adapt to the severe conditions to ensure their survival. With climate change affecting the environment adversely, it is of utmost importance to make plants and crops robust enough to withstand harsh conditions and safeguard global food production. As traditional methods of bolstering plant defense against stressful conditions come to their biological limit, we require newer methods that can allow us to strengthen the plant's internal defense mechanism. This motivated us to look into the genetic networks of plants. In this paper, we lay out a method to analyze genetic networks in plants that are activated under abiotic stress, specifically drought conditions. This method is based on the analysis of Bayesian networks and should ultimately help in finding genes in the genetic networks of the plant that play a key role in its defense response against drought. The WRKY transcription factor is well known for its role in plant defence against biotic stresses, but recent studies have shed light on its activity against abiotic stresses such as drought. Therefore, it is logical to study the various components of the WRKY gene network in order to maximize a plant's defence mechanism. The data used to learn the parameters of the Bayesian network consisted of both real world and synthetic data. The synthetic data was generated using the dependencies in the Bayesian network model. The network parameters were learned using a Bayesian approach and the frequentist approach. The estimated parameters are then used to build a Bayesian decision network, where nodes are selected one at a time for intervention and the utility (score) for the upregulation of a downstream abiotic stress response gene is computed. The node that maximizes this utility is recommended for biological intervention.
The alignment of biological networks enables to shed light on important biological information. The local network alignment (LNA) aims to discover unique sub-regions of similarity among networks, while the global network alignment (GNA) aims to find large conserved regions by matching all the nodes of the input networks. Some recent works demonstrated that two approaches are complementary. Thus it is possible to combine them to improve the alignment performances. LNA algorithms need as input both networks and supplementary information to start the process. In particular global alignment may be used to produce such information without any a priori knowledge about networks. We recently explored such possibility. Here, we here extend and refine our previous results by introducing SL-GLAlign (Simulated Annealing-Global Local Aligner), a novel methodology based on the use of topological information extracted from global alignment to guide the building of the local alignment. We tested SL-GLAlign on biological networks. Results available at https://sites.google.com/view/sl-glalign/home confirm that SL-GLAlign methodology.
Any concept of a long-term, relatively permanent, memory must include a mechanism for encoding episodic experiences into some appropriate format and later retrieving and recreating a reasonable approximation of the original "episode". Psychologists commonly call these processes "consolidation" and "recall" . In this paper, we present two computable functions tht closely imitate these processes. Of necessity, we must posit a recording format. It will consist of a system of cycles which support very high information content and are, furthermore, commonly found as protein polymer structures within the membranes of all our cells.
Identification of motifs-recurrent and statistically significant patterns-in biological networks is the key to understand the design principles, and to infer governing mechanisms of biological systems. This, however, is a computationally challenging task. This task is further complicated as biological interactions depend on limited resources, i.e., a reaction takes place if the reactant molecule concentrations are above a certain threshold level. This biochemical property implies that network edges can participate in a limited number of motifs simultaneously. Existing motif counting methods ignore this problem. This simplification often leads to inaccurate motif counts (over- or under-estimates), and thus, wrong biological interpretations. In this paper, we develop a novel motif counting algorithm, Partially Overlapping MOtif Counting (POMOC), that considers capacity levels for all interactions in counting motifs. Our experiments on real and synthetic networks demonstrate that motif count using the POMOC method significantly differs from the existing motif counting approaches, and our method extends to large-scale biological networks in practical time. Our results also show that our method makes it possible to characterize the impact of different stress factors on cell's organization of network. In this regard, analysis of a S. cerevisiae transcriptional regulatory network using our method shows that oxidative stress is more disruptive to organization and abundance of motifs in this network than mutations of individual genes. Our analysis also suggests that by focusing on the edges that lead to variation in motif counts, our method can be used to find important genes, and to reveal subtle topological and functional differences of the biological networks under different cell states.
Clinical studies often track dose-response curves of subjects over time. One can easily model dose-response curve at each time point with Hill equation, but such a model fails to capture the temporal evolution of curves. On the other hand, one can use Gompertz equation to model the dose-time curves at each time point without capturing the evolution of time curves across dosage. In this article, we propose a parametric model for dose-time responses that follows Gompertz law in time and approximately follows Hill equation across dose. We derive a recursion relation for dose-response curves over time capturing the temporal evolution. We then specify a regression model connecting the parameters controlling the dose-time responses with individual level proteomic data. The resultant joint model allows us to predict the dose-response curves over time for new individuals. We illustrate the superior performance of our proposed model as compared to the individual models using data from the HMS-LINCS database.
Rapid increases in the availability of genomic data for diverse organisms has spurred the search for better mathematical and computational methods to investigate the underlying patterns that connect genotypic and phenotypic data. Large genomic datasets make it possible to search for higher order epistatic interactions, but also highlight the need for new mathematical tools that can simultaneously represent sequences and phenotypes. We propose a multivariate tensor-based orthogonal polynomial approach to characterize nucleotides or amino acids in a DNA/RNA or protein sequence. Given phenotype data and corresponding sequences, we can construct orthogonal polynomials using sequence information and subsequently map phenotypes on to the space of the polynomials. This approach provides information about higher order associations between different parts of a sequence, and allows us to identify both linear and nonlinear relationships between phenotype and genomic or proteomic sequence data. We use this method to assess the relationship between sequences and transcription activity levels in a large raw mammalian enhancer dataset downloaded from NCBI. We provide insights into the bioinformatics and computational pipeline necessary to curate and translate large-scale genomic data to extract and quantify complex genome-phenotype interactions.
The hybrid stochastic simulation algorithm, proposed by Haseltine and Rawlings, can significantly improve the simulation efficiency for multiscale biochemical networks. However, the population of some species might be driven negative under certain situations. This paper investigates the negativity problem of the hybrid method based on the second slow reaction firing time. Our analysis and tests on several models demonstrate that usually the error caused by negative populations is negligible compared with approximation errors of the method itself. But for systems involving nonlinear reactions or highly sensitive species, the system stability will be influenced and may lead to system failure. The proposed Zero-Reaction rule is recommended considering its efficiency and simplicity.
Network synthesis models in NAPAbench provide effective means to generate synthetic network families that can be used to rigorously assess the performance of network alignment algorithms. In recent years, the protein-protein-interaction (PPI) databases have been significantly updated, hence the network synthesis models in NAPAbench need to be updated to be able to create synthetic network families whose characteristics are close to those of real PPI networks. In this work, we present updated models based on an extensive analysis of real-world PPI networks and their key features.
Cell size is a key characteristic that significantly affects many aspects of cellular physiology. There are specific control mechanisms during cell cycle to maintain the cell size within a range from one generation to another. Such control mechanisms introduce substantial variability to important properties of the cell cycle such as inter-division time. To quantitatively study the effect of such variability in progression through cell cycle, detailed stochastic models are required. In this paper, a new hybrid stochastic model is proposed to study the effect of molecular noise and size control mechanism on the variabilities in cell cycle of the budding yeast Saccharomyces cerevisiae. The proposed model provides an accurate, yet computationally efficient approach for simulation of an intricate system by integrating the deterministic and stochastic simulation schemes.
Biological networks describes the mechanisms which govern cellular functions. Temporal networks show how these networks evolve over time. Studying the temporal progression of network topologies is of utmost importance since it uncovers how a network evolves and how it resists to external stimuli and internal variations. Two temporal networks have co-evolving subnetworks if the topologies of these subnetworks remain similar to each other as the network topology evolves over a period of time. In this paper, we consider the problem of identifying co-evolving pair of temporal networks, which aim to capture the evolution of molecules and their interactions over time. Although this problem shares some characteristics of the well-known network alignment problems, it differs from existing network alignment formulations as it seeks a mapping of the two network topologies that is invariant to temporal evolution of the given networks. This is a computationally challenging problem as it requires capturing not only similar topologies between two networks but also their similar evolution patterns. We present an efficient algorithm, Tempo, for solving identifying coevolving subnetworks with two given temporal networks. We formally prove the correctness of our method. We experimentally demonstrate that Tempo scales efficiently with the size of network as well as the number of time points, and generates statistically significant alignments---even when evolution rates of given networks are high. Our results on a human aging dataset demonstrate that Tempo identifies novel genes contributing to the progression of Alzheimer's, Huntington's and Type II diabetes, while existing methods fail to do so.
Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data. The performance of such methods, however, depends on the knowledge of missing value mechanism, which is rarely fully achievable in practice. We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We present how the missing-value problem fits neatly into the overall framework of optimal clustering by marginalizing out the missing-value process from the feature distribution. In particular, we demonstrate the proposed framework for the multivariate Gaussian model with an arbitrary covariance structure. Comprehensive experimental studies on both synthetic and real-world RNA-seq data shows the superior performance of the proposed optimal clustering with missing values, compared to various clustering approaches, including k-means, fuzzy c-means and hierarchical clustering, with the off-the-shelf Gibbs sampling based imputation method. Optimal clustering offers a robust and flexible framework for dealing with the missing value problem, obviating the need for imputation-based pre-processing of the data. Its superior performance compared to various clustering methods in settings with different missing rates and small sample sizes, demonstrates the optimal clusterer as a promising tool for dealing with missing data in biomedical applications.
Abstract Objective. As Genome-Wide Association Studies (GWAS) have been increasingly used with data from various populations, it has been observed that data from different populations reveal different sets of Single Nucleotide Polymorphisms (SNPs) that are associated with the same disease. Using Type II Diabetes (T2D) as a test case, we develop measures and methods to characterize the functional overlap of SNPs associated with the same disease across populations. Materials and methods. We introduce the notion of an Overlap Matrix as a general means of characterizing the functional overlap between different SNP sets at different genomic and functional granularities. Using SNP-to-gene mapping, functional annotation databases, and interaction networks, we assess the degree of functional overlap in T2D-associated loci identified across nine populations from Asian and European ethnic origins. Results. Our results show that more overlap is captured as more functional data is incorporated as we go through the pipeline, starting from SNPs and ending at network overlap analyses. We hypothesize that these observed differences in the functional mechanisms of T2D across populations can also explain the popularity of different prescription drugs in different populations. We show that this hypothesis is concordant with the literature on the functional mechanisms of prescription drugs. Discussion. Our results show that although there exist distinct T2D processes that are affected in different populations, network-based annotations can capture more functional overlap across populations, and that the functional similarity is more substantial between populations with similar ethnic origins. Conclusion. These results support the notion that ethnicity needs to be taken into account in making personalized treatment decisions for complex diseases.
Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty. The goal of this paper is to develop optimal classification of single-cell trajectories accounting for potential model uncertainty. Partially-observed Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy gene-expression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of single-cell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particle-based single-cell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution. The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network with noisy time-series gene-expression data.
The importance of the use of networks to model and analyse biological data and the interplay of bio-molecules is widely recognised. Consequently, many algorithms for the analysis and the comparison of networks (such as alignment algorithms) have been developed in the past. Recently, many different approaches tried to integrate into a single model the interplay of different molecules, such as genes, transcription factors and microRNAs. A possible formalism to model such scenario comes from node/edge coloured networks (or heterogeneous networks) implemented as node/ edge-coloured graphs. Consequently, the need for the introduction of alignment algorithms able to analyse heterogeneous networks arises. We here focus on the local comparison of heterogeneous networks that may be formulated as a network alignment problem. To the best of our knowledge, this problem has not been investigated in the past. We here propose HetNetAligner a novel algorithm that receives as input two heterogeneous networks (node-coloured graphs) and a similarity function among nodes of two networks. We first build a single alignment graph. Then we mine this graph extracting relevant subgraphs. We also implemented our algorithm, and we tested it on some selected heterogeneous biological networks. Preliminary results confirm that our method builds high-quality alignments. The website https://sites.google.com/view/heterogeneusnetworkalignment/home contains supplementary material and the code.
The inference of gene networks from large-scale human genomic data is challenging due to the difficulty in identifying correct regulators for each gene in a high-dimensional search space. We present a Bayesian approach integrating external data sources with knockdown data from human cell lines to infer gene regulatory networks. In particular, we assemble multiple data sources including gene expression data, genome-wide binding data, gene ontology, known pathways and use a supervised learning framework to compute prior probabilities of regulatory relationships. We show that our integrated method improves the accuracy of inferred gene networks. We apply our method to two different human cell lines, which illustrates the general scope of our method.
RNAseq has become a popular technology for biomarker discovery. However, in many applications, such as single cell sequencing, zero counts comprise a considerable portion of data. Here we propose a new RNAseq model that explicitly models zero counts and solve a previously proposed feature selection framework, called Optimal Bayesian Filter (OBF), for this model and find the posterior probability of a feature having distributional differences across classes. As the posterior does not exist in closed form, we propose Sequence Approximation OBF (SA-OBF) as a closed form approximation which is based on log transformations of non-zero reads. We use SA-OBF to study two breast cancer RNAseq datasets.
Understanding sensitivity is an important step to study system robustness and adaptability. In this work, we model and investigate intra-cellular networks via discrete modeling approach, which assigns a set of discrete values and a deterministic update rule to each model element. The models can be analyzed formally or simulated in a stochastic manner. We propose a comprehensive framework to study sensitivity in these models. In the framework, we define element influence (activity) and sensitivity with respect to the state distribution of the modeled system. Previous sensitivity analysis approaches all assume uniform state distribution, which is usually not true in biology. We perform both static and dynamic sensitivity analysis, the former assuming uniform state distribution, and the latter using a distribution estimated from stochastic simulation trajectories under a particular scenario. Additionally, we extended the element update functions to include weights according to these computed influences. Adding weights generates a weighted directed graph, and therefore, enables identifying key elements in the model and dominant signaling pathways that determine the behavior of the overall model. In the end, we apply our sensitivity analysis framework on pathway extraction in the intra-cellular networks that control T cells differentiation.
Real biological and social data is increasingly being represented as graphs. Pattern-mining-based graph learning and analysis techniques report meaningful biological biological subnetworks that elucidate important interactions among entities. At the backbone of these algorithms is the enumeration of pattern space. In this work, we propose a linear-space linear-delay reverse search-based algorithm for enumerating all connected induced subgraphs of an undirected graph. Building on this enumeration approach, we propose an algorithm for mining all maximal cohesive subgraphs that integrates vertices' attributes with subgraph enumeration. To efficiently mine all maximal cohesive subgraphs, we propose two pruning techniques that remove futile search nodes in the enumeration tree. Experiments on synthetic and real graphs show the effectiveness of the proposed algorithm and the pruning techniques. On enumerating all connected induced subgraphs, our algorithm is several times faster than existing approaches. On dense graphs, the proposed approach is at least an order of magnitude faster than the best existing algorithm. Experiments on protein-protein interaction network with cancer gene dysregulation profile show that the reported cohesive subnetworks are biologically interesting.
Recent studies have identified different microbiome profiles in healthy and sick individuals for a variety of diseases; this suggests that the microbiome profile can be used as a diagnostic tool in identifying the disease states of an individual. However, the high-dimensional nature of metagenomic data poses a significant challenge to existing machine learning models. In this paper, we propose MetaNN (i.e., classification of host phenotypes from Metagenomic data using Neural Networks), a neural network framework which utilizes a new data augmentation technique to mitigate the effects of data over-fitting. We show that MetaNN outperforms existing state-of-the-art models in terms of classification accuracy for both synthetic and real metagenomic data. These results pave the way towards developing personalized treatments for microbiome related diseases.
Schizophrenia and autism are examples of polygenic diseases caused by a multitude of genetic variants. Recently, both diseases have been associated with disrupted neuron motility and migration patterns, suggesting that aberrant cell motility is a phenotype for these neurological diseases. Abnormal neuronal development is central to both schizophrenia and autism, which critically implicates these cell motility perturbations in the disease mechanisms. However, despite the genetic characterization of these diseases by large-scale genome-wide association studies, extracting causality for symptoms and pathophysiology from these data remains challenging due to the large number of genes implicated and the additive effect the mutations have on the cellular processes. We present a network-based machine learning approach to identify genes implicated in both a disease of interest (e.g., schizophrenia or autism) and a disease phenotype (e.g., aberrant cell motility). We use a brain-specific functional interaction network to identify which genes are most centrally implicated in a polygenic disease based on functional similarity. Our algorithm identifies genes that are near known disease genes and cell motility genes in the network. Top schizophrenia candidates include many Protein Phosphatase 1 subunits and Lysyl Oxidase, which are promising genes for follow-up experimental validation. Candidate genes predicted by our method suggest testable hypotheses about these genes' role in cell motility regulation, offering a framework for generating predictions for experimental validation.
\sectionBackground Amyotrophic lateral sclerosis (ALS), is a neurodegenerative disease that primarily effects motor neurons in both brain and spinal cord \citezarei2015comprehensive. Several independent studies conformed the deposition of TAR DNA-binding protein (TDP)-43 aggregates in the cytoplasm of the effected cells suggesting the role of TDP-43 in ALS. However, the molecular mechanism of TDP-43 in ALS is not well established. It is only recently reported that TDP-43 contributes to pre-mRNA splicing by inhibiting cryptic exons \citeling2015tdp. While this is a very interesting observation, it opens to several intriguing aspects of TDP-43 dependent splicing errors like preferential 5'/3' errors, enrichment of specific alternative splicing events and Intron retentions. A systematic characterization and decoding TDP-43 cryptic splicing is critical to better understanding of the molecular pathogenesis of ALS. However, none of the existing computational approaches are precisely designed for cryptic splice characterization, which advocates a strong need of robust genome-wise scalable pipeline. \sectionResults In this study we applied CrypSplice \citetan2016extensive, in-house novel cryptic splice site detection and characterization method on several publicly available TDP-43 datasets. Every junction is subjected to a beta binomial test and characterize to aid molecular inferences. Upon exploring 18 TDP-43 knock-down samples across different tissues and cell lines we found that genes that are targeted by cryptic splicing are enriched in cell cycle, autophagy and protein folding. While this is in good agreement with previous studies we uncovered a preferential enrichment of 5' splice site errors indicating a U1 spliceosome mediated mechanism. To infer a co-splicing network, similar cryptic splicing characterization was performed on a total of 236 samples covering 118 RNA binding proteins (RBPs) \citeyalamanchili2017data. A network of RBPs was constructed based on the induced cryptic load similarity w.r.t TDP-43 cryptic signature that are also validated by TDP-43 binding (eCLIP-Seq). We found other reported ALS genes like FUS, HNRNPA1 and TAF15 enriched in the neighboring genes of TDP-43 in the RBP network. Novel (putative) ALS-causing RBPs are identified and prioritized using Network Propagation, Guilt by association, and Cryptic signature similarity. \sectionConclusion Through a comprehensive CrypSplice analysis we uncovered a preferential enrichment of TDP-43 dependent 5' splice site errors. Network propagation and prioritization of RBP cryptic network yielded a list of (putative) novel ALS associated genes. Further follow-ups through genetic screening could discover more ALS causing genes and aid decoding the underlying molecular mechanism.
The rapid accumulation of protein structures presents a unique set of challenges and opportunities in the analysis, comparison, modeling, and prediction of macromolecular structures and their interactions. This workshop brought together researchers with expertise in bioinformatics, structural computational biology, machine learning, optimization, and high performance computing, to disseminate new results, and discuss techniques and research problems.
Due to the central role that tertiary structure plays in determining protein function, resolving protein tertiary structures is an integral research thrust in both wet and dry laboratories. Dry laboratories have primarily focused on small- to medium-size proteins. However, proteins central to human biology and human health are often quite complex, containing multiple domains and consisting of thou- sands of amino acids. Such proteins are challenging for various reasons, including the inability to crystallize. We present a case study of structure determination for the Rift Valley fever virus L-protein, a a large, multi-domain protein with currently no available tertiary structure. We employ this case study as an emerging paradigm and demonstrate how to leverage the rich and diverse landscape of bioinformatics tools for building tertiary structure models for multi-domain proteins with thousands of amino acids.
Significant efforts are devoted to resolving biologically-active structures in wet and dry laboratories. In particular, due to hardware and algorithmic innovations, computational methods can now obtain thousands of structures that populate the structure space of a protein of interest. With such advances, attention turns to organizing computed structures to extract the underlying organization of the structure space in service of highlighting biologically-active structural states. In this paper we report on the promise of leveraging community detection methods, designed originally to detect communities in social networks, to organize protein structure spaces probed in silico. We report on a principled comparison of such methods along several metrics and on proteins of diverse folds and lengths. More importantly, we present a rigorous evaluation in the context of decoy selection in template-free protein structure prediction. The presented results make the case that network-based community detection methods warrant further investigation to advance analysis of protein structure spaces for automated selection of biologically-active structures.
Cryo-electron microscopy (cryo-EM) is an emerging biophysical technique for structural determination of protein complexes. However, accurate detection of secondary structures is still challenging when cryo-EM density maps are at medium resolutions (5-10 Å). Most of existing methods are image processing methods that do not fully utilize available images in the cryo-EM database. In this paper, we present a deep learning approach to segment secondary structure elements as helices and beta-sheets from medium-resolution density maps. The proposed 3D convolutional neural network is shown to detect secondary structure locations with an F1 score between 0.79 and 0.88 for six simulated test cases. The architecture was also applied to an experimentally-derived cryo-EM density map with good accuracy.
A novel method for particle picking in cryo-electron microscopy (cryo-EM) based on a convolutional neural network (CNN) is proposed. The key to successful 3D reconstruction lies in the ability to pick as many particles as possible before 2D class averaging. In most of the existing studies, particles are selected either manually or semi-automatically, which can be time-consuming and laborious. We aim to pick particles fully automatically to improve the picking efficiency without any human intervention. A new CNN model is designed and two data preprocessing methods, image sharpening and histogram equalization, are employed to make the model get better performance. The experimental results show that the proposed method has a better recall score compared to existing algorithms. Moreover, the proposed model is validated and compared using various EM data. With the fully automatically picked particles, 2D class averaging can be processed efficiently to further select good-quality particles. Subsequently, 3D reconstruction can be performed.
Protein surface shape plays an essential role in various function of proteins. In order to efficiently investigate protein function and evolutionary history, we introduce a global protein surface shape representation called EMNets. EMNets provides an effective and accurate way of protein surface representation and similarity search, and thus contributes to biomedical research. The method uses a Convolutional Autoencoder (CAE) neural network to learn the geometric information of three-dimensional (3D) density maps in a data-driven manner. Our method effectively represents a 3D cryo-electron microscopy density map by using a descriptor consists of only 256 numeric variables which is called EMNets descriptor. Based on EMNets descriptor, we are able to retrieve similar protein surfaces using k-nearest-neighbor algorithm in real-time. The search results of protein surface represented with the EMNets descriptor has shown high agreement with the existing Combinatorial Extension (CE) algorithm of sequence and structure similarity search. Overall, EMNets is a powerful tool in comparing 3D protein structures obtained by cryo-electron microscopy.
Protein-peptide binding interactions play an important role in cellular regulation and are functionally important in many diseases. If no prior knowledge of the location of a binding site is available, prediction may be needed as a starting point for further modeling or docking. Existing approaches to prediction either require a sequence of the peptide to be already known or offer an unsatisfactory predictive performance. Here we propose P2Rank-Pept, a new machine learning based method for prediction of peptide-binding sites from protein structure. We show that our method significantly outperforms other evaluated methods, including the most recent structure based prediction method SPRINT-Str published last year (AUC: $0.85 > 0.78$). P2Rank-Pept utilizes local structural and sequence information, including evolutionary conservation, and builds a prediction model based on a Random Forest classifier. The novelty of our approach lies in using points on the solvent accessible surface as a unit of classification (as opposed to the typical approach of focusing on amino acid residues), and in the application of the robust technique of Bayesian optimization to systematically optimize arbitrary parameters of the algorithm. Our results assert that P2Rank software package is a viable framework for developing top-performing binding-site prediction methods for different types of binding partners.
Cryo-electron microscopy (cryo-EM) has become a major technique for protein structure determination. Many atomic structures have been derived from cryo-EM density maps of about 3Å resolution. Side-chain conformations are well determined in density maps with super-resolutions such as 1-2Å. It is desirable to have a statistical method to detect anomalous side-chains without a super-resolution density map. In this study, we analyzed structures derived from X-ray density maps with higher than 1.5Å resolutions and those from cryo-EM density maps with 2-4 Å and 4-6 Å resolutions respectively. We introduce a histogram-based outlier score (HBOS) for anomaly detection in protein models built from cryo-EM density maps. This method uses the statistics derived from X-ray dataset (<1.5Å) as the reference and combines five features involving the distal block distance, side-chain length, phi, psi, and first chi angle of the residues. Higher percentages of anomalies were observed in the cryo-EM models than in the super-resolution X-ray models. Lower percentages of anomalies were observed in cryo-EM models derived after January 2017 than those before 2017.
The adaptive immune system is a defense system against repeated infection. In order to trigger the immune response, antigen peptides from the infecting agent must first be recognized by the Major Histocompatibility Complex (MHC) proteins. Identifying peptides that bind to MHC class II is thus a critical step in vaccine development. We hypothesize that comparing individual subsites of the peptide binding groove could predict the individual amino acids of possible antigens. This modularized approach to individual subsites could reduce the amount of training data needed for accurate classification while also reducing computing times associated with molecular simulation and docking. To test this hypothesis, we evaluated the capability of two classification techniques and multiple modular representations of the MHC subsites to correctly classify the binding preference categories of P1 subsites of MHC class II structures. Our results shows that the average accuracies are 0.87 for K-mean and 0.95 for SVM with all feature vector configurations. Our results demonstrate that accurate predictions on individual binding subsites is possible, pointing to larger scale applications predicting whole-peptide preferences.
Research in the field of HIV transmission has yet to provide a vaccine for this imponderable virus. Though progress has been made to extend the life of those chronically infected, a solution to the transmission of the disease remains elusive. Previous studies involving electrostatic surface charge analysis revealed the sensitivity of gp120 envelope (Env) protein function to changes in pH across levels consistent with those found in the human body. A prototype computational approach was developed and found to agree with these results. A refined process was developed capable of classifying Env sequences/structures through machine learning techniques. We expound this analytical procedure to encompass residue-level analysis and include minimization steps to ensure the integrity of the protein models. Additionally, the process has been enhanced with advanced data compression techniques to allow for more in-depth analysis of the systems. In this research we explore a new technique termed electrostatic variance masking (EVM), that reveals what we hypothesize to be the mechanistic residues responsible for the pH sensitivity of Env binding site. The data implies that a conserved set of core residues may be responsible for modulation of the binding process in varying environmental conditions mainly involving pH.
Cryo-electron microscopy (cryo-EM) is becoming the imaging method of choice for determining protein structures. Many atomic structures have been resolved based on an exponentially growing number of published three-dimensional (3D) high resolution cryo-EM density maps. The resolution value claimed for the reconstructed 3D density map has been the topic of scientific debate for many years. The Fourier Shell Correlation (FSC) is the currently accepted cryo-EM resolution measure, but it can be subjective and has its own limitations. The FSC indicates the quality of the experimental maps but no the amount of geometric and volumetric feature details present in the 3D map. In this study, we propose supervised deep learning methods to extract representative 3D features at high, medium and low resolutions from simulated protein density maps and build classification models that objectively validate resolutions of experimental 3D cryo-EM maps. Specifically, we build classification models based on dense artificial neural network (DNN) and 3D convolutional neural network (3D CNN) architectures. The trained models can classify a given 3D cryo-EM density map into one of three resolution levels: high, medium, low. The DNN model achieved 92.73% accuracy and the 3D CNN model achieved 99.75% accuracy on simulated test maps. Applying the DNN and 3D CNN models to thirty experimental cryo-EM maps achieved an agreement of 60.0% and 56.7%, respectively, with the author published resolution value of the density maps. The results suggest that deep learning can be utilized to potentially improve the resolution validation process of experimental cryo-EM maps.
Molecular dynamics (MD) simulation is a powerful technique for sampling the conformational landscape of natively folded proteins (NFPs) and structurally dynamic intrinsically disordered proteins (IDPs). NFPs and IDPs can be viewed as nonlinear dynamical systems that exercise available degrees of freedom to explore their energetically-accessible conformation landscape. Dimensionality estimators have emerged as useful tools to characterize nonlinear dynamical systems in other domains, but their application to MD simulation has been limited due to thermal noise and a lack of ground-truth data. We develop a series of increasingly complex biopolymer models which exhibit a range of dynamics we seek to characterize in MD simulations (stochastic dynamics, helical structures, partially folded states, and correlated motions) and are of known dimensionality. We utilize the maximum-likelihood dimension (MLD) estimator to investigate the effects of thermal noise and noise-smoothing techniques on the estimates obtained from the polymer models and MD simulations of two NFPs and two IDPs. We find that under certain noise/smoothing conditions, the MLD over/under-estimates the true dimensionality of the models in a predictable manner, allowing us to relate differences between MLD estimates to differences between NFP and IDP motions for classification of biomolecular systems based on their dynamics.
Even a single amino acid substitution in a protein can be the cause of a debilitating disease. Experimentally studying the effects of all possible multiple mutations in a protein is infeasible since it requires a combinatorial number of mutants to be engineered and assessed. Computational methods for studying the impact of single amino acid substitutions do not scale to handling the number of mutants that are possible for two amino acid substitutions. We present an approach for reducing the amount of mutation samples that need to be used to predict the impact of pairwise amino acid substitutions. We evaluate the effectiveness of our method by generating exhaustive mutations in silico for 8 proteins with 2 amino acid substitutions, analyzing the mutants via rigidity analysis, and comparing the predictions from a sample of the mutants to that in the exhaustive dataset. We show it is possible to approximate the effect of the two amino acid substitutions using as little as 25% of the exhaustive mutations, which is further improved by imposing a low rank constraint.
It is our great pleasure to welcome you to the ACM-BCB 2018 ParBio workshop. We received 3 submissions from around the world covering a broad range of topics. We evaluated them regarding relevance, quality, and novelty, selecting 3 full papers. We took into account the coverage of the different areas related to ParBio as well as the potential audience, to schedule presentations in a single day with minimal audience interest overlap. ParBio will take place in the morning and will include the following three presentations: A Cooperative Vehicle Routing Algorithm for Logistic Management in Healthcare A Voice-Aware System for Vocal Wellness Deep Learning Based Medical Diagnosis System Using Multiple Data Sources
In order to reduce the cost of healthcare processes, optimization systems are used to optimize logistics in healthcare. Algorithms for solving the so-called Vehicle Routing Problem (VRP) are more and more applied in healthcare systems requiring the movement of nursery/medical staff or patients. In this paper, we introduce a novel software platform that uses a cooperative vehicle routing algorithm and is able to reduce transportation costs in healthcare applications requiring the movement of nursery/medical staff. The COOP_VR platform adapts an already existing VRP algorithm to the healthcare context and allows the cooperation between two independent healthcare organizations (shippers) that manage their own vehicle fleets in a given geographic area. Preliminary simulation results show the possibility to reduce costs for both healthcare organizations in a range between 1% and 21% of the initial transportation costs.
Phonemic alterations can indicate possible pathologies with poten- tially serious repercussions both from a psychological than a social point of view. Nowadays, the level of population awareness about this issues is rather low and mild dysfunctions are neglected. The main goal of this work is to show the design and imple- mentation of an innovative voice-awareness system able to train and monitor the voice apparatus for any individual. The system supports users in: (i) monitoring voice to avoid overload of the phonatory apparatus; (ii) giving indications to prevent upset or fatigue; (iii) analyzing vocal signals to identify possible pathologies; (iv) monitoring voice during the during rehabilitation phase in pre- and post-surgical treatments. The system has been implemented with IBM Watson services and it is designed for a wide range of users.
Recently, many researchers have conducted data mining over medical data to uncover hidden patterns and use them to learn prediction models for clinical decision making and personalized medicine. While such healthcare learning models can achieve encouraging results, they seldom incorporate existing expert knowledge into their frameworks and hence prediction accuracy for individual patients can still be improved. However, expert knowledge spans across various websites and multiple databases with heterogeneous representations and hence is difficult to harness for improving learning models. In addition, patients' queries at medical consult websites are often ambiguous in their specified terms and hence the returned responses may not contain the information they seek. To tackle these problems, we first design a knowledge extraction framework that can generate an aggregated dataset to characterize diseases by integrating heterogeneous medical data sources. Then, based on the integrated dataset, we propose an end-to-end deep learning based medical diagnosis system (DL-MDS) to provide disease diagnosis for authorized users. Evaluations on real-world data demonstrate that our proposed system achieves good performance on diseases diagnosis with a diverse set of patients' queries.
Technological advancements have given us the ability to sequence genomes in great depths, and, consequently, generated an exponential growth in data. National Cancer Institute Cloud Resources (NCICR), formerly the NCI Cancer Genomics Cloud Pilots, were developed with a goal democratizing NCI-generated cancer genomic data and facilitating analysis by co-localizing cloud computing and petabyte-scale data. Based on commercial cloud architectures, the Cloud Resources offer the flexibility for users to utilize tools in the form of Docker containers, and tools can be joined to create workflows described by Common Workflow Language (CWL) or Workflow Description Language (WDL). The application of the Cloud Resources has been expanded from cancer genomics to include proteomics, imaging, metagenomics, and analysis involving other types of data in the future. The cloud environment has proven to be a cost-effective, reproducible, reusable, interoperable, and user-friendly alternative to high-performance computing, with minimal overhead and setup requirements. These production-ready and highly scalable platforms represent a necessary step in a publicly available toolset meant to support open and Findable, Accessible, Interoperable, Reusable (FAIR) scientific research. Through this demonstration workshop, participants have the opportunity to 1. learn about the basic features of the NCI Cloud Resources - a. Broad Institute FireCloud b. Institute for Systems Biology Cancer Genomics Cloud c. Seven Bridges Genomics Cancer Genomics Cloud 2. create interoperable, containerized tools, and 3. run genomic analysis workflow on the Cloud Resources