|Call for||Submission Deadline||Notification of Acceptance|
|Papers||May 22||June 22|
|Workshops||April 15||May 02|
|Tutorials||May 01||May 09|
|Highlights||May 31||June 20|
|Posters||May 14||May 27|
Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses.
We consider the problem of identifying viral reads in human host genome data. We pose the problem as open-set classification as reads can originate from unknown sources such as bacterial and fungal genomes. Sequence-matching methods have low sensitivity in recognizing viral reads when the viral family is highly diverged. Hidden Markov models have higher sensitivity but require domain-specific training and are difficult to repurpose for identifying different viral families. Supervised learning methods can be trained with little domain-specific knowledge but have reduced sensitivity in open-set scenarios. We present DeepViFi, a transformer-based pipeline, to detect viral reads in short-read whole genome sequence data. At 90% precision, DeepViFi achieves 90% recall compared to 15% for other deep learning methods. DeepViFi provides a semi-supervised framework to learn representations of viral families without domain-specific knowledge, and rapidly and accurately identify target sequences in open-set settings.
Information about genetic variations in either individual genomes or viral populations provides insight in genetic signatures of diseases and suggests directions for medical and pharmaceutical research. State-of-the-art sequencing platforms generate massive amounts of reads, with length varying from one technology to another, that provide data needed for the reconstruction of haplotypes and viral quasispecies. On the one hand, high-throughput platforms are capable of providing enormous amounts of highly accurate but relatively short reads; inability to bridge long genetic distances renders the reconstruction with such reads challenging. On the other hand, the latest generation of sequencing technologies is capable of generating much longer reads but those reads suffer from sequencing errors at a rate higher than the error rate of short reads. This motivates search for reconstruction methods capable of leveraging both the high accuracy of short reads and the phase resolving power of long reads. We present a deep learning framework that relies on convolutional auto-encoders with a clustering layer to reconstruct individual haplotypes or viral populations from hybrid data sources. First, an auto-encoder for haplotype assembly / viral population reconstruction from short reads is pre-trained separately from another one utilizing long reads for the same task. The pre-trained models are then retrained simultaneously to enable decision fusion. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework outperforms state-of-the-art techniques for haplotype assembly and viral quasispecies reconstruction, and achieves significantly higher accuracy on those tasks than methods utilizing only one type of reads. Code is available at https://github.com/WuLoli/HybSeq.
Transciptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. Here, we describe ArcheGEO, a novel end-to-end framework that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.
Obtaining informative representations of gene expression is crucial in predicting various downstream regulatory-related tasks such as promoter prediction and transcription factor binding sites prediction. Nevertheless, current supervised learning with insufficient labeled genomes limits the generalization capability of training a robust predictive model. Recently researchers model DNA sequences by self-supervised training and transfer the pre-trained genome representations to various downstream tasks. Instead of directly shifting the mask language learning to DNA sequence learning, we incorporate prior knowledge into genome language modeling representations. We propose a novel Motif-oriented DNA (MoDNA) pre-training framework, which is designed self-supervised and can be fine-tuned for different downstream tasks MoDNA effectively learns the semantic level genome representations from enormous unlabelled genome data, and is more computationally efficient than previous methods. We pre-train MoDNA on human genome data and fine-tune it on downstream tasks. Extensive experimental results on promoter prediction and transcription factor binding sites prediction demonstrate the state-of-the-art performance of MoDNA.
Health disparities, or inequalities between different patient demographics, are becoming a crucial issue in medical decision-making, especially in Electronic Health Record (EHR) predictive modeling. In order to ensure the fairness of sensitive attributes, conventional studies mainly adopt calibration or re-weighting methods to balance the performance on among different demographic groups. However, we argue that these methods have some limitations. First, these methods usually mean making a trade-off between the model's performance and fairness. Second, many methods attribute the existence of unfairness completely to the data collection process, which lacks substantial evidence. In this paper, we provide an empirical study to discover the possibility of using deconfounder to address the disparity issue in healthcare. Our study can be summarized in two parts. The first part is a pilot study demonstrating the exacerbation of disparity when unobserved confounders exist. The second part proposed a novel framework, Parity Medical Deconfounder (PriMeD), to deal with the disparity issue in healthcare datasets. Inspired by the deconfounder theory, PriMeD adopts a Conditional Variational Autoencoder (CVAE) to learn latent factors (substitute confounders) for observational data, and extensive experiments are provided to show its effectiveness.
As critically ill patients frequently develop anemia or coagulopathy, transfusion of blood products is a frequent intervention in the Intensive Care Units (ICU). However, inappropriate transfusion decisions made by physicians are often associated with increased risk of complications and higher hospital costs. In this work, we aim to develop a decision support tool that uses available patient information for transfusion decision-making on three common blood products (red blood cells, platelets, and fresh frozen plasma). To this end, we adopt an off-policy batch reinforcement learning (RL) algorithm, namely, discretized Batch Constrained Q-learning, to determine the best action (transfusion or not) given observed patient trajectories. Simultaneously, we consider different state representation approaches and reward design mechanisms to evaluate their impacts on policy learning. Experiments are conducted on two real-world critical care datasets: the MIMIC-III and the UCSF. Results demonstrate that policy recommendations on transfusion achieved comparable matching against true hospital policies via accuracy and weighted importance sampling evaluations on the MIMIC-III dataset. Furthermore, a combination of transfer learning (TL) and RL on the data-scarce UCSF dataset can provide up to 17.02% improvement in terms of accuracy, and up to 18.94% and 21.63% improvement in jump-start and asymptotic performance in terms of weighted importance sampling averaged over three transfusion tasks. Finally, simulations on transfusion decisions suggest that the transferred RL policy could reduce patients' estimated 28-day mortality rate by 2.74% and decreased acuity rate by 1.18% on the UCSF dataset. In short, RL with appropriate patient state encoding and reward designs shows promise in treatment recommendations for blood transfusion and further optimizes the real-time treatment strategies by improving patients' clinical outcomes.
Complex deep learning models show high prediction tasks in various clinical prediction tasks but their inherent complexity makes it more challenging to explain model predictions for clinicians and healthcare providers. Existing research on explainability of deep learning models in healthcare have two major limitations: using post-hoc explanations and using raw clinical variables as units of explanation, both of which are often difficult for human interpretation. In this work, we designed a self-explaining deep learning framework using the expert-knowledge driven clinical concepts or intermediate features as units of explanation. The self-explaining nature of our proposed model comes from generating both explanations and predictions within the same architectural framework via joint training. We tested our proposed approach on a publicly available Electronic Health Records (EHR) dataset for predicting patient mortality in the ICU. In order to analyze the performance-interpretability trade-off, we compared our proposed model with a baseline having the same set-up but without the explanation components. Experimental results suggest that adding explainability components to a deep learning framework does not impact prediction performance and the explanations generated by the model can provide insights to the clinicians to understand the possible reasons behind patient mortality.
Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).
Question Answering (QA) in clinical notes has gained a lot of attention in the past few years. Existing machine reading comprehension approaches in clinical domain can only handle questions about a single block of clinical texts and fail to retrieve information about multiple patients and their clinical notes. To handle more complex questions, we aim at creating knowledge base from clinical notes to link different patients and clinical notes, and performing knowledge base question answering (KBQA). Based on the expert annotations available in the n2c2 dataset, we first created the ClinicalKBQA dataset that includes around 9K QA pairs and covers questions about seven medical topics using more than 300 question templates. Then, we investigated an attention-based aspect reasoning (AAR) method for KBQA and analyzed the impact of different aspects of answers (e.g., entity, type, path, and context) for prediction. The AAR method achieves better performance due to the well-designed encoder and attention mechanism. From our experiments, we find that both aspects, type and path, enable the model to identify answers satisfying the general conditions and produce lower precision and higher recall. On the other hand, the aspects, entity and context, limit the answers by node-specific information and lead to higher precision and lower recall.
We present ADAGIO, a new method for network-based disease gene prioritization that balances network interconnection structure with an embedding measure of network similarity. We show ADAGIO performs better than previous methods for recovering known disease genes in a recent benchmark set encompassing disease-associated genes for 22 polygenic diseases. We find ADAGIO discovers some interesting new disease gene candidates in both Alzheimer's and Parkinson's diseases.
Code, ranked lists of disease genes, and supplementary figures and tables appear at https://github.com/merterden98/ADAGIO.
During normal protein synthesis, the ribosome shifts along the messenger RNA (mRNA) by exactly three nucleotides for each amino acid added to the protein being translated. However, in special cases, the sequence of the mRNA somehow induces the ribosome to slip, which shifts the "reading frame" in which the mRNA is translated, and gives rise to an otherwise unexpected protein. Such "programmed frameshifts" are well-known in viruses, including coronavirus, and a few cases of programmed frameshifting are also known in cellular genes. However, there is no good way, either experimental or informatic, to identify novel cases of programmed frameshifting. Thus it is possible that substantial numbers of cellular proteins generated by programmed frameshifting in human and other organisms remain unknown.
Here, we build on prior works observing that data from ribosome profiling can be analyzed for anomalies in mRNA reading frame periodicity to identify putative programmed frameshifts. We develop a statistical framework to identify all likely (even for very low frameshifting rates) frameshift positions in a genome. We also develop a frameshift simulator for ribosome profiling data to verify our algorithm. We show high sensitivity of prediction on the simulated data, retrieving 97.4% of the simulated frameshifts. Furthermore, our method found all three of the known yeast genes with programmed frameshifts. Our results suggest there could be a large number of un-annotated alternative proteins in the yeast genome, generated by programmed frameshifting. This motivates further study and parallel investigations in the human genome.
Boolean Networks (BNs) play a crucial role in modeling, analyzing, and controlling biological systems. One of the most important problems on BNs is to compute all the possible attractors of a BN. There are two popular types of BNs, Synchronous BNs (SBNs) and Asynchronous BNs (ABNs). Although ABNs are considered more suitable than SBNs in modeling real-world biological systems, their attractor computation is more challenging than that of SBNs. Several methods have been proposed for computing attractors of ABNs. However, none of them can robustly handle large and complex models. In this paper, we propose a novel method called mtsNFVS for exactly computing all the attractors of an ABN based on its minimal trap spaces, where a trap space is a subspace of state space that no path can leave. The main advantage of mtsNFVS lies in opening the chance to reach easy cases for the attractor computation. We then evaluate mtsNFVS on a set of large and complex real-world models with crucial biologically motivations as well as a set of randomly generated models. The experimental results show that mtsNFVS can easily handle large-scale models and it completely outperforms the state-of-the-art method CABEAN as well as other recently notable methods.
Various algorithmic and statistical approaches have been proposed to uncover functionally coherent network motifs consisting of sets of genes that may occur as compensatory pathways (called Between Pathway Modules, or BPMs) in a high-throughput S. Cerevisiae genetic interaction network. We extend our previous Local-Cut/Genecentric method to also make use of a spectral clustering of the physical interaction network, and uncover some interesting new fault-tolerant modules.
Large-scale data often suffer from the curse of dimensionality and the constraints associated with it; therefore, dimensionality reduction methods are often performed prior to most machine learning pipelines. In this paper, we directly compare autoencoders performance as a dimensionality reduction technique (via the latent space) to other established methods: PCA, LASSO, and t-SNE. To do so, we use four distinct datasets that vary in the types of features, metadata, labels, and size to robustly compare different methods. We test prediction capability using both Support Vector Machines (SVM) and Random Forests (RF). Significantly, we conclude that autoencoders are an equivalent dimensionality reduction architecture to the previously established methods, and often outperform them in both prediction accuracy and time performance when condensing large, sparse datasets.
Building models for health prediction based on Electronic Health Records (EHR) has become an active research area. EHR patient journey data consists of patient time-ordered clinical events/visits from patients. Most existing studies focus on modeling long-term dependencies between visits, without explicitly taking short-term correlations between consecutive visits into account, where irregular time intervals, incorporated as auxiliary information, are fed into health prediction models to capture latent progressive patterns of patient journeys. We present a novel deep neural network with four modules to take into account the contributions of various variables for health prediction: i) the Stacked Attention module strengthens the deep semantics in clinical events within each patient journey and generates visit embeddings, ii) the Short-Term Temporal Attention module models short-term correlations between consecutive visit embeddings while capturing the impact of time intervals within those visit embeddings, iii) the Long-Term Temporal Attention module models long-term dependencies between visit embeddings while capturing the impact of time intervals within those visit embeddings, iv) and finally, the Coupled Attention module adaptively aggregates the outputs of Short-Term Temporal Attention and Long-Term Temporal Attention modules to make health predictions. Experimental results on MIMIC-III demonstrate superior predictive accuracy of our model compared to existing state-of-the-art methods, as well as the interpretability and robustness of this approach. Furthermore, we found that modeling short-term correlations contributes to local priors generation, leading to improved predictive modeling of patient journeys.
Vital signs (e.g., heart and respiratory rate) are indicative for health status assessment. Efforts have been made to extract vital signs using radio frequency (RF) techniques (e.g., Wi-Fi, FMCW, UWB), which offer a non-touch solution for continuous and ubiquitous monitoring without users' cooperative efforts. While RF-based vital signs monitoring is user-friendly, its robustness faces two challenges. On the one hand, the RF signal is modulated by the periodic chest wall displacement due to heartbeat and breathing in a nonlinear manner. It is inherently hard to identify the fundamental heart and respiratory rates (HR and RR) in the presence of higher order harmonics of them and intermodulation between HR and RR, especially when they have overlapping frequency bands. On the other hand, the inadvertent body movements may disturb and distort the RF signal, overwhelming the vital signals, thus inhibiting the parameter estimation of the physiological movement (i.e., heartbeat and breathing). In this paper, we propose DeepVS, a deep learning approach that addresses the aforementioned challenges from the non-linearity and inadvertent movements for robust RF-based vital signs sensing in a unified manner. DeepVS combines 1D CNN and attention models to exploit local features and temporal correlations. Moreover, it leverages a two-stream scheme to integrate features from both time and frequency domains. Additionally, DeepVS unifies the estimation of HR and RR with a multi-head structure, which only adds limited extra overhead (<1%) to the existing model, compared to doubling the overhead using two separate models for HR and RR respectively. Our experiments demonstrate that DeepVS achieves 80-percentile HR/RR errors of 7.4/4.9 beat/breaths per minute (bpm) on a challenging dataset, as compared to 11.8/7.3 bpm of a non-learning solution. Besides, an ablation study has been conducted to quantify the effectiveness of DeepVS.
Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.
Timely identification of individuals with a high risk of imminent acute events in long-term care facilities can aid in reducing the frequency or severity of such events and lead to safer residential environments. Specifically, an interval-based classification of mobility behavior (i.e., the real-time pattern of walking and physical activities in older adults) has been used for early recognition and prevention of acute events such as falls, delirium, and urinary tract infections. It has also been shown that supplementing such temporal mobility behavior data with static cognitive condition information (such as test scores) can yield better prediction results. However, classifying such multi-modal (static+time-series) data is a challenging task as it requires simultaneously taking different similarity relationships into account. In this work, we present an unsupervised clustering technique for classifying this type of multi-modal data points via jointly optimizing separate objective functions associated with the static and time-series parts. We show that our customized deep learning pipeline achieves competitive or superior results compared to several recent clustering baselines when studied on a few generic tasks aiming at clustering time-series data using both static and time-series data. Following this, we show that our clustering model can be used to cluster movement patterns into clinically meaningful clusters that can effectively capture the risk of near future acute events.
CD4+ T-cell receptors recognize peptide-MHCII complexes displayed on the surface of antigen-presenting cells to induce an immune response. A fundamental problem in immunology is to characterize which peptides (i.e., epitopes) in an antigen induce such a response; this is the problem of computational epitope prediction. To be presented in the form of peptide-MHCII complex, peptides must satisfy two important criteria: they should be processed from an antigen to be available in the pool of peptides to which MHCII can bind and should have a sufficiently high binding affinity to MHCII molecules to form stable complexes. This latter phenomenon has been studied widely and used almost exclusively for epitope prediction. In prior work we have developed methods for modeling antigen processing and have shown that it has significant predictive power in predicting epitopes. In this paper, we propose an integer linear programming (ILP) approach to combine the contributions of antigen processing and peptide binding that provides a holistic and flexible framework for epitope prediction. We validate our results on data sets comprising of antigens associated with tumors and pathogens and show consistent enrichment and improvement in accuracy over other methods.
Our perception of protein's function is highly related to our understanding of the protein's three-dimensional (3D) structure and how the structure is computationally predicted. Evaluating the quality of a predicted 3D structural model is crucial for protein structure prediction. In recent years, many research works have leveraged deep learning architectures for the protein structure prediction alongside combinations of massive protein features to evaluate the predicted model's quality. Most recent works have proven that the inter-residue distance and alignment-based coevolutionary information significantly improve the accuracy of protein structure prediction tasks. This work utilizes the structural constraints derived from multiple sequence alignments, powered by the deep graph convolutional neural network, to estimate the protein model accuracy (EMA). The method models protein structure as a connected graph, in which each node encodes the residue's structural information, and the edge represents the structural relationship between any pair of residues in a structure. We incorporate a new feature embedding block in deep graph learning that utilizes the convolution and self-attention technique to leverage sequence alignment information for high-accurate protein quality estimation. We benchmark our methods to other state-of-the-art quality assessment approaches on the CASP13 and CASP14 datasets. The results indicate the effectiveness of alignment-based features and attention-based graph learning in EMA problems and show an improvement of our method among the previous works.
RNA G-quadruplexes (rG4s) are RNA secondary structures, which are formed by guanine-rich sequences and have important cellular functions. Thus, researchers would like to know where and when rG4s are formed throughout the transcriptome. Measuring rG4s experimentally is a long and lobarious process, and hence researchers often rely on computational methods to predict the rG4 propensity of a given RNA sequence. However, existing computational methods for rG4 propensity prediction are sub-optimal since they rely on specific sequence features and/or were trained on small datasets and without considering rG4 stability information. Here, we developed rG4detector, a convolutional neural network to predict the rG4 propensity of any given RNA sequence. We demonstrated that rG4detector outperforms existing methods over various transcriptomic datasets. In addition, we used rG4detector to detect potential rG4s in transcriptomic data, and showed that it improves detection performance compared to existing methods. Last, we interrogated rG4detector for the important features it learned and discovered known and novel molecular principles behind rG4 formation. We expect rG4detector to advance future rG4 research by accurate detection and propensity prediction of rG4s. The code, trained models, and processed datasets are publicly available via github.com/OrensteinLab/rG4detector.
World-renowned pediatric patient care in scoliosis, craniofacial, orthopedic, and other life-altering conditions is provided at the international Shriners Children's hospital system. The impact of scoliosis can be extreme with significant curvature of the spine that often progresses during childhood periods of growth and development. Gauging the impact of treatment is vital throughout the diagnostic and treatment process and is achieved using radiographic imaging and patient reported feedback surveys. Surgeons from multiple clinical centers have amassed a wealth of patient data from more than 1,000 scoliosis patients. However, these data are difficult to access due to data heterogeneity and poor interoperability between complex hospital systems. These barriers significantly decrease the value of these data to improve patient care. To solve these challenges, we create a generalizable multi-site and multi-modality cloud infrastructure for managing the clinical data of multiple diseases. First, we establish a standardized and secure research data repository using the Fast Health Interoperability Resources (FHIR) standard to harmonize multi-modal clinical data from different hospital sites. Additionally, we develop a SMART-on-FHIR application with a user-friendly graphical user interface (GUI) to enable non-technical users to access the harmonized clinical data. We demonstrate the generalizability of our solution by expanding it to also facilitate craniofacial microsomia and pediatric bone disease imaging research. Ultimately, we present a generalized framework for multi-site, multimodal data harmonization, which can efficiently organize and store data for clinical research to improve pediatric patient care.
In order to predict cell population behavior, it is important to understand the dynamic characteristics of individual cells. Individual induced pluripotent stem (iPS) cells in colonies have been difficult to track over long times, both because segmentation is challenging due to close proximity of cells and because cell morphology at the time of cell division does not change dramatically in phase contrast images; image features do not provide sufficient discrimination for 2D neural network models of label-free images. However, these cells do not move significantly during division, and they display a distinct temporal pattern of morphologies. As a result, we can detect cell division with images overlaid in time. Using a combination of a 3D neural network applied over time-lapse data to find regions of cell division activity, followed by a 2D neural network for images in these selected regions to find individual dividing cells, we developed a robust detector of iPS cell division. We created an initial 3D neural network to find 3D image regions in (x,y,t) in which identified cell divisions occurred, then used semi-supervised training with additional stacks of images to create a more refined 3D model. These regions were then inferenced with our 2D neural network to find the location and time immediately before cells divide when they contain two sets of chromatin, information needed to track the cells after division. False positives from the 3D inferenced results were identified and removed with the addition of the 2D model. We successfully identified 37 of the 38 cell division events in our manually annotated test image stack, and specified the time and (x,y) location of each cell just before division within an accuracy of 10 pixels.
Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. We propose using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample. Although our foremost goal is to make a more transparent model, we find that our method achieves comparable or better accuracies than the state-of-the-art gating-free methods through a simple linear classifier. As a result, our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model makes it easy to interpret classification results. Analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
Modern high-throughput single-cell immune profiling technologies, such as flow and mass cytometry and single-cell RNA sequencing can readily measure the expression of a large number of protein or gene features across the millions of cells in a multi-patient cohort. While bioinformatics approaches can be used to link immune cell heterogeneity to external variables of interest, such as, clinical outcome or experimental label, they often struggle to accommodate such a large number of profiled cells. To ease this computational burden, a limited number of cells are typically sketched or subsampled from each patient. However, existing sketching approaches fail to adequately subsample rare cells from rare cell-populations, or fail to preserve the true frequencies of particular immune cell-types. Here, we propose a novel sketching approach based on Kernel Herding that selects a limited subsample of all cells while preserving the underlying frequencies of immune cell-types. We tested our approach on three flow and mass cytometry datasets and on one single-cell RNA sequencing dataset and demonstrate that the sketched cells (1) more accurately represent the overall cellular landscape and (2) facilitate increased performance in downstream analysis tasks, such as classifying patients according to their clinical outcome. An implementation of sketching with Kernel Herding is publicly available at https://github.com/vishalathreya/Set-Summarization.
With recent advances of single cell RNA (scRNA) sequencing technology, several methods have been proposed to infer cell-cell communication by analyzing ligand-receptor pairs. However, existing methods have limited ways of using what we call "prior knowledge", i.e., what are already known (albeit incompletely) about the upstream for the ligand and the downstream for the receptor. In this paper, we present a novel framework, called rCom, capable of inferring cell-cell interactions by considering portions of pathways that would be associated with upstream of the ligand and downstream of receptors under examination. The rCom framework integrates knowledge from multiple biological databases including transcription factor-target database, ligand-receptor database and publicly available curated signaling pathway databases. We combine both algorithmic methods and heuristic rules to score how each putative ligand-receptor pair may matchup between all possible cell subtype pairs. Permutation test is performed to rank the hypothesized cell-cell communication routes. We performed a case study using single cell transcriptomic data from bone biology. Our literature survey suggests that rCom could be effective in discovering novel cell-cell communication relationships that have been only partially known in the field.
Modern single-cell technologies, such as Cytometry by Time of Flight (CyTOF), measure the simultaneous expression of multiple protein markers per cell and have enabled the characterization of the immune system at unparalleled depths across numerous clinical applications. Despite the success of a variety of developed bioinformatics techniques for automatically characterizing cells into particular immune cell-types, methods to encode variation across heterogeneous cellular landscapes and with respect to a clinical outcome of interest are still lacking. To summarize and unravel the immunological variation across multiple samples profiled with CyTOF, we developed CytoEMD, a fast and scalable metric-based method to encode a compact vector representation for each profiled sample. CytoEMD uses earth mover's distance (EMD) to quantify the differences between pairs of profiled samples, which can be further projected into a latent space for visualization and interpretation. We compared CytoEMD to gating-based and deep-learning based set autoencoder methods and found that the CytoEMD approach 1) correctly captures between-patient variation, and 2) is more efficient and requires significantly fewer parameters. CytoEMD further promotes interpretability by providing insight into the cell-types driving variation between samples. CytoEMD is available as an open-sourced python package at https://github.com/CompCy-lab/CytoEMD.
Cells are the building blocks of human tissues and organs, and the distributions of different cell-types change due to environmental or disease conditions and treatments. Single-cell RNA sequencing is used to study heterogeneity of cells in biological samples. To date, computational approaches aided in the discovery of dominant and rare cell-types and facilitated the construction of cell atlases. Integration of new data with the existing reference atlases is an emerging computational problem, and this paper proposes to frame it as a multi-target prediction task, solvable using supervised machine learning. We systematically and rigorously test 63 different predictors on synthetic benchmarks with different properties. The best performing predictor has high Cohen's Kappa scores and low mean absolute errors in single-batch and multi-batch integration experiments.
We have developed a computational framework for constructing synthetic signal peptides from a base set of protein sequences. A large number of structured "building blocks", represented as m-step ordered pairs of amino acids, are extracted from the base sequences. Using a straightforward procedure, the building blocks enable the construction of a diverse set of synthetic signal peptides and targeting sequences that have the potential for industrial and therapeutic purposes. We have validated the proposed framework using several state-of-the-art sequence prediction platforms such as Signal-BLAST, SignalP-5.0, MULocDeep, and DeepMito. Experimental results show the computational framework can successfully generate synthetic signal peptides and targeting sequences and transform non-signaling sequences into synthetic signal peptides.
Drug repurposing aims to find new uses for existing drugs. One drug repurposing approach, called "Connectivity Mapping," links transcriptomic profiles of drugs to profiles characterizing disease states. However, experimentally evaluating the transcriptomic effects of drug exposure in particular cells is a costly process. Characterizing drug-cell combinations widely is further hindered because primary tissue samples may not be abundant, leading to many gaps in drug-cell databases. To best find drugs relevant for particular conditions, we may therefore want to impute the transcriptomic impact of a given drug on an unassayed cell type or types. This step deviates from classic data completion problems, however, because of the fundamental bottleneck that state of the art data imputation techniques for this problem do not consider the unique characteristics of the data. The missing values in the data are not randomly distributed, and the genes are not independent entities, but rather they interact with and affect the transcription rates of one another. Here, we address the first and one of the most fundamental parts of the connectivity map data imputation problem to enable drug repurposing. We develop a novel method, named FiT (Fiber-based Tensor Completion) to impute the transcription values for missing drug-cell line combinations in a highly sparse drug-cell line dataset accurately and efficiently, while exploiting the distribution of missing values as well as the interactions among genes. Our results demonstrate that even on a sparse dataset, where approximately 75% of the data is missing, FiT outperforms existing approaches and obtains more accurate results in a significantly shorter amount of time.
The plight of navigating high-dimensional transcription datasets remains a persistent problem. This problem is further amplified for complex disorders, such as cancer, as these disorders are often multigenic traits with multiple subsets of genes collectively affecting the type, stage, and severity of the trait. We are often faced with a trade-off between reducing the dimensionality of our datasets and maintaining the integrity of our data. Almost exclusively, researchers apply techniques commonly known as dimensionality reduction to reduce the dimensions of the feature space to allow classifiers to work in more appropriately sized input spaces. As the number of dimensions is reduced, however, the ability to distinguish classes from one another reduces as well. Thus, to accomplish both tasks simultaneously for very high dimensional transcriptome for complex multigenic traits, we propose a new supervised technique, Class Separation Transformation (CST). CST accomplishes both tasks simultaneously by significantly reducing the dimensionality of the input space into a one-dimensional transformed space that provides optimal separation between the differing classes. We compare our method with existing state-of-the-art methods using both real and synthetic datasets, demonstrating that CST is the more accurate, robust, and scalable technique relative to existing methods. Code used in this paper is available on https://github.com/aisharjya/CST
Twitter users post tweets on many topics, emotions, and events. The technological advancement and ease of tweeting quicken people's interaction with social network sites. Engagement with tweets led to product promotion in many corporate companies. Many studies focused on understanding tweeting patterns for marketing, retweeting, getting noticed, and receiving feedback. The time of a tweet was used for marketing strategies. Domain-based tweet timestamp patterns helped corporates in their tweet schedules and attracted more customers for their products. We collected 2.3 million depressive, anti-depressive, and COVID-19 tweets for one year. Our analysis of these tweets results in detailed tweet patterns in different timings in a day and days in a week. The depressive tweets follow the diurnal pattern, whereas the anti-depressive tweets follow a similar trend with intermediate aberrations. We also classified the tweet keywords into three different types with their frequency and amplitude of tweet patterns. Analyzing multi-domain tweets to discover time series patterns related to human health will be helpful for the planning and execution of medical disaster preparedness and emergency teams.
Molecular biology prediction tasks suffer the limited labeled data problem since it normally demands a series of professional experiments to label the target molecule. Self-training is one of the semi-supervised learning paradigms that utilizes both labeled and unlabeled data. It trains a teacher model on labeled data, and uses it to generate pseudo labels for unlabeled data. The labeled and pseudo-labeled data are then combined to train a student model. However, the pseudo labels generated from the teacher model are not sufficiently accurate. Thus, we propose a robust self-training strategy by exploring robust loss function to handle such noisy labels, which is model and task agnostic, and can be easily embedded with any prediction tasks. We have conducted molecular biology prediction tasks to gradually evaluate the performance of proposed robust self-training strategy. The results demonstrate that the proposed method consistently boosts the prediction performance, especially for molecular regression tasks, which have gained a 41.5% average improvement.
Parkinson's disease (PD) is the second most prevalent neurodegenerative disease in the United States. The structural or functional connectivity between regions of interest (ROIs) in the brain and their changes captured in brain connectomes could be potential biomarkers for PD. To effectively model the complex non-linear characteristic connectomic patterns related to PD and exploit the long-range feature interactions between ROIs, we propose a connectome transformer model for PD patient classification and biomarker identification. The proposed connectome transformer learns the key connectomic patterns by leveraging the global scope of the attention mechanism guided by an additional skip-connection from the input connectome and the local level focus of the CNN techniques. Our proposed model significantly outperformed the benchmarking models in the classification task and was able to visualize key feature interactions between ROIs in the brain.
Supervised machine learning models are, by definition, data-sighted, requiring to view all or most parts of the training dataset which are labeled. This paradigm presents two bottlenecks which are intertwined: risk of exposing sensitive data samples to the third-party site with machine learning engineers, and time-consuming, laborious, bias-prone nature of data annotations by the personnel at the data source site. In this paper we studied learning impact of data adequacy as bias source in a data-blinded semi-supervised learning model for covid chest X-ray classification. Data-blindedness was put in action on a semi-supervised generative adversarial network to generate synthetic data based only on a few labeled data samples and concurrently learn to classify targets. We designed and developed a data-blind COVID-19 patient classifier that classifies whether an individual is suffering from COVID-19 or other type of illness with the ultimate goal of producing a system to assist in labeling large datasets. However, the availability of the labels in the training data had an impact in the model performance, and when a new disease spreads, as it was COVID9-19 in 2019, access to labeled data may be limited. Here, we studied how bias in the labeled sample distribution per class impacted in classification performance for three models: a Convolution Neural Network based classifier (CNN), a semi-supervised GAN using the source data (SGAN), and finally our proposed data-blinded semi-supervised GAN (BSGAN). Data-blind prevents machine learning engineers from directly accessing the source data during training, thereby ensuring data confidentiality. This was achieved by using synthetic data samples, generated by a separate generative model which were then used to train the proposed model. Our model achieved comparable performance, with the trade-off between a privacy-aware model and a traditionally-learnt model of 0.05 AUC-score, and it maintained stable, following the same learning performance as the data distribution was changed.
Digital breast tomosynthesis, or 3D mammography, has advanced the field of breast imaging diagnosis. It has been rapidly replacing the traditional full-field digital mammography because of its diagnostic superiority. However, automatic detection of breast cancer using digital breast tomosynthesis images has remained challenging, mainly due to their high resolution, high volume, and complexity. In this study, we developed a novel model for more precise detection of cancerous 3D mammogram images. The proposed model first, represents 3D mammograms as graphs, then employs a self-attention graph convolutional neural network model to effectively and efficiently learn the features of 3D mammograms, and finally, using the extracted features, identifies the cancerous 3D mammograms. We trained and evaluated the performance of the proposed model using public and private datasets. We compared the performance of the proposed model with those of multiple state-of-the-art CNN-based models as baseline models. The results show that the proposed model outperforms all the baseline models in terms of accuracy, precision, sensitivity, F1, and AUC.
We often find our minds drifting off a current task towards something else, a phenomenon known as mind wandering. Mind wandering can negatively impact performance of many tasks (e.g., learning). Thus, it is crucial to find a way to detect mind wandering. Using deep learning and electroencephalogram (EEG) seems very promising. EEG systems offer high temporal precision and accessibility, and deep learning can automatically extract features from EEG signals. However, three key challenges hinder deep learning performance: the dynamic and distributed nature of mind wandering, small EEG datasets, and diverse EEG systems. Existing deep learning solutions do not perform well on the small datasets and cannot use data from other EEG systems. We propose a novel deep learning model, TopographyNET, which 1) captures the dynamic and distributed properties through spatial and temporal processing via 2D topographic scalp maps and a recurrent neural network; 2) applies transfer learning to address the issue of small datasets using a pretrained image classification neural network on topographic scalp maps; and 3) represents data in a uniform format and thus enables the usage of EEG data from diverse systems. Compared to an existing solution, our approach achieves a much higher classification accuracy. In addition, we present the hyperparameter tuning process that helped us achieve a high classification accuracy.
Interactions among molecules, also known as biological networks, are often modeled as binary graphs, where nodes and edges represent the molecules and the interaction among those molecules, such as signal transmission, genes-regulation, and protein-protein interactions. Subgraph patterns which are recurring in these networks, called motifs, describe conserved biological functions. Although traditional binary graph provides a simple model to study biological interactions, it lacks the expressive power to provide a holistic view of cell behavior as the interaction topology alters and adopts under different stress conditions as well as genetic variations. Multilayer network model captures the complexity of cell functions for such systems. Unlike the classic binary network model, multilayer network model provides an opportunity to identify conserved functions in cell among varying conditions. In this paper, we introduce the problem of co-existing motifs in multilayer networks. These motifs describe the dual conservation of the functions of cells within a network layer (i.e., cell condition) as well as across different layers of networks. We propose a new algorithm to solve the co-existing motif identification problem efficiently and accurately. Our experiments on both synthetic and real datasets demonstrate that our method identifies all co-existing motifs at near 100 % accuracy for all networks we tested on, while competing method's accuracy varies greatly between 10 to 95 %. Furthermore, our method runs at least an order of magnitude faster than state of the art motif identification methods for binary network models.
COVID-19 unleashed a global pandemic that has resulted in human, economic, and social crises of unprecedented scale. While the efficacy of mobility restrictions in curbing contagion has been scientifically and empirically acknowledged, a deeper understanding of the human behavioral trends driving the mixed adoption of mobility restrictions will aid future policymaking. In this paper, we employ associative rule-mining and regression to pinpoint socioeconomic and demographic factors influencing the evolving mobility trends. We compare and contrast short-distance and long-distance trips by analyzing Chicago county-level and US state-level mobility. Our study yields rules that explain the changing propensity in trip length and the collective effect of population density, economic standing, COVID testing, and the number of infected cases on mobility decisions. Through regression and correlation analysis, we show the influence of ethnic and demographic factors and perception of infection on short and long-distance trips. We find that the new mobility rules correspond to reduced long- and short-distance trip frequencies. We graphically demonstrate a marked decline in the proportion of long county-level trips but a minor change in the distribution of state-level trips. Our correlation study highlights it is hard to characterize the effect of perception of infection spread on mobility decisions. We conclude the paper with a discussion on the overlap between the analysis in the existing literature on both during- and post-lockdown mobility trends and our findings.
With advancements in next-generation sequencing techniques, the whole protein sequence repertoire has increased to a great extent. In the meantime, deep learning techniques have promoted the development of computational methods to interpret large-scale proteomic data and facilitate functional studies of proteins. Inferring properties from protein amino acid sequences has been a long-standing problem in Bioinformatics. Extensive studies have successfully applied natural language processing (NLP) techniques for the representation learning of protein sequences. In this paper, we applied the deep sequence model - UDSMProt, to fine-tune and evaluate two protein prediction tasks: (1) predict proteins with liquid-liquid phase separation propensity and (2) predict synaptic proteins. Our results have shown that, without prior domain knowledge and only based on protein sequences, the fine-tuned language models achieved high classification accuracies and outperformed baseline models using compositional k-mer features in both tasks. Hence, it is promising to apply the protein language model to some learning tasks and the fine-tuned models can be used to predict protein candidates for biological studies.
Influenza is a communicable respiratory illness that can cause serious public health hazards. Flu surveillance in New Zealand tracks case counts from various District health boards (DHBs) in the country to monitor the spread of influenza in different geographic locations. Many factors contribute to the spread of the influenza across a geographic region, and it can be challenging to forecast cases in one region without taking into account case numbers in another region. This paper proposes a novel ensemble method called Geographic Ensembles of Observations using Randomised Ensembles of Autoregression Chains (GEO-Reach). GEO-Reach is an ensemble technique that uses a two layer approach to utilise interdependence of historical case counts between geographic regions in New Zealand. This work extends a previously published method by the authors  called Randomized Ensembles of Auto-regression chains (Reach). State-of-the-art forecasting models look at studying the spread of the virus. They focus on accurate forecasting of cases for a location using historical case counts for the same location and other data sources based on human behaviour such as movement of people across cities/geographic regions. This new approach is evaluated using Influenza like illness (ILI) case counts in 7 major regions in New Zealand from the years 2015--2019 and compares its performance with other standard methods such as Dante, ARIMA, Autoregression and Random Forests. The results demonstrate that the proposed method performed better than baseline methods when applied to this multi-variate time series forecasting problem.
Data valuation in machine learning comprises computational methods for the estimation of the importance of individual training instances. It has been used to remove noise, uncover biases, and improve the accuracy of trained models. Current data valuation techniques do not scale up for large datasets and do not work for regression tasks, where the objective is to predict a numerical outcome rather than a small number of nominal class labels. In this work, an evolutionary approach for qualitative and quantitative data valuation, is presented. The proposed approach is tested on regression and classification benchmarks, and on several bioinformatics and health informatics datasets. In addition, models trained with most valuable subsets of data are validated on independently acquired tests, demonstrating the generalizability as well as the practical utility of the proposed approach.
Fast growing global connectivity and urbanisation increases the risk of spreading worldwide disease. The worldwide SARS-COV-2 disease causes healthcare system strained, especially for the intensive care units. Therefore, prognostic of patients' need for intensive care units is priority at the hospital admission stage for efficient resource allocation. In the early hospitalization, patient chest radiography and clinical data are always collected to diagnose. Hence, we proposed a clinical data structured graph Markov neural network embedding with computed radiography exam features (CGMNN) to predict the intensive care units demand for COVID patients. The study utilized 1,342 patients' chest computed radiography with clinical data from a public dataset. The proposed CGMNN outperforms baseline models with an accuracy of 0.82, a sensitivity of 0.82, a precision of 0.81, and an F1 score of 0.76.
Bio-marker identification for COVID-19 remains a vital research area to improve current and future pandemic responses. Innovative artificial intelligence and machine learning-based systems may leverage the large quantity and complexity of single cell sequencing data to quickly identify disease with high sensitivity. In this study, we developed a novel approach to classify patient COVID-19 infection severity using single-cell sequencing data derived from patient BronchoAlveolar Lavage Fluid (BALF) samples. We also identified key genetic biomarkers associated with COVID-19 infection severity. Feature importance scores from high performing COVID-19 classifiers were used to identify a set of novel genetic biomarkers that are predictive of COVID-19 infection severity. Treatment development and pandemic reaction may be greatly improved using our novel big-data approach. Our implementation is available on https://github.com/aekanshgoel/COVID-19_scRNAseq.
As of May 15th, 2022, the novel coronavirus SARS-COV-2 has infected 517 million people and resulted in more than 6.2 million deaths around the world. About 40% to 87% of patients suffer from persistent symptoms weeks or months after their original infection. Despite remarkable progress in preventing and treating acute COVID-19 conditions, the clinical diagnosis of long-term COVID remains difficult. In this work, we use free-text clinical notes and natural language processing (NLP) techniques to explore long-term COVID effects. We first obtain free-text clinical notes from 719 outpatient encounters representing patients treated by physicians at Emory Clinic to detect patterns in patients with long-term COVID symptoms. We apply state-of-the-art NLP frameworks to automatically identify patients with long-term COVID effects, achieving 0.881 recall (sensitivity) score for note-level prediction. We further interpret the prediction outcomes and discuss potential phenotypes. Our work aims to provide a data-driven solution to identify patients who have developed persistent symptoms after acute COVID infection. With this work, clinicians may be able to identify patients who have long-term COVID symptoms to optimize treatment.
Heart failure (HF) is a leading cause of morbidity, mortality, and substantial health care costs. Prolonged conduction through the myocardium can occur with HF, and a device-driven approach, termed cardiac resynchronization therapy (CRT), can improve left ventricular (LV) myocardial conduction patterns. We used machine learning methods of classifying HF patients, namely Decision Trees, and Artificial Neural Networks (ANNs), to develop predictive models of individual outcomes following CRT. Clinical, functional, and biomarker data were collected in HF patients before and following CRT. A prospective 6-month endpoint of a reduction in LV volume was defined as a CRT response. Using this approach on 764 subjects (368 responders, 396 non-responders), each with 53 parameters, we could classify HF patients based on their response to CRT with more than 72% success.
We also explored the utilization of machine learning techniques in predicting the magnitude of LV volume, 3 months after CRT placement. Using techniques such as linear regression and Artificial neural networks, we can predict the 3-month LV volume within a 17% median margin of error.
We have demonstrated that using machine learning approaches can identify HF patients with a high probability of a positive CRT response. Developing these approaches into a clinical algorithm to assist in clinical decision-making regarding the use of CRT in HF patients would potentially improve outcomes and reduce health care costs.
In this article, we extend a general framework, Pathway-based Kernel Boosting (PKB), which incorporates clinical information and prior knowledge about pathways for prediction of binary, continuous and survival outcomes. We introduce appropriate loss functions and optimization procedures for different outcome types. Our prediction algorithm incorporates pathway knowledge by constructing kernel function spaces from the pathways and use them as base learners in the boosting procedure. Through extensive simulations and case studies in drug response and cancer survival datasets, we demonstrate that PKB can substantially outperform other competing methods, better identify biological pathways related to drug response and patient survival, and provide novel insights into cancer pathogenesis and treatment response.
In medicine, survival analysis studies the time duration to events of interest such as mortality. One major challenge is how to deal with multiple competing events (e.g., multiple disease diagnoses). In this work, we propose a transformer-based model that does not make the assumption for the underlying survival distribution and is capable of handling competing events, namely SurvTRACE. We account for the implicit confounders in the observational setting in multi-events scenarios, which causes selection bias as the predicted survival probability is influenced by irrelevant factors. To sufficiently utilize the survival data to train transformers from scratch, multiple auxiliary tasks are designed for multi-task learning. The model hence learns a strong shared representation from all these tasks and in turn serves for better survival analysis. We further demonstrate how to inspect the covariate relevance and importance through interpretable attention mechanisms of SurvTRACE, which suffices to great potential in enhancing clinical trial design and new treatment development. Experiments on METABRIC, SUPPORT, and SEER data with 470k patients validate the all-around superiority of our method. Software is available at https://github.com/RyanWangZf/SurvTRACE.
Genomic data have been used for trait association and disease risk prediction for a long time. In recent years, many such prediction models are built using machine learning (ML) algorithms. As of today, human genomic data and other biomedical data suffer from sampling biases in terms of people's ethnicity, as most of the data come from people of European ancestry. Smaller sample sizes for other population groups can cause suboptimal results in ML-based prediction models for those populations. Suboptimal predictions in precision medicine for some particular group can cause serious consequences limiting the model's applicability in real-world problems. As data collection for those populations is time-consuming and costly, we suggest deep learning-based models for in-silico data enhancement. Existing Generative Adversarial Network (GAN) models for genomic data like Population scale Genomic conditional-GAN (PG-cGAN) can generate realistic genomic data while trained on fairly unbiased data but fails while trained on biased data and encounters severe mode collapse. Our proposed model, Offspring GAN, can resolve the mode collapse issue even when trained in strongly biased genomic datasets. Our results demonstrate the ability of Offspring GAN to generate realistic and diverse label-aware data, which can augment limited real data to alleviate biases and disparities in genomic data. We also propose a privacy-preserving protocol using Offspring GAN to protect the privacy of genomic data.
Graph-based genome representations have proven to be a powerful tool in genomic analysis due to their ability to encode variations found in multiple haplotypes and capture population genetic diversity. Such graphs also unavoidably contain paths which switch between haplotypes (i.e., recombinant paths) and thus do not fully match any of the constituent haplotypes. The number of such recombinant paths increases combinatorially with path length and cause inefficiencies and false positives when mapping reads. In this paper, we study the problem of finding reduced haplotype-aware genome graphs that incorporate only a selected subset of variants, yet contain paths corresponding to all α-long substrings of the input haplotypes (i.e., non-recombinant paths) with at most δ mismatches. Solving this problem optimally, i.e., minimizing the number of variants selected, is previously known to be NP-hard . Here, we first establish several inapproximability results regarding finding haplotype-aware reduced variation graphs of optimal size. We then present an integer linear programming (ILP) formulation for solving the problem, and experimentally demonstrate this is a computationally feasible approach for real-world problems and provides far superior reduction compared to prior approaches.
We develop methods for constructing low-dimensional vector representations (embeddings) of large-scale genotyping data, capable of reducing genotypes of hundreds of thousands of SNPs to 100-dimensional embeddings that retain substantial predictive power for inferring medical phenotypes. We demonstrate that embedding-based models yield an average F-score of 0.605 on a test of ten phenoypes (including BMI prediction, genetic relatedness, and depression) versus 0.339 for baseline models. Genotype embeddings also hold promise for creating sharing data while preserving subject anonymity: we show that they retain substantial predictive power even after anonymization by adding Gaussian noise to each dimension.
Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction.
We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.
Biomedical Question Answering aims to obtain an answer to the given question from the biomedical domain. Due to its high requirement of biomedical domain knowledge, it is difficult for the model to learn domain knowledge from limited training data. We propose a contextual embedding method that combines open-domain QA model AoA Reader and BioBERT model pre-trained on biomedical domain data. We adopt unsupervised pre-training on large biomedical corpus and supervised fine-tuning on biomedical question answering dataset. Additionally, we adopt an MLP-based model weighting layer to automatically exploit the advantages of two models to provide the correct answer. The public dataset biomrc constructed from PubMed corpus is used to evaluate our method. Experimental results show that our model outperforms state-of-the-art system by a large margin.
Traditional taxonomy provides a hierarchical organization of bacteria and archaea across taxonomic ranks from kingdom to subspecies. More recently, bacterial taxonomy has been more robustly quantified using comparisons of sequenced genomes, as in the Genome Taxonomy Database (GTDB), resolving down to genera and species. Such taxonomies have proven useful in many contexts, yet lack the flexibility and resolution of a more fine-grained approach. We apply our Life Identification Number (LIN) approach as a common, quantitative framework to tie existing (and future) bacterial taxonomies together, increase the resolution of genome-based discrimination of taxa, and extend taxonomic identification below the species level in a principled way. We utilize our existing concept of a LINgroup as an organizational concept for microorganisms that are closely related by overall genomic similarity, to help resolve some of the confusions and unforeseen negative effects of nomenclature changes of microbes due to genome-based reclassification. Our results obtained from experimentation demonstrate the value of LINs and LINgroups in mapping between taxonomies, translating between different nomenclatures, and integrating them into a single taxonomic framework.
Major recent advances in sequencing technologies have created new opportunities for studying the complex microbiome domain. However, microbial communities have many unknown roles and unclear impacts on their host environment. The increased availability of microbial omics data associated with heterogeneous metadata has the potential to revolutionize microbiome research. This study proposes a novel data-integration model and a practical pipeline to explore microbial community functions with the integration of omics data. Three case studies were employed to highlight the advanced abilities and applications of our graph database model. Furthermore, we show that a variety of information can be queried against our model and easily extracted using the proposed analysis pipeline. Our findings suggest that the proposed model is highly queryable and provides a critical analytical platform to extract useful knowledge from multi-omics data. We show that such knowledge extraction can lead to new discoveries, particularly when utilizing all available datasets.
Diabetes is a long-standing disease caused by high blood sugar over a long period of time and one in every ten Americans has diabetes. The neural networks have gained attention in large-scale genetic research because of its ability in non-linear relationships. However, the data imbalance problem, which is caused by the disproportion between the number of disease samples and the number of healthy samples, will decrease the prediction accuracy. In this project, we tackle the data imbalance problem when predicting diabetes with genotype SNP data and phenotype data provided by UK BioBank. The dataset is highly skewed with healthy samples with the ratio of 20. We build a phenotype neural network and a genotype neural network, which uses two sampling techniques and a data augmentation method by generative adversarial neural network (GAN) to counter the data imbalance problem before feeding the data to the neural networks. We found out that the phenotype neural network outperforms the genotype neural network and achieves 90% accuracy. We reach the conclusion that undersampling performs better than both oversampling and the GAN, and the phenotype is better than the genotype in terms of predicting diabetes. We have identified key phenotype and genotype features that contributed to the effectiveness of the prediction.
Antibiotic resistance is a global problem projected to kill 10 million each year by 2050. The CDC lists Neisseria gonorrhoeae among the most urgent threats in this area as there exists a severe lack of efficient resistance detection techniques and only a handful of resistance-causing mutations have been identified thus far . Currently, testing for antibiotic resistance in N. gonorrhoeae samples depends on culturing a sample in a lab environment. Sensitivity and specificity may reach 85--95% and 100% respectively, but only under optimal conditions and for urogenital specimens .
In this study, eight machine learning models - multi-layer perceptron, support vector machine, random forest classifier, K-nearest neighbors, eXtreme gradient boosting, Gaussian Naive Bayes, stochastic gradient descent, and logistic regression - were trained on three datasets containing data regarding resistance against azithromycin, ciprofloxacin and cefixime, which are three drugs of choice against N. gonorrhoeae. Each dataset had 3000+ samples and their corresponding resistance values; each sample consisted of a binary series representing the presence/absence of certain unitigs within that sample's genome. The technique differs from the standard research in this field, which has almost exclusively used whole-genome sequences.
Once the models were trained, their accuracies, sensitivities and specificities were compared and analyzed. Maximum balanced accuracies of 97.6%, 95.9% and 100% were achieved on azithromycin, ciprofloxacin and cefixime training data respectively, exhibiting an improvement over previous work . As a point of comparison between various models, performance on azithromycin resistance is represented in Fig 1. The balanced accuracy of GNB, at 68%, is too low to register on the scale.
Subsequently, Fisher's exact test was used to test for the existence of biomarkers, i.e. unitigs that had a statistically significant correlation with antibiotic resistance. The feature importances of the top models from the first step were used to create a ranking of these genetic signatures, representing a novel method of unitig organization. Out of 584,362 unitigs, 191, 3304 and 1 were identified as statistically significant for azithromycin, ciprofloxacin and cefixime respectively. The majority of these genetic regions encode for proteins - some of which are likely novel discoveries - such as DsbA oxidoreductase, FtsJ methyltransferase, and Pilin glycosyltransferase. These biomarkers present useful leads for the development of point-of-care tests for antibiotic resistance in N. gonorrhoeae, while the ML models can predict resistance through direct genotype sequencing of patient samples .
Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide . As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells . However, prior research has utilized "black-box" algorithms, which are not appropriate for use in the life sciences .
In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function .
Table 1: Performance and Identified Biomarkers by Cancer
To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds .
Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations.
In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.
B-cell epitope prediction for antipeptide paratopes is key to developing novel vaccines and immunodiagnostics. This entails estimating free-energy changes for paratope binding to variable-length disordered peptidic sequences as has been previously described for the Heuristic Affinity Prediction Tool for Immune Complexes (HAPTIC), which resolves said binding into processes of epitope compaction, collapse and contact by analogy to protein folding. However, HAPTIC analyzes antigen sequence data without excluding potentially problematic candidate epitopes (e.g., comprising inaccessible and/or conformationally rigid residues) while also neglecting the temperature dependence of polyproline II (PPII) helix propensity (for compaction), occurrence of epitope-backbone hydrogen bonding and impact of disulfide bond formation between epitope cysteine residues. The present work thus provides a more physically realistic revision of HAPTIC (HAPTIC2), the HAPTIC2-like Epitope Prediction Tool for Antigen with Disulfide (HEPTAD) and the HAPTIC2/HEPTAD Input Preprocessor (HIP), forming the HAPTIC2/HEPTAD User Toolkit (HUT). HIP facilitates tagging of residues (e.g., in hydrophobic blobs, ordered regions and glycosylation motifs) for exclusion from downstream analyses by HAPTIC2 and HEPTAD. HAPTIC2 enables temperature-dependent PPII helix propensity calculations while also regarding glycine and proline as polar residues that form hydrogen bonds with paratopes. HEPTAD analyzes antigen sequences that each contain two cysteine residues for which the impact of disulfide pairing is estimated as a correction to the free-energy penalty of compaction. All components of HUT (i.e., HIT, HAPTIC2 and HEPTAD) are freely accessible online (http://badong.freeshell.org/hut.htm).
Challenges in medicine are often faced as interdisciplinary endeavors. In such an interdisciplinary view, sonification of medical data provides an additional sensory dimension to highlight often hard-to-find information and details. Some examples of sonification of medical data include Covid genome mapping , auditory representations of tridimensional objects as the brain , enhancement of medical imagery through the use of sound . Here, we focus on kidney filtering-efficiency time-evolution data. We consider the estimated glomerular filtration rate (eGFR), the main indicator of kidney efficiency in diabetic kidney disease patients.1 We propose a technique to sonify the eGFR trajectories with time, frequency, and timbre to distinguish amongst patients (Figure 1). Multiple pitch trajectories can be formally investigated with the tools of counterpoint (Figure 2), and computationally analyzed with sound-processing techniques. Patients who present similar patterns of eGFR behavior can be more easily spotted through musical similarities. We use the Fréchet distance, which evaluates the shape similarity between curves , to cluster patients with similar eGFR behavior. We thus compare the information gathered through sonification and shape-based analysis. We find the mean curves in each trajectory cluster and we compare them with the characteristics of sonified curves. Clustering methods have also been applied to sound analysis: it is the case of k-means to cluster sound data . The Fréchet-based clustering technique is a development of k-means taking shape into account. Thus, we sketch a sound-based clustering approach for medical data, as an additional tool to find patterns of behavior. This study can foster new research between computer science, medicine, and sound processing.
Alzheimer's disease (AD) is a heterogeneous, multifactorial neurodegenerative disorder, where beta-amyloid (A), pathologic tau (T), neurodegeneration ([N]), and structural brain network (Net) are four major indicators of AD progression. Most current studies on AD rely on single-source modality and ignore complex biological interactions at molecular level. In this study, we propose a novel multimodal spatiotemporal stratification network (MSSN) that is built upon the fusion of multiple data modalities and the combined power of systems biology and deep learning. Altogether, our stratification approach could (1) ameliorate limitations caused by insufficient longitudinal imaging data, (2) extract important spatiotemporal features vectors from imaging data, (3) exploit the subject-specific longitudinal prediction of a holistic biomarker set, and (4) generate symptoms related finegrained subtype classification.
Segments of DNA that are inherited from a common ancestor are referred to as identical-by-descent (IBD). Because these segments are inherited, they not only allow us to study population characteristics and the sharing of rare variants but also understand the hidden familial relationships within populations. Over the past two decades, various IBD finding algorithms have been developed using hidden Markov model (HMM), hashing and extension, and Burrows-Wheeler Transform (BWT) approaches.
In this study, we investigate the utility of pedigree information in enhancing the efficacy of IBD finding methods for endogamous populations. With the increasing prevalence of computationally efficient sequencing technology and proper documentation of pedigree structures, we expect complete pedigree information to become readily available for more populations. While IBD segments have been used to reconstruct pedigrees , because we now have access to the pedigree, it is a natural question to ask if including pedigree information would substantially improve IBD segment finding for the purpose of studying inheritance.
Our contributions center around the proposition of two types of IBD finding algorithms for reducing the number of false positives in the detected IBD segments. Both methods analyze the familial relationships between cohorts of individuals who are initially hypothesized to share IBD segments. Our first algorithm is inspired by a k-nearest neighbors (KNN) algorithm  where we perform outlier detection on the cohort of IBD-sharing individuals. The metric for proximity is determined by the kinship coefficient evaluated from the pairwise relationships between individuals from the cohort. Our second algorithm is inspired by the Bonsai algorithm  and uses multiple hypothesis tests to evaluate if an individual has much more IBD than is expected by chance. Bonsai IBD detection algorithm first divides the pedigree into multiple cohorts of family members with no shared individuals, proceeds to pick the two cohorts with the most shared IBD, and performs a hypothesis test between individuals in the first cohort against everyone in the second cohort. If the hypothesis test is rejected, we remove the individual from the cohort, recompute the common ancestor, and recurse on the remaining individual and the new cohort. Essentially, we account for recombination rates on top of Bonsai's hypothesis tests computations. Our algorithms are evaluated against simulations of an endogamous Amish population to determine their efficacy in removing false positive IBD segments.
Porcine reproductive and respiratory syndrome virus (PRRSV) is the most economically important swine pathogen in North America and is second globally only to African swine fever virus. PRRSV is a positive sense, single stranded RNA virus associated with reproductive disorder of sows and respiratory disease of pigs at all age. Diagnostic tests are commonly used to monitor the presence of PRRSV in swine populations including sequencing the open reading frame 5 (ORF5) gene to track the epidemiology of the virus and lateral introductions into a farm. PRRSView is a web portal created at the Iowa State University Veterinary Diagnostic Laboratory (ISU VDL) to host analytical and phylogenetic tools related to PRRSV ORF5 sequences with the goal of assisting veterinarians and producers in evaluating the genetic diversity, spatial, and temporal aspects of ORF5 sequences maintained in the ISU VDL database. PRRSView works in conjunction with the broader Swine Disease Reporting System (SDRS) project to contextualize the ever changing patterns of PRRSV diversity, and supports interactive tools for veterinarians to analyze their sequence data compared to other sequences detected throughout the United States. The PRRSView homepage provides a phylogenetic overview of the sequences generated by the ISU VDL within the previous month, indicating the current strains detected in circulation. There are currently three ORF5 analytical tools available on PRRSView: a genetic sequence BLAST tool, a vaccine identity tool, and an RFLP tool. The ORF5 BLAST tool allows the users to submit their ORF5 gene sequences and returns up to 10 closely related sequences from the ISU VDL database, with metadata that includes the state, genetic lineage, RFLP, and identity to the query sequence. The vaccine identity tool allows users to quickly calculate the percent homology of their sequence(s) to five different PRRSV vaccines: Inglevac PRRS ATP, Inglevac PRRS MLV, Prime Pac PRRS, Fostera PRRS, and Prevacent PRRS, as well as the distance to the Lelystad strain. This tool also builds a neighbor joining tree with a set of curated strains to estimate the genetic lineage of the sequence, which is rendered in the web browser for viewing. Additionally, this tool will calculate the RFLP of the sequence, and the exact positions of the cut sites are shown when hovering over the RFLP value. The last analytical tool provided is the ORF5 RFLP tool, which quickly calculates the RFLP pattern of the input sequences. These analytical tools are designed to allow veterinarians and researchers to easily analyze their PRRSV ORF5 sequences against the expansive ISU VDL database to gain valuable epidemiologic information and comparative data regarding the genetic lineages and related metadata of the PRRSV circulating in a production system, while lowering the barrier of entry for use.
Billions of neurons make up our brains where the emergence of synchronous behavior is one of the most fundamental questions in the field of neuroscience. In a system as complex as the human brain, synchronization of neuronal activity can be useful and necessary as during the sleep cycles and in consolidation of memory but can also be problematic and undesirable in disorders such as epilepsy and Parkinson's disease. The goal in this study is to shed light on a particular type of neuronal synchronization associated with epileptic seizures that result from a central nervous system disorder characterized by abnormal brain activity. The approach consists of analyzing electroencephalogram (EEG) data containing information about neuronal electrical activity of epileptic patients before, during and after a seizure. The database includes EEG recordings of 14 patients obtained from the Unit of Neurology and Neurophysiology of the University of Siena, with electrical activity collected from 29 brain areas through electrodes placed on the scalp of the patients . The data is initially preprocessed using filters to reduce the noise level , and the phase of the filtered signal is extracted using the Hilbert Transform and the Phase Estimation by Means of Frequency (PEMF) methods . The phase of each of the 29 signal is then compared over time with each of the other 28 signals to verify whether the signals have their phases in synchrony, or not. We compute the phase locking value (PLV) to quantify the level of synchronization between pairs of signals and obtain color maps for graphical visualization of the overall behavior of the brain electrical activity (Fig. 1, top panel). The functional connectivity in the pre, during, and post seizure of a patient experiencing a seizure is depicted in Fig. 1, bottom panel. Each line represents a functional connection were PLV was grater than 0.95.
Our preliminary results show that there are more synchronized channels during the seizure across patients compared to pre and post seizure. Additionally, neurons of certain areas of the brain tend to be more synchronous than others during the epileptic seizure. The approach considered in this work can be extended beyond epilepsy, with potential implementation to study other neurological disorders including schizophrenia and Parkinson's disease, for example.
Vaccinations mechanisms are common strategies for controlling the spread of viral spreading processes, such as epidemics and computer viruses. Their supply is often limited, and thus devising optimal strategies for their allocation and for their time of administration can be of high value to fight the epidemic spread.
We account for arbitrary heterogeneous networks (populations) and consider the problem of multi-region systems. We prove a general property of the effective reproduction number: its reduction (under SIR models) is convex. Using this property, we analyze the effects of vaccination strategies on the acquirement of herd immunity and derive an efficient greedy algorithm that finds the optimal (HIT minimizing) allocation and administration timing of vaccines.
Diabetes mellitus is a metabolic disease characterized by abnormally high blood glucose levels. Today, about 15 % of the population has been diagnosed with diabetes in Arkansas.
Nationally, diabetes affects about 8.7% of the population and is a leading cause of death in adults. An analysis of the impact of diabetes to Arkansas and the U.S over the years could elucidate significant trends and factors that could aid in combating and reducing the economic effects of this devastating disease. Trends in the utilization of health care services by studying health care costs and hospital utilization of diabetes patients with and without complications was examined both at the state and national level between the years 2006 and 2018. A high level of hospitalizations was seen in the age groups of 18--44 yrs and 45--64 yrs compared to the other age groups both in diabetes patients with and without complications in Arkansas and U.S. Levels of hospital use were higher among females in Arkansas while it was higher in men in the U.S. The prevalence of diabetes with complications shows an increasing trend in younger adults while the prevalence of diabetes without complications shows an increase in children and adolescents over the last few years. An increase in hospital costs is seen overall for patients with diabetes.
We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between CUDA and SYCL. Evaluating the CUDA and SYCL applications on an NVIDIA V100 GPU, we find that the optimization of loop unrolling needs to be applied manually to the SYCL kernel for obtaining comparable performance. The performance of the SYCL group reduce function, an alternative to the CUDA warp-based reduction, depends on the problem and work group sizes. The 64-bit popcount operation implemented with tree of adders is slightly faster than the built-in popcount operation. When the number of OpenMP threads is four, the highest performance of the SYCL and CUDA applications are comparable.
Sentiment analysis aims at extracting opinions and or emotions mainly from written text. The most popular problem in sentiment analysis certainly is polarity detection, which falls into the broader class of Natural Language Processing (NLP) problems of text classification. To date, state-of-the-art approaches to text classification use neural language models built on popular architectures such as Transformers. However, these approaches are difficult to apply in low-resource languages and domains, as for instance the Italian language or small clinical trials. Motivated by this, this paper presents VADER-IT, a lexicon-based algorithm for polarity prediction in written text, that is an adaptation to the Italian language of the popular VADER. Unlike VADER, our system also predicts a polarity class (i.e. positive, negative or neutral). The system was tested on a dataset of 5495 healthcare related reviews from QSalute https://www.qsalute.it/, reaching a micro averaged F1--score = 81% and a micro averaged Jaccard - score = 73%.
The processing of medical images is gaining an important role to allow an increasingly accurate diagnosis, essential for chronic diseases identification and treatment. We focus on image processing techniques, such as segmentation ones, and we report implementation experiences and tests in different programming languages. Results regard the use and implementation of K-means algorithm to analyze T1-weighted MRI images regarding 233 subjects. Dataset refers to on line available one containing images referred to three different brain tumors (meningioma, glioma and pituitary tumor). We report the results of implementing the K-means algorithm by using two different programming languages, Java and Octave, measuring different performances.
Scoliosis is a curvature of the spine often found in adolescents. Commonly the management of patients with scoliosis is done through manual methods or web applications. Web applications require the doctor or patient to upload data, after taking scoliosis measurements with some separate instrument.
More recently, applications that can be downloaded on smartphones (the so called apps), have been integrated into the clinical practice of scoliosis management. These applications allow to take scoliosis measurements directly, without the need to upload data by the user, thanks to the use of the smartphone sensors. In this paper, we first define some qualitative criteria to evaluate such apps and then we evaluate some relevant apps for scoliosis management. The criteria of evaluation taken into consideration include: Availability, Technology, Measurement, Functions and Qualitative evaluation. Each criterion represents an aspect of the apps and serves to characterize them.
Apps-based scoliosis management offers several advantages both on the doctor and on the patient side. For example, from the patient's point of view the app may be useful to continuously and easily monitor scoliosis at home, while the doctor can monitor the scoliosis evolution along the time reducing the number of visits to the patient.
This paper provides an overview on scalable deep learning platforms and how they are used in medical context. An introduction highlights the key factors, then an overview on medical context is provided. Afterwards, the basic concepts about deep learning and parallel and distributed computing are briefly recalled. Then a specific deep learning library for medical applications is described. The last part of the paper is focused on a real use case application of deep learning on medical data. As a result, the main contribution of this paper is a short survey on main scalable deep learning platforms with a first analysis of their features, and the description of a practical example.
In this era of Big Data and AI, expertise in multiple aspects of data, computing, and the domains of application is needed. This calls for teams of experts with different training and perspectives. Because data analysis can have serious ethical implications, it is important that these teams are well and deeply integrated. No-Boundary Thinking (NBT) teams can provide support for team formation and maintenance, thereby attending to the many dimensions of the ethics of data and analysis. In this NBT workshop session, we discuss the ethical concerns that arise from the use of data and AI, and the implications for team building; and provide and brainstorm suggestions for ethical data enabled science and AI.
Team building can be challenging when participants are from the same discipline or sub-discipline, but needs special attention when participants use a different vocabulary and have different cultural views on what constitutes viable problems and solutions. Essential to No Boundary Thinking (NBT) teams is proper formulation of the problem to be solved, and a basic tenant is that the NBT team must come together with diverse perspectives to decide the problem before solutions can be considered. Given that participants come with different views on problem formulation and solution, it is important to consider a robust process for team formation and maintenance. This takes extra effort and time, but scholars studying teams of experts with diverse training have found that they are better positioned to be successful in solving even deep and difficult problems especially if they have learned to work well with each other. At this workshop we will discuss principles that scholars who have worked in NBT teams have discovered as effective. We will then engage with the workshop participants to consider discuss these principles and brainstorm to consider other approaches.
Genome rearrangement problems in computational biology have been modeled as combinatorial optimization problems related to the familiar problem of sorting, namely transforming arbitrary permutations to the identity permutation. When a permutation is viewed as the string of integers from 1 through n, any substring in it that is also a substring in the identity permutation will be called a strip. The objective in the combinatorial optimization problems arising from the applications is to obtain the identity permutation from an arbitrary permutation in the least number of a particular chosen strip operation. Among the strip operations which have been investigated thus far in the literature are strip moves, transpositions, reversals, and block interchanges. However, it is important to note that most of the existing research on sorting by strip operations has been focused on obtaining hardness results or designing approximation algorithms, with little work carried out thus far on the implementation of the proposed approximation algorithms. In this paper, two new algorithms for sorting by strip swaps are presented. The first algorithm takes a greedy approach and selects at each step a strip swap that reduces the number of strips the most, and puts maximum strips in their correct positions. The second algorithm brings the closest consecutive pairs together at each step. Approximation ratios for these two algorithms are experimentally estimated.
Add information for authors
June 24, 2022
Add CNB-MAC workshop
May 20, 2022
Extended submission deadline
May 15, 2022
Updated Call for highlights
Apr 25, 2022
Updated Call for posters
Apr 12, 2022
Updated Call for tutorial
Mar 28, 2022
Added sponsorship benefits
Mar 14, 2022
Updated Call for workshop
Feb 28, 2022
Website launched with CFP
Jan 11, 2022