ACM-BCB 2020

Accepted Tutorials

Homomorphic encryption and its application to privacy-preserving genotype imputation
Organizers: Gamze Gursoy (Yale University), Michail Maniatakos (NYU Abu Dhabi), Eduardo Chielle (NYU Abu Dhabi), Oleg Mazonka (NYU Abu Dhabi) Abstract: The genomic characterization of thousands of individuals promises to be immensely useful for medical research. Including more study individuals increases statistical power for making more and better discoveries; hence more individuals are likely to be sequenced in the future. As the number of genomes increases, more computational resources are needed to store and analyze such data. Therefore, federal institutions like NIH are increasingly outsourcing genomic computation in the cloud. As genomic information of individuals is sensitive and prone to re-identification for malicious purposes, outsourcing cloud computing poses serious privacy concerns. One way to overcome the privacy issues is to perform all the computations in the encrypted domain, which is possible with a form of encryption called homomorphic encryption (HE). In this tutorial, we will give a high-level introduction on homomorphic encryption. We will introduce our Encrypt-Everything-Everywhere (E3) framework with hands-on practices. E3 allows HE programming to be enjoyable by being as close to standard programming languages, allowing programmers to incorporate privacy to their programs without expertise in cryptography. Lastly, we will demonstrate the feasibility of HE in the genotype imputation problem with a few example solutions developed using statistical inference and machine learning frameworks. URL: https://github.com/momalab/e3/wiki

Homomorphic encryption and its application to privacy-preserving genotype imputation

Organizers: Gamze Gursoy (Yale University), Michail Maniatakos (NYU Abu Dhabi), Eduardo Chielle (NYU Abu Dhabi), Oleg Mazonka (NYU Abu Dhabi)

Abstract: The genomic characterization of thousands of individuals promises to be immensely useful for medical research. Including more study individuals increases statistical power for making more and better discoveries; hence more individuals are likely to be sequenced in the future. As the number of genomes increases, more computational resources are needed to store and analyze such data. Therefore, federal institutions like NIH are increasingly outsourcing genomic computation in the cloud. As genomic information of individuals is sensitive and prone to re-identification for malicious purposes, outsourcing cloud computing poses serious privacy concerns. One way to overcome the privacy issues is to perform all the computations in the encrypted domain, which is possible with a form of encryption called homomorphic encryption (HE). In this tutorial, we will give a high-level introduction on homomorphic encryption. We will introduce our Encrypt-Everything-Everywhere (E3) framework with hands-on practices. E3 allows HE programming to be enjoyable by being as close to standard programming languages, allowing programmers to incorporate privacy to their programs without expertise in cryptography. Lastly, we will demonstrate the feasibility of HE in the genotype imputation problem with a few example solutions developed using statistical inference and machine learning frameworks.

URL: https://github.com/momalab/e3/wiki

Phylogenetic Tree Reconciliation from Practice to Theory
Organizers: Ran Libeskind-Hadas (Department of Computer Science, Harvey Mudd College) Abstract: Phylogenetic tree reconciliation is a fundamental tool for studying the complex interactions between genes and species, parasites and hosts, and species and their habitats. The reconciliation problem seeks to find the best supported evolutionary history of a pair of phylogenetic trees by identifying the events that link them. The advent of efficient computational methods for tree reconciliation has led to numerous fundamental discoveries. For example, phylogenetic tree reconciliation was a key tool in the discovery that over a quarter of extant gene families arose during the Archaean period and that interactions of species and their ecological niches are strongly conserved across the entire tree of life. Researchers have used tree reconciliation to study viruses including the coronavirus and HIV. This workshop will provide participants with hands-on experience using several reconciliation tools, will address the relative merits and limitations of existing tools, and will discuss recent algorithmic developments and areas for future research. URL: https://sites.google.com/g.hmc.edu/acm-bcb2020tutorial/home

Phylogenetic Tree Reconciliation from Practice to Theory

Organizers: Ran Libeskind-Hadas (Department of Computer Science, Harvey Mudd College)

Abstract: Phylogenetic tree reconciliation is a fundamental tool for studying the complex interactions between genes and species, parasites and hosts, and species and their habitats. The reconciliation problem seeks to find the best supported evolutionary history of a pair of phylogenetic trees by identifying the events that link them. The advent of efficient computational methods for tree reconciliation has led to numerous fundamental discoveries. For example, phylogenetic tree reconciliation was a key tool in the discovery that over a quarter of extant gene families arose during the Archaean period and that interactions of species and their ecological niches are strongly conserved across the entire tree of life. Researchers have used tree reconciliation to study viruses including the coronavirus and HIV. This workshop will provide participants with hands-on experience using several reconciliation tools, will address the relative merits and limitations of existing tools, and will discuss recent algorithmic developments and areas for future research.

URL: https://sites.google.com/g.hmc.edu/acm-bcb2020tutorial/home

Privacy-Preserving Genomic Data Sharing
Organizers: Erman Ayday (Case Western Reserve University, USA, and Bilkent University, Turkey), Xiaoqian Jiang (University of Texas, Houston, USA) Abstract: With the help of rapidly developing technology, DNA sequencing is becoming less expensive. Consequently, research in genomics has gained speed in paving the way to personalized (genomic) medicine, and geneticists need large collections of human genomes to further increase this speed. Furthermore, individuals are using their genomes to learn about their (genetic) predispositions to diseases, their ancestries, and even their compatibility with potential partners. On the other hand, genomic data carries much sensitive information about its owner. By analyzing the DNA of an individual, it is now possible to learn about his disease predispositions, ancestries, and physical attributes. The threat to genomic privacy is magnified by the fact that a person’s genome is correlated to his family members’ genomes. The Golden State killer case is an example that such linkage can put innocent family members in shame. This tutorial will help bioinformaticians better understand the privacy challenges of genomic data sharing, a crucial requirement to facilitate genomic data in research and treatment. We will discuss about privacy challenges and privacy-preserving solutions when (i) an individual (data owner) shares their data with a data collector (a service provider or a public database); (ii) a data collector shares statistical information about its database; and (iii) two or more data owners or data collectors (e.g., hospitals) share their data (or databases) with each other. No prerequisite knowledge on security or cryptography is required for the attendees of this tutorial. We only require the attendees to have a slight background on genomics and statistics.

Privacy-Preserving Genomic Data Sharing

Organizers: Erman Ayday (Case Western Reserve University, USA, and Bilkent University, Turkey), Xiaoqian Jiang (University of Texas, Houston, USA)

Abstract: With the help of rapidly developing technology, DNA sequencing is becoming less expensive. Consequently, research in genomics has gained speed in paving the way to personalized (genomic) medicine, and geneticists need large collections of human genomes to further increase this speed. Furthermore, individuals are using their genomes to learn about their (genetic) predispositions to diseases, their ancestries, and even their compatibility with potential partners. On the other hand, genomic data carries much sensitive information about its owner. By analyzing the DNA of an individual, it is now possible to learn about his disease predispositions, ancestries, and physical attributes. The threat to genomic privacy is magnified by the fact that a person’s genome is correlated to his family members’ genomes. The Golden State killer case is an example that such linkage can put innocent family members in shame. This tutorial will help bioinformaticians better understand the privacy challenges of genomic data sharing, a crucial requirement to facilitate genomic data in research and treatment. We will discuss about privacy challenges and privacy-preserving solutions when (i) an individual (data owner) shares their data with a data collector (a service provider or a public database); (ii) a data collector shares statistical information about its database; and (iii) two or more data owners or data collectors (e.g., hospitals) share their data (or databases) with each other. No prerequisite knowledge on security or cryptography is required for the attendees of this tutorial. We only require the attendees to have a slight background on genomics and statistics.

From Multi-omics Data to Knowledge Networks
Organizers: Rabie Saidi (European Bioinformatics Institute (EMBL-EBI), Cambridge, UK), Maryam Abdollahyan (Barts Cancer Institute (BCI), London, UK), Maria J Martin (European Bioinformatics Institute (EMBL-EBI), Cambridge, UK) Abstract: Knowledge networks are powerful tools for investigating the relationships between biological entities (genes, proteins, diseases, etc.). Such networks have been extensively used for different applications including functional annotation, drug discovery and precision medicine. Using various data sources (the UniProt Knowledgebase, protein interactions and pathway data, phenotype data, disease ontologies and drug databases), we present a multi-omics data integration approach to building a knowledge network. This tutorial is a hands-on introduction to two emerging tools in the field of big data mining, namely the Apache Spark data processing framework and the Apache Zeppelin interactive data analytics framework. Participants will learn about techniques for data ingestion and transformation, data structures for representing knowledge networks and how to explore them. URL: https://gitlab.ebi.ac.uk/data-science/acm-bcb-2020-tutorial

From Multi-omics Data to Knowledge Networks

Organizers: Rabie Saidi (European Bioinformatics Institute (EMBL-EBI), Cambridge, UK), Maryam Abdollahyan (Barts Cancer Institute (BCI), London, UK), Maria J Martin (European Bioinformatics Institute (EMBL-EBI), Cambridge, UK)

Abstract: Knowledge networks are powerful tools for investigating the relationships between biological entities (genes, proteins, diseases, etc.). Such networks have been extensively used for different applications including functional annotation, drug discovery and precision medicine. Using various data sources (the UniProt Knowledgebase, protein interactions and pathway data, phenotype data, disease ontologies and drug databases), we present a multi-omics data integration approach to building a knowledge network. This tutorial is a hands-on introduction to two emerging tools in the field of big data mining, namely the Apache Spark data processing framework and the Apache Zeppelin interactive data analytics framework. Participants will learn about techniques for data ingestion and transformation, data structures for representing knowledge networks and how to explore them.

URL: https://gitlab.ebi.ac.uk/data-science/acm-bcb-2020-tutorial

SC1: interactive web-based single cell RNA-seq analysis
Organizers: Marmar Moussa (Department of Immunology and Carole and Ray Neag Comprehensive Cancer Center, University of Connecticut School of Medicine, USA), Ion Mandoiu (Computer Science and Engineering, University of Connecticut, USA) Abstract: Single cell RNA-seq (scRNA-seq) is critical for studying cellular function and phenotypic heterogeneity as well as development of tissues and tumors. Currently there are very few tools that allow researchers to analyze scRNA-seq data without requiring considerable coding expertise. Here, we present a web-based interactive scRNA-seq data analysis tool publicly accessible at https://sc1.engr.uconn.edu . The pipeline implements a novel method of selecting informative genes based on the term frequency-inverse document frequency (TF-IDF) transformation, and provides a broad range of methods for cell clustering, differential expression analysis, gene enrichment, visualization, and cell cycle analysis. In only a few steps, researchers can generate a full initial analysis with powerful insights of their single cell RNA-seq data.

SC1: interactive web-based single cell RNA-seq analysis

Organizers: Marmar Moussa (Department of Immunology and Carole and Ray Neag Comprehensive Cancer Center, University of Connecticut School of Medicine, USA), Ion Mandoiu (Computer Science and Engineering, University of Connecticut, USA)

Abstract: Single cell RNA-seq (scRNA-seq) is critical for studying cellular function and phenotypic heterogeneity as well as development of tissues and tumors. Currently there are very few tools that allow researchers to analyze scRNA-seq data without requiring considerable coding expertise. Here, we present a web-based interactive scRNA-seq data analysis tool publicly accessible at https://sc1.engr.uconn.edu . The pipeline implements a novel method of selecting informative genes based on the term frequency-inverse document frequency (TF-IDF) transformation, and provides a broad range of methods for cell clustering, differential expression analysis, gene enrichment, visualization, and cell cycle analysis. In only a few steps, researchers can generate a full initial analysis with powerful insights of their single cell RNA-seq data.

Computational analysis of viral genomes and outbreaks
Organizers: Pavel Skums (Department of Computer Science, Georgia State University, USA), Alex Zelikovsky (Department of Computer Science, Georgia State University, USA) Abstract: Next-Generation Sequencing (NGS) has facilitated the assessment of viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data requires sophisticated analysis dealing with millions of error-prone short reads per patient. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We will survey bioinformatics tools analyzing NGS data for (i) characterization of viral genetic diversity including SNV and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.

Computational analysis of viral genomes and outbreaks

Organizers: Pavel Skums (Department of Computer Science, Georgia State University, USA), Alex Zelikovsky (Department of Computer Science, Georgia State University, USA)

Abstract: Next-Generation Sequencing (NGS) has facilitated the assessment of viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data requires sophisticated analysis dealing with millions of error-prone short reads per patient. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We will survey bioinformatics tools analyzing NGS data for (i) characterization of viral genetic diversity including SNV and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.

Fundamentals of alignment free sequence analysis: k-mer hashing
Organizers: Sven Rahmann (Genome Informatics, Institute of Human Genetics, University Medicine Essen, University of Duisburg-Essen, Essen, Germany), Jens Zentgraf (Computer Science XI / Algorithm Engineering, TU Dortmund University, Dortmund, Germany) Abstract: In recent years, alignment free sequence analysis methods have gained importance, due to their superior speed at equivalent results in comparison to traditional mapping- and alignment-based methods. Recently, methods have emerged that are able to index very large collections of sequenced DNA samples (e.g. any genome ever sequenced). The basis of each alignment-free method is a so called k-mer dictionary (or key-value-store) that associates a value (e.g., a transcript ID, chromosome number, species ID or counter) to each DNA substring of length k (from a genome or a sequenced sample). Almost always, such a dictionary is implemented via hashing. Ideally, considering that billions of k-mers have to be processed, such a hash table is both small and fast. It is both a science and an art to design fast and small hash tables for a given task. This tutorial is addressed to bioinformaticians who have heard about or used alignment-free methods and would like to know more about the underlying hashing algorithms. The tutorial will also be interesting to algorithmically oriented scientists who have not followed the advances in hashing methods over the past few years. Following the tutorial will enable you to better understand the underlying methods (and their limitations) of many state-of-the-art sequence analysis tools in genomics, transcriptomics, metagenomics and pangenomics. It will also help you to design your own method efficiently when the need arises.

Fundamentals of alignment free sequence analysis: k-mer hashing

Organizers: Sven Rahmann (Genome Informatics, Institute of Human Genetics, University Medicine Essen, University of Duisburg-Essen, Essen, Germany), Jens Zentgraf (Computer Science XI / Algorithm Engineering, TU Dortmund University, Dortmund, Germany)

Abstract: In recent years, alignment free sequence analysis methods have gained importance, due to their superior speed at equivalent results in comparison to traditional mapping- and alignment-based methods. Recently, methods have emerged that are able to index very large collections of sequenced DNA samples (e.g. any genome ever sequenced). The basis of each alignment-free method is a so called k-mer dictionary (or key-value-store) that associates a value (e.g., a transcript ID, chromosome number, species ID or counter) to each DNA substring of length k (from a genome or a sequenced sample). Almost always, such a dictionary is implemented via hashing. Ideally, considering that billions of k-mers have to be processed, such a hash table is both small and fast. It is both a science and an art to design fast and small hash tables for a given task.
This tutorial is addressed to bioinformaticians who have heard about or used alignment-free methods and would like to know more about the underlying hashing algorithms. The tutorial will also be interesting to algorithmically oriented scientists who have not followed the advances in hashing methods over the past few years. Following the tutorial will enable you to better understand the underlying methods (and their limitations) of many state-of-the-art sequence analysis tools in genomics, transcriptomics, metagenomics and pangenomics. It will also help you to design your own method efficiently when the need arises.

Important Dates

Call for	Submission Deadline	Notification of Acceptance
Papers	June 12	July 15
Workshops	March 27	April 3
Tutorials	April 15	April 22
Highlights	July 19	July 29
Posters	July 22	July 29
Late-break poster	August 15	August 17
Camera-ready:	July 29