Our lab is compromised to bridging the worlds of Artificial Intelligence systems and cancer treatment methods. Take a further look into what we have been working over the past years.


Unveiling Zero Shot Prediction for Gene Attributes Through Interpretable AI

Ala Jararweh, Oladimeji Macaulay, David Arredondo, Olufunmilola Oyebamiji, Luis E Tafoya, Kushal Virupakshappa, Avinash Sahu

Representation learning has transformed the prediction of structures and functions of genes and proteins by employing sequence, expression, and network data. Yet, this approach taps into just a fraction of the knowledge accumulated over more than a century of genetic research. Here, we introduce GeneLLM, an interpretable transformer-based model that integrates textual information through contrastive learning to refine gene representations. While it has been posited that such knowledge representation could result in bias towards well-characterized genes, GeneLLM surprisingly shows high accuracy across eight gene-related benchmarks, not only matching but often outperforming task-specific models, with a 50% increase in accuracy over its closest solubility-specific competitor. It demonstrates robust zero-shot learning capabilities for unseen gene annotations. The model’s interpretability and our multimodal strategic approach to mitigating inherent data biases bolster its utility and reliability, particularly in biomedical applications where interpretability is paramount. Our findings affirm the complementary nature of unstructured text to structured databases in enhancing biomedical predictions, while conscientiously addressing interpretability and bias for AI deployment in healthcare.

A community challenge to predict clinical outcomes after immune checkpoint blockade in non-small cell lung cancer

Mike Mason, Óscar Lapuente-Santana, Anni S. Halkola, Wenyu Wang, Raghvendra Mall, Xu Xiao, Jacob Kaufman, Jingxin Fu, Jacob Pfeil, Jineta Banerjee, Verena Chung, Han Chang, Scott D. Chasalow, Hung Ying Lin, Rongrong Chai, Thomas Yu, Francesca Finotello, Tuomas Mirtti, Mikko I. Mäyränpää, Jie Bao, Emmy W. Verschuren, Eiman I. Ahmed, Michele Ceccarelli, Lance D. Miller, Gianni Monaco, Wouter R. L. Hendrickx, Shimaa Sherif, Lin Yang, Ming Tang, Shengqing Stan Gu, Wubing Zhang, Yi Zhang, Zexian Zeng, Avinash Das Sahu, Yang Liu, Wenxian Yang, Davide Bedognetti, Jing Tang, Federica Eduati, Teemu D. Laajala, William J. Geese, Justin Guinney, Joseph D. Szustakowski, Benjamin G. Vincent & David P. Carbone

Predictive biomarkers of immune checkpoint inhibitor (ICI) efficacy are currently lacking for non-small cell lung cancer (NSCLC). Here, we describe the results from the Anti–PD-1 Response Prediction DREAM Challenge, a crowdsourced initiative that enabled the assessment of predictive models by using data from two randomized controlled clinical trials (RCTs) of ICIs in first-line metastatic NSCLC.

Chromosome 7 to the rescue: overcoming chromosome 10 loss in gliomas

Nishanth Ulhas Nair, Alejandro A. Schäffer, E. Michael Gertz, Kuoyuan Cheng, Johanna Zerbib, Avinash Das Sahu, Gil Leor, Eldad D. Shulman, Kenneth D. Aldape, Uri Ben-David, Eytan Ruppin

The co-occurrence of chromosome 10 loss and chromosome 7 gain in gliomas is the most frequent loss-gain co-aneuploidy pair in human cancers, a phenomenon that has been investigated without resolution since the late 1980s. Expanding beyond previous gene-centric studies, we investigate the co-occurrence in a genome-wide manner taking an evolutionary perspective. First, by mining large tumor aneuploidy data, we predict that the more likely order is 10 loss followed by 7 gain. Second, by analyzing extensive genomic and transcriptomic data from both patients and cell lines, we find that this co-occurrence can be explained by functional rescue interactions that are highly enriched on 7, which can possibly compensate for any detrimental consequences arising from the loss of 10. Finally, by analyzing transcriptomic data from normal, non-cancerous, human brain tissues, we provide a plausible reason why this co-occurrence happens preferentially in cancers originating in certain regions of the brain.


Tutorial: integrative computational analysis of bulk RNA-sequencing data to characterize tumor immunity using RIMA

Lin Yang, Jin Wang, Jennifer Altreuter, Aashna Jhaveri, Cheryl J. Wong, Li Song, Jingxin Fu, Len Taing, Sudheshna Bodapati, Avinash Sahu, Collin Tokheim, Yi Zhang, Zexian Zeng, Gali Bai, Ming Tang, Xintao Qiu, Henry W. Long, Franziska Michor, Yang Liu & X. Shirley Liu

RNA-sequencing (RNA-seq) has become an increasingly cost-effective technique for molecular profiling and immune characterization of tumors. In the past decade, many computational tools have been developed to characterize tumor immunity from gene expression data. However, the analysis of large-scale RNA-seq data requires bioinformatics proficiency, large computational resources and cancer genomics and immunology knowledge. In this tutorial, we provide an overview of computational analysis of bulk RNA-seq data for immune characterization of tumors and introduce commonly used computational tools with relevance to cancer immunology and immunotherapy. These tools have diverse functions such as evaluation of expression signatures, estimation of immune infiltration, inference of the immune repertoire, prediction of immunotherapy response, neoantigen detection and microbiome quantification. We describe the RNA-seq IMmune Analysis (RIMA) pipeline integratingmany of these tools to streamline RNA-seq analysis. We also developed a comprehensive and user-friendly guide in the form of a GitBook with text and video demos to assist users in analyzing bulk RNA-seq data for immune characterization at both individual sample and cohort levels by using RIMA

Robust identification of perturbed cell types in single-cell RNA-seq data

Phillip B. Nicol, Danielle Paulson, Gege Qian, X. Shirley Liu,  Rafael Irizarry &  Avinash D. Sahu

Single-cell transcriptomics has emerged as a powerful tool for understanding how different cells contribute to disease progression by identifying cell types that change across diseases or conditions. However, detecting changing cell types is challenging due to individual-to-individual and cohort-to-cohort variability and naive approaches based on current computational tools lead to false positive findings. To address this, we propose a computational tool, scDist, based on a mixed-effects model that provides a statistically rigorous and computationally efficient approach for detecting transcriptomic differences. By accurately recapitulating known immune cell relationships and mitigating false positives induced by individual and cohort variation, we demonstrate that scDist outperforms current methods in both simulated and real datasets, even with limited sample sizes. Through the analysis of COVID-19 and immunotherapy datasets, scDist uncovers transcriptomic perturbations in dendritic cells, plasmacytoid dendritic cells, and FCER1G+NK cells, that provide new insights into disease mechanisms and treatment responses. As single-cell datasets continue to expand, our faster and statistically rigorous method offers a robust and versatile tool for a wide range of research and clinical applications, enabling the investigation of cellular perturbations with implications for human health and disease.

A landscape of response to drug combinations in non-small cell lung cancer

Nishanth Ulhas Nair, Patricia Greninger, Xiaohu Zhang, Adam A. Friedman, Arnaud Amzallag, Eliane Cortez, Avinash Das Sahu, Joo Sang Lee, Anahita Dastur, Regina K. Egan, Ellen Murchie, Michele Ceribelli, Giovanna S. Crowther, Erin Beck, Joseph McClanaghan, Carleen Klump-Thomas, Jessica L. Boisvert, Leah J. Damon, Kelli M. Wilson, Jeffrey Ho, Angela Tam, Crystal McKnight, Sam Michael, Zina Itkin, Mathew J. Garnett, Jeffrey A. Engelman, Daniel A. Haber, Craig J. Thomas, Eytan Ruppin & Cyril H. Benes

Combination of anti-cancer drugs is broadly seen as way to overcome the often-limited efficacy of single agents. The design and testing of combinations are however very challenging. Here we present a uniquely large dataset screening over 5000 targeted agent combinations across 81 non-small cell lung cancer cell lines. Our analysis reveals a profound heterogeneity of response across the tumor models. Notably, combinations very rarely result in a strong gain in efficacy over the range of response observable with single agents. Importantly, gain of activity over single agents is more often seen when co-targeting functionally proximal genes, offering a strategy for designing more efficient combinations. Because combinatorial effect is strongly context specific, tumor specificity should be achievable. The resource provided, together with an additional validation screen sheds light on major challenges and opportunities in building efficacious combinations against cancer and provides an opportunity for training computational models for synergy prediction.

Murad R. Mamedov, Shane Vedova, Jacob W. Freimer, Avinash Das Sahu, Amrita Ramesh, Maya M. Arce, Angelo D. Meringa, Mineto Ota, Peixin Amy Chen, Kristina Hanspers, Vinh Q. Nguyen, Kirsten A. Takeshima, Anne C. Rios, Jonathan K. Pritchard, Jürgen Kuball, Zsolt Sebestyen, Erin J. Adams & Alexander Marson

γδ T cells are potent anticancer effectors with the potential to target tumours broadly, independent of patient-specific neoantigens or human leukocyte antigen background1,2,3,4,5. γδ T cells can sense conserved cell stress signals prevalent in transformed cells2,3, although the mechanisms behind the targeting of stressed target cells remain poorly characterized. Vγ9Vδ2 T cells—the most abundant subset of human γδ T cells4—recognize a protein complex containing butyrophilin 2A1 (BTN2A1) and BTN3A1 (refs. 6,7,8), a widely expressed cell surface protein that is activated by phosphoantigens abundantly produced by tumour cells. Here we combined genome-wide CRISPR screens in target cancer cells to identify pathways that regulate γδ T cell killing and BTN3A cell surface expression. The screens showed previously unappreciated multilayered regulation of BTN3A abundance on the cell surface and triggering of γδ T cells through transcription, post-translational modifications and membrane trafficking. In addition, diverse genetic perturbations and inhibitors disrupting metabolic pathways in the cancer cells, particularly ATP-producing processes, were found to alter BTN3A levels. This induction of both BTN3A and BTN2A1 during metabolic crises is dependent on AMP-activated protein kinase (AMPK). Finally, small-molecule activation of AMPK in a cell line model and in patient-derived tumour organoids led to increased expression of the BTN2A1–BTN3A complex and increased Vγ9Vδ2 T cell receptor-mediated killing. This AMPK-dependent mechanism of metabolic stress-induced ligand upregulation deepens our understanding of γδ T cell stress surveillance and suggests new avenues available to enhance γδ T cell anticancer activity.

Avinash Sahu, Xiaoman Wang, Phillip Munson, Jan P.G. Klomp, Xiaoqing Wang, Shengqing Stan Gu, Ya Han, Gege Qian, Phillip Nicol, Zexian Zeng, Chenfei Wang, Collin Tokheim, Wubing Zhang, Jingxin Fu, Jin Wang, Nishanth Ulhas Nair, Joost A.P. Rens, Meriem Bourajjaj, Bas Jansen, Inge Leenders, Jaap Lemmers, Mark Musters, Sanne van Zanten, Laura van Zelst, Jenny Worthington, Jun S. Liu, Dejan Juric, Clifford A. Meyer, Arthur Oubrie, X. Shirley Liu, David E. Fisher & Keith T. Flaherty

Drugs that kill tumors through multiple mechanisms have the potential for broad clinical benefits. Here, we first developed an in silico multiomics approach (BipotentR) to find cancer cell–specific regulators that simultaneously modulate tumor immunity and another oncogenic pathway and then used it to identify 38 candidate immune–metabolic regulators. We show the tumor activities of these regulators stratify patients with melanoma by their response to anti–PD-1 using machine learning and deep neural approaches, which improve the predictive power of current biomarkers. The topmost identified regulator, ESRRA, is activated in immunotherapy-resistant tumors. Its inhibition killed tumors by suppressing energy metabolism and activating two immune mechanisms: (i) cytokine induction, causing proinflammatory macrophage polarization, and (ii) antigen-presentation stimulation, recruiting CD8+ T cells into tumors. We also demonstrate a wide utility of BipotentR by applying it to angiogenesis and growth suppressor evasion pathways. BipotentR (http://bipotentr.dfci.harvard.edu/) provides a resource for evaluating patient response and discovering drug targets that act simultaneously through multiple mechanisms.


Murad R Mamedov, Shane Vedova, Jacob W. Freimer, Avinash Das Sahu, Peixin Amy Chen, Amrita Ramesh, Kristina Hanspers, Vinh Q. Nguyen, Erin J Adams & Alexander Marson

γδ T cells are potent anti-cancer effectors with potential to target tumors broadly, independent of neoantigens or HLA-background. γδ T cells can sense conserved cell stress signals prevalent in transformed cells, although the mechanisms governing how γδ T cells sense and kill stressed target cells remain poorly characterized. Vγ9Vδ2 T cells – the most abundant subset of human γδ T cells – recognize a protein complex containing butyrophilin 3A1 (BTN3A1), a ubiquitously expressed cell surface protein that is activated by phosphoantigens abundantly produced by tumor cells. Here we performed genome-wide CRISPR screens in target cancer cells to identify pathways that regulate: (1) γδ T cell activity with a functional co-culture killing screen, and (2) BTN3A1 cell surface expression in a phenotypic screen. Multilayered regulation of BTN3A1 expression was uncovered: transcriptional regulation (IRF1, CTBP1, ZNF217, RUNX1), intracellular trafficking, sialylation, oxidative phosphorylation (OXPHOS), purine metabolism, and other metabolic pathways. Consistent with these results, we found upregulated BTN3A1 on cells undergoing an energy crisis due to glucose deprivation, glycolysis inhibition, or OXPHOS inhibition. Furthermore, we discovered that this BTN3A1 upregulation was induced by activation of the AMP-activated protein kinase (AMPK). Also, CRISPR screen-derived gene expression signatures correlated with higher survival in low-grade glioma patients whose tumors had high Vγ9Vδ2 T cell infiltration. Uncovering this AMPK-dependent mechanism of metabolic stress-induced ligand activation deepens our understanding of γδ T cell stress surveillance and suggests new avenues to enhance γδ T cell anti-cancer activity.


Despite the success of targeted therapies including immunotherapies in cancer treatments, tumor resistance to targeted therapies remains a fundamental challenge. Tumors can evolve resistance to a therapy that targets one gene by acquiring compensatory alterations in another gene, such compensatory interaction between two genes is referred to as synthetic rescue (SR) interactions. To identify SRs, here we describe an algorithm, INCISOR, that leverages tumor transcriptomics and clinical information from 10,000 patients as well as data from experimental screens. INCISOR can identify SRs that are common across several cancer-types in genome-wide fashion by sifting through half a billion possible gene–gene combinations and provide a framework to design therapies to tackle resistance.


Christopher Chen, Avinash Sahu, Kristin Price, Christopher Pinto, Read Allen, Xiaoman Wang, Annamaria Szabolcs, Lipika Goyal, David Ting, Shirley Liu & Dejan Juric

Alpelisib, a PI3Kα inhibitor, was recently approved by the FDA for the treatment of metastatic PIK3CA-mutated HR+ breast cancer in combination with fulvestrant. This therapeutic breakthrough came after years of disappointing efforts to develop solid tumor drugs targeting PIK3CA, one of the most commonly altered oncogenes in cancer, including 40% of HR+ breast and 30% of head and neck squamous cell cancer. The limited success of PI3Kinhibitors outside of breast cancer may be partly because cancers may harbor heterogeneous PIK3CA mutations and/or co-occurring mutations that confer varying degrees of sensitivity to PI3Kα inhibition.


Nishanth Ulhas Nair, Avinash Das, Vasiliki-Maria Rogkoti, Michiel Fokkelman, Richard Marcotte, Chiaro G. de Jong, Esmee Koedoot, Joo Sang Lee, Isaac Meilijson, Sridhar Hannenhalli, Benjamin G. Neel, Bob van de Water, Sylvia E. Le Dévédec & Eytan Ruppin

The efficacy of prospective cancer treatments is routinely estimated by in vitro cell-line proliferation screens. However, it is unclear whether tumor aggressiveness and patient survival are influenced more by the proliferative or the migratory properties of cancer cells. To address this question, we experimentally measured proliferation and migration phenotypes across more than 40 breast cancer cell-lines. Based on the latter, we built and validated individual predictors of breast cancer proliferation and migration levels from the cells’ transcriptomics. We then apply these predictors to estimate the proliferation and migration levels of more than 1000 TCGA breast cancer tumors. Reassuringly, both estimates increase with tumor’s aggressiveness, as qualified by its stage, grade, and subtype. However, predicted tumor migration levels are significantly more strongly associated with patient survival than the proliferation levels. We confirmed these findings by conducting siRNA knock-down experiments on the highly migratory MDA-MB-231 cell lines and deriving gene knock-down based proliferation and migration signatures. We show that cytoskeletal drugs might be more beneficial in patients with high predicted migration levels. Taken together, these results testify to the importance of migration levels in determining patient survival.

Assaf Magen, Avinash Das Sahu, Joo Sang Lee, Mahfuza Sharmin, Alexander Lugo, J. Silvio Gutkind, Alejandro A. Schäffer, Eytan Ruppin & Sridhar Hannenhalli

The phenotypic effect of perturbing a gene’s activity depends on the activity level of other genes, reflecting the notion that phenotypes are emergent properties of a network of functionally interacting genes. In the context of cancer, contemporary investigations have primarily focused on just one type of functional relationship between two genes—synthetic lethality (SL). Here, we define the more general concept of “survival-associated pairwise gene expression states” (SPAGEs) as gene pairs whose joint expression levels are associated with survival. We describe a data-driven approach called SPAGE-finder that when applied to The Cancer Genome Atlas (TCGA) data identified 71,946 SPAGEs spanning 12 distinct types, only a minority of which are SLs. The detected SPAGEs explain cancer driver genes’ tissue specificity and differences in patients’ response to drugs and stratify breast cancer tumors into refined subtypes. These results expand the scope of cancer SPAGEs and lay a conceptual basis for future studies of SPAGEs and their translational applications.

Avinash Das Sahu, Joo S Lee, Zhiyong Wang, Gao Zhang, Ramiro Iglesias-Bartolome, Tian Tian, Zhi Wei, Benchun Miao, Nishanth Ulhas Nair, Olga Ponomarova, Adam A Friedman, Arnaud Amzallag, Tabea Moll, Gyulnara Kasumova, Patricia Greninger, Regina K Egan, Leah J Damon, Dennie T Frederick, Livnat Jerby-Arnon, Allon Wagner, Kuoyuan Cheng, Seung Gu Park, Welles Robinson, Kevin Gardner, Genevieve Boland & Eytan Ruppin

Most patients with advanced cancer eventually acquire resistance to targeted therapies, spurring extensive efforts to identify molecular events mediating therapy resistance. Many of these events involve synthetic rescue (SR) interactions, where the reduction in cancer cell viability caused by targeted gene inactivation is rescued by an adaptive alteration of another gene (the rescuer). Here, we perform a genome-wide in silico prediction of SR rescuer genes by analyzing tumor transcriptomics and survival data of 10,000 TCGA cancer patients. Predicted SR interactions are validated in new experimental screens. We show that SR interactions can successfully predict cancer patients’ response and emerging resistance. Inhibiting predicted rescuer genes sensitizes resistant cancer cells to therapies synergistically, providing initial leads for developing combinatorial approaches to overcome resistance proactively. Finally, we show that the SR analysis of melanoma patients successfully identifies known mediators of resistance to immunotherapy and predicts novel rescuers.


Joo Sang Lee, Avinash Das, Livnat Jerby-Arnon, Rand Arafeh, Noam Auslander, Matthew Davidson, Lynn McGarry, Daniel James, Arnaud Amzallag, Seung Gu Park, Kuoyuan Cheng, Welles Robinson, Dikla Atias, Chani Stossel, Ella Buzhor, Gidi Stein, Joshua J. Waterfall, Paul S. Meltzer, Talia Golan, Sridhar Hannenhalli, Eyal Gottlieb, Cyril H. Benes, Yardena Samuels, Emma Shanks & Eytan Ruppin

While synthetic lethality (SL) holds promise in developing effective cancer therapies, SL candidates found via experimental screens often have limited translational value. Here we present a data-driven approach, ISLE (identification of clinically relevant synthetic lethality), that mines TCGA cohort to identify the most likely clinically relevant SL interactions (cSLi) from a given candidate set of lab-screened SLi. We first validate ISLE via a benchmark of large-scale drug response screens and by predicting drug efficacy in mouse xenograft models. We then experimentally test a select set of predicted cSLi via new screening experiments, validating their predicted context-specific sensitivity in hypoxic vs normoxic conditions and demonstrating cSLi’s utility in predicting synergistic drug combinations. We show that cSLi can successfully predict patients’ drug treatment response and provide patient stratification signatures. ISLE thus complements existing actionable mutation-based methods for precision cancer therapy, offering an opportunity to expand its scope to the whole genome.

Nishanth Ulhas Nair, Avinash Das, Uri Amit, Welles Robinson, Seung Gu Park, Mahashweta Basu, Alex Lugo, Jonathan Leor, Eytan Ruppin & Sridhar Hannenhalli

Idiopathic dilated cardiomyopathy (DCM) is a complex disorder with a genetic and an environmental component involving multiple genes, many of which are yet to be discovered. We integrate genetic, epigenetic, transcriptomic, phenotypic, and evolutionary features into a method – Hridaya, to infer putative functional genes underlying DCM in a genome-wide fashion, using 213 human heart genomes and transcriptomes. Many genes identified by Hridaya are experimentally shown to cause cardiac complications. We validate the top predicted genes, via five different genome-wide analyses: First, the predicted genes are associated with cardiovascular functions. Second, their knockdowns in mice induce cardiac abnormalities. Third, their inhibition by drugs cause cardiac side effects in human. Fourth, they tend to have differential exon usage between DCM and normal samples. Fifth, analyzing 213 individual genotypes, we show that regulatory polymorphisms of the predicted genes are associated with elevated risk of cardiomyopathy. The stratification of DCM patients based on cardiac expression of the functional genes reveals two subgroups differing in key cardiac phenotypes. Integrating predicted functional genes with cardiomyocyte drug treatment experiments reveals novel potential drug targets. We provide a list of investigational drugs that target the newly identified functional genes that may lead to cardiac side effects.


Mahashweta Basu, Mahfuza Sharmin, Avinash Das, Nishanth Ulhas Nair, Kun Wang, Joo Sang Lee, Yen-Pei Christy Chang, Eytan Ruppin & Sridhar Hannenhalli

Hypertension (HT) is a complex systemic disease involving transcriptional changes in multiple organs. Here we systematically investigate the pan-tissue transcriptional and genetic landscape of HT spanning dozens of tissues in hundreds of individuals. We find that in several tissues, previously identified HT-linked genes are dysregulated and the gene expression profile is predictive of HT. Importantly, many expression quantitative trait loci (eQTL) SNPs associated with the population variance of the dysregulated genes are linked with blood pressure in an independent genome-wide association study, suggesting that the functional effect of HT-associated SNPs may be mediated through tissue-specific transcriptional dysregulation. Analyses of pan-tissue transcriptional dysregulation profile, as well as eQTL SNPs underlying the dysregulated genes, reveals substantial heterogeneity among the HT patients, revealing two broad groupings – a Diffused group where several tissues exhibit HT-associated molecular alterations and a Localized group where such alterations are localized to very few tissues. These two patient subgroups differ in several clinical phenotypes including respiratory, cerebrovascular, diabetes, and heart disease. These findings suggest that the Diffused and Localized subgroups may be driven by different molecular mechanisms and have different genetic underpinning.

DNA methylation at the promoter of a gene is presumed to render it silent, yet a sizable fraction of genes with methylated proximal promoters exhibit elevated expression. Here, we show, through extensive analysis of the methylome and transcriptome in 34 tissues, that in many such cases, transcription is initiated by a distal upstream CpG island (CGI) located several kilobases away that functions as an alternative promoter. Specifically, such genes are expressed precisely when the neighboring CGI is unmethylated but remain silenced otherwise. Based on CAGE and Pol II localization data, we found strong evidence of transcription initiation at the upstream CGI and a lack thereof at the methylated proximal promoter itself. Consistent with their alternative promoter activity, CGI-initiated transcripts are associated with signals of stable elongation and splicing that extend into the gene body, as evidenced by tissue-specific RNA-seq and other DNA-encoded splice signals. Furthermore, based on both inter- and intra-species analyses, such CGIs were found to be under greater purifying selection relative to CGIs upstream of silenced genes. Overall, our study describes a hitherto unreported conserved mechanism of transcription of genes with methylated proximal promoters in a tissue-specific fashion. Importantly, this phenomenon explains the aberrant expression patterns of some cancer driver genes, potentially due to aberrant hypomethylation of distal CGIs, despite methylation at proximal promoters.


Avinash Das, Michael Morley, Christine S. Moravec, W. H. W. Tang, Hakon Hakonarson, MAGNet Consortium, Kenneth B. Margulies, Thomas P. Cappola, Shane Jensen & Sridhar Hannenhalli

The standard expression quantitative trait loci (eQTL) detects polymorphisms associated with gene expression without revealing causality. We introduce a coupled Bayesian regression approach—eQTeL, which leverages epigenetic data to estimate regulatory and gene interaction potential, and identifies combination of regulatory single-nucleotide polymorphisms (SNPs) that explain the gene expression variance. On human heart data, eQTeL not only explains a significantly greater proportion of expression variance but also predicts gene expression more accurately than other methods. Based on realistic simulated data, we demonstrate that eQTeL accurately detects causal regulatory SNPs, including those with small effect sizes. Using various functional data, we show that SNPs detected by eQTeL are enriched for allele-specific protein binding and histone modifications, which potentially disrupt binding of core cardiac transcription factors and are spatially proximal to their target. eQTeL SNPs capture a substantial proportion of genetic determinants of expression variance and we estimate that 58% of these SNPs are putatively causal.

Kun Wang, Avinash Das, Zheng-Mei Xiong, Kan Cao & Sridhar Hannenhalli

Hutchinson Gilford progeria syndrome (HGPS) is a rare genetic disease with symptoms of aging at a very early age. Its molecular basis is not entirely clear, although profound gene expression changes have been reported, and there are some known and other presumed overlaps with normal aging process. Identification of genes with agingor HGPS-associated expression changes is thus an important problem. However, standard regression approaches are currently unsuitable for this task due to limited sample sizes, thus motivating development of alternative approaches. Here, we report a novel iterative multiple regression approach that leverages co-expressed gene clusters to identify gene clusters whose expression co-varies with age and/or HGPS. We have applied our approach to novel RNA-seq profiles in fibroblast cell cultures at three different cellular ages, both from HGPS patients and normal samples. After establishing the robustness of our approach, we perform a comparative investigation of biological processes underlying normal aging and HGPS. Our results recapitulate previously known processes underlying aging as well as suggest numerous unique processes underlying aging and HGPS. The approach could also be useful in detecting phenotype-dependent co-expression gene clusters in other contexts with limited sample sizes.


Hutchinson Gilford progeria syndrome (HGPS) is a rare genetic disease with symptoms of aging manifested at a very early age. Molecular basis of HGPS is not entirely clear, although there are some known and other presumed overlaps with normal aging process. Comparative investigation of biological processes associated with HGPS and normal aging may reveal common and distinctive pathways underlying these two conditions.To investigate transcriptome changes through aging we have performed RNA-seq profiling in fibroblast cell cultures at three different cellular ages as measured by number of passages through culture growth, both from HGPS patients and matched normal samples. We then developed a novel iterative multiple regression approach that leverages co-expressed gene clusters to identify gene clusters whose expression changes significantly with age and/or disease state. We establish the robustness of our approach. Finally, we perform a comparative investigation of biological processes underlying normal aging and HGPS.Based on an iterative multiple regression approach applied to novel RNA-seq data in HGPS and aging our results recapitulate the previously known processes underlying aging while at the same time suggests numerous unique processes underlying aging and HGPS.


Avinash Das Sahu, Radhouane Aniba, Yen-Pei Christy Chang & Sridhar Hannenhalli

Mammalian gene regulation is often mediated by distal enhancer elements, in particular, for tissue specific and developmental genes. Computational identification of enhancers is difficult because they do not exhibit clear location preference relative to their target gene and also because they lack clearly distinguishing genomic features. This represents a major challenge in deciphering transcriptional regulation. Recent ChIP-seq based genome-wide investigation of epigenomic modifications have revealed that enhancers are often enriched for certain epigenomic marks. Here we utilize the epigenomic data in human heart tissue along with validated human heart enhancers to develop a Support Vector Machine (SVM) model of cardiac enhancers. Cross-validation classification accuracy of our model was 84% and 92% on positive and negative sets respectively with ROC AUC = 0.92. More importantly, while P300 binding has been used as gold standard for enhancers, our model can distinguish P300-bound validated enhancers from other P300-bound regions that failed to exhibit enhancer activity in transgenic mouse. While GWAS studies reveal polymorphic regions associated with certain phenotypes, they do not immediately provide causality. Next, we hypothesized that genomic regions containing a GWAS SNP associated with a cardiac phenotype might contain another SNP in a cardiac enhancer, which presumably mediates the phenotype. Starting with a comprehensive set of SNPs associated with cardiac phenotypes in GWAS studies, we scored other SNPs in LD with the GWAS SNP according to its probability of being an enhancer and choose one with best score in the LD as enhancer. We found that our predicted enhancers are enriched for known cardiac transcriptional regulator motifs and are likely to regulate the nearby gene. Importantly, these tendencies are more favorable for the predicted enhancers compared with an approach that uses P300 binding as a marker of enhancer activity.

The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from http://lcbb.epfl.ch/software.html.