ASCETIC
R
packagevignettes/ASCETIC.rmd
ASCETIC.rmd
Cancer development is a stochastic process involving large populations of cells. Random genetic and epigenetic alterations commonly occurring in any cell can occasionally be beneficial to neoplastic ones, thus defining clones characterized by a functional selective advantage. During clonal evolution, certain clones can be positively selected for increased proliferation and survival ability, outgrowing competing cells and this can eventually lead to invasion and metastasis. Throughout such a multi-step stochastic process, cancer cells can acquire over time a set of biological capabilities that sometimes are referred to as hallmarks. Not all variants are involved in their acquisition, but only a relative small subset of them – i.e., the drivers –, while most mutations present in the cancer clones do not increase their fitness – i.e., the passengers.
In response to the constantly increasing availability of cancer omics data and the rapid advancement of data science and machine learning techniques, we introduce a novel framework named ASCETIC (Agony-baSed Cancer EvoluTion InferenCe, https://www.nature.com/articles/s41467-023-41670-3). ASCETIC can extract cancer’s evolutionary signatures, which represent recurring paths of driver mutation acquisition associated with distinct disease outcomes. ASCETIC can process sequencing data derived from different technologies, including bulk and single-cell sequencing. The approach goes beyond the traditional focus on individual genetic alterations and explores the broader landscape of genomic alterations and their intricate interactions, with the aim to enhance predictive accuracy and our understanding of cancer evolution’s impact on prognosis.
ASCETIC’s workflow involves the reconstruction of robust tumor evolution models for individual patients, followed by the integration of these models into a comprehensive cancer-specific evolution model. By leveraging prognostic data through regularized Cox regression, ASCETIC identifies significant evolutionary patterns or signatures associated with patient outcomes.
In summary, ASCETIC represents a powerful tool for uncovering consistent evolutionary patterns in cancer, offering the potential to contribute to a curated catalogue of evolutionary signatures akin to the widely used COSMIC Mutational Signatures database.
In its basic implementation, ASCETIC requires two main inputs: (i) a binary matrix where rows are patients (i.e., samples) and columns are mutations. Each cell of the matrix is 1 if the related mutation was observed in the sample; 0 otherwise. (2) Information of temporal patters across the considered driver genes. Such information can be determined either from classical NGS genomics data considering a single biopsy per patient, in terms of cancer cell fractions reported for each gene, or from data at different resolutions such as multi-region or single-cell data where multiple samples per patients are provided. In this latter case, ASCETIC takes as input a phylogenetic model per patient, represented as a mutational tree. Such trees where each node corresponds to a mutation, display the mutation history of a tumor.
The ASCETIC framework is built upon the observation that, in most cases, the accumulation of passenger mutations during cancer progression follows random dynamics. However, a small set of genes drive tumor evolution, and for these alterations, drift-driven evolution and selective pressures may lead to a consistent ordering across multiple patients. However, this ordering may not be unique and can be confounded by the presence of heterogeneous cancer subtypes within a tumor dataset. To address this challenge, ASCETIC decomposes the inference problem into three main tasks.
First, it combines information of temporal patters obtained for multiple patients in order to create an agony-derived ranking of the considered alterations, providing a partial temporal ordering among the drivers during cancer evolution. Agony is a measure of hierarchy within a directed graph. Given a directed graph and a ranking metric (e.g., in our case the time ordering of accumulation of driver alterations during tumor evolution), any arc from nodes that are higher in the hierarchy (e.g., alterations that occur in later stages of the tumor) to nodes that are lower in the hierarchy (e.g., alterations that occur at the initiation of the tumor) are not expected and they are said to be causing agony. The output of this phase is a partially ordered set (poset), which minimizes inconsistencies measured in terms of agony.
Then, the approach adopts a likelihood-based approach grounded in the theory of probabilistic causation to select the most significant relationships among driver genes in a Bayesian Network that depicts repeated evolutionary trajectories. Finally, ASCETIC leverages the inferred evolutionary steps to stratify patients based on survival data and selects the most relevant features from the model to cluster the samples into different risk groups or clusters. Survival analysis of the different risk groups is then performed via the standard Kaplan-Meier estimate. ASCETIC outputs a set of evolutionary signatures associated with the different risk groups, displaying the inferred risk for each of the selected evolutionary steps and their relative prevalence in the cluster.
In this vignette, we give an overview of the package by presenting some of its main functions.
The ASCETIC package can be installed from Bioconductor as follows.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ASCETIC")
Or it can be installed directly from GitHub using the R package devtools as follows.
library("devtools")
install_github("danro9685/ASCETIC", ref = 'master')
We provide within the package example datasets for both the single biopsy and the multi-regions/single-cell cases. We now load such example datasets.
library("ASCETIC")
data(datasetExampleSingleSamples)
data(ccfDatasetExampleSingleSamples)
data(datasetExamplePhylogenies)
data(modelsPhylogenies)
First, we outline the inference steps for the single biopsy NGS data scenario, where only one sample per patient is available and the temporal patters among genes are estimated considering cancer cell fractions data. To test this, we consider an example dataset comprising 10 hepatocellular carcinoma sequencing samples from Nguyen, Bastien, et al., 2022. We recall that such (reduced and partial) dataset was provided only for computational testing and should not be used for biological analysis/conclusions.
We now perform the basic ASCETIC inference.
set.seed(12345)
resExampleSingleSamples <- asceticCCF(
dataset = datasetExampleSingleSamples,
ccfDataset = ccfDatasetExampleSingleSamples,
regularization = "aic",
command = "hc",
restarts = 0 )
The output of this analysis is a a list of 4 elements: 1) dataset, input dataset; 2) ccfDataset, input ccf dataset; 3) poset, partially order set among mutations estimated by ASCETIC from the agony ranking and 4) inference, inferred ASCETIC evolutionary model for each selected regularization.
The poset represents an estimate of the agony ranking for this dataset, while the Bayesian Network obtained after the likelihood-fit step is the inference output of ASCETIC, which provides the inferred repeated evolutionary trajectories.
To improve the statistical stability of the framework expecially in the common situation of noisy data, ASCETIC implements a re-sampling algorithm which builds on the function just described. This procedure is more computational intensive compared to the basic one, but should be preferred on real data. It requires an additional dataset with variant allele frequencies information and can be executed as follows.
set.seed(12345)
data(vafDatasetExampleSingleSamples)
resExampleSingleSamplesResampling <- asceticCCFResampling(
dataset = datasetExampleSingleSamples,
ccfDataset = ccfDatasetExampleSingleSamples,
vafDataset = vafDatasetExampleSingleSamples,
nsampling = 5,
regularization = "aic",
command = "hc",
restarts = 0 )
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
We refer to the manual for a detailed description of each parameter.
We now describe the execution of the ASCETIC framework when high-resolution cancer genomics data are available, such as multi-regions data or single-cell data. In this case, phylogenetic reconstructions for each patient are available and are given as input to ASCETIC as a adjacency matrices representing mutational trees. We consider an other example dataset consisting of 10 myeloid malignancies samples sequenced with Tapestri from Miles, Linde A., et al., 2020. We recall that also in this case, such (reduced and partial) dataset was provided only for computational testing and should not be used for biological analysis/conclusions.
We now perform the basic ASCETIC inference for this second scenario.
set.seed(12345)
resExamplePhylogeniesDataset <- asceticPhylogenies(
dataset = datasetExamplePhylogenies,
models = modelsPhylogenies,
regularization = "aic",
command = "hc",
restarts = 0)
The output of this analysis is a a list of 4 elements: 1) dataset, input dataset; 2) models, input mutational trees; 3) poset, partially order set among mutations estimated by ASCETIC from the agony ranking and 4) inference, inferred ASCETIC evolutionary model for each selected regularization.
The resulting poset and inference elements have the same semantics as the previous example.
Also in this case we provide a method to improve the statistical stability of the framework expecially in the situation of noisy data. ASCETIC implements a bootstrap procedure which builds on the function just described. This procedure should be preferred on real data. It can be executed as follows.
set.seed(12345)
resExamplePhylogeniesDatasetBootstrap <- asceticPhylogeniesBootstrap(
dataset = datasetExamplePhylogenies,
models = modelsPhylogenies,
nsampling = 5,
regularization = "aic",
command = "hc",
restarts = 0 )
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
ASCETIC allows for estimating confidence in the inferred evolutionary trajectories through cross-validation. Specifically, the algorithm is applied multiple times on sampled data (80% sampling), and statistics are returned regarding the confidence of each evolutionary step.
Cross-validation can be executed for both bulk data providing single samples and for multi-region bulk data and single-cell data. The estimation can be executed as follows.
resExampleCCFAssessment <- asceticCCFAssessment(
inference = resExampleSingleSamplesResampling,
niterations = 5,
vafDataset = vafDatasetExampleSingleSamples,
nsampling = 5)
## Starting cross-validation...
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Cross-validation progress: 0.2
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Cross-validation progress: 0.4
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Cross-validation progress: 0.6
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Cross-validation progress: 0.8
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Cross-validation progress: 1
set.seed(12345)
resExamplePhylogeniesAssessment <- asceticPhylogeniesAssessment(
inference = resExamplePhylogeniesDatasetBootstrap,
niterations = 5,
nsampling = 5)
## Starting cross-validation...
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Warning in check.data(x, allow.missing = TRUE): variable 3 in the data has
## levels that are not observed in the data.
## Warning in check.data(x, allow.missing = TRUE): variable 15 in the data has
## levels that are not observed in the data.
## Cross-validation progress: 0.2
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Warning in check.data(x, allow.missing = TRUE): variable 2 in the data has
## levels that are not observed in the data.
## Cross-validation progress: 0.4
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Warning in check.data(x, allow.missing = TRUE): variable 3 in the data has
## levels that are not observed in the data.
## Warning in check.data(x, allow.missing = TRUE): variable 15 in the data has
## levels that are not observed in the data.
## Cross-validation progress: 0.6
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Warning in check.data(x, allow.missing = TRUE): variable 16 in the data has
## levels that are not observed in the data.
## Cross-validation progress: 0.8
## 0
## 0.2
## 0.4
## 0.6
## 0.8
## 1
## Warning in check.data(x, allow.missing = TRUE): variable 13 in the data has
## levels that are not observed in the data.
## Warning in check.data(x, allow.missing = TRUE): variable 14 in the data has
## levels that are not observed in the data.
## Cross-validation progress: 1
Finally, ASCETIC can perform the extraction of evolutionary signatures significantly associated to prognosis and use them to stratify patients. This can be done as follows.
set.seed(12345)
data(amlExample)
resExampleEvosigs <- evoSigs(survivalData = amlExample$survival_data,
evolutionarySteps = amlExample$evolutionary_steps)
print(names(resExampleEvosigs))
## [1] "survivalAnalysis" "evolutionarySteps" "clustersPrevalence"
print(resExampleEvosigs$evolutionarySteps)
## ASXL1.to.NRAS DNMT3A.to.NRAS DNMT3A.to.WT1 IDH1.to.PTPN11 IDH1.to.TET2
## 0.56617338 -0.26258331 1.07489799 -1.57548330 -1.50253436
## IDH2.to.TET2 JAK2.to.NRAS JAK2.to.RUNX1 NPM1.to.FLT3 Root.to.CBL
## -0.47637487 0.39346249 1.36329173 0.35692492 1.22736994
## Root.to.IDH1 Root.to.JAK2 Root.to.KIT Root.to.NPM1 Root.to.PHF6
## 0.51762715 -0.24124407 -0.76511252 -0.04687322 1.16513495
## Root.to.PPM1D Root.to.RAD21 Root.to.STAG2 Root.to.TP53 Root.to.U2AF1
## -1.42425908 -0.40917150 0.48307602 1.52174568 0.69844295
print(resExampleEvosigs$clustersPrevalence)
## ASXL1.to.NRAS CALR.to.ASXL1 DNMT3A.to.NRAS DNMT3A.to.WT1 IDH1.to.DNMT3A
## C1 0.00000000 0 0.11111111 0.00000000 0.00000000
## C2 0.00000000 0 0.00000000 0.00000000 0.00000000
## C3 0.04301075 0 0.02150538 0.08602151 0.08602151
## IDH1.to.PTPN11 IDH1.to.TET2 IDH2.to.TET2 JAK2.to.NRAS JAK2.to.RUNX1
## C1 0.01587302 0.04761905 0.03174603 0.000000000 0.00000000
## C2 0.00000000 0.00000000 0.00000000 0.004761905 0.00000000
## C3 0.00000000 0.00000000 0.00000000 0.010752688 0.05376344
## NPM1.to.FLT3 Root.to.BRAF Root.to.CALR Root.to.CBL Root.to.IDH1 Root.to.IDH2
## C1 0.04761905 0 0 0.00000000 0.03174603 0.04761905
## C2 0.22380952 0 0 0.00000000 0.00000000 0.13333333
## C3 0.01075269 0 0 0.01075269 0.16129032 0.01075269
## Root.to.JAK2 Root.to.KIT Root.to.NPM1 Root.to.PHF6 Root.to.PPM1D
## C1 0.111111111 0.07936508 0.49206349 0.00000000 0.03174603
## C2 0.004761905 0.00000000 0.20952381 0.00000000 0.00000000
## C3 0.043010753 0.00000000 0.04301075 0.04301075 0.00000000
## Root.to.RAD21 Root.to.SF3B1 Root.to.STAG2 Root.to.TP53 Root.to.U2AF1
## C1 0.1746032 0.00000000 0.00000000 0.0000000 0.0000000
## C2 0.0000000 0.07619048 0.09047619 0.0000000 0.0000000
## C3 0.0000000 0.01075269 0.03225806 0.3655914 0.2258065
## U2AF1.to.FLT3
## C1 0.00000000
## C2 0.00000000
## C3 0.03225806
The accumulation of alterations in driver genes can follow repeated routes in different cancer patients. Detecting such trajectories could be important in order to implement appropriate therapeutic responses. As a matter of fact, being able to stratify cancer patients based on their molecular evolution could enable the prediction of the future steps of the disease progression, potentially allowing the execution of optimal and personalized treatments that anticipate the next stages of the cancer’s evolution. In our example, ASCETIC could identify a set of evolutionary steps (see evolutionarySteps and clustersPrevalence variables) with a significant association to prognosis.
Specifically, the evolutionarySteps variable reports 20 evolutionary steps that, in this example dataset, show a significant association with overall survival. The steps with negative values are negatively associated with risk, meaning that patients with these mutations have a better prognosis. On the other hand, steps with positive values increase the risk, indicating that patients with these mutations have a worse prognosis.
Finally, the clustersPrevalence variable provides the three clusters discovered based on the detected evolutionary signatures, along with the relative prevalences of each evolutionary step in the clusters.
We refer to the manual for a detailed description of each parameter and to the ASCETIC manuscript for details on the method.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ASCETIC_1.0.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] shape_1.4.6.1 lattice_0.22-6 digest_0.6.37
## [7] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.1
## [10] bookdown_0.41 iterators_1.0.14 fastmap_1.2.0
## [13] foreach_1.5.2 jsonlite_1.8.9 glmnet_4.1-8
## [16] Matrix_1.7-1 survival_3.7-0 BiocManager_1.30.25
## [19] fansi_1.0.6 scales_1.3.0 codetools_0.2-20
## [22] textshaping_0.4.0 jquerylib_0.1.4 cli_3.6.3
## [25] rlang_1.1.4 bnlearn_5.0.1 splines_4.4.1
## [28] munsell_0.5.1 cachem_1.1.0 yaml_2.3.10
## [31] tools_4.4.1 parallel_4.4.1 dplyr_1.1.4
## [34] colorspace_2.1-1 ggplot2_3.5.1 vctrs_0.6.5
## [37] R6_2.5.1 lifecycle_1.0.4 fs_1.6.5
## [40] htmlwidgets_1.6.4 ragg_1.3.3 pkgconfig_2.0.3
## [43] desc_1.4.3 pkgdown_2.1.1.9000 pillar_1.9.0
## [46] bslib_0.8.0 gtable_0.3.6 glue_1.8.0
## [49] Rcpp_1.0.13 systemfonts_1.1.0 xfun_0.48
## [52] tibble_3.2.1 tidyselect_1.2.1 knitr_1.48
## [55] htmltools_0.5.8.1 rmarkdown_2.28 compiler_4.4.1