Cancer is a genetic disease caused by somatic mutations in genes controlling key biological functions such as cellular growth and division. Such mutations may arise both through cell-intrinsic and exogenous processes, generating characteristic mutational patterns over the genome named mutational signatures. The study of mutational signatures have become a standard component of modern genomics studies, since it can reveal which (environmental and endogenous) mutagenic processes are active in a tumor, and may highlight markers for therapeutic response.

Mutational signatures computational analyses fall mostly within two categories: (i) de novo signatures extraction and (ii) signatures exposure estimation. In the first case, the presence of mutational processes is first assessed from the data, signatures are identified and extracted and finally assigned to samples. This task is typically performed by Non-Negative Matrix Factorization (NMF). While other approaches have been proposed, NMF-based methods are by far the most used. The estimation of signatures exposures is performed by holding a set of signatures fixed (see, e.g., COSMIC mutational signatures catalogue) and assigning them to samples by minimizing, e.g., mean squared error between observed and estimated mutational patterns for each sample.

However, available mutational signatures computational tools presents many pitfalls. First, the task of determining the number of signatures is very complex and depends on heuristics. Second, several signatures have no clear etiology, casting doubt on them being computational artifacts rather than due to mutagenic processes. Last, approaches for signatures assignment are greatly influenced by the set of signatures used for the analysis. To overcome these limitations, we developed RESOLVE (Robust EStimation Of mutationaL signatures Via rEgularization), a framework that allows the efficient extraction and assignment of mutational signatures.

The RESOLVE R package implements a new de novo signatures extraction algorithm, which employs a regularized Non-Negative Matrix Factorization procedure. The method incorporates a background signature during the inference step and adopts elastic net regression to reduce the impact of overfitting. The estimation of the optimal number of signatures is performed by bi-cross-validation. Furthermore, the package also provide a procedure for confidence estimation of signatures activities in samples.

As such, RESOLVE represents an addition to other Bioconductor packages, such as, e.g., SparseSignatures, MutationalPatterns, musicatk among others, that implements a novel approach for detecting mutational signatures.

In this vignette, we give an overview of the package by presenting some of its main functions.

Installing the RESOLVE R package

The RESOLVE package can be installed from Bioconductor as follow.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("RESOLVE")

Changelog

  • 1.0.0 - Package released in Bioconductor 3.16.
  • 1.2.0 - Major code refactoring.

Using the RESOLVE R package

We now present some of the main features of the package. We notice that the package supports different types of mutational signatures such as: SBS (single base substitutions) and MNV (multi-nucleotide variant) (see Degasperi, Andrea, et al. “Substitution mutational signatures in whole-genome–sequenced cancers in the UK population.” Science 376.6591 (2022): abl9283), CX (chromosomal instability) (see Drews, Ruben M., et al. “A pan-cancer compendium of chromosomal instability.” Nature 606.7916 (2022): 976-983) and CN (copy number) signatures (see Steele, Christopher D., et al. “Signatures of copy number alterations in human cancer.” Nature 606.7916 (2022): 984-991). But, for the sake of this vignette, we present only results on the classical SBS signatures. We refer to the manual for details.

First, we show how to load example data and import them into a count matrix to perform the signatures analysis.

library("RESOLVE")
data(ssm560_reduced)

These data are a reduced version of the 560 breast tumors provided by Nik-Zainal, Serena, et al. (2016) comprising only 3 patients. We notice that these data are provided purely as an example, and, as they are a reduced and partial version of the original dataset, they should not be used to draw any biological conclusion.

We now import such data into a count matrix to perform the signatures discovery. To do so, we also need to specify the reference genome as a BSgenome object to be considered. This can be done as follows, where in the example we used hs37d5 as reference genome as provided as data object within the package.

library("BSgenome.Hsapiens.1000genomes.hs37d5")
imported_data = getSBSCounts(data = ssm560_reduced, reference = BSgenome.Hsapiens.1000genomes.hs37d5)

Now, we present an example of visualization feature provided by the package, showing the counts for the first patient, i.e., PD10010a, in the following plot.

patientsSBSPlot(trinucleotides_counts=imported_data,samples="PD10010a")
Visualization of the counts for patient PD10010a from the dataset published in Nik-Zainal, Serena, et al.

Visualization of the counts for patient PD10010a from the dataset published in Nik-Zainal, Serena, et al.

After the data are loaded, we can perform signatures de novo extraction. To do so, we need to define a range for the number of signatures (variable K) to be considered. We now show how to perform the inference on the dataset from Nik-Zainal, Serena, et al. (2016), whose counts are provided within the package.

data(background)
data(patients)
set.seed(12345)
res_denovo = signaturesDecomposition(x = patients, 
                                     K = 1:15, 
                                     background_signature = background, 
                                     nmf_runs = 100, 
                                     num_processes = 50)

We notice that this function can be also used to perform de novo estimation for other types of mutational signatures, such as SBS, MNV, CX and CN.

Now that we have performed the de novo inferece, we need to decide the optimal number of signatures to be extracted from our dataset. To do so, we provide a procedure based on cross-validation.

set.seed(12345)
res_cv = signaturesCV(x = patients, 
                      beta = res_denovo$beta, 
                      cross_validation_repetitions = 100, 
                      num_processes = 50)

We notice that the computations for this task can be very time consuming, expecially when many iterations of cross validations are performed (see manual) and a large set of configurations of the parameters are tested.

We refer to the manual for a detailed description of each parameter and to the RESOLVE manuscript for details on the method.

Current R Session

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
##  [2] BSgenome_1.74.0                            
##  [3] rtracklayer_1.66.0                         
##  [4] BiocIO_1.16.0                              
##  [5] Biostrings_2.74.0                          
##  [6] XVector_0.46.0                             
##  [7] GenomicRanges_1.58.0                       
##  [8] GenomeInfoDb_1.42.0                        
##  [9] IRanges_2.40.0                             
## [10] S4Vectors_0.44.0                           
## [11] RESOLVE_1.8.0                              
## [12] Biobase_2.66.0                             
## [13] BiocGenerics_0.52.0                        
## [14] BiocStyle_2.34.0                           
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-9                gridExtra_2.3              
##   [3] rlang_1.1.4                 magrittr_2.0.3             
##   [5] gridBase_0.4-7              matrixStats_1.4.1          
##   [7] compiler_4.4.1              ggalluvial_0.12.5          
##   [9] systemfonts_1.1.0           vctrs_0.6.5                
##  [11] reshape2_1.4.4              stringr_1.5.1              
##  [13] pkgconfig_2.0.3             shape_1.4.6.1              
##  [15] crayon_1.5.3                fastmap_1.2.0              
##  [17] labeling_0.4.3              utf8_1.2.4                 
##  [19] Rsamtools_2.22.0            rmarkdown_2.28             
##  [21] pracma_2.4.4                UCSC.utils_1.2.0           
##  [23] ragg_1.3.3                  xfun_0.48                  
##  [25] glmnet_4.1-8                zlibbioc_1.52.0            
##  [27] cachem_1.1.0                jsonlite_1.8.9             
##  [29] highr_0.11                  SnowballC_0.7.1            
##  [31] DelayedArray_0.32.0         BiocParallel_1.40.0        
##  [33] parallel_4.4.1              cluster_2.1.6              
##  [35] R6_2.5.1                    stringi_1.8.4              
##  [37] bslib_0.8.0                 RColorBrewer_1.1-3         
##  [39] jquerylib_0.1.4             Rcpp_1.0.13                
##  [41] bookdown_0.41               SummarizedExperiment_1.36.0
##  [43] iterators_1.0.14            knitr_1.48                 
##  [45] Matrix_1.7-1                splines_4.4.1              
##  [47] nnls_1.6                    tidyselect_1.2.1           
##  [49] abind_1.4-8                 yaml_2.3.10                
##  [51] doParallel_1.0.17           codetools_0.2-20           
##  [53] curl_5.2.3                  lattice_0.22-6             
##  [55] tibble_3.2.1                plyr_1.8.9                 
##  [57] withr_3.0.2                 evaluate_1.0.1             
##  [59] desc_1.4.3                  survival_3.7-0             
##  [61] MutationalPatterns_3.16.0   pillar_1.9.0               
##  [63] lsa_0.73.3                  BiocManager_1.30.25        
##  [65] rngtools_1.5.2              MatrixGenerics_1.18.0      
##  [67] foreach_1.5.2               generics_0.1.3             
##  [69] RCurl_1.98-1.16             ggplot2_3.5.1              
##  [71] munsell_0.5.1               scales_1.3.0               
##  [73] NMF_0.28                    RhpcBLASctl_0.23-42        
##  [75] glue_1.8.0                  tools_4.4.1                
##  [77] data.table_1.16.2           GenomicAlignments_1.42.0   
##  [79] registry_0.5-1              fs_1.6.5                   
##  [81] XML_3.99-0.17               grid_4.4.1                 
##  [83] colorspace_2.1-1            GenomeInfoDbData_1.2.13    
##  [85] restfulr_0.0.15             cli_3.6.3                  
##  [87] textshaping_0.4.0           fansi_1.0.6                
##  [89] S4Arrays_1.6.0              dplyr_1.1.4                
##  [91] gtable_0.3.6                sass_0.4.9                 
##  [93] digest_0.6.37               SparseArray_1.6.0          
##  [95] farver_2.1.2                rjson_0.2.23               
##  [97] htmlwidgets_1.6.4           htmltools_0.5.8.1          
##  [99] pkgdown_2.1.1.9000          lifecycle_1.0.4            
## [101] httr_1.4.7