A Simple ModelBased Approach to Inferring and Visualizing Cancer Mutation Signatures
Somatic (noninherited) mutations are acquired throughout our lives in cells throughout our body. These mutations can be caused, for example, by DNA replication errors or exposure to environmental mutagens such as tobacco smoke. Some of these mutations can lead to cancer. Different cancers, and even different instances of the same cancer, can show different distinctive patterns of somatic mutations. These distinctive patterns have become known as “mutation signatures”. For example, C > A mutations are frequent in lung caners whereas C > T and CC > TT mutations are frequent in skin cancers. Each mutation signature may be associated with a specific kind of carcinogen, such as tobacco smoke or ultraviolet light. Identifying mutation signatures therefore has the potential to identify new carcinogens, and yield new insights into the mechanisms and causes of cancer, In this paper, we introduce new statistical tools for tackling this important problem. These tools provide more robust and interpretable mutation signatures compared to previous approaches, as we demonstrate by applying them to largescale cancer genomic data.
Published in the journal:
. PLoS Genet 11(12): e32767. doi:10.1371/journal.pgen.1005657
Category:
Research Article
doi: 10.1371/journal.pgen.1005657
Summary
Somatic (noninherited) mutations are acquired throughout our lives in cells throughout our body. These mutations can be caused, for example, by DNA replication errors or exposure to environmental mutagens such as tobacco smoke. Some of these mutations can lead to cancer. Different cancers, and even different instances of the same cancer, can show different distinctive patterns of somatic mutations. These distinctive patterns have become known as “mutation signatures”. For example, C > A mutations are frequent in lung caners whereas C > T and CC > TT mutations are frequent in skin cancers. Each mutation signature may be associated with a specific kind of carcinogen, such as tobacco smoke or ultraviolet light. Identifying mutation signatures therefore has the potential to identify new carcinogens, and yield new insights into the mechanisms and causes of cancer, In this paper, we introduce new statistical tools for tackling this important problem. These tools provide more robust and interpretable mutation signatures compared to previous approaches, as we demonstrate by applying them to largescale cancer genomic data.
Introduction
Cancer is a genomic disease. As we lead a life, DNA within our cells acquires random somatic mutations, mainly caused by DNA replication errors and exposures to mutagens such as chemical substances, radioactivities and inflammatory reactions. Although most mutations are harmless (called “passenger mutations”), a small portions of mutations at some specific sites in cancer genes (“driver mutations”) affect cell growth, causing autonomous proliferation, tissue invasion, and contributing to oncogenesis [1]. Cancer genome studies typically focus on identifying driver mutations, to help understand the mechanism of cancer development. However, passenger mutations can also yield important information, because they often show patterns (“mutation signatures”) which can provide insights into the forces that cause somatic mutations. For example, classical studies of mutation patterns revealed that C > A mutations are abundant in lung cancers in patients with smoking history, and these are caused by benzo(a)pyrene included in tobacco smoke [2]. Also, C > T and CC > TT mutations are abundant in ultravioletlightassociated skin cancers, and these are caused by pyrimidine dimers as a result of ultraviolet radiation [3].
The potential for classical studies to yield insights into somatic mutation processes was limited in several ways. Due to limited sequencing throughput, most classical studies focused on a few cancer genes, such as TP53, where high mutation frequencies could be expected. They then contrasted mutation pattern profiles among different cancer types, aggregating mutations across multiple individuals within the same cancer type to yield sufficient mutations for analysis. However, since many of the mutations in cancer genes are driver mutations causing cell proliferation, the resultant mutation profiles are a biased representation of the underlying mutation process. Furthermore, the paucity of mutation data made it effectively impossible to assess variation in mutation patterns among individuals.
Recent advances in highthroughput sequencing provide new opportunities to investigate samplebysample mutation signatures in an unbiased way using genomewide somatic mutation data. For example, a largescale study using 21 breast cancer samples identified an association of C > [AGT] mutations at TpC sites, which was later proved to be caused by APOBEC protein family [4–6], and a novel phenomenon called kataegis [7]. Moreover, a landmark study of 7,034 primary cancer samples, representing 30 different cancer classes, has provided the first largescale overview of mutation signatures across a large number of cancer types [8]. This has lead to great hopes that detection of novel mutation signatures and associated mutagens can lead to identification of novel mutagens and prevention of cancer.
To make the most of these opportunities requires the development of efficient and effective statistical methods for analyzing mutation signatures in vast amounts of somatic mutation data. Current statistical approaches [9, 10] are excellent starting points, and have helped generate the new insights noted above. However, we argue here that these existing methods have two important limitations, caused by the fact that they use an unconstrained model for each “mutation signature”. First, although using an unconstrained model might appear to be a good thing in terms of flexibility, in practice it can actually reduce flexibility, because the price of using an unconstrained model is that one must limit the domain of mutation signatures considered. For example, most recent analyses of mutation signatures consider only the immediate flanking 5′ and 3′ bases of each substitution to be part of the signature, even though it is known that more distal bases—and particularly the next flanking base on each side—can contain important contextual information [11]. These recent analyses take this approach because, in the unconstrained model, incorporating the more distal bases into the signature very substantially increases the number of parameters, making estimated mutation signatures unstable. Secondly, and just as important, the unconstrained model means that each signature is a probability distribution in a highdimensional parameter space, which can make signatures difficult to interpret.
In this paper, we present a novel probabilistic approach to mutation signature modelling that addresses these limitations. In brief, we first simplify the modelling of mutation signatures by decomposing them into separate “mutation features”. For example, the substitution type is one feature; flanking bases are each another feature. We then exploit this decomposition by using a probabilistic model for signatures that assumes independence across features. This approach substantially reduces the number of parameters associated with each signature, greatly facilitating the incorporation of additional relevant sequence context. For example, our approach can incorporate the two bases 3′ and 5′ of the substitution and transcription strand biases using only 18 parameters per signature, compared with 3,071 parameters per signature with current approaches. We demonstrate the benefits of this simplification in data analyses. These benefits include more stable estimation of signatures from smaller samples, refinement of the detail and resolution of many mutation signatures, and possibly identification of novel signatures.
Assuming independence among features in a signature may initially seem unnatural. However, its use here is analogous to “position weight matrix models” which have been highly successful for modelling transcription factor binding motifs. Indeed, an important second contribution of our paper is to provide intuitive visual representations for mutation signatures, analogous to the “sequence logos” used for visualizing binding motifs. Finally, we also highlight the close connection between mutation signature models and the “mixedmembership models”, also known as “admixture models” [12] or “Latent Dirichlet Allocation” models [13] that are widely used in population genetics and document clustering applications. These connections should be helpful for future elaboration of computational and statistical methods for cancer mutation signature detection.
Software implementing the proposed methods is available in an R package pmsignature (probabilistic mutation signature), at https://github.com/friend1ws/pmsignature. The core part of the estimation process is implemented in C++ by way of the Rcpp package [14], which enables handling millions of somatic mutations from thousands of cancer genomes using a standard desktop computer. In addition, a webbased application of our method is available at https://friend1ws.shinyapps.io/pmsignature_shiny/.
Result
New model for mutation signatures
The term “mutation signature” is used to describe a characteristic mutational pattern observed in cancer genomes. Such patterns are often related to carcinogens (e.g., frequent C > A mutations in lung cancers with smoking histories).
What constitutes a mutational pattern varies among papers. The simplest approach is to consider 6 possible mutation patterns, corresponding to 6 possible substitution patterns (C>A, C>G, C>T, T>A, T>C, T>G; the original base is often fixed to C or T to remove redundancy of taking complementary strands). However, in practice we know that DNA context of a substitution is often important, and so it is common to go the next level of complexity, and include the immediate 5′ and 3′ flanking bases in the mutation pattern. This results in 96 (6 × 4 × 4) patterns. Further incorporating the strand (plus or minus) of each substitution extends this to 192 patterns [8, 9].
Mathematically, mutation signatures have previously been characterized using an unconstrained distribution over mutation patterns [9, 10]. Thus, if the number of mutation patterns considered is M then each mutation signature is characterized by a probability vector of length M (which must sum to 1, so M − 1 parameters). A problem with this approach is that it requires a large number of parameters per mutation signature. As noted above, even accounting only for immediately flanking bases gives M = 96. Furthermore, M increases exponentially if we try to account for additional context: to take account of up to n bases 5′ and 3′ to the mutated site (henceforth referred to as the −n position and the +n position, respectively) gives M = 6 × 4^{2n}. Having a large number of parameters per signature causes two important problems: i) estimates of signature parameters can become statistically unstable; ii) signatures can become difficult to interpret.
The first contribution of this paper is to suggest a more parsimonious approach to modelling mutation signatures, with the benefit of producing both more stable estimates and more easily interpretable signatures. In brief, we substantially reduce the number of parameters per signature by breaking each mutation pattern into “features”, and assuming independence across mutation features. For example, consider the case where a mutation pattern is defined by the substitution and its two flanking bases. We break this into three features (substitution, 3′ base, 5′ base), and characterize each mutation signature by a probability distribution for each feature (which, by our independence assumption, are multiplied together to define a distribution on mutation patterns). Since the number of possible values for each feature is 6, 4, and 4 respectively this requires 5 + 3 + 3 = 11 parameters instead of 96 − 1 = 95 parameters. Furthermore, extending this model to account for ±n neighboring bases requires only 5 + 6n parameters instead of 6 × 4^{2n} − 1. For example, considering ±2 positions requires 17 parameters instead of 1,535. Finally, incorporating transcription strand as an additional feature adds just one parameter, instead of doubling the number of parameters.
Since the aim of a mutation signature is, in some ways, to capture dependencies among features, the independence assumption may seem counterintuitive. However, the idea is exactly analogous to the use of a “position weight matrix” (PWM) to represent motifs in sequence data. In this analogy, a motif is analogous to a mutation signature, and each location in the motif is analogous to a “feature”. Just as we use a probability vector for each feature, a PWM defines a probability vector for each location in a motif, and these probabilities at each location can be multiplied together to yield a probability distribution on sequences. Even though a PWM cannot capture complex higherorder dependencies, some of which likely do exist in practice, it has been a highly successful tool for motif analysis—likely because it can capture the most important characteristics of transcription factor binding sites (that some locations will show strong preference for a particular base, whereas others will not), and also because it can be represented in an easily interpretable way via sequencing logos [15]. For similar reasons—in addition to the empirical demonstrations we present later—we believe our mutation signature representation will prove useful for mutation signature analysis.
Fig 1 illustrates the way that our new representation of signatures can simply capture a previously identified signature [9, 16] and provides an easily interpretable visualization of the signature that is analogous to sequencing logos [15]. We particularly note how the key elements of this mutation signature are more immediately visually apparent than with visualizations of the full vector of probabilities used by existing approaches.
An overview of mathematical specification of mutation signatures and the generative model of somatic mutations
Suppose each somatic mutation has L mutation features, m = (m_{1}, m_{2}, ⋯, m_{L}), where each m_{l} can take M_{l} discrete values. Also, let M: = (M_{1}, ⋯, M_{L}). For example, when taking account of 6 substitution patterns and ±2 flanking sites, M = (6, 4, 4, 4, 4). See S1 Table for other examples.
Suppose we have observed mutations in I sampled cancer genomes, and let J_{i} denote the number of observed mutations in the ith cancer genome. Further, let x_{i, j} = (x_{i, j, 1}, ⋯, x_{i, j, L}), (i = 1, ⋯, I, j = 1, ⋯, J_{i}) denote the observed mutation feature vector for the jth mutation of ith cancer genome, where x_{i, j, l} ∈ {1, ⋯, M_{l}}.
Our model assumes that each mutation arose from one of K possible mutation signatures. Each cancer sample has its own characteristic proportion of mutations of each signature type (which might depend on lifestyle, genetic differences, etc.).g We let q_{i, k} denote the proportion of signature k in sample i, so q_{i} = (q_{i, 1}, q_{i, 2}, ⋯, q_{i, K})∈Δ^{K}, (i = 1, ⋯, I) where Δ S = { ( t 1 , ⋯ , t S ) : t s ≥ 0 ∀ s , ∑ s = 1 S t s = 1 } denotes the Sdimensional simplex of nonnegative vectors summing to 1. Further, each mutation signature is characterized by parameter vectors F_{k}: = (f_{k, 1}, …, f_{k, L}), where f_{k, l} is a probability vector for the lth feature in the kth signature. That is, f_{k, l} = (f_{k, l, 1}, …, f_{k, l, M}_{l}) ∈ Δ^{Ml}.
Our generative model for the observed mutations {x_{i, j}} in each cancer sample can now be described as a twostep process.

Generate z_{i, j} ∼ Multinomial(q_{m}), where z_{i, j} ∈ {1, ⋯, K} denotes the (unobserved) underlying mutation signature that caused the jth mutation in the ith sample.

For each l(= 1, ⋯, L), generate x_{i, j, l} ∼ Multinomial(f_{zi, j, l}). Thus,
This generative model is summarized in Fig 2. This model is essentially a “mixedmembership model”, also known as an “admixture model” [12] or “Latent Dirichlet Allocation” [13]. For example, the membership proportions for each sample are analogous to admixture proportions in an admixture model; the mutation signatures are analogous to populations, and the mutation signaturespecific parameters are analogous to populationspecific allele frequencies.
The key parameters in this model are the membership proportions for each sample, q_{i}, and the mutation signature parameters, F_{k}. We estimate these parameters by maximizing likelihood using an EM algorithm. A simulation study demonstrates that the estimation method can reproduce the mutation signature very accurately provided enough mutations and samples are available (see S1 Text). See Methods for more detailed models, parameter estimation, further discussion on relationships with mixed membership models, how to select K, etc.
The intrinsic composition of genome sequence, if unaccounted for, can undesirably influence estimated mutation signatures. For example, since the dinucleotide CpG is underrepresented in most genomic regions (other than promoters), a signature with substitutions from a C base can have weaker signals of G base at the +1 position. In previous work [10], this background problem was dealt by explicitly incorporating “mutation opportunity” coefficients into the model. Here, to reduce the influences of intrinsic sequence composition on our signature estimates, we introduce a special “background signature” { F 0 } ∈ Δ M 1 × ⋯ × M L, which is designed to capture biases in intrinsic genome sequence composition and is calculated from the composition of consecutive nucleotides of the human genome sequence (See Methods for the detail).
Robustness experiments using cancer genomes from urothelial carcinoma of the upper urinary tract
Here we compare our new “independent model” for mutation signatures, which assumes independence among mutation features, with the “full model”, which corresponds to existing approaches. We compare mutation signatures obtained by the two approaches and investigate the robustness of each approach by downsampling experiments.
The data consist of 14,717 somatic substitutions collected from a study of 26 urothelial carcinomas of the upper urinary tract (UCUT) [18]. The original study identified a novel mutation signature in these data: T > A substitutions at CpTpG sites with a strong transcription strand specificity caused by aristolochic acids (AA).
We consider a mutation pattern to consist of the substitution pattern, the ±2 flanking bases, and the transcription strand direction. Thus each signature is characterized by 18 parameters in our independent model, and by 3,071 parameters in the full model. After analyzing the data with various numbers of mutation signatures K, we selected K = 3 signatures for these analyses (see S2 Text).
The inferred APOBEC signature under the independent model shows a clear depletion of G base at the −2 position, which is consistent with the previous study [9] and results in the next subsection (Fig 3A and 3B). In contrast, for the full model, this tendency is rather mild (Figs 3C, 3D, and S1). The inferred AA mutation signature has no clear characteristics at the −2 position. These results suggest that our independent model has the potential to identify signatures in more detail and with less data than existing approaches based on the full model.
To investigate this further we performed downsampling experiments. Using the mutation signatures obtained using all 14,717 substitutions as a gold standard, we assessed performance of the proposed method on downsampled data consisting of r% of the original data, where r = (1%, 2.5%, 5%, 10%, 25%, 50%). To measure robustness we used the cosine similarity on the full dimensional vector space, which allows comparison between the full model and the independent model. We repeated each downsampling experiment 100 times for each model.
The results (Fig 3E and 3F) confirm that the results of the independent model are substantially more robust to reductions in data size than the full model. Indeed, mutation signatures inferred using the independent model with only 10% of the data remain highly similar to the signatures inferred from the full data; by comparison the full model shows a much larger dropoff in similarity, especially in the APOBEC signature where even using 50% of the data gives a substantial dropoff in similarity. Both methods found the AA signature easier to recover than the APOBEC signature. We believe that this is because the number of T > A substitutions at GpTpC sites are far more frequent in this dataset.
Application to somatic mutation data of 30 cancer types
To provide a more comprehensive practical illustration of our method, we applied it to somatic mutation data from 30 cancer types [8]. We applied the method to each cancer type separately to assess similarity of estimated signatures across cancer types. For each cancer type we selected the number of signatures K by fitting the model with increasing K and examining the loglikelihood, bootstrap errors, and correlation of membership parameters. The selected values of K are given in S2 Table. Also, we simply removed somatic mutations located in an intergenic region to include transcription strand biases as mutation features. Finally, we merged similar mutation signatures across different cancer types (when their Frobenius distances were < 0.6, where the Frobenius distance between two matrices (F_{1}, F_{2}) is Tr ( ( F 1  F 2 ) ( F 1  F 2 ) t ) (Tr means the trace of square matrices).
Figs 4 and 5 summarize the results. In total, we identified 27 mutation signatures. Many of these signatures show reassuring similarities with signatures identified in previous studies. However signatures from our independence model, because of its ability to effectively and parsimoniously deal with both ±2 flanking base context and strand bias, are often more refined, highlighting additional details or features not previously evident. By comparing the composition of nucleotides and cancer types exhibiting the signatures with results of previous studies, we were able to associate many of the detected signatures with known mutational processes. In addition, as we reviewed these signatures and compared them with previous work, we noticed connections that, while not directly related to our new model, appear novel and noteworthy. The remainder of this section provides a comprehensive discussion of these findings.
Signatures 1 and 8 (C > A at TpCpT and C > T at TpCpG, respectively) observed in colorectal and uterine cancers appear likely to be associated with deregulated activity of the errorprone polymerase Pol ϵ. In previous analyses of these data [8], the signature for Pol ϵ dysfunction was represented by a single signature (their “signature 10”, see S2 Fig). In contrast our new approach uses two signatures. Since these signatures are highly correlated, and appear connected by a single biological mechanism, we certainly do not argue that inferring them as a single signature is “wrong”. However, splitting them into two signatures does help highlight certain features. Specifically, signature 1 shows a transcription strand bias whereas signature 8 does not, and this is true for both colorectal and uterus cancers (S3C and S3D Fig). This strand bias may be connected with the enrichment of C >A at TpCpT mutations in leading strands of replication forks observed by [19]. Although replication strand bias is different from transcription strand bias, these two biases may be connected through the fact that replication origins prefer transcription start sites [20].
These signatures also illustrate the ability of our model to help highlight sequence context effects beyond the immediate flanking bases. Specifically, both signatures 1 and 8 show an elevated frequency of the T base at position −2, and signature 1 also shows slightly elevated frequency of the T base at position +2 (Figs 6B, 6C, S3C, and S3D). A previous study of Pol ϵ [19] found that a nonsense mutation R23X of TP53 is enriched in cancers with Pol ϵ defects. In fact, the pattern of this mutation is C > T at TpTpCpGpA, closely matching signature 8. This illustrates that the inclusion of ±2 bases into signatures may be helpful for identifying underlying mechanisms.
Signature 2 (C > A at [CT]pCpT) is observed solely in low grade gliomas, and appears related to, but slightly different from, the signature previously detected in the same cancer types (“signature 14”, [9]). Indeed, the corresponding signature in the previous study shows very complex patterns (C > A at NpCpT or C > T at GpTpN). Further investigation revealed that this signature is driven by a single sample with an extremely high mutation rate (see S4A and S4B Fig), and signature 2 disappeared when we removed this sample (S5 Fig). It may be that the complex lowgradeglioma specific signature detected in the previous study is driven by the same single sample. We suggest that these signatures should be treated with caution until validated in additional samples.
Signature 4 (C > A at CpCpG) observed in kidney clear cell carcinomas, lung adenocarcinomas and melanomas seems to correspond to the “signature R2” detected in the same cancer types (plus lung squamous carcinomas) in [9] (see their Supplementary Figures). Again our analysis highlights additional contextual information, with a strikingly elevated frequency of base C at the −2 position (S6A, S6B and S6C Fig). However, for each cancer type, only a few samples support this signature (see S6D, S6E and S6F Fig), and the corresponding signature could not be validated in the previous study: most somatic mutations corresponding to that signature could not be validated by resequencing or visual inspection of BAM files using genomic viewers. Again, further investigation yielded a potential explanation for this finding: this signature largely matches that of a putative artifact caused by oxidation of DNA during acoustic shearing [21], and we conclude that this signature, and the corresponding signature in previous work, are likely artefactual. Although not of direct biological interest, identifying artefactual signatures could be helpful in removing false positive mutations.
Signature 13 (T > [AGT] at TpCpN sites) was observed in 12 cancer types, and is surely related to the activity of the APOBEC family. The 12 inferred signatures were highly consistent among cancer types except for Bcell lymphoma (see S3A Fig), highlighting the robustness of our approach. Almost all of them show enrichment of A and T and depletion of G base at the −2 position (Figs 6A and S3A), consistent with the UCUT data above and previous analyses [9]. The estimated transcribed strand specificities varied among cancer types, suggesting that there is not consistent strandspecificity in APOBEC signatures (and the observed variation may be due to estimation errors). Signatures 15 and 16 may also be related to APOBEC, although the estimated signatures are sufficiently different from 13 that they were not merged into a single cluster by our specified clustering criteria.
Signatures 10, 11, 12, 19 and 21 provide further examples of our method refining previously observed signatures, highlighting strand biases and/or sequence context effects, particularly 2 bases upstream of the substitution. Signature 10 (C >T at [CT]pCpC) was observed in head and neck cancers and melanomas, and probably relates to ultraviolet light. Consistent strand specificities among the two cancer types (S3E Fig) matches previous results [9], but our analysis additionally highlights elevated abundance of T at the −2 position (Figs 6D and S3E). Signature 11 (C > T at GpCp[CG]) appears in smallcell lung cancers and stomach cancers, and seems to be the same as “signature 15” in the previous study, whose function remains unclear. Again our analysis highlights elevated abundance of G at the −2 position (Figs 6E and S3F). Signatures 12 (C > T at [CG]pCp[CT]), 19 (T > C at GpTpN) and 21 (T > [CG] at CpTpT) observed in pilocytic astrocytomas, stomach cancers and oesophagus cancers, respectively, agree well with those detected in the same cancer types in the previous study [9]. However our analysis again refines these signatures, highlighting a strand bias in all three, and sequence context effects at the −2 position in Signatures 12 and 21.
One signature, Signature 20, appears not to match any signatures in the previous analysis [9] and represents a potentially novel signature. This signature (T > C at [AC]pTpN) is observed in thyroid cancers, and shows a very strong strand specificity, which could be due to transcriptioncoupled nucleotide excision repairs. This signature may have been too weak for previous methods to detect, perhaps because the mutation ratio of thyroid cancer is low, possibly reflecting improved sensitivity of our more parsimonious model.
The remaining signatures largely recapitulate previous results. Signature 3 and 5 (C > A at NpCpN) observed in headandneck cancers and three types of lung cancers are probably associated with tobacco smoking. The estimated signature in each cancer type shows higher mutation prevalence on the template strand (S3B Fig), which is consistent with the previous study [2, 9]. Signature 6 (C > A at NpCp[AT]) observed in neuroblastomas matches the pattern detected in the same cancer type in the previous study. Signature 7 (C > T at NpCpG sites) was observed in 25 out of 30 cancer type, and arguably relates to deamination of 5methylcytosine. Signature 9 (C > T at NpCp[CT]) was observed in melanomas and glioblastomas, and is probably associated with a chemotherapy drug, temozolomide. Signature 18 (T > C at ApTp[AG]) observed in liver cancers has been shown to be more common in Asian cases than in other ancestries [16], though the source of this signature is still not clear. In this signature, we observe a very strong strand specificity as shown in [9, 16], suggesting a possible role for transcriptioncoupled nucleotide excision repairs.
Discussion
In this paper, we presented new methods for inferring and visualizing mutation signatures from multiple cancer samples. The new methods exploit simpler, more parsimonious, models for mutation signatures than existing methods. This improves stability of statistical estimation, and easily allows a wider range of contextual factors (e.g. more flanking bases) to be incorporated into the analysis. In addition, we provide a new intuitive way to visualize the inferred signatures.
We have also emphasized the connection between mutation signature detection, and the use of mixedmembership models in other fields, particularly admixture analysis and document clustering. This connection naturally raises the possibility of improving the proposed approach by learning from experiences in those other fields. For example, in admixture analysis, [22] found that the use of a correlated prior on allele frequencies improved sensitivity to detect population clusters; this suggests that it might be fruitful to consider a correlated prior distribution on signatures, to allow that some signatures—perhaps in different cancers—may be similar to one another (though not identical). More generally, introducing certain prior distributions or penalty terms, such as sparsitypromoting penalties [23, 24] and determinantal point process priors [25, 26] could improve both accuracy and interpretation. Further, as the scale of cancer genome data becomes large, more sophisticated computational approaches for estimating parameters may become necessary. We can potentially borrow a number of computational techniques such as those using EMalgorithm [27, 28], sequential quadratic programming [29], Gibbs sampling [12, 30] and variational methods [13, 31, 32]. Finally, to address the problem of determining the number of signatures, it may be fruitful to extend the framework to the Hierarchical Dirichlet processes [33].
Although we have focused on point substitution mutations in this paper, many other types of mutations occur in cancer genomes, including insertions, deletions, double nucleotides substitutions, structural variations and copy number alterations [34, 35]. Our framework could incorporate these additional mutation types, by summarizing them using appropriate mutation features. In some cases, choice of appropriate features may need investigation. For example, longer deletions could be represented by the length of deletion and the adjacent bases; for short deletions (a few bases) it may be fruitful to include the actual deleted bases as part of the features.
We have detected a number of mutation signatures having transcription strand biases, which are naturally considered to be associated with transcription activities. Therefore, to further understand the effect of transcription activities on mutational mechanisms, we can include gene expression or RNA polymerase II occupancies to mutation features, so that the relationships of strand biases and transcription activities will be clarified. Also, it may be interesting to devise a probabilistic model for mutation signatures somewhere between complete independence and nonindependence assumption, for example, using ideas analogous to those in [36] that uses a Markovian structure for transcription factor binding sites. This may help improve the modelling flexibility of mutation signatures while keeping the number of parameters moderate.
Although we believe our new methods already provide useful gains compared with existing approaches, the methods are perhaps even more important for their future potential to incorporate other contextual data, including epigenetic data, into mutation signature analysis. This is important, because local mutation densities are closely related to a number of genomic and epigenetic factors, such as GC content, repeat sequences, chromatin accessibility and modifications, and replication timing [37–40]. A recent study found that epigenetic information in the cell types of origin of the corresponding tumors is the most predictive [41] for local mutation densities. A growing range of epigenetic data from many cell types are now available, and it will be interesting to integrate these epigenetic factors into mutation signature analysis to help understand how these epigenetic factors influence DNA damage and repair mechanisms. Our work here provides a straightforward way to do this: epigenetic data can be simply added as features to the mutation signature. This has the potential to improve accuracy of signature detection (e.g. S7 Fig), and to produce novel biological insights. We believe that the value and impact of our work, and specifically our proposed approach to modelling mutation signatures via independent features, will grow as more and more features are incorporated into the analysis.
Methods
Parameter estimation
The parameters {f_{k, l}} and {q_{i}} must be estimated from the available mutation data {x_{i, j}}. Here we adopt a simple approach that uses an EMalgorithm to maximize the likelihood.
Let g_{i, m} denote the number of mutations in the ith sample that have mutation feature vector m. In the E step of the EM algorithm, we calculate values of auxiliary variables θ_{i, k, m} defined as
Then, in the Mstep, we update the parameters {f_{k, l}} and {q_{i, k}} as
We use the R package SQUAREM [42] to accelerate convergence of this EM algorithm (SQUAREM implements a general approach to accelerate the convergence of any fixedpoint iterative scheme such as an EM algorithm). To address potential problems with convergence to local minima, we apply the EM algorithm several times (10 times in this paper) using different initial points, and use the estimate with the largest loglikelihood. See S3 Text for the derivation of the above updating procedures.
Background signatures
Here, we describe how the background mutation signature is obtained in the case where mutation features are the substitution patterns, the ±2 flanking bases, and the transcription strand. Since the majority of the data used in this paper is exome sequencing data, and since we consider transcription strand as a mutation feature, we use the exonic regions of the human genome reference sequence to obtain the background mutation signature. First we calculate the frequencies of 5mers with transcription strands of the corresponding exon, where we take complement sequences and flip the strand for those whose central bases are A or G. Then, assuming alternated bases are equally likely from each central base C and T, the frequency of each mutation feature is derived directly from those of the 5mers and transcription strands. Finally, the probability of each mutation feature is derived by normalizing each frequency to sum to one.
Estimating standard errors
We use the nonparametric bootstrap [43] to calculate standard errors for parameter estimates. This involves resampling somatic mutations from the empirical distribution of the original data {x_{i, j}} for each cancer genome. For each of 100 such bootstrap samples, we refitted the model, using parameters obtained for the original data as initial points. We then used sample standard errors of the inferred mutational signatures as estimates of parameter standard errors.
Selecting the number of signatures
Determining an appropriate number of mutation signatures K is a challenging task. One approach is to utilize some statistical information criteria such as AIC [44] or BIC [45]. In the population structure problems, for example, the Bayesian deviance [12] and crossvalidation [46] have been suggested. One previous study on mutation signature problems [10] utilized BIC. On the other hand, the problem of using these statistical information criteria is that most of them are based on the likelihood, where slight deviations between the specified probabilistic models and the reality sometimes leads to additional (possibly spurious) mutation signatures being selected to compensate for those deviations.
In this paper, instead of utilizing a statistical information criteria, we adopt the following strategy:

After calculating the likelihood and standard errors of parameters for a range of K, the value of K is determined at the point where the likelihood is sufficiently high, and the standard errors are sufficiently low [9].

When, for k_{1}th and k_{2}th mutation signatures, we could detect strong correlations between the estimated membership parameters for each cancer genome ((q_{1, k1}, q_{2, k1}, ⋯, q_{I, k1}) and (q_{1, k2}, q_{2, k2}, ⋯, q_{I, k2})), and the two mutation signatures ({F_{k1}} and {F_{k2}}) show similar patterns, then this suggests that the method may have split one mutation signature into two. We choose K to be small enough that such pairs of mutation signatures do not occur.
These strategies are not claimed as optimal, but appeared to provide satisfactory results in our applications here. The development of automated and practical approaches for choosing K is a possible area for future development.
Existing methods as a special case
Previous approaches to mutation signature modelling in [8, 9] are a special case of our framework. Specifically, they correspond to combining all possible combinations of mutation features into a single “metafeature”, which takes M_{1} × M_{2} × … × M_{L} possible values. Thus, instead of having L features with M = (M_{1}, …, M_{L}), existing approaches have one feature with M = (M_{1} × … × M_{L}) (see S1 Table). The resulting model allows for arbitrary distributions on the M_{1} × … × M_{L} feature space, and we call the resulting model the “full model”. The full model can represent complicated dependencies in a single signature. For example, a situation where C > A is frequent at ApCpG sites and C > T is frequent at TpCpA sites could be represented with one signature. This may be desirable in some settings and not in others. However, when many mutation contextual factors are taken into account and the number of free parameters becomes huge, estimated results can be unstable and unreliable. Furthermore, there is a risk of overinterpreting the complex features of estimated signatures.
Relationship with mixedmembership models
Our model is closely related to mixedmembership models that have been adopted in other applications, such as document classification and population structure inference problems. In this subsection, we outline these relationships, slightly abusing notation to contrast the relationships.
In the topic model [13, 27], which are a form of mixedmembership models frequently used in document classification problems, each document is assumed to have K different “topics” in varying proportions (q_{i} ∈ Δ^{K}), where each topic is characterized by a word frequency (a multinomial distribution on a set of words W (f_{k} ∈ Δ^{W}). And each word is assumed to be generated by one of K multinomial distributions (topics). The detailed generative process of the jth word in the ith document x_{i, j} is:

Generate the underlying topic for the jth word: z_{i, j} ∼ Multinomial(q_{i}), where z_{i, j} ∈ {1, ⋯, K}.

Generate x_{i, j} ∼ Multinomial(f_{zi, j}), where x_{i, j} ∈ {1, ⋯, W}.
Actually our “full model” (L = 1) is essentially the same as a topic model.
On the other hand, in population structure inference problems [12, 47], each individual is assumed to be an admixture of K ancestries in varying proportions, where each ancestry is characterized by the allele frequency at each SNP locus. Each SNP genotype of an individual is assumed to be generated by the two step model: first, an ancestry (“population”) is chosen according to the admixture proportion for each individual, and then the SNP genotype is generated according to the allele frequency of the selected ancestry at that locus. The relationships among the mutation signature models, topic models and population structure models are summarized in Table 1.
As pointed out by [48], there is a close relationship between mixedmembership models and nonnegative matrix factorization, which has been successfully used in the previous studies for mutational signature problems [7–9]. In fact, the proposed method can be seen as nonnegative matrix factorization with additional restrictions. See S4 Text for details of the relationship between the proposed approach and nonnegative matrix factorization.
Supporting Information
Zdroje
1. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719–724. doi: 10.1038/nature07943 19360079
2. Pfeifer GP, Denissenko MF, Olivier M, Tretyakova N, Hecht SS, Hainaut P. Tobacco smoke carcinogens, DNA damage and p53 mutations in smokingassociated cancers. Oncogene. 2002 Oct;21(48):7435–7451. doi: 10.1038/sj.onc.1205803 12379884
3. Pfeifer GP, You YH, Besaratinia A. Mutations induced by ultraviolet light. Mutat Res. 2005 Apr;571(12):19–31. doi: 10.1016/j.mrfmmm.2004.06.057 15748635
4. Burns MB, Lackey L, Carpenter MA, Rathore A, Land AM, Leonard B, et al. APOBEC3B is an enzymatic source of mutation in breast cancer. Nature. 2013 Feb;494(7437):366–370. doi: 10.1038/nature11881 23389445
5. Burns MB, Temiz NA, Harris RS. Evidence for APOBEC3B mutagenesis in multiple human cancers. Nat Genet. 2013 Sep;45(9):977–983. doi: 10.1038/ng.2701 23852168
6. Roberts SA, Lawrence MS, Klimczak LJ, Grimm SA, Fargo D, Stojanov P, et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat Genet. 2013 Sep;45(9):970–976. doi: 10.1038/ng.2702 23852170
7. NikZainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012 May;149(5):979–993. doi: 10.1016/j.cell.2012.04.024 22608084
8. Alexandrov LB, NikZainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013 Aug;500(7463):415–421. doi: 10.1038/nature12477 23945592
9. Alexandrov LB, NikZainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013 Jan;3(1):246–259. doi: 10.1016/j.celrep.2012.12.008 23318258
10. Fischer A, Illingworth CJ, Campbell PJ, Mustonen V. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome Biol. 2013 Apr;14(4):R39. doi: 10.1186/gb2013144r39 23628380
11. Krawczak M, Ball EV, Cooper DN. Neighboringnucleotide effects on the rates of germline singlebasepair substitution in human genes. Am J Hum Genet. 1998 Aug;63(2):474–488. doi: 10.1086/301965 9683596
12. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945–959. 10835412
13. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research. 2003;3:993–1022.
14. Eddelbuettel D, François R, Allaire J, Chambers J, Bates D, Ushey K. Rcpp: Seamless R and C++ integration. Journal of Statistical Software. 2011;40(8):1–18. doi: 10.18637/jss.v040.i08
15. Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct;18(20):6097–6100. doi: 10.1093/nar/18.20.6097 2172928
16. Totoki Y, Tatsuno K, Covington KR, Ueda H, Creighton CJ, Kato M, et al. Transancestry mutational landscape of hepatocellular carcinoma genomes. Nat Genet. 2014 Dec;46(12):1267–1273. doi: 10.1038/ng.3126 25362482
17. Rrnyi A. On measures of entropy and information. In: Fourth Berkeley symposium on mathematical statistics and probability. vol. 1; 1961. p. 547–561.
18. Hoang ML, Chen CH, Sidorenko VS, He J, Dickman KG, Yun BH, et al. Mutational signature of aristolochic acid exposure as revealed by wholeexome sequencing. Sci Transl Med. 2013 Aug;5(197):197ra102. doi: 10.1126/scitranslmed.3006200 23926200
19. Shinbrot E, Henninger EE, Weinhold N, Covington KR, Goksenin AY, Schultz N, et al. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 2014 Nov;24(11):1740–1750. doi: 10.1101/gr.174789.114 25228659
20. Dellino GI, Cittaro D, Piccioni R, Luzi L, Banfi S, Segalla S, et al. Genomewide mapping of human DNAreplication origins: levels of transcription at ORC1 sites regulate origin selection and replication timing. Genome Res. 2013 Jan;23(1):1–11. doi: 10.1101/gr.142331.112 23187890
21. Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013 Apr;41(6):e67. doi: 10.1093/nar/gks1443 23303777
22. Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003 Aug;164(4):1567–1587. 12930761
23. Hoyer PO. Nonnegative matrix factorization with sparseness constraints. Journal of Machine Learning Research. 2004;5:1457–1469.
24. Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genetics. 2010;6(9):e1001117. doi: 10.1371/journal.pgen.1001117 20862358
25. Kulesza A, Taskar B. Determinantal point processes for machine learning. arXiv preprint arXiv:12076083. 2012;.
26. Kwok JT, Adams RP. Priors for diversity in generative latent variable models. In: Advances in Neural Information Processing Systems; 2012. p. 2996–3004.
27. Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM; 1999. p. 50–57.
28. Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: analytical and study design considerations. Genetic Epidemiology. 2005;28(4):289–301. doi: 10.1002/gepi.20064 15712363
29. Zhou H, Alexander D, Lange K. A quasiNewton acceleration for highdimensional optimization algorithms. Statistics and Computing. 2011;21(2):261–273. doi: 10.1007/s1122200991663 21359052
30. Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci USA. 2004 Apr;101 Suppl 1:5228–5235. doi: 10.1073/pnas.0307752101 14872004
31. Teh YW, Newman D, Welling M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems; 2006. p. 1353–1360.
32. Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics. 2014 Jun;197(2):573–589. doi: 10.1534/genetics.114.164350 24700103
33. Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical dirichlet processes. Journal of the American Statistical Association. 2006;101(476). doi: 10.1198/016214506000000302
34. Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through secondgeneration sequencing. Nature Reviews Genetics. 2010;11(10):685–696. doi: 10.1038/nrg2841 20847746
35. Helleday T, Eshtad S, NikZainal S. Mechanisms underlying mutational signatures in human cancers. Nature Reviews Genetics. 2014;15(9):585–598. doi: 10.1038/nrg3729 24981601
36. Zhao X, Huang H, Speed TP. Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005;12(6):894–906. doi: 10.1089/cmb.2005.12.894 16108724
37. SchusterBockler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012 Aug;488(7412):504–507. doi: 10.1038/nature11273 22820252
38. Hodgkinson A, Chen Y, EyreWalker A. The largescale distribution of somatic mutations in cancer genomes. Hum Mutat. 2012 Jan;33(1):136–143. doi: 10.1002/humu.21616 21953857
39. Liu L, De S, Michor F. DNA replication timing and higherorder nuclear organization determine singlenucleotide substitution patterns in cancer genomes. Nat Commun. 2013;4:1502. doi: 10.1038/ncomms2502 23422670
40. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature. 2013 Jul;499(7457):214–218. doi: 10.1038/nature12213 23770567
41. Polak P, Karli R, Koren A, Thurman R, Sandstrom R, Lawrence MS, et al. Celloforigin chromatin organization shapes the mutational landscape of cancer. Nature. 2015 Feb;518(7539):360–364. doi: 10.1038/nature14221 25693567
42. Varadhan R, Roland C. Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics. 2008;35(2):335–353. doi: 10.1111/j.14679469.2007.00585.x
43. Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994.
44. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19(6):716–723. doi: 10.1109/TAC.1974.1100705
45. Schwarz G, et al. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–464. doi: 10.1214/aos/1176344136
46. Alexander DH, Lange K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics. 2011;12(1):246. doi: 10.1186/1471210512246 21682921
47. Alexander DH, Novembre J, Lange K. Fast modelbased estimation of ancestry in unrelated individuals. Genome Res. 2009 Sep;19(9):1655–1664. doi: 10.1101/gr.094052.109 19648217
48. Ding C, Li T, Peng W. On the equivalence between nonnegative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis. 2008;52(8):3913–3927. doi: 10.1016/j.csda.2008.01.011
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2015 Číslo 12
Nejčtenější v tomto čísle
 "Women Who Don't Give a Crap"
 A Point Mutation in Suppressor of Cytokine Signalling 2 () Increases the Susceptibility to Inflammation of the Mammary Gland while Associated with Higher Body Weight and Size and Higher Milk Production in a Sheep Model
 Data Sharing Policy: In Pursuit of Functional Utility
 Catching a (DoubleStrand) Break: The Rad51 and Dmc1 Strand Exchange Proteins Can Cooccupy Both Ends of a Meiotic DNA DoubleStrand Break