Inferring selection in cancer sequencing data using ABC and population based simulations
Author marcjwilliams1
12 Stars
Updated Last
1 Year Ago
Started In
August 2017


Build Status Build Status Coverage Status

It is recommended that you use mobster for this type of analysis moving forward. mobster is an R package that provides similar functionality with orders of magnitude increases in speed and has many other features. SubClonalSelection.jl will remain here but I am unlikely to actively develop the package or be able to provide much ongoing support.

A Julia package for inferring the strength of selection from cancer sequencing data. Package simultaneously estimates the number of subclones in the population and their relative fitnesses. This is done by generating synthetic data by simulating different population dynamics (see CancerSeqSim.jl) and fitting this to data using Approximate Bayesian Computation (see ApproxBayes.jl).

The package enables analysis as described in Quantification of subclonal selection in cancer from bulk sequencing data. See this paper for the technical background.

Getting Started

To download the package, once you're in a Julia session type the following command:

] add

Then once you are in a julia session the package can be loaded with

using SubClonalSelection

Below is an example of how to run an analysis, for some more examples with more details go here.

Input data

Running an analysis requires variant allele frequencies (VAFs) as measured in deep sequencing of cancer samples. Generating the synthetic data assumes that the cancer is diploid, therefore any mutations falling in non-diploid regions should be removed. This does unfortunately mean that in highly aneuploid tumours, there will not be enough mutations to perform an analysis. We would recommend a minimum of 100 mutations.

Running an analysis

The main function to perform an analysis is fitABCmodels. This takes as its first argument either a vector of Floats or a string pointing to a text file with a vector of floats, which will be read in automatically. The second argument is the name of the sample you wish to analyse which will be used to write the data and plots to a file. There are then a number of keyword arguments set to reasonable defaults. More details of these arguments and their defaults can be found by typing ?fitABCmodels in a Julia session.

There is some example data generated from the simulation found in the examples directory. For example the following command will run the inference on the oneclone.txt data set with 400 posterior samples and 5*10^5 iterations given sequencing depth of sample is 300X:

using Random
out = fitABCmodels("example/oneclone.txt",
  read_depth = 300,
  resultsdirectory = "example",
  Nmaxinf = 10^6,
  maxiterations = 5*10^5,
  save = true,
  nparticles = 400);

Running this is quite computationally expensive, so running on a cluster would be recommended in most cases.

Also included are a number of functions to summarize the output and plot the posterior. The output of the fitABCmodels function prints a summary of the posterior distribution as well as some details on the ABC convergence. For example for the above we can see a print out of the model fitting results:


Total number of simulations: 5.31e+05
Cumulative number of simulations = [400, 1501, 3347, 6755, 11325, 17583, 25144, 43005, 94935, 225243, 531167]
Acceptance ratio: 7.53e-04
Tolerance schedule = [5701.38, 3576.56, 2353.92, 1721.4, 1348.54, 1156.23, 1021.12, 877.64, 721.45, 575.07, 467.33]
Model probabilities:
        Model 1 (0 subclones): 0.003
        Model 2 (1 subclones): 0.819
        Model 3 (2 subclones): 0.178
Model 1 (0 subclones)
        Median (95% intervals):
        Parameter 1 - μ/β: 21.81 (20.75,24.80)
        Parameter 2 - Clonal Mutations: 352.81 (345.11,382.23)
        Parameter 3 - Cellularity: 0.68 (0.66,0.69)
Model 2 (1 subclones)
        Median (95% intervals):
        Parameter 1 - μ/β: 17.81 (14.30,23.13)
        Parameter 2 - Clonal Mutations: 308.29 (232.27,362.17)
        Parameter 3 - Fitness: 0.86 (0.40,2.27)
        Parameter 4 - Time (tumour doublings): 8.15 (5.06,12.50)
        Parameter 5 - Cellularity: 0.70 (0.68,0.74)
        Parameter 6 - Subclone Frequency: 0.62 (0.51,0.70)
        Parameter 7 - Subclone Mutations: 222.00 (151.71,295.00)
Model 3 (2 subclones)
        Median (95% intervals):
        Parameter 1 - μ/β: 14.88 (11.83,18.84)
        Parameter 2 - Clonal Mutations: 304.75 (252.96,366.35)
        Parameter 3 - Fitness - Subclone 1: 1.38 (0.41,23.07)
        Parameter 4 - Time (tumour doublings) - Subclone 1: 9.88 (4.10,13.98)
        Parameter 5 - Fitness - Subclone 2: 0.84 (0.14,3.05)
        Parameter 6 - Time (tumour doublings) - Subclone 2: 8.71 (2.54,14.05)
        Parameter 7 - Cellularity: 0.70 (0.67,0.73)
        Parameter 8 - Subclone 1 Frequency: 0.63 (0.53,0.72)
        Parameter 9 - Subclone 2 Frequency: 0.09 (0.05,0.51)
        Parameter 10 - Subclone 1 Mutations: 216.00 (82.09,280.98)
        Parameter 11 - Subclone 2 Mutations: 192.00 (72.94,301.17)

We can also plot the posterior distributions.

Plot the posterior model probabilities.



Plot the histogram for model with 1 subclone (second argument is the number of subclones).

plothistogram(out, 1)


Plot the posterior parameter distribution for model with 1 subclone.

plotparameterposterior(out, 1)


Note the ground truth of the parameters in this case are. And as can be seen we do a reasonably good job of recovering the parameters. All ground truth parameters are within the 95% credible intervals.

  • Mutation rate: 20.0
  • Number of clonal mutations: 300
  • Number of subclones: 1
  • Cellularity: 0.7
  • Tumour population size: 10^6
  • Subclone frequency: 0.58
  • Fitness advantage: 1.03
  • Mutations in subclone: 251
  • Time emerges (tumour doublings): 9.0
  • Read depth: 300X

Finally we can also save all plots and text files with posterior distributions to a directory, unless specified the default will be to create a file a directory called output in the current working directory. It is also possible to automatically save the output by adding a save=true keywords to fitABCmodels.

saveresults(out; resultsdirectory = "example")
saveallplots(out, resultsdirectory = "example")