Please note that until a peer reviewed publication describing the demografr package is published, the software should be regarded as experimental and potentially unstable.
You can now read our preprint on bioRxiv! All comments and feedback are welcome!
The goal of demografr is to simplify and streamline the development of simulation-based inference pipelines in population genetics and evolutionary biology, such as Approximate Bayesian Computation (ABC) or parameter grid inferences, and make them more reproducible. demografr also aims to make these inferences orders of magnitude faster and more efficient by leveraging the tree sequence as an internal data structure and computation engine.
Unlike traditional ABC and other simulation-based approaches, which generally involve custom-built pipelines and scripts for population genetic simulation and computation of summary statistics, demografr makes it possible to perform simulation, computation of summary statistics, and the inference itself within a single reproducible R script. By eliminating the need to write custom simulation code and scripting for integration of various population genetic tools for computing summary statistics, it lowers the barrier for new users and facilitates reproducibility for everyone regardless of their level of experience by eliminating many common sources of bugs.
demografr streamlines every step of a typical ABC pipeline by leveraging the slendr framework as a building block for simulation and data analysis, making it possible to write complete simulation-based workflows entirely in R. Specifically:
-
slendr's intuitive, interactive interface for definning population genetic models makes it easy to encode even complex demographic models with only a bare minimum of R knowledge needed.
-
demografr makes it possible to encode prior distributions of parameters using familiar R interface resembling standard probabilistic statements, and provides an automated function which simulates ABC replicates drawing parameters from priors in a trivial, one-step manner. Automated routines for exploring model parameter grids in settings other than ABC are also provided.
-
Because slendr embraces the tree sequence as its default internal data structure, most population genetic statistics can be computed directly on such tree sequences using R functions which are part of slendr's statistical library. A tree sequence is never saved to disk and no conversion between file formats is required, which significantly speeds up every workflow. That said, using custom simulation scripts as well as computation of summary statistics in third party software is also possible (while still leveraging the consistent demografr workflow approach and most of its standard simulation and inference functions).
-
demografr facilitates tight integration with the powerful R package abc by automatically feeding it simulation data for inference and diagnostics. At the moment, the abc package represents a core inference engine of demografr. Other inference engines and statistical approaches will be developed in the future.
Once the demografr package appears on CRAN, you will be able to install the latest released version with the following command:
install.packages("demografr")However, especially early after an initial release, it might be worth keeping an eye on the changelog for bugfixes and improvements, which you can always obtain by installing the development version of demografr with:
devtools::install_github("bodkan/demografr")Note that this requires an R package devtools, which you can install simply
by running install.packages("devtools").
Note: A much more detailed explanation of this toy example can be found in the following vignette. Or in the demografr manuscript available on bioRxiv. Please take the code below as a bare minimum demonstration of some basic functionality of the package.
Imagine that we sequenced genomes of individuals from populations "A", "B", "C", and "D".
Let's also assume that we know that the populations are phylogenetically
related in the following way, with an indicated gene-flow event at a certain
time in the past, but we don't know anything else (i.e., we have no idea about
the
After sequencing the genomes of individuals from these populations, we computed
the nucleotide diversity in each of them, their pairwise genetic
divergence, and
- Nucleotide diversity in each population:
observed_diversity <- read.table(system.file("examples/basics_diversity.tsv", package = "demografr"), header = TRUE)
observed_diversity
#> set diversity
#> 1 A 8.030512e-05
#> 2 B 3.288576e-05
#> 3 C 1.013804e-04
#> 4 D 8.910909e-05- Pairwise divergence between populations X and Y:
observed_divergence <- read.table(system.file("examples/basics_divergence.tsv", package = "demografr"), header = TRUE)
observed_divergence
#> x y divergence
#> 1 A B 0.0002378613
#> 2 A C 0.0002375761
#> 3 A D 0.0002379385
#> 4 B C 0.0001088217
#> 5 B D 0.0001157056
#> 6 C D 0.0001100633- Value of the following
$f_4$ -statistic:
observed_f4 <- read.table(system.file("examples/basics_f4.tsv", package = "demografr"), header = TRUE)
observed_f4
#> W X Y Z f4
#> 1 A B C D -3.262146e-06Note that the value of each given statistic is given as the last column of each data frame, with additional columns providing relevant "metadata". Keeping consistent naming throughout an inference pipeline is extremely important for demografr, and goes a long way towards reproducibility and towards avoiding sneaky bugs! For this reason, demografr often loudly complains if something looks inconsistent. Better safe than sorry!
This is how we would use demografr to estimate the
library(demografr)
library(slendr)
# running setup_env() first might be necessary to set up slendr's internal
# simulation environment
init_env()
# set up parallelization across all CPUs on the current machine
library(future)
plan(multisession, workers = availableCores())
#--------------------------------------------------------------------------------
# bind data frames with empirical summary statistics into a named list
observed <- list(
diversity = observed_diversity,
divergence = observed_divergence,
f4 = observed_f4
)
#--------------------------------------------------------------------------------
# define a model generating function using the slendr interface
# (each of the function parameters correspond to a parameter we want to infer)
model <- function(Ne_A, Ne_B, Ne_C, Ne_D, T_AB, T_BC, T_CD, gf_BC) {
A <- population("A", time = 1, N = Ne_A)
B <- population("B", time = T_AB, N = Ne_B, parent = A)
C <- population("C", time = T_BC, N = Ne_C, parent = B)
D <- population("D", time = T_CD, N = Ne_D, parent = C)
gf <- gene_flow(from = B, to = C, start = 9000, end = 9301, rate = gf_BC)
model <- compile_model(
populations = list(A, B, C, D), gene_flow = gf,
generation_time = 1, simulation_length = 10000,
direction = "forward", serialize = FALSE
)
samples <- schedule_sampling(
model, times = 10000,
list(A, 25), list(B, 25), list(C, 25), list(D, 25),
strict = TRUE
)
# when a specific sampling schedule is to be used, both model and samples
# must be returned by the function
return(list(model, samples))
}
#--------------------------------------------------------------------------------
# setup priors for model parameters
priors <- list(
Ne_A ~ runif(1000, 3000),
Ne_B ~ runif(100, 1500),
Ne_C ~ runif(5000, 10000),
Ne_D ~ runif(2000, 7000),
T_AB ~ runif(1, 4000),
T_BC ~ runif(3000, 9000),
T_CD ~ runif(5000, 10000),
gf_BC ~ runif(0, 0.3)
)
#--------------------------------------------------------------------------------
# define summary functions to be computed on simulated data (must be of the
# same format as the summary statistics computed on empirical data)
compute_diversity <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_diversity(ts, sample_sets = samples)
}
compute_divergence <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_divergence(ts, sample_sets = samples)
}
compute_f4 <- function(ts) {
samples <- ts_names(ts, split = "pop")
A <- samples["A"]; B <- samples["B"]
C <- samples["C"]; D <- samples["D"]
ts_f4(ts, A, B, C, D)
}
# the summary functions must be also bound to an R list named in the same
# way as the empirical summary statistics
functions <- list(
diversity = compute_diversity,
divergence = compute_divergence,
f4 = compute_f4
)
#--------------------------------------------------------------------------------
# validate the individual ABC components for correctness and consistency
validate_abc(model, priors, functions, observed,
sequence_length = 1e6, recombination_rate = 1e-8)
#--------------------------------------------------------------------------------
# run ABC simulations
data <- simulate_abc(
model, priors, functions, observed, iterations = 10000,
sequence_length = 50e6, recombination_rate = 1e-8, mutation_rate = 1e-8
)
#--------------------------------------------------------------------------------
# infer posterior distributions of parameters using the abc R package
abc <- run_abc(data, engine = "abc", tol = 0.01, method = "neuralnet")After we run this R script, we end up with an object called abc (see the
last row of the script). This
object contains the complete information about the results of our inference.
In particular, it carries the posterior samples for our parameters of interest
(
What can we do with this, to get an idea about the most likely parameters of the assumed evolutionary history of our populations of interest?
For instance, we can get a summary table of all parameter posteriors with the
function extract_summary():
extract_summary(abc)
#> Ne_A Ne_B Ne_C Ne_D T_AB T_BC
#> Min.: 1492.557 526.6231 6373.061 2254.591 859.5749 5131.473
#> Weighted 2.5 % Perc.: 1774.758 672.5853 7344.040 2895.795 1318.3159 5595.637
#> Weighted Median: 2032.148 848.1467 8553.144 3814.660 1934.0008 6136.188
#> Weighted Mean: 2021.722 838.5594 8678.066 3804.614 1954.3343 6112.305
#> Weighted Mode: 2054.933 861.3408 8428.777 3721.917 1933.3694 6230.341
#> Weighted 97.5 % Perc.: 2270.334 1003.8421 10162.320 4530.273 2522.5919 6611.315
#> Max.: 2438.657 1047.0935 12703.479 5764.139 2650.7248 6851.800
#> T_CD gf_BC
#> Min.: 6752.669 -0.04681798
#> Weighted 2.5 % Perc.: 7125.349 0.02469992
#> Weighted Median: 7835.810 0.09867650
#> Weighted Mean: 7824.532 0.10282631
#> Weighted Mode: 7851.617 0.09804401
#> Weighted 97.5 % Perc.: 8426.377 0.18467366
#> Max.: 8476.157 0.21801675We can also specify a subset of model parameters to select, or provide a regular expression for this subsetting:
extract_summary(abc, param = "gf_BC")
#> gf_BC
#> Min.: -0.04681798
#> Weighted 2.5 % Perc.: 0.02469992
#> Weighted Median: 0.09867650
#> Weighted Mean: 0.10282631
#> Weighted Mode: 0.09804401
#> Weighted 97.5 % Perc.: 0.18467366
#> Max.: 0.21801675Of course, we can also visualize the posterior distributions. Rather than
plotting many different distributions at once, let's first check out the
posterior distributions of inferred
plot_posterior(abc, param = "Ne")Similarly, we can take a look at the inferred posteriors of the split times:
plot_posterior(abc, param = "T")And, finally, the rate of gene flow:
plot_posterior(abc, param = "gf") + ggplot2::coord_cartesian(xlim = c(0, 1))Additionally, we have the full diagnostic functionality of the abc R package at our disposal:
plot(abc, param = "Ne_C")Many diagnostic and model selection functions implemented by abc are also supported by demografr. For more information, see this vignette.
demografr also provides a couple of functions designed to make development (and troubleshooting) a little easier.
For instance, assuming we have priors set up as above, we can visualize the
prior distribution(s) like this:
plot_prior(priors, "Ne")To make developing complete pipelines more efficient, demografr also provides
means to test and evaluate their individual components even further. For instance,
the function simulate_model() simulates data from a single simulation run:
ts <- simulate_model(model, priors, sequence_length = 1e6, recombination_rate = 1e-8, mutation_rate = 1e-8)
ts
#> ╔═══════════════════════════╗
#> ║TreeSequence ║
#> ╠═══════════════╤═══════════╣
#> ║Trees │ 1,670║
#> ╟───────────────┼───────────╢
#> ║Sequence Length│ 1,000,000║
#> ╟───────────────┼───────────╢
#> ║Time Units │generations║
#> ╟───────────────┼───────────╢
#> ║Sample Nodes │ 200║
#> ╟───────────────┼───────────╢
#> ║Total Size │ 467.6 KiB║
#> ╚═══════════════╧═══════════╝
#> ╔═══════════╤═════╤═════════╤════════════╗
#> ║Table │Rows │Size │Has Metadata║
#> ╠═══════════╪═════╪═════════╪════════════╣
#> ║Edges │7,284│227.6 KiB│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Individuals│ 100│ 2.8 KiB│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Migrations │ 0│ 8 Bytes│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Mutations │1,942│ 70.2 KiB│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Nodes │2,122│ 58.0 KiB│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Populations│ 4│331 Bytes│ Yes║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Provenances│ 2│ 3.3 KiB│ No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Sites │1,938│ 47.3 KiB│ No║
#> ╚═══════════╧═════╧═════════╧════════════╝With this one simulation data instance ts (in this case, a tree-sequence
object of which the above-mentioned function
simulate_abc() would produce millions, making troubleshooting challenging and
slow),
we can apply individual summary statistic functions using another helper
function summarise_data():
summarise_data(ts, functions)
#> $diversity
#> # A tibble: 4 × 2
#> set diversity
#> <chr> <dbl>
#> 1 A 0.0000792
#> 2 B 0.0000409
#> 3 C 0.000190
#> 4 D 0.000126
#>
#> $divergence
#> # A tibble: 6 × 3
#> x y divergence
#> <chr> <chr> <dbl>
#> 1 A B 0.000270
#> 2 A C 0.000255
#> 3 A D 0.000268
#> 4 B C 0.000206
#> 5 B D 0.000248
#> 6 C D 0.000211
#>
#> $f4
#> # A tibble: 1 × 5
#> W X Y Z f4
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 A B C D -0.0000152By comparing the format of this result to the observed data (given in the
list observed at the beginning of this page), we can make sure
that both simulated and observed summary statistics are mutually compatible,
and can be thus directly compared in later steps of the ABC workflow.
See the reference for a complete overview of the functionality of this package.





