Title: | Quantifies Effects of Geo/Eco Distance on Genetic Differentiation |
---|---|
Description: | Provides functions that allow users to quantify the relative contributions of geographic and ecological distances to empirical patterns of genetic differentiation on a landscape. Specifically, we use a custom Markov chain Monte Carlo (MCMC) algorithm, which is used to estimate the parameters of the inference model, as well as functions for performing MCMC diagnosis and assessing model adequacy. |
Authors: | Gideon Bradburd |
Maintainer: | Gideon Bradburd <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.6.1 |
Built: | 2025-02-24 04:17:36 UTC |
Source: | https://github.com/cran/BEDASSLE |
This method models the covariance in allele frequencies between populations on a landscape as a decreasing function of their pairwise geographic and ecological distance. Allele frequencies are modeled as a spatial Gaussian process with a parametric covariance function. The parameters of this covariance function, as well as the spatially smoothed allele frequencies, are estimated in a custom Markov chain Monte Carlo.
The two inference functions are MCMC
and MCMC_BB
, which call the
Markov chain Monte Carlo algorithms on the standard and overdispersion (Beta-Binomial)
models, respectively. To evaluate MCMC performance, there are a number of MCMC diagnosis
and visualization functions, which variously show the trace, plots, marginal and joint
marginal densities, and parameter acceptance rates. To evaluate model adequacy, there is
a posterior predictive sample function (posterior.predictive.sample
), and an
accompanying function to plot its output and visually assess the model's ability to
describe the user's data.
Gideon Bradburd
Maintainer: Gideon Bradburd <[email protected]>
Bradburd, G.S., Ralph, P.L., and Coop, G.M. Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 2013.
Internal BEDASSLE functions
These functions are called by other functions (mostly
MCMC
and MCMC_BB
), and will not
be called directly by the user.
This function calculates unbiased (based on Weir and Hill's
,
2002), between all populations/individuals included in the
counts
matrix, and
returns the results in a k
by k
matrix, where k = nrow(counts)
.
Loci for which either of the populations/individuals has missing data (i.e. - the sample
size is zero) are excluded.
calculate.all.pairwise.Fst(allele.counts, sample.sizes)
calculate.all.pairwise.Fst(allele.counts, sample.sizes)
allele.counts |
A matrix of allelic count data, for which |
sample.sizes |
A matrix of sample sizes, for which |
A matrix of pairwise, unbiased .
Gideon Bradburd
Weir,B.S. and W.G. Hill. 2002. Estimating F-statistics. Ann.Rev.Gen. 36:949-952.
#With the HGDP dataset data(HGDP.bedassle.data) #Calculate pairwise Fst between all population pairs hgdp.pairwise.Fst <- calculate.all.pairwise.Fst( HGDP.bedassle.data$allele.counts, HGDP.bedassle.data$sample.sizes ) #Plot pairwise Fst against geographic distance plot(HGDP.bedassle.data$GeoDistance, hgdp.pairwise.Fst, pch=19, col=HGDP.bedassle.data$EcoDistance+1, ylab="pairwise Fst", xlab="geographic distance", main="isolation by distance") legend(x="bottomright",pch=19,col=c(1,2), legend=c("same side of Himalayas", "opposite sides of Himalayas"))
#With the HGDP dataset data(HGDP.bedassle.data) #Calculate pairwise Fst between all population pairs hgdp.pairwise.Fst <- calculate.all.pairwise.Fst( HGDP.bedassle.data$allele.counts, HGDP.bedassle.data$sample.sizes ) #Plot pairwise Fst against geographic distance plot(HGDP.bedassle.data$GeoDistance, hgdp.pairwise.Fst, pch=19, col=HGDP.bedassle.data$EcoDistance+1, ylab="pairwise Fst", xlab="geographic distance", main="isolation by distance") legend(x="bottomright",pch=19,col=c(1,2), legend=c("same side of Himalayas", "opposite sides of Himalayas"))
This function calculates unbiased (based on Weir and Hill's
, 2002), between a pair of populations/individuals. Loci for which either
of the populations/individuals has missing data (i.e. - the sample size is zero) are
excluded.
calculate.pairwise.Fst(allele.counts, sample.sizes)
calculate.pairwise.Fst(allele.counts, sample.sizes)
allele.counts |
A matrix of allele counts of dimensions |
sample.sizes |
A matrix of sample sizes of dimensions |
Pairwise unbiased between a pair of populations/individuals
Gideon Bradburd
Weir,B.S. and W.G. Hill. 2002. Estimating F-statistics. Ann.Rev.Gen. 36:949-952.
#With the HGDP dataset data(HGDP.bedassle.data) #Draw 2 populations at random from the Eurasian HGDP dataset pop1 <- sample(nrow(HGDP.bedassle.data$allele.counts),1) pop2 <- sample(nrow(HGDP.bedassle.data$allele.counts),1) #Calculate unbiased Fst between them pairwise.Fst <- calculate.pairwise.Fst( HGDP.bedassle.data$allele.counts[c(pop1,pop2),], HGDP.bedassle.data$sample.sizes[c(pop1,pop2),] ) #Print that Fst to the console print(sprintf("Fst between the %s population and the %s population is %s", HGDP.bedassle.data$hgdp.metadata[pop1,1], HGDP.bedassle.data$hgdp.metadata[pop2,1], round(pairwise.Fst,3)) )
#With the HGDP dataset data(HGDP.bedassle.data) #Draw 2 populations at random from the Eurasian HGDP dataset pop1 <- sample(nrow(HGDP.bedassle.data$allele.counts),1) pop2 <- sample(nrow(HGDP.bedassle.data$allele.counts),1) #Calculate unbiased Fst between them pairwise.Fst <- calculate.pairwise.Fst( HGDP.bedassle.data$allele.counts[c(pop1,pop2),], HGDP.bedassle.data$sample.sizes[c(pop1,pop2),] ) #Print that Fst to the console print(sprintf("Fst between the %s population and the %s population is %s", HGDP.bedassle.data$hgdp.metadata[pop1,1], HGDP.bedassle.data$hgdp.metadata[pop2,1], round(pairwise.Fst,3)) )
This function parameterizes the decay in covariance of transformed allele frequencies between sampled populations/individuals over their pairwise geographic and ecological distance.
Covariance(a0, aD, aE, a2, GeoDist, EcoDist, delta)
Covariance(a0, aD, aE, a2, GeoDist, EcoDist, delta)
a0 |
This parameter controls the variance when pairwise distance is zero. It is the
variance of the population-specific transformed allelic deviate (theta) when pairwise
distances are zero (i.e. when |
aD |
This parameter gives the effect size of geographic distance ( |
aE |
This parameter gives the effect size(s) of ecological distance(s) ( |
a2 |
This parameter controls the shape of the decay in covariance with distance. |
GeoDist |
Pairwise geographic distance ( |
EcoDist |
Pairwise ecological distance(s) ( |
delta |
This gives the size of the "delta shift" on the off-diagonal elements of the parametric covariance matrix, used to ensure its positive-definiteness (even, for example, when there are separate populations sampled at the same geographic/ecological coordinates). This value must be large enough that the covariance matrix is positive-definite, but, if possible, should be smaller than the smallest off-diagonal distance elements, lest it have an undue impact on inference. If the user is concerned that the delta shift is too large relative to the pairwise distance elements in D and E, she should run subsequent analyses, varying the size of delta, to see if it has an impact on model inference. |
Gideon Bradburd
#With the HGDP dataset data(HGDP.bedassle.data) #Draw random values of the {alpha} parameters from their priors alpha0 <- rgamma(1,shape=1,rate=1) alphaD <- rexp(1,rate=1) alphaE <- matrix(rexp(1,rate=1),nrow=1,ncol=1) alpha2 <- runif(1,0.1,2) #Parameterize the covariance function using the HGDP dataset distances (Geo and Eco) example.covariance <- Covariance(a0 = alpha0,aD = alphaD,aE = alphaE,a2 = alpha2, GeoDist = HGDP.bedassle.data$GeoDistance, EcoDist = list(HGDP.bedassle.data$EcoDistance), delta = 0.001) #Plot the example covariance against geographic distance plot(HGDP.bedassle.data$GeoDistance, example.covariance, pch=19,col=HGDP.bedassle.data$EcoDistance+1, main="Covariance in allele frequencies across the Himalayas") legend(x="topright",pch=19,col=c(1,2), legend=c("same side of Himalayas", "opposite sides of Himalayas"))
#With the HGDP dataset data(HGDP.bedassle.data) #Draw random values of the {alpha} parameters from their priors alpha0 <- rgamma(1,shape=1,rate=1) alphaD <- rexp(1,rate=1) alphaE <- matrix(rexp(1,rate=1),nrow=1,ncol=1) alpha2 <- runif(1,0.1,2) #Parameterize the covariance function using the HGDP dataset distances (Geo and Eco) example.covariance <- Covariance(a0 = alpha0,aD = alphaD,aE = alphaE,a2 = alpha2, GeoDist = HGDP.bedassle.data$GeoDistance, EcoDist = list(HGDP.bedassle.data$EcoDistance), delta = 0.001) #Plot the example covariance against geographic distance plot(HGDP.bedassle.data$GeoDistance, example.covariance, pch=19,col=HGDP.bedassle.data$EcoDistance+1, main="Covariance in allele frequencies across the Himalayas") legend(x="topright",pch=19,col=c(1,2), legend=c("same side of Himalayas", "opposite sides of Himalayas"))
The allelic counts, sample sizes, geographic distances, ecological distances, and population metadata from the 38 human populations used in example BEDASSLE analyses, subsetted from the Human Genome Diversity Panel (HGDP) dataset.
data(HGDP.bedassle.data)
data(HGDP.bedassle.data)
The format is: List of 7
int [1:38, 1:1000] 12 16 5 17 4 14 20 5 34 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:38] "Adygei" "Basque" "Italian" "French" ...
.. ..$ : chr [1:1000] "rs13287637" "rs17792496" "rs1968588" ...
int [1:38, 1:1000] 34 48 24 56 30 50 56 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:38] "Adygei" "Basque" "Italian" "French" ...
.. ..$ : chr [1:1000] "rs13287637" "rs17792496" "rs1968588" ...
num [1:38, 1:38] 0 1.187 0.867 1.101 1.247 ...
num [1:38, 1:38] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:38] "1" "2" "3" "4" ...
.. ..$ : chr [1:38] "1" "2" "3" "4" ...
int 38
int 1000
'data.frame': 38 obs. of 3 variables:
chr [1:38] "Adygei" "Basque" "Italian" ...
chr [1:38] "44" "43" "46" "46" ...
chr [1:38] "39" "0" "10" "2" ...
A matrix of allelic count data, for which nrow =
the number of populations and ncol =
the number of bi-allelic loci
sampled. Each cell gives the number of times allele ‘1’ is observed in each
population. The choice of which allele is allele ‘1’ is arbitrary, but must
be consistent across all populations at a locus.
A matrix of sample sizes, for which nrow =
the number
of populations and ncol =
the number of bi-allelic loci sampled
(i.e. - the dimensions of sample.sizes
must match those of
counts
). Each cell gives the number of chromosomes successfully
genotyped at each locus in each population.
Pairwise geographic distance (). This may be
Euclidean, or, if the geographic scale of sampling merits it, great-circle
distance. In the case of this dataset, it is great-circle distance.
Pairwise ecological distance(s) (), which may
be continuous (e.g. - difference in elevation) or binary (same or opposite
side of some hypothesized barrier to gene flow). In this case, the
ecological distance is binary, representing whether a pair of populations
occurs on the same side, or on opposite sides, of the Himalayas.
The number of populations in the analysis.
This should be equal to nrow(
counts)
. In this dataset, there
are 38 populations sampled.
The number of loci in the analysis. This should be equal
to ncol(
counts)
. In this dataset, there are 1000 loci
sampled.
This data frame contains the metadata on the populations included in the analysis, including:
Population name
Latitude
Longitude
Conrad et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 2008.
Li et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 2008.
Bradburd, G.S., Ralph, P.L., and Coop, G.M. Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 2013.
## see \command{MCMC}, \command{MCMC_BB}, \command{calculate.pariwise.Fst}, ## \command{calculate.all.pairwise.Fst}, and \command{Covariance} for usage.
## see \command{MCMC}, \command{MCMC_BB}, \command{calculate.pariwise.Fst}, ## \command{calculate.all.pairwise.Fst}, and \command{Covariance} for usage.
Creates a single MCMC output object that links together the output from 2 different runs. To be used when analyses are run serially on a single dataset, with subsequent runs initiated at the parameter values estimated in the last generation of the previous MCMC run.
link.up.posteriors(MCMC.output1, MCMC.output2, linked.up.output.file.name)
link.up.posteriors(MCMC.output1, MCMC.output2, linked.up.output.file.name)
MCMC.output1 |
The first (chronologically) MCMC output to be incorporated into the linked-up output object. |
MCMC.output2 |
The second (chronologically) MCMC output to be incorporated into the linked-up output object. |
linked.up.output.file.name |
The linked-up MCMC object file name. The suffix ".Robj" will be appended to the user-specified name. |
Acceptance rates are re-calculated to be consistent across the new, larger MCMC object. The function is also flexible with respect to the model parameterization (e.g. - it will recognize, for example, whether users have specified the standard or beta-binomial models, or whether users have specified one, or more than one, alphaE parameters).
Gideon Bradburd
This function creates an R object that contains the parameter values read from the
last generation of a previous MCMC run. This R object can then be used to initiate
a subsequent analysis, effectively creating a single long chain. [A single MCMC
object from both runs can be created using the function
link.up.posteriors
].
make.continuing.params(MCMC.output, file.name)
make.continuing.params(MCMC.output, file.name)
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
file.name |
The user-defined name assigned to the R object of parameters to be used in continuing analysis. |
Gideon Bradburd
This function initiates the Markov chain Monte Carlo (MCMC) for the binomial BEDASSLE model.
MCMC(counts, sample_sizes, D, E, k, loci, delta, aD_stp, aE_stp, a2_stp, thetas_stp, mu_stp, ngen, printfreq, savefreq, samplefreq, directory = NULL, prefix = "", continue = FALSE, continuing.params = NULL)
MCMC(counts, sample_sizes, D, E, k, loci, delta, aD_stp, aE_stp, a2_stp, thetas_stp, mu_stp, ngen, printfreq, savefreq, samplefreq, directory = NULL, prefix = "", continue = FALSE, continuing.params = NULL)
counts |
A matrix of allelic count data, for which |
sample_sizes |
A matrix of sample sizes, for which |
D |
Pairwise geographic distance ( |
E |
Pairwise ecological distance(s) ( |
k |
The number of populations in the analysis. This should be equal to
|
loci |
The number of loci in the analysis. This should be equal to
|
delta |
The size of the "delta shift" on the off-diagonal elements of the parametric
covariance matrix, used to ensure its positive-definiteness (even, for example,
when there are separate populations sampled at the same geographic/ecological
coordinates). This value must be large enough that the covariance matrix is
positive-definite, but, if possible, should be smaller than the smallest off-
diagonal distance elements, lest it have an undue impact on inference. If the
user is concerned that the delta shift is too large relative to the pairwise
distance elements in |
aD_stp |
The scale of the tuning parameter on aD (alphaD). The scale of the tuning
parameter is the standard deviation of the normal distribution from which small
perturbations are made to those parameters updated via a random-walk sampler.
A larger value of the scale of the tuning parameter will lead to, on average,
larger proposed moves and lower acceptance rates (for more on acceptance rates,
see |
aE_stp |
The scale of the tuning parameter on aE (alphaE). If there are multiple ecological distances included in the analysis, there will be multiple alphaE parameters (one for each matrix in the list of E). These may be updated all with the same scale of a tuning parameter, or they can each get their own, in which case aE_stp should be a vector of length equal to the number of ecological distance variables. |
a2_stp |
The scale of the tuning parameter on a2 (alpha_2). |
thetas_stp |
The scale of the tuning parameter on the theta parameters. |
mu_stp |
The scale of the tuning parameter on mu. |
ngen |
The number of generations over which to run the MCMC (one parameter is updated at random per generation, with mu, theta, and phi all counting, for the purposes of updates, as one parameter). |
printfreq |
The frequency with which MCMC progress is printed to the screen. If
|
savefreq |
The frequency with which the MCMC saves its output as an R object ( |
samplefreq |
The thinning of the MCMC chain ( |
directory |
If specified, this points to a directory into which output will be saved. |
prefix |
If specified, this prefix will be added to all output file names. |
continue |
If |
continuing.params |
The list of parameter values used to initiate the MCMC if |
This function saves an MCMC output object at intervals specified by savefreq
.
This object may be ported into R working memory using the base function
load
.
As with any MCMC method, it is very important here to perform MCMC diagnosis and
evaluate chain mixing. I have provided a number of MCMC diagnosis graphing functions
for user convenience in visually assessing MCMC output. These include
plot_all_trace
;plot_all_marginals
;
plot_all_joint_marginals
;
and plot_all_acceptance_rates
. To evaluate model adequacy, users should
use posterior.predictive.sample
and
plot_posterior_predictive_sample
. These MCMC diagnosis/model adequacy
functions all call the standard MCMC output R object that the BEDASSLE MCMC generates
as their principal argument.
If users wish to start another MCMC run from where the current run left off, they
should use make.continuing.params
, and initiate the new run with option
continue = TRUE
and the continuing.params
list from the previous run
specified.
Gideon Bradburd
Bradburd, G.S., Ralph, P.L., and Coop, G.M. Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 2013.
#With the HGDP dataset and mcmc operators data(HGDP.bedassle.data) data(mcmc.operators) #The value of delta may set off warnings, #so temporarily disable warnings. op <- options("warn") options(warn = -1) #Call the Markov chain Monte Carlo for the standard model ## Not run: MCMC( counts = HGDP.bedassle.data$allele.counts, sample_sizes = HGDP.bedassle.data$sample.sizes, D = HGDP.bedassle.data$GeoDistance, E = HGDP.bedassle.data$EcoDistance, k = HGDP.bedassle.data$number.of.populations, loci = HGDP.bedassle.data$number.of.loci, delta = mcmc.operators$delta, aD_stp = mcmc.operators$aD_stp, aE_stp = mcmc.operators$aE_stp, a2_stp = mcmc.operators$a2_stp, thetas_stp = mcmc.operators$thetas_stp, mu_stp = mcmc.operators$mu_stp, ngen = mcmc.operators$ngen, printfreq = mcmc.operators$printfreq, savefreq = mcmc.operators$savefreq, samplefreq = mcmc.operators$samplefreq, directory = NULL, prefix = mcmc.operators$prefix, continue = FALSE, continuing.params = NULL ) ## End(Not run) #Re-enable warnings options(op)
#With the HGDP dataset and mcmc operators data(HGDP.bedassle.data) data(mcmc.operators) #The value of delta may set off warnings, #so temporarily disable warnings. op <- options("warn") options(warn = -1) #Call the Markov chain Monte Carlo for the standard model ## Not run: MCMC( counts = HGDP.bedassle.data$allele.counts, sample_sizes = HGDP.bedassle.data$sample.sizes, D = HGDP.bedassle.data$GeoDistance, E = HGDP.bedassle.data$EcoDistance, k = HGDP.bedassle.data$number.of.populations, loci = HGDP.bedassle.data$number.of.loci, delta = mcmc.operators$delta, aD_stp = mcmc.operators$aD_stp, aE_stp = mcmc.operators$aE_stp, a2_stp = mcmc.operators$a2_stp, thetas_stp = mcmc.operators$thetas_stp, mu_stp = mcmc.operators$mu_stp, ngen = mcmc.operators$ngen, printfreq = mcmc.operators$printfreq, savefreq = mcmc.operators$savefreq, samplefreq = mcmc.operators$samplefreq, directory = NULL, prefix = mcmc.operators$prefix, continue = FALSE, continuing.params = NULL ) ## End(Not run) #Re-enable warnings options(op)
This function initiates the Markov chain Monte Carlo (MCMC) for the beta-binomial BEDASSLE model. The beta-binomial model allows populations to diverge from the model's expectations based on their location and their neighbors.
MCMC_BB(counts, sample_sizes, D, E, k, loci, delta, aD_stp, aE_stp, a2_stp, phi_stp, thetas_stp, mu_stp, ngen, printfreq, savefreq, samplefreq, directory = NULL, prefix = "", continue = FALSE, continuing.params = NULL)
MCMC_BB(counts, sample_sizes, D, E, k, loci, delta, aD_stp, aE_stp, a2_stp, phi_stp, thetas_stp, mu_stp, ngen, printfreq, savefreq, samplefreq, directory = NULL, prefix = "", continue = FALSE, continuing.params = NULL)
counts |
A matrix of allelic count data, for which |
sample_sizes |
A matrix of sample sizes, for which |
D |
Pairwise geographic distance ( |
E |
Pairwise ecological distance(s) ( |
k |
The number of populations in the analysis. This should be equal to
|
loci |
The number of loci in the analysis. This should be equal to
|
delta |
The size of the "delta shift" on the off-diagonal elements of the parametric
covariance matrix, used to ensure its positive-definiteness (even, for example,
when there are separate populations sampled at the same geographic/ecological
coordinates). This value must be large enough that the covariance matrix is
positive-definite, but, if possible, should be smaller than the smallest off-
diagonal distance elements, lest it have an undue impact on inference. If the
user is concerned that the delta shift is too large relative to the pairwise
distance elements in |
aD_stp |
The scale of the tuning parameter on aD (alphaD). The scale of the tuning
parameter is the standard deviation of the normal distribution from which small
perturbations are made to those parameters updated via a random-walk sampler.
A larger value of the scale of the tuning parameter will lead to, on average,
larger proposed moves and lower acceptance rates (for more on acceptance rates,
see |
aE_stp |
The scale of the tuning parameter on aE (alphaE). If there are multiple ecological distances included in the analysis, there will be multiple alphaE parameters (one for each matrix in the list of E). These may be updated all with the same scale of a tuning parameter, or they can each get their own, in which case aE_stp should be a vector of length equal to the number of ecological distance variables. |
a2_stp |
The scale of the tuning parameter on a2 (alpha_2). |
phi_stp |
The scale of the tuning parameter on the phi parameters. |
thetas_stp |
The scale of the tuning parameter on the theta parameters. |
mu_stp |
The scale of the tuning parameter on mu. |
ngen |
The number of generations over which to run the MCMC (one parameter is updated at random per generation, with mu, theta, and phi all counting, for the purposes of updates, as one parameter). |
printfreq |
The frequency with which MCMC progress is printed to the screen. If
|
savefreq |
The frequency with which the MCMC saves its output as an R object ( |
samplefreq |
The thinning of the MCMC chain ( |
directory |
If specified, this points to a directory into which output will be saved. |
prefix |
If specified, this prefix will be added to all output file names. |
continue |
If |
continuing.params |
The list of parameter values used to initiate the MCMC if |
This function saves an MCMC output object at intervals specified by savefreq
.
This object may be ported into R working memory using the base function
load
.
As with any MCMC method, it is very important here to perform MCMC diagnosis and
evaluate chain mixing. I have provided a number of MCMC diagnosis graphing functions
for user convenience in visually assessing MCMC output. These include
plot_all_trace
;plot_all_marginals
;
plot_all_joint_marginals
;
and plot_all_acceptance_rates
. To evaluate model adequacy, users should
use posterior.predictive.sample
and
plot_posterior_predictive_sample
. These MCMC diagnosis/model adequacy
functions all call the standard MCMC output R object that the BEDASSLE MCMC generates
as their principal argument.
If users wish to start another MCMC run from where the current run left off, they
should use make.continuing.params
, and initiate the new run with option
continue = TRUE
and the continuing.params
list from the previous
run specified.
Gideon Bradburd
Bradburd, G.S., Ralph, P.L., and Coop, G.M. Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 2013.
#With the HGDP dataset and mcmc operators data(HGDP.bedassle.data) data(mcmc.operators) #The beta-binomial likelihood function may generate "NaNs produced" warnings, #so temporarily disable warnings. op <- options("warn") options(warn = -1) #Call the Markov chain Monte Carlo for the overdispersion model ## Not run: MCMC_BB( counts = HGDP.bedassle.data$allele.counts, sample_sizes = HGDP.bedassle.data$sample.sizes, D = HGDP.bedassle.data$GeoDistance, E = HGDP.bedassle.data$EcoDistance, k = HGDP.bedassle.data$number.of.populations, loci = HGDP.bedassle.data$number.of.loci, delta = mcmc.operators$delta, aD_stp = mcmc.operators$aD_stp, aE_stp = mcmc.operators$aE_stp, a2_stp = mcmc.operators$a2_stp, phi_stp = mcmc.operators$phi_stp, thetas_stp = mcmc.operators$thetas_stp, mu_stp = mcmc.operators$mu_stp, ngen = mcmc.operators$ngen, printfreq = mcmc.operators$printfreq, savefreq = mcmc.operators$savefreq, samplefreq = mcmc.operators$samplefreq, directory = NULL, prefix = mcmc.operators$prefix, continue = FALSE, continuing.params = NULL ) ## End(Not run) #Re-enable warnings options(op)
#With the HGDP dataset and mcmc operators data(HGDP.bedassle.data) data(mcmc.operators) #The beta-binomial likelihood function may generate "NaNs produced" warnings, #so temporarily disable warnings. op <- options("warn") options(warn = -1) #Call the Markov chain Monte Carlo for the overdispersion model ## Not run: MCMC_BB( counts = HGDP.bedassle.data$allele.counts, sample_sizes = HGDP.bedassle.data$sample.sizes, D = HGDP.bedassle.data$GeoDistance, E = HGDP.bedassle.data$EcoDistance, k = HGDP.bedassle.data$number.of.populations, loci = HGDP.bedassle.data$number.of.loci, delta = mcmc.operators$delta, aD_stp = mcmc.operators$aD_stp, aE_stp = mcmc.operators$aE_stp, a2_stp = mcmc.operators$a2_stp, phi_stp = mcmc.operators$phi_stp, thetas_stp = mcmc.operators$thetas_stp, mu_stp = mcmc.operators$mu_stp, ngen = mcmc.operators$ngen, printfreq = mcmc.operators$printfreq, savefreq = mcmc.operators$savefreq, samplefreq = mcmc.operators$samplefreq, directory = NULL, prefix = mcmc.operators$prefix, continue = FALSE, continuing.params = NULL ) ## End(Not run) #Re-enable warnings options(op)
These parameters, which are passed to the MCMC
and MCMC_BB
functions, control the operation of the MCMC. They specify the number of
generations over which the MCMC runs; the scales of the tuning parameters (stp)
for all parameters updated via random-walk samplers; the save, print, and
sample frequency of the chain, and the output file names.
data(mcmc.operators)
data(mcmc.operators)
The format is: List of 12
num 0.001
num 0.0018
num 0.04
num 0.0035
num 30
num 0.07
num 0.17
num 100
num 2
num 100
num 5
chr "example_"
The size of the "delta shift" on the off-diagonal elements of the
parametric covariance matrix, used to ensure its positive-definiteness (even,
for example, when there are separate populations sampled at the same
geographic/ecological coordinates). This value must be large enough that the
covariance matrix is positive-definite, but, if possible, should be smaller
than the smallest off-diagonal distance elements, lest it have an undue
impact on inference. If the user is concerned that the delta shift is too
large relative to the pairwise distance elements in D
and E
,
she should run subsequent analyses, varying the size of delta, to see if it
has an impact on model inference.
The scale of the tuning parameter on aD (alphaD). The scale of the
tuning parameter is the standard deviation of the normal distribution from
which small perturbations are made to those parameters updated via a
random-walk sampler. A larger value of the scale of the tuning parameter will
lead to, on average, larger proposed moves and lower acceptance rates (for
more on acceptance rates, see plot_acceptance_rate
).
The scale of the tuning parameter on aE (alphaE). If there are multiple ecological distances included in the analysis, there will be multiple alphaE parameters (one for each matrix in the list of E). These may be updated all with the same scale of a tuning parameter, or they can each get their own, in which case aE_stp should be a vector of length equal to the number of ecological distance variables.
The scale of the tuning parameter on a2 (alpha_2).
The scale of the tuning parameter on the phi parameters.
The scale of the tuning parameter on the theta parameters.
The scale of the tuning parameter on mu.
The number of generations over which to run the MCMC (one parameter is updated at random per generation, with mu, theta, and phi all counting, for the purposes of updates, as one parameter).
The frequency with which MCMC progress is printed to the
screen. If printfreq =1000
, an update with the MCMC generation number
and the posterior probability at that generation will print to the screen
every 1000 generations.
The frequency with which the MCMC saves its output as an R
object (savefreq =50,000
means that MCMC output is saved every 50,000
generations). If ngen
is large, this saving process may be
computationally expensive, and so should not be performed too frequently.
However, users may wish to evalute MCMC performance while the chain is still
running, or may be forced to truncate runs early, and should therefore
specify a savefreq
that is less than ngen
. We recommend a
savefreq
of between 1/10th and 1/20th of ngen
.
The thinning of the MCMC chain (samplefreq = 1000
means
that the parameter values saved in the MCMC output are sampled once every
1000 generations). A higher samplefreq
will decrease parameter
autocorrelation time. However, there is still information in autocorrelated
draws from the joint posterior, so the samplefreq
should be viewed
merely as a computational convenience, to decrease the size of the MCMC
output objects.
If specified, this prefix will be added to all output file names.
## see \command{MCMC} and \command{MCMC_BB} for example usage.
## see \command{MCMC} and \command{MCMC_BB} for example usage.
Creates a plot showing the proportion of proposed moves to accepted moves over the duration of the MCMC analysis.
plot_acceptance_rate(accepted.moves, proposed.moves, param.name = deparse(substitute(accepted.moves)))
plot_acceptance_rate(accepted.moves, proposed.moves, param.name = deparse(substitute(accepted.moves)))
accepted.moves |
A vector giving the number of accepted random-walk moves at each sampled MCMC generation. |
proposed.moves |
A vector giving the number of proposed random-walk moves at each sampled MCMC generation. |
param.name |
The name of the parameter for which the trace plot is being displayed. |
For optimal mixing, between ~20 samplers should be accepted. If the acceptance rates fall outside that range, this function will automatically highlight that parameter as a potential instance of poor mixing. If the acceptance rates are too low, then for subsequent analyses the user should decrease the scale of the tuning parameter (or "std," as in, e.g., "aD_std"), and if acceptance rates are too high, the user should increase the scale of the tuning parameter. The scale of the tuning parameter is the standard deviation of the normal distribution from which the small random variable is drawn and added to the current parameter value to propose a move. If the acceptance rate has not plateaued by the end of an analysis, it is an indication that the chain may still be "going somewhere" in parameter space, and subsequent analyses should be performed.
Gideon Bradburd
Creates a series of plots showing the proportion of proposed moves to accepted moves over the duration of the MCMC analysis for each parameter updated via a random-walk sampler.
plot_all_acceptance_rates(MCMC.output)
plot_all_acceptance_rates(MCMC.output)
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
For optimal mixing, between ~20
samplers should be accepted. If the acceptance rates fall outside that range, this
function will automatically highlight that parameter as a potential instance of poor
mixing. If the acceptance rates are too low, then for subsequent analyses the user
should decrease the scale of the tuning parameter (or "std," as in, e.g.,
aD_std
), and if acceptance rates are too high, the user should increase
the scale of the tuning parameter. The scale of the tuning parameter is the standard
deviation of the normal distribution from which the small random variable is drawn
and added to the current parameter value to propose a move. If the acceptance rate
has not plateaued by the end of an analysis, it is an indication that the chain may
still be "going somewhere" in parameter space, and subsequent analyses should be
performed.
Gideon Bradburd
For each sampled MCMC generation, the values estimated for a pair of parameters are logged and plotted against one another. Points are color coded by when in the analysis they were sampled, so that users can visually assess mixing. A joint marginal plot is generated for all combinations of parameters, excluding the phi parameters estimated in the beta-binomial model.
plot_all_joint_marginals(MCMC.output, percent.burnin = 0, thinning = 1)
plot_all_joint_marginals(MCMC.output, percent.burnin = 0, thinning = 1)
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
Visualizations of the joint marginal distributions allow users to (1) assess how well the MCMC is mixing, and (2) potentially diagnose instances of non-identifiability in the model. Strong linear trends in the joint marginal, or visible "ridges" in the likelihood surface, may be indicative of parameter non-identifiability, in which multiple combinations of values of these two parameters provide equally reasonable fits to the data.
Gideon Bradburd
Plots the posterior marginal density of all parameters. Users may specify whether they want a histogram, a density, or both.
plot_all_marginals(MCMC.output, percent.burnin = 0, thinning = 1, population.names = NULL)
plot_all_marginals(MCMC.output, percent.burnin = 0, thinning = 1, population.names = NULL)
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
A vector of length |
The marginal plot is another basic visual tool for MCMC diagnosis. Users should look for marginal plots that are "smooth as eggs" (indicating that the chain has been run long enough) and unimodal (indicating a single peak in the likelihood surface).
Gideon Bradburd
Plots the posterior marginal densities of all phi parameters. Users may specify
whether they want a histogram, a density, or both. For convenience, the
statistic is presented in place of the phi parameter, as this is the statistic users
care about.
is defined as
.
plot_all_phi_marginals(phi_mat, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index= NULL, histogram = TRUE, density = TRUE)
plot_all_phi_marginals(phi_mat, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index= NULL, histogram = TRUE, density = TRUE)
phi_mat |
The |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
A vector of length |
pop.index |
A population index number generated to title a marginal plot if no
|
histogram |
A switch that controls whether or not the plot contains a histogram of the values
estimated for the parameter over the course of the MCMC. Default is |
density |
A switch that controls whether or not the plot shows the density of the values
estimated for the parameter over the course of the MCMC. Default is |
The marginal plot is another basic visual tool for MCMC diagnosis. Users should look for marginal plots that are "smooth as eggs" (indicating that the chain has been run long enough) and unimodal (indicating a single peak in the likelihood surface).
Gideon Bradburd
Plots all trace plots for the phi parameters in all populations. For convenience,
the statistic is presented in place of the phi parameter, as this is the
statistic users care about.
is defined as
.
plot_all_phi_trace(phi_mat, percent.burnin = 0, thinning = 1, population.names = NULL)
plot_all_phi_trace(phi_mat, percent.burnin = 0, thinning = 1, population.names = NULL)
phi_mat |
The |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
A vector of length |
A trace plot is a basic visual tool for assessing MCMC mixing. If the chain is mixing well, the trace plot will resemble a "fuzzy caterpillar." If the trace plot has not plateaued, it is an indication that the chain has not converged on the stationary posterior distribution, and must be run longer. If the trace plot of a parameter exhibits high autocorrelation, the user may wish to either increase or decrease the scale of the tuning parameter on that parameter, to decrease or increase acceptance rates, respectively. If the chain appears to be bouncing between areas of "fuzzy caterpillar-dom," it may be an indication of a multi-modal likelihood surface.
Gideon Bradburd
This function plots the parameter value estimated in each sampled generation of the MCMC against the index of that sampled generation for each parameter in the model.
plot_all_trace(MCMC.output, percent.burnin = 0, thinning = 1, population.names = NULL)
plot_all_trace(MCMC.output, percent.burnin = 0, thinning = 1, population.names = NULL)
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
A vector of length |
A trace plot is a basic visual tool for assessing MCMC mixing. If the chain is mixing well, the trace plot will resemble a "fuzzy caterpillar." If the trace plot has not plateaued, it is an indication that the chain has not converged on the stationary posterior distribution, and must be run longer. If the trace plot of a parameter exhibits high autocorrelation, the user may wish to either increase or decrease the scale of the tuning parameter on that parameter, to decrease or increase acceptance rates, respectively. If the chain appears to be bouncing between areas of "fuzzy caterpillar-dom," it may be an indication of a multi-modal likelihood surface.
Gideon Bradburd
For each sampled MCMC generation, the values estimated for a pair of parameters are logged and plotted against one another. Points are color coded by when in the analysis they were sampled, so that users can visually assess mixing.
plot_joint_marginal(parameter1, parameter2, percent.burnin = 0, thinning = 1, param.name1 = deparse(substitute(parameter1)), param.name2 = deparse(substitute(parameter2)))
plot_joint_marginal(parameter1, parameter2, percent.burnin = 0, thinning = 1, param.name1 = deparse(substitute(parameter1)), param.name2 = deparse(substitute(parameter2)))
parameter1 |
One of the two parameters for which the joint marginal is being plotted. |
parameter2 |
The other of the two parameters for which the joint marginal is being plotted. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
param.name1 |
The name of one of the two parameters for which the joint marginal is being plotted. |
param.name2 |
The name of the other of the two parameters for which the joint marginal is being plotted. |
Visualizations of the joint marginal distribution allow users to (1) assess how well the MCMC is mixing, and (2) potentially diagnose instances of non-identifiability in the model. Strong linear trends in the joint marginal, or visible "ridges" in the likelihood surface, may be indicative of parameter non-identifiability, in which multiple combinations of values of these two parameters provide equally reasonable fits to the data.
Gideon Bradburd
Plots the posterior marginal density of a parameter. Users may specify whether they want a histogram, a density, or both.
plot_marginal(parameter, percent.burnin = 0, thinning = 1, histogram = TRUE, density = TRUE, population.names = NULL, param.name = deparse(substitute(parameter)))
plot_marginal(parameter, percent.burnin = 0, thinning = 1, histogram = TRUE, density = TRUE, population.names = NULL, param.name = deparse(substitute(parameter)))
parameter |
The parameter for which the marginal plot is being generated. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
histogram |
A switch that controls whether or not the plot contains a histogram of the values
estimated for the parameter over the course of the MCMC. Default is |
density |
A switch that controls whether or not the plot shows the density of the values
estimated for the parameter over the course of the MCMC. Default is |
population.names |
A vector of length |
param.name |
The name of the parameter for which the trace plot is being displayed. |
The marginal plot is another basic visual tool for MCMC diagnosis. Users should look for marginal plots that are "smooth as eggs" (indicating that the chain has been run long enough) and unimodal (indicating a single peak in the likelihood surface).
Gideon Bradburd
Plots the posterior marginal density of a phi parameter. Users may specify whether
they want a histogram, a density, or both. For convenience, the
statistic is presented in place of the phi parameter, as this is the statistic users
care about.
is defined as
.
plot_phi_marginal(phi, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index = NULL,histogram = TRUE, density = TRUE)
plot_phi_marginal(phi, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index = NULL,histogram = TRUE, density = TRUE)
phi |
The vector of phi values estimated for a single population from an MCMC run. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
The name of the population/individual for which the marginal density of the phi
parameter is being plotted. This will be used to title the marginal plot. If
|
pop.index |
A population index number generated to title a marginal plot if no
|
histogram |
A switch that controls whether or not the plot contains a histogram of the values
estimated for the parameter over the course of the MCMC. Default is |
density |
A switch that controls whether or not the plot shows the density of the values
estimated for the parameter over the course of the MCMC. Default is |
The marginal plot is another basic visual tool for MCMC diagnosis. Users should look for marginal plots that are "smooth as eggs" (indicating that the chain has been run long enough) and unimodal (indicating a single peak in the likelihood surface).
Gideon Bradburd
This function plots the phi parameter value estimated in each sampled generation of
the MCMC against the index of that sampled generation. For convenience, the
statistic is presented in place of the phi parameter, as this is the
statistic users care about.
is defined as
.
plot_phi_trace(phi, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index = NULL)
plot_phi_trace(phi, percent.burnin = 0, thinning = 1, population.names = NULL, pop.index = NULL)
phi |
The vector of phi values estimated for a single population from an MCMC run. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
population.names |
The name of the population/individual for which the marginal density of the phi
parameter is being plotted. This will be used to title the marginal plot. If
|
pop.index |
A population index number generated to title a marginal plot if no
|
A trace plot is a basic visual tool for assessing MCMC mixing. If the chain is mixing well, the trace plot will resemble a "fuzzy caterpillar." If the trace plot has not plateaued, it is an indication that the chain has not converged on the stationary posterior distribution, and must be run longer. If the trace plot of a parameter exhibits high autocorrelation, the user may wish to either increase or decrease the scale of the tuning parameter on that parameter, to decrease or increase acceptance rates, respectively. If the chain appears to be bouncing between areas of "fuzzy caterpillar-dom," it may be an indication of a multi-modal likelihood surface.
Gideon Bradburd
This function plots the posterior predictive samples generated by
posterior.predictive.sample
around the observed data, so that users can
evaluate how well the model is able to describe their data.
plot_posterior_predictive_samples(posterior.predictive.sample.file, save.figure = NULL, figure.name = NULL)
plot_posterior_predictive_samples(posterior.predictive.sample.file, save.figure = NULL, figure.name = NULL)
posterior.predictive.sample.file |
The output file generated by the function |
save.figure |
If |
figure.name |
If |
This function plots posterior predictive unbiased pairwise around the
observed unbiased pairwise
values to determine how well the model is
able to describe the user's data. Users should examine these plots to make sure that
the model is picking up general trends (e.g. the slopes of isolation by geographic
distance and isolation by ecological distance), and also to identify specific
populations whose relationships with their neighbors are being poorly accommodated by
the model.
Gideon Bradburd
This function plots the parameter value estimated in each sampled generation of the MCMC against the index of that sampled generation.
plot_trace(parameter, percent.burnin = 0, thinning = 1, param.name = deparse(substitute(parameter)))
plot_trace(parameter, percent.burnin = 0, thinning = 1, param.name = deparse(substitute(parameter)))
parameter |
The parameter for which the trace plot is being generated. |
percent.burnin |
The percent of the sampled MCMC generations to be discarded as "burn-in." If the
MCMC is run for 1,000,000 generations, and sampled every 1,000 generations, there
will be 1,000 sampled generations. A |
thinning |
The multiple by which the sampled MCMC generations are thinned. A |
param.name |
The name of the parameter for which the trace plot is being displayed. |
A trace plot is a basic visual tool for assessing MCMC mixing. If the chain is mixing well, the trace plot will resemble a "fuzzy caterpillar." If the trace plot has not plateaued, it is an indication that the chain has not converged on the stationary posterior distribution, and must be run longer. If the trace plot of a parameter exhibits high autocorrelation, the user may wish to either increase or decrease the scale of the tuning parameter on that parameter, to decrease or increase acceptance rates, respectively. If the chain appears to be bouncing between areas of "fuzzy caterpillar-dom," it may be an indication of a multi-modal likelihood surface.
Gideon Bradburd
This function simulates data using the inference model parameterized from the joint
posterior of the MCMC and the observed independent variables
(). These posterior predictive samples can be compared to the
observed data to see how well the model is able to describe the observed data.
posterior.predictive.sample(MCMC.output, posterior.predictive.sample.size, output.file, prefix = "")
posterior.predictive.sample(MCMC.output, posterior.predictive.sample.size, output.file, prefix = "")
MCMC.output |
The standard MCMC output file generated from a BEDASSLE run. |
posterior.predictive.sample.size |
The number of posterior predictive datasets the user wishes to simulate. |
output.file |
The name that will be assigned to the R object containing the posterior predictive datasets. The suffix ".Robj" will be appended to the user-specified name. |
prefix |
If specified, this prefix will be added to the output file name. |
This function simulates datasets like those the user analyzed with BEDASSLE, using
the same independent variables (sample.sizes
, and
)
as in the user's dataset and plugging them into the inference model, which is
parameterized by randomly drawing parameter values from the joint posterior output of
the MCMC analysis. These posterior predictive simulated allelic count data are
summarized as unbiased pairwise
(using
calculate.all.pairwise.Fst
), which may then be compared to the
observed unbiased pairwise to determine how well the model is able to
describe the user's data. The output of
posterior.predictive.sample
can be
visualized using plot.posterior.predictive.sample
.
Gideon Bradburd