--- title: "How to format data for a conStruct analysis" author: "Gideon Bradburd" date: '`r format(Sys.Date(), "%B %d, %Y")`' output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{format-data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Format data This document describes the format of the data used in a `conStruct` analysis. For information on how to run a `conStruct` analysis after you've formatted your data, see the companion vignette on [how to run conStruct](run-conStruct.html). Throughout the document, I'll be referring to the example dataset included with the package: ```{r} library(conStruct) data(conStruct.data) ``` ## conStruct data There are 3 data objects you need to run a `conStruct` analysis: 1. [allele frequency data] 2. [geographic sampling coordinates] 3. [geographic distance matrix] In the sections below, I walk through the specific format required for each. ### Allele frequency data You must specify a matrix of allele frequency data for your samples. (Make sure the data are of class `matrix`, and that it's not a `data.frame`.) I assume that the data consist of bi-allelic SNPs. At each locus, you pick an allele to count across all samples (it doesn't matter whether it's randomly chosen or whether it's always the major or minor allele). The frequency of the counted allele at a locus in a sample is the number of times the counted allele is observed at a locus divided by the number of chromosomes genotyped in that sample. A sample can consist of a single individual or multiple individuals lumped together. So, a successfully genotyped diploid individual heterozygous at a particular locus would have an allele frequency of 0.5. If the sample is a population of 13 haploids, of which 12 have the counted allele at a given locus, the frequency in that sample at that locus would be 12/13. The matrix of allele frequencies should have one row per sample and one column per locus. Missing data should be denoted with the value `NA`. An small example allele frequency data matrix is shown below: | Sample | Locus1 | Locus2 | Locus3 | Locus4 | Locus5 | Locus6 | Locus7 | Locus8 | Locus9 | Locus10 | |:------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | Pop1 | 0 | 1 | NA | 0.8 | 0.7 | 0 | 0 | 0.6 | 0 | 1 | | Pop2 | 0 | 1 | 1 | 0.9 | 1 | 1 | 0.1 | 0.6 | 0 | 0.9 | | Pop3 | 0.2 | 0.75 | 0 | 1 | 1 | NA | 1 | 1 | 0.1 | 1 | | Pop4 | 0.1 | 0.9 | 1 | 1 | 0.8 | 1 | 0.2 | 0.7 | 0.1 | 0.3 | | Pop5 | 0 | 1 | 1 | 1 | 1 | 1 | 0.3 | 0.9 | 0.3 | NA | An full example allele frequency data matrix is included in the `conStruct.data` object included with the package. ```{r} # load the example data object data(conStruct.data) # look at the allele frequency data # for the first 5 populations and 10 loci conStruct.data$allele.frequencies[1:5,1:10] ``` ### Geographic sampling coordinates You must specify a matrix of geographic sampling coordinates, which will be used for plotting the results of the analysis. This should be a matrix with two columns that give the sample x-coordinates (longitude) and y-coordinates (latitude), respectively. The order of rows of the matrix should be the same as the order of the rows of the allele frequencies matrix. If you specify longitude and latitude, they should be in decimal degrees. A full example sampling coordinate data matrix is included in the `conStruct.data` object included with the package. ```{r} # load the example data object data(conStruct.data) # look at the geographic sampling coordinates # for the first 5 populations conStruct.data$coords[1:5,] ``` ### Geographic distance matrix If you choose to run the spatial model implemented in `conStruct`, you must specify a matrix of pairwise geographic distance between all samples. If the coordinates of the samples are real locations on Earth (as opposed to simulated coordinates), I recommend calculating pairwise great-circle distance between sampling coordinates (using, e.g., `geosphere::distm` or `geosphere::distGeo`). The order of the samples in the geographic distance matrix should match both that of the geographic coordinates and that of the allele frequency data matrix, and all three matrices should have the same number of rows. The geographic distance matrix you specify should be the full matrix (that is, not the upper- or lower-triangles), with values of `0` on the diagonal entries. A full example geographic distance matrix between all samples is in the `conStruct.data` object included with the package. ```{r} # load the example data object data(conStruct.data) # look at pariwise geographic distance # between the first 5 populations conStruct.data$geoDist[1:5,1:5] ``` # Other formats to conStruct For convenience, I've written a function to convert population genetic data in STRUCTURE format to that used in `conStruct`. ## STRUCTURE to conStruct The program STRUCTURE is one of the most widely used methods for model-based clustering in population genetics. Many existing programs, including plink (v1.9 and above) and PgdSpider, convert data from diverse formats (including .vcf files) into STRUCTURE format. In this section of the vignette, I walk through an example of converting a STRUCTURE format data file into a `conStruct` format data file. ### STRUCTURE data format More extensive documentation on STRUCTURE's data format can be found in [the STRUCTURE manual](https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf). An example STRUCTURE-formatted dataset is shown below: | | | Loc1 | Loc1 | Loc2 | Loc2 | Loc3 | Loc3 | Loc4 | Loc4 | |:------:|:---:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | Ind1 | 1 | 1 | 1 | 2 | 2 | 1 | 2 | -9 | -9 | | Ind2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | Ind3 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | | Ind4 | 2 | -9 | -9 | 1 | 2 | 1 | 1 | 1 | 1 | | Ind5 | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 2 | Example STRUCTURE format dataset, with one row per individual and two columns per locus. The first column gives sample names, the second refers to the sample locations, and the last 8 columns give genotype data for four loci. The numbers in the genotype data refer to the allele present at that locus: A1 = `1`, A2 = `2`, missing = `-9`. To convert a STRUCTURE format file to `conStruct` format, you can use the function `structure2conStruct`, included in the `conStruct` package. Below, I give an example of the usage of this function, assuming that the file containing the STRUCTURE format data is called "myStructureData.str", and that it's on the "desktop" directory on the computer. I also assume that the data are formatted as in the table above, with the genotype data starting at the 3rd column of the data matrix, and missing data denoted with a value of -9. **Note that the STRUCTURE-format data must be a text file and there can be no lines of text before the data table begins. If your file is in an Excel spreadsheet, it can be converted to a text file using Save As > File Format = Tab delimited Text (.txt). If there are lines of text at the top of the document before the data matrix begins, they must be deleted or specified via the `start.samples` argument. In addition, your data can only contain bi-allelic data. If you have loci with more than two alleles, they should be not be included in the dataset. For more information on multi-allelic datasets, see the section on [Microsatellites](#microsatellites) below.** ```{r,eval=FALSE} conStruct.data <- structure2conStruct(infile = "~/Desktop/myStructureData.str", onerowperind = TRUE, start.loci = 3, start.samples = 1, missing.datum = -9, outfile = "~/Desktop/myConStructData") ``` An alternate STRUCTURE data format has two rows and one column per diploid genotype: | | | Loc1 | Loc2 | Loc3 | Loc4 | Loc5 | Loc6 | Loc7 | Loc8 | |:------:|:---:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | Ind1 | 1 | 1 | 1 | 2 | 2 | 1 | 2 | 0 | 1 | | Ind1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 0 | 1 | | Ind2 | 1 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | | Ind2 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | | Ind3 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | | Ind3 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | | Ind4 | 2 | 2 | 0 | 2 | 2 | 2 | 1 | 0 | 2 | | Ind4 | 2 | 2 | 0 | 2 | 2 | 1 | 1 | 0 | 2 | Example STRUCTURE format dataset, with two rows per individual and one column per locus. The first column gives sample names, the second refers to the sample locations, and the last 8 columns give genotype data for 8 loci. The numbers in the genotype data refer to the allele present at that locus: A1 = `1`, A2 = `2`, missing = `0`. Data in this format can be converted to `conStruct` format using the command below: ```{r,eval=FALSE} conStruct.data <- structure2conStruct(infile = "~/Desktop/myStructureData.str", onerowperind = FALSE, start.loci = 3, start.samples = 1, missing.datum = 0, outfile = "~/Desktop/myConStructData") ``` Further documentation for this function is in its help page, which you can go to using the command `help(structure2conStruct)`. If you wish to group multiple individuals together into a single sample for analysis you can collapse rows of the `conStruct` format data file. For example, if you have 12 individuals from 4 locations (3 individuals from each location), and you wish to analyze the data treating populations at a sampling location as the unit of analysis, you can do something like the following: ```{r,eval=FALSE} pop.data.matrix <- matrix(NA,nrow=4,ncol=ncol(conStruct.data)) for(i in 1:nrow(pop.data.matrix)){ pop.data.matrix[i,] <- colMeans( conStruct.data[ which(pop.index==i),, drop=FALSE ],na.rm=TRUE ) } ``` where `pop.index` is a vector that gives the population of origin for each of the individuals sampled. In the example above, with 12 individuals sampled from 4 locations (3 from each), `pop.index` would be `c(1,1,1,2,2,2,3,3,3,4,4,4)`. ## Microsatellites This method is designed to run on large datasets consisting of bi-allelic SNPs. If you have a microsatellite dataset and you wish to run `conStruct`, the first consideration is whether you have sufficient data. You should have more loci than samples in your data matrix (i.e., your data matrix should have more columns than rows). If that's the case, the second consideration is how to format your microsat data so that you can run conStruct. There are two standard ways of "SNP-ifying" a microsat dataset. The first is to lump all microsatellite alleles present at a locus into two categories: "major" and "other". The "major" allele is the allele that occurs most frequently at a particular locus; all other alleles are put in the "other" bin. You then can create a dataset in which you only report the frequency of the major allele, effectively reducing the number of alleles per locus to 2. This method has the disadvantage of throwing out data, but acknowledges the simplex relationships between alleles at a locus (the sum of the frequencies of all alleles at a locus must be 1). The second approach, introduced by Cavalli-Sforza, is to split out each allele at a locus into a separate pseudo-locus consisting of only that allele. That is, if you had 4 alleles present in the genotyped sample at a particular locus, at frequencies {0.4,0.3,0.1,0.2}, you would split those out into 4 separate columns in your data matrix (pseudo-loci), with frequencies in the sampled population of {0.4,0.3,0.1,0.2}. This approach has the advantage of not throwing data away, but does not acknowledge the inter-allele dependence structure in frequencies, and therefore introduces some pseudoreplication into the dataset. This pseudoreplication may make you overconfident in your results, as the credible intervals on parameter estimates may be artificially narrow. I would recommend trying both approaches, and comparing the estimates of pairwise relatedness you get from each to those derived from the raw microsatellite data to see which best recovers the patterns of relatedness in the data. I also recommend running `conStruct` on datasets SNP-ified using both approaches, and comparing the results.