Chapter 2 Quick start

Natália Martínková and Stuart J. E. Baird

The package diemr incorporates the diagnostic index expectation maximisation algorithm used to estimate which genomic alleles belong to which side of a barrier to geneflow. To start using diemr, load the package or install it from CRAN if it is not yet available:

# Attempt to load a package, if the package was not available, install and load
if(!require("diemr", character.only = TRUE)){
    install.packages("diemr", dependencies = TRUE)
    library("diemr", character.only = TRUE)
}

Next, assemble paths to all files containing the data to be used by diemr. Here, we will use a tiny example dataset that is included in the package for illustrating the analysis workflow. A good practice is to check that all files contain data in the correct format for all individuals and markers. Additionally, the analysis will need a list with ploidies for all genomic compartments and individuals, and a vector with indices of samples that will be included in the analysis.

filepaths <- system.file("extdata", "data7x3.txt", package = "diemr")
# Analysing six individuals
samples <- 1:6
# Assuming diploid markers of all individuals
ploidies = rep(list(rep(2, nchar(readLines(filepaths[1])[1]) - 1)), length(filepaths))
CheckDiemFormat(files = filepaths, 
                ChosenInds = samples,
                ploidy = ploidies)
# File check passed: TRUE
# Ploidy check passed: TRUE

If the CheckDiemFormat() function fails, work through the error messages and fix the stored input files accordingly. The algorithm repeatedly accesses data from the harddisk, so seeing the passed file and variable check prior to the analysis is critical.

To estimate the marker polarities, their diagnostic indices and support, run the function diem() with default settings. Here, we have only one file with data, so paralelisation is unnecessary, and we set nCores = 1. The Windows operating system only allows for nCores = 1. Other operating systems can process multiple genomic compartments (e.g.  chromosomes) in parallel, the analysis of different genomic compartment files running on multiple processors.

res <- diem(files = filepaths, 
            ploidy = ploidies, 
            markerPolarity = list(c(FALSE, FALSE, TRUE)),
            ChosenInds = samples, 
            nCores = 1)

The result is a list, where the element res$DI contains a table with marker polarities, their diagnostic indices and support.

res$DI
#   newPolarity         DI   Support Marker
# 1       FALSE  -4.872256 15.930181      1
# 2        TRUE  -3.520647 18.633399      2
# 3        TRUE -13.274822  6.130625      3

The column newPolarity means that marker 1 should be imported for subsequent analyses as is, and markers 2 and 3 should be imported with 0 replaced with 2 and 2 replaced with 0 (hereafter ‘flipped’ 0↔︎2). The marker 3 has the lowest diagnostic index and low support, indicating that the genotypes scored at this marker are poorly related to the barrier to geneflow arbitrated by the data.

With the marker polarities optimised to detect a barrier to geneflow, a plot of the polarised genome will show how genomic regions cross the barrier. First, the genotypes need to be imported with optimal marker polarities. Second, individual hybrid indices need to be calculated from the polarised genotypes. And last, the data will be plotted as a raster image.

genotypes <- importPolarized(file = filepaths, 
                             changePolarity = res$markerPolarity, 
                             ChosenInds = samples)
                             
h <- apply(X = res$I4, 
           MARGIN = 1, 
           FUN = pHetErrOnStateCount)[1, ]
           
plotPolarized(genotypes = genotypes,
              HI = h[samples])

CAUTION: This is just a quick start to get you started! For real datasets you will use the diagnostic index (DI) to filter the full set of sites you have analysed with diem in order to plot only those markers relevant to any barrier detected in the analysis.