Title: | Genotype Quality Control with 'PLINK' |
---|---|
Description: | Genotyping arrays enable the direct measurement of an individuals genotype at thousands of markers. 'plinkQC' facilitates genotype quality control for genetic association studies as described by Anderson and colleagues (2010) <doi:10.1038/nprot.2010.116>. It makes 'PLINK' basic statistics (e.g. missing genotyping rates per individual, allele frequencies per genetic marker) and relationship functions accessible from 'R' and generates a per-individual and per-marker quality control report. Individuals and markers that fail the quality control can subsequently be removed to generate a new, clean dataset. Removal of individuals based on relationship status is optimised to retain as many individuals as possible in the study. |
Authors: | Hannah Meyer [aut, cre] |
Maintainer: | Hannah Meyer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.4 |
Built: | 2025-01-27 04:41:06 UTC |
Source: | https://github.com/meyer-lab-cshl/plinkqc |
Runs and evaluates results of plink –pca on merged genotypes from
individuals to be QCed and individuals of reference population of known
genotypes. Currently, check ancestry only supports automatic selection of
individuals of European descent. It uses information from principal
components 1 and 2 returned by plink –pca to find the center of the European
reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef). It then
computes the maximum Euclidean distance (maxDist) of the European reference
samples from this centre. All study samples whose Euclidean distance from the
centre falls outside the circle described by the radius r=europeanTh* maxDist
are considered non-European and their IDs are returned as failing the
ancestry check.
check_ancestry
creates a scatter plot of PC1 versus PC2 colour-coded
for samples of the reference populations and the study population.
check_ancestry( indir, name, qcdir = indir, prefixMergedDataset, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refPopulation = c("CEU", "TSI"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", legend_labels_per_row = 6, run.check_ancestry = TRUE, interactive = FALSE, verbose = verbose, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, path2plink = NULL, showPlinkOutput = TRUE )
check_ancestry( indir, name, qcdir = indir, prefixMergedDataset, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refPopulation = c("CEU", "TSI"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", legend_labels_per_row = 6, run.check_ancestry = TRUE, interactive = FALSE, verbose = verbose, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, path2plink = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] prefix of plink files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where prefixMergedDataset.eigenvec results as returned by plink –pca should be saved. Per default qcdir=indir. If run.check_ancestry is FALSE, it is assumed that plink –pca prefixMergedDataset has been run and qcdir/prefixMergedDataset.eigenvec exists.User needs writing permission to qcdir. |
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refPopulation |
[vector] Vector with population identifiers of European reference population. Identifiers have to be corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population. |
legend_labels_per_row |
[integer] Number of population names per row in PCA plot. |
run.check_ancestry |
[logical] Should plink –pca be run to
determine principal components of merged dataset; if FALSE, it is assumed
that plink –pca has been run successfully and
qcdir/prefixMergedDataset.eigenvec is present;
|
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_ancestry); all highlight_samples IIDs have to be present in the IIDs of the prefixMergedDataset.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry).
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" fail_ancestry <- check_ancestry(indir=indir, name=name, refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, run.check_ancestry=FALSE) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_ancestry <- check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, highlight_samples = highlight_samples[,2], run.check_ancestry=FALSE, highlight_type = c("text", "shape")) ## End(Not run)
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" fail_ancestry <- check_ancestry(indir=indir, name=name, refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, run.check_ancestry=FALSE) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_ancestry <- check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, highlight_samples = highlight_samples[,2], run.check_ancestry=FALSE, highlight_type = c("text", "shape")) ## End(Not run)
Runs and evaluates results from plink –missing (missing genotype rates
per individual) and plink –het (heterozygosity rates per individual).
Non-systematic failures in genotyping and outlying heterozygosity (hz) rates
per individual are often proxies for DNA sample quality. Larger than expected
heterozygosity can indicate possible DNA contamination.
The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where
N: number of non-missing genotypes and O:observed number of homozygous
genotypes for a given individual.
Mean heterozygosity can differ between populations and SNP genotyping panels.
Within a population and genotyping panel, a reduced heterozygosity rate can
indicate inbreeding - these individuals will then likely be returned by
check_relatedness
as individuals that fail the relatedness
filters. check_het_and_miss
creates a scatter plot with the
individuals' missingness rates on x-axis and their heterozygosity rates on
the y-axis.
check_het_and_miss( indir, name, qcdir = indir, imissTh = 0.03, hetTh = 3, run.check_het_and_miss = TRUE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = FALSE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
check_het_and_miss( indir, name, qcdir = indir, imissTh = 0.03, hetTh = 3, run.check_het_and_miss = TRUE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = FALSE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss. |
qcdir |
[character] /path/to/directory where name.het as returned by plink –het and name.imiss as returned by plink –missing will be saved. Per default qcdir=indir. If run.check_het_and_miss is FALSE, it is assumed that plink –missing and plink –het have been run and qcdir/name.imiss and qcdir/name.het are present. User needs writing permission to qcdir. |
imissTh |
[double] Threshold for acceptable missing genotype rate per individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
run.check_het_and_miss |
[logical] Should plink –missing and plink
–het be run to determine genotype missingness and heterozygosity rates; if
FALSE, it is assumed that plink –missing and plink –het have been run and
qcdir/name.imiss and qcdir/name.het are present;
|
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
check_het_and_miss
wraps around
run_check_missingness
,
run_check_heterozygosity
and
evaluate_check_het_and_miss
.
If run.check_het_and_miss is TRUE, run_check_heterozygosity
and
run_check_missingness
are executed ; otherwise it is assumed
that plink –missing and plink –het have been run externally and
qcdir/name.het and qcdir/name.imiss exist. check_het_and_miss
will fail with missing file error otherwise.
For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het
Named [list] with i) fail_imiss [data.frame] containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het [data.frame] containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check and iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss).
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" path2plink <- "path/to/plink" # whole dataset fail_het_miss <- check_het_and_miss(indir=indir, name=name, interactive=FALSE,path2plink=path2plink) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_het_miss <- check_het_and_miss(indir=indir, name=name, interactive=FALSE,path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" path2plink <- "path/to/plink" # whole dataset fail_het_miss <- check_het_and_miss(indir=indir, name=name, interactive=FALSE,path2plink=path2plink) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_het_miss <- check_het_and_miss(indir=indir, name=name, interactive=FALSE,path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
Runs and evaluates results from plink –hardy. It calculates the observed and
expected heterozygote frequencies for all variants in the individuals that
passed the perIndividualQC
and computes the deviation of the
frequencies from Hardy-Weinberg equilibrium (HWE) by HWE exact test. The
p-values of the HWE exact test are displayed as histograms (stratified by
all and low p-values), where the hweTh is used to depict the quality control
cut-off for SNPs.
check_hwe( indir, name, qcdir = indir, hweTh = 1e-05, interactive = FALSE, path2plink = NULL, verbose = FALSE, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
check_hwe( indir, name, qcdir = indir, hweTh = 1e-05, interactive = FALSE, path2plink = NULL, verbose = FALSE, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
hweTh |
[double] Significance threshold for deviation from HWE. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_hwe) via ggplot2::ggsave(p=p_hwe, other_arguments) or pdf(outfile) print(p_hwe) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
check_hwe
uses plink –remove name.fail.IDs –hardy to
calculate the observed and expected heterozygote frequencies per SNP in the
individuals that passed the perIndividualQC
. It does so
without generating a new dataset but simply removes the IDs when calculating
the statistics.
For details on the output data.frame fail_hwe, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#hwe.
Named list with i) fail_hwe containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), TEST (Type of test: one of 'ALL', 'AFF', 'UNAFF', 'ALL(QT)', 'ALL(NP)'), A1 (Allele 1; usually minor), A2 (Allele 2; usually major), GENO ('/'-separated genotype counts: A1 hom, het, A2 hom), O(HET) (Observed heterozygote frequency E(HET) (Expected heterozygote frequency), P (Hardy-Weinberg equilibrium exact test p-value) for all SNPs that failed the hweTh and ii) p_hwe, a ggplot2-object 'containing' the HWE p-value distribution histogram which can be shown by (print(p_hwe)).
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_hwe <- check_hwe(indir=indir, qcdir=qcdir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_hwe <- check_hwe(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, remove_individuals=remove_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_hwe <- check_hwe(indir=indir, qcdir=qcdir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_hwe <- check_hwe(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, remove_individuals=remove_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
Runs and evaluates results from plink –freq. It calculates the minor allele
frequencies for all variants in the individuals that passed the
perIndividualQC
. The minor allele frequency distributions is
plotted as a histogram.
check_maf( indir, name, qcdir = indir, macTh = 20, mafTh = NULL, verbose = FALSE, interactive = FALSE, path2plink = NULL, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
check_maf( indir, name, qcdir = indir, macTh = 20, mafTh = NULL, verbose = FALSE, interactive = FALSE, path2plink = NULL, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_hwe) via ggplot2::ggsave(p=p_maf, other_arguments) or pdf(outfile) print(p_maf) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
check_maf
uses plink –remove name.fail.IDs –freq to calculate the
minor allele frequencies for all variants in the individuals that passed the
perIndividualQC
. It does so without generating a new dataset
but simply removes the IDs when calculating the statistics.
For details on the output data.frame fail_maf, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#frq.
Named list with i) fail_maf containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), A1 (Allele 1; usually minor), A2 (Allele 2; usually major), MAF (Allele 1 frequency), NCHROBS (Number of allele observations) for all SNPs that failed the mafTh/macTh and ii) p_maf, a ggplot2-object 'containing' the MAF distribution histogram which can be shown by (print(p_maf)).
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_maf <- check_maf(indir=indir, qcdir=qcdir, name=name, macTh=15, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") exclude_markers_file <- system.file("extdata", "exclude_markers", package="plinkQC") fail_maf <- check_maf(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, exclude_markers=exclude_markers_file) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_maf <- check_maf(indir=indir, qcdir=qcdir, name=name, macTh=15, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") exclude_markers_file <- system.file("extdata", "exclude_markers", package="plinkQC") fail_maf <- check_maf(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, exclude_markers=exclude_markers_file) ## End(Not run)
Runs and evaluates results from plink –check-sex.
check_sex
returns IIDs for individuals whose SNPSEX != PEDSEX
(where the SNPSEX is determined by the heterozygosity rate across
X-chromosomal variants).
Mismatching SNPSEX and PEDSEX IDs can indicate plating errors, sample-mixup
or generally samples with poor genotyping. In the latter case, these IDs are
likely to fail other QC steps as well.
Optionally, an extra data.frame (externalSex) with sample IDs and sex can be
provided to double check if external and PEDSEX data (often processed at
different centers) match. If a mismatch between PEDSEX and SNPSEX was
detected, while SNPSEX == Sex, PEDSEX of these individuals can optionally be
updated (fixMixup=TRUE).
check_sex
depicts the X-chromosomal heterozygosity (SNPSEX) of
the individuals split by their (PEDSEX).
check_sex( indir, name, qcdir = indir, maleTh = 0.8, femaleTh = 0.2, run.check_sex = TRUE, externalSex = NULL, externalFemale = "F", externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", fixMixup = FALSE, interactive = FALSE, verbose = FALSE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, showPlinkOutput = TRUE )
check_sex( indir, name, qcdir = indir, maleTh = 0.8, femaleTh = 0.2, run.check_sex = TRUE, externalSex = NULL, externalFemale = "F", externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", fixMixup = FALSE, interactive = FALSE, verbose = FALSE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam and name.sexcheck. |
qcdir |
[character] /path/to/directory to save name.sexcheck as returned by plink –check-sex. Per default qcdir=indir. If run.check_sex is FALSE, it is assumed that plink –check-sex has been run and qcdir/name.sexcheck is present. User needs writing permission to qcdir. |
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
run.check_sex |
[logical] Should plink –check-sex be run? if set to
FALSE, it is assumed that plink –check-sex has been run and
qcdir/name.sexcheck is present; |
externalSex |
[data.frame, optional] Dataframe with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex while Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sexcheck) via ggplot2::ggsave(p=p_sexcheck, other_arguments) or pdf(outfile) print(p_sexcheck) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
check_sex
wraps around run_check_sex
and
evaluate_check_sex
. If run.check_sex is TRUE,
run_check_sex
is executed ; otherwise it is assumed that plink
–check-sex has been run externally and qcdir/name.sexcheck exists.
check_sex
will fail with missing file error otherwise.
For details on the output data.frame fail_sex, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#sexcheck.
Named list with i) fail_sex: [data.frame] with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals failing sex check, ii) mixup: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals whose PEDSEX != Sex and Sex == SNPSEX and iii) p_sexcheck, a ggplot2-object 'containing' a scatter plot of the X-chromosomal heterozygosity (SNPSEX) of the sample split by their (PEDSEX), which can be shown by print(p_sexcheck).
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" # whole dataset fail_sex <- check_sex(indir=indir, name=name, run.check_sex=FALSE, interactive=FALSE, verbose=FALSE) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_sex <- check_sex(indir=indir, name=name, interactive=FALSE, path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
## Not run: indir <- system.file("extdata", package="plinkQC") name <- "data" # whole dataset fail_sex <- check_sex(indir=indir, name=name, run.check_sex=FALSE, interactive=FALSE, verbose=FALSE) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_sex <- check_sex(indir=indir, name=name, interactive=FALSE, path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
Runs and evaluates results from plink –missing –freq. It calculate the
rates of missing genotype calls and frequency for all variants in the
individuals that passed the perIndividualQC
. The SNP
missingness rates (stratified by minor allele frequency) are depicted as
histograms.
check_snp_missingness( indir, name, qcdir = indir, lmissTh = 0.01, interactive = FALSE, path2plink = NULL, verbose = FALSE, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
check_snp_missingness( indir, name, qcdir = indir, lmissTh = 0.01, interactive = FALSE, path2plink = NULL, verbose = FALSE, showPlinkOutput = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9 )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_lmiss) via ggplot2::ggsave(p=p_lmiss, other_arguments) or pdf(outfile) print(p_lmiss) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
check_snp_missingness
uses plink –remove name.fail.IDs –missing
–freq to calculate rates of missing genotype calls and frequency per SNP in
the individuals that passed the perIndividualQC
. It does so
without generating a new dataset but simply removes the IDs when calculating
the statistics.
For details on the output data.frame fail_missingness, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#lmiss.
Named list with i) fail_missingness containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), CLST (Cluster identifier. Only present with –within/–family), N_MISS (Number of missing genotype call(s), not counting obligatory missings), N_CLST (Cluster size; does not include nonmales on Ychr; Only present with –within/–family), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) for all SNPs failing the lmissTh and ii) p_lmiss, a ggplot2-object 'containing' the SNP missingness histogram which can be shown by (print(p_lmiss)).
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # run on all individuals and markers fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
checkFiltering checks that the file names with the individuals and markers to be filtered can be found. If so, it constructs the command for filtering
checkFiltering( keep_individuals = NULL, remove_individuals = NULL, extract_markers = NULL, exclude_markers = NULL )
checkFiltering( keep_individuals = NULL, remove_individuals = NULL, extract_markers = NULL, exclude_markers = NULL )
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
Vector containing args in sys::exec_wait format to enable filtering on individuals and/or markers.
checkPlink checks that the PLINK software (https://www.cog-genomics.org/plink/1.9/) can be found from system call.
checkPlink(path2plink = NULL)
checkPlink(path2plink = NULL)
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
Path to PLINK executable.
checkRemoveIDs checks that the file names with the individuals to be filtered can be found. It reads the corresponding files, combines the selected individuals into one data.frame and compares these to all individuals in the analysis.
checkRemoveIDs(prefix, remove_individuals = NULL, keep_individuals)
checkRemoveIDs(prefix, remove_individuals = NULL, keep_individuals)
prefix |
[character] Prefix of PLINK files, i.e. path/2/name.bed, path/2/name.bim and path/2/name.fam. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
data.frame containing family (FID) and individual (IID) IDs of individuals to be removed from analysis.
Individuals that fail per-individual QC and markers that fail per-marker QC are removed from indir/name.bim/.bed/.fam and a new, dataset with the remaining individuals and markers is created as qcdir/name.clean.bim/.bed/.fam.
cleanData( indir, name, qcdir = indir, filterSex = TRUE, filterHeterozygosity = TRUE, filterSampleMissingness = TRUE, filterAncestry = TRUE, filterRelated = TRUE, filterSNPMissingness = TRUE, lmissTh = 0.01, filterHWE = TRUE, hweTh = 1e-05, filterMAF = TRUE, macTh = 20, mafTh = NULL, path2plink = NULL, verbose = FALSE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
cleanData( indir, name, qcdir = indir, filterSex = TRUE, filterHeterozygosity = TRUE, filterSampleMissingness = TRUE, filterAncestry = TRUE, filterRelated = TRUE, filterSNPMissingness = TRUE, lmissTh = 0.01, filterHWE = TRUE, hweTh = 1e-05, filterMAF = TRUE, macTh = 20, mafTh = NULL, path2plink = NULL, verbose = FALSE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
filterSex |
[logical] Set to exclude samples that failed the sex
check (via |
filterHeterozygosity |
[logical] Set to exclude samples that failed
check for outlying heterozygosity rates (via
|
filterSampleMissingness |
[logical] Set to exclude samples that failed
check for excessive missing genotype rates (via
|
filterAncestry |
[logical] Set to exclude samples that failed ancestry
check (via |
filterRelated |
[logical] Set to exclude samples that failed relatedness
check (via |
filterSNPMissingness |
[logical] Set to exclude markers that have
excessive missing rates across samples (via
|
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
filterHWE |
[logical] Set to exclude markers that fail HWE exact test
(via |
hweTh |
[double] Significance threshold for deviation from HWE. |
filterMAF |
[logical] Set to exclude markers that fail minor allele
frequency or minor allele count threshold (via |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
names [list] with i) passIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that pass the QC, ii) failIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that fail the QC.
package.dir <- find.package('plinkQC') indir <- file.path(package.dir, 'extdata') qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # Run qc on all samples and markers in the dataset ## Run individual QC checks fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, path2plink=path2plink) ## Run marker QC checks fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) ## Create new dataset of individuals and markers passing QC ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15, verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE, filterRelated=TRUE) # Run qc on subset of samples and markers in the dataset highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, dont.check_ancestry = TRUE, interactive=FALSE, verbose=FALSE, highlight_samples = highlight_samples[,2], highlight_type = "label", remove_individuals = remove_individuals_file, path2plink=path2plink) ## Run marker QC checks fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) ## Create new dataset of individuals and markers passing QC ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15, verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE, remove_individuals = remove_individuals_file) ## End(Not run)
package.dir <- find.package('plinkQC') indir <- file.path(package.dir, 'extdata') qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # Run qc on all samples and markers in the dataset ## Run individual QC checks fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, path2plink=path2plink) ## Run marker QC checks fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) ## Create new dataset of individuals and markers passing QC ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15, verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE, filterRelated=TRUE) # Run qc on subset of samples and markers in the dataset highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, dont.check_ancestry = TRUE, interactive=FALSE, verbose=FALSE, highlight_samples = highlight_samples[,2], highlight_type = "label", remove_individuals = remove_individuals_file, path2plink=path2plink) ## Run marker QC checks fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) ## Create new dataset of individuals and markers passing QC ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15, verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE, remove_individuals = remove_individuals_file) ## End(Not run)
Evaluates and depicts results of plink –pca (via
run_check_ancestry
or externally conducted pca) on merged
genotypes from individuals to be QCed and individuals of reference population
of known genotypes. Currently, check ancestry only supports automatic
selection of individuals of European descent. It uses information from
principal components 1 and 2 returned by plink –pca to find the center of
the European reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef).
It computes the maximum Euclidean distance (maxDist) of the European
reference samples from this centre. All study samples whose Euclidean
distance from the centre falls outside the circle described by the radius
r=europeanTh* maxDist are considered non-European and their IDs are returned
as failing the ancestry check.
check_ancestry creates a scatter plot of PC1 versus PC2 colour-coded for
samples of the reference populations and the study population.
evaluate_check_ancestry( indir, name, prefixMergedDataset, qcdir = indir, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", refPopulation = c("CEU", "TSI"), legend_labels_per_row = 6, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = FALSE )
evaluate_check_ancestry( indir, name, prefixMergedDataset, qcdir = indir, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", refPopulation = c("CEU", "TSI"), legend_labels_per_row = 6, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = FALSE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
qcdir |
[character] /path/to/directory/with/QC/results containing prefixMergedDataset.eigenvec results as returned by plink –pca. Per default qcdir=indir. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. If refSamples and refSamplesFile are not specified, defaultRefSamples will be used as reference. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If refColors and refColorsFile are not specified and refSamples and refSamplesFile are not specified, default colors will be determined from the defaultRefSamples option. If refColors and refColorsFile are not specified and but refSamples or refSamplesFile are given, ggplot default colors will be used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. If both refSamplesFile and refSamples are not NULL, defaultRefSamples information is used. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If refColors and refColorsFile are not specified and refSamples and refSamplesFile are not specified, default colors will be determined from the defaultRefSamples option. If refColors and refColorsFile are not specified and but refSamples or refSamplesFile are given, ggplot default colors will be used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population if plot is TRUE. |
refPopulation |
[vector] Vector with population identifiers of European reference population. Identifiers have to be corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
legend_labels_per_row |
[integer] Number of population names per row in PCA plot. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_ancestry); all highlight_samples IIDs have to be present in the IIDs of the prefixMergedDataset.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
Both run_check_ancestry
and
evaluate_check_ancestry
can simply be invoked by
check_ancestry
.
1000 Genomes samples were downloaded from https://www.internationalgenome.org/category/sample/, HapMap Phase 3 samples were downloaded from https://www.broadinstitute.org/medical-and-population-genetics/hapmap-3.
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry) and iii) plot_data, a data.frame with the data visualised in p_ancestry (ii).
## Not run: qcdir <- system.file("extdata", package="plinkQC") name <- "data" # whole dataset fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, highlight_samples = highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
## Not run: qcdir <- system.file("extdata", package="plinkQC") name <- "data" # whole dataset fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, highlight_samples = highlight_samples[,2], highlight_type = c("text", "shape")) ## End(Not run)
Evaluates and depicts results from plink –missing (missing genotype rates
per individual) and plink –het (heterozygosity rates per individual) via
run_check_heterozygosity
and
run_check_missingness
or externally conducted check.)
Non-systematic failures in genotyping and outlying heterozygosity (hz) rates
per individual are often proxies for DNA sample quality. Larger than expected
heterozygosity can indicate possible DNA contamination.
The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where
N: number of non-missing genotypes and O:observed number of homozygous
genotypes for a given individual.
Mean heterozygosity can differ between populations and SNP genotyping panels.
Within a population and genotyping panel, a reduced heterozygosity rate can
indicate inbreeding - these individuals will then be returned by
check_relatedness as individuals that fail the relatedness filters.
evaluate_check_het_and_miss
creates a scatter plot with the
individuals' missingness rates on x-axis and their heterozygosity rates on
the y-axis.
evaluate_check_het_and_miss( qcdir, name, imissTh = 0.03, hetTh = 3, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, highlight_legend = FALSE, interactive = FALSE )
evaluate_check_het_and_miss( qcdir, name, imissTh = 0.03, hetTh = 3, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, highlight_legend = FALSE, interactive = FALSE )
qcdir |
[character] path/to/directory/with/QC/results containing name.imiss and name.het results as returned by plink –missing and plink –het. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss. |
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity in any individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off(). |
All, run_check_heterozygosity
,
run_check_missingness
and
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het
named [list] with i) fail_imiss dataframe containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO ( Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het dataframe containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check; iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss) and iv) plot_data, a data.frame with the data visualised in p_het_imiss (iii).
qcdir <- system.file("extdata", package="plinkQC") name <- "data" ## Not run: fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE) #' # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE, highlight_samples = highlight_samples[,2], highlight_type = c("text", "color")) ## End(Not run)
qcdir <- system.file("extdata", package="plinkQC") name <- "data" ## Not run: fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE) #' # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE, highlight_samples = highlight_samples[,2], highlight_type = c("text", "color")) ## End(Not run)
Evaluates and depicts results from plink –check-sex (via
run_check_sex
or externally conducted sex check).
Takes file qcdir/name.sexcheck and returns IIDs for samples whose
SNPSEX != PEDSEX (where the SNPSEX is determined by the heterozygosity rate
across X-chromosomal variants).
Mismatching SNPSEX and PEDSEX IDs can indicate plating errors, sample-mixup
or generally samples with poor genotyping. In the latter case, these IDs are
likely to fail other QC steps as well.
Optionally, an extra data.frame (externalSex) with sample IDs and sex can be
provided to double check if external and PEDSEX data (often processed at
different centers) match. If a mismatch between PEDSEX and SNPSEX was
detected while SNPSEX == Sex, PEDSEX of these individuals can optionally be
updated (fixMixup=TRUE).
evaluate_check_sex
depicts the X-chromosomal heterozygosity (SNPSEX)
of the samples split by their (PEDSEX).
evaluate_check_sex( qcdir, name, maleTh = 0.8, femaleTh = 0.2, externalSex = NULL, fixMixup = FALSE, indir = qcdir, externalFemale = "F", externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", verbose = FALSE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE, interactive = FALSE )
evaluate_check_sex( qcdir, name, maleTh = 0.8, femaleTh = 0.2, externalSex = NULL, fixMixup = FALSE, indir = qcdir, externalFemale = "F", externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", verbose = FALSE, label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE, interactive = FALSE )
qcdir |
[character] /path/to/directory containing name.sexcheck as returned by plink –check-sex. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam and name.sexcheck. |
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
externalSex |
[data.frame, optional] with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex, with Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files; only required of fixMixup==TRUE. User needs writing permission to indir. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sexcheck) via ggplot2::ggsave(p=p_sexcheck, other_arguments) or pdf(outfile) print(p_sexcheck) dev.off(). |
Both run_check_sex
and evaluate_check_sex
can
simply be invoked by check_sex
.
For details on the output data.frame fail_sex, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#sexcheck.
named list with i) fail_sex: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals failing sex check; ii) mixup: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals whose PEDSEX != Sex and Sex == SNPSEX; iii) p_sexcheck, a ggplot2-object 'containing' a scatter plot of the X-chromosomal heterozygosity (SNPSEX) of the individuals split by their (PEDSEX), which can be shown by print(p_sexcheck) and iv) plot_data, a data.frame with the data visualised in p_sexcheck (iii).
qcdir <- system.file("extdata", package="plinkQC") name <- "data" path2plink <- '/path/to/plink' ## Not run: fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE, verbose=FALSE, path2plink=path2plink) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE, verbose=FALSE, path2plink=path2plink, highlight_samples = highlight_samples[,2], highlight_type = c("label", "color"), highlight_color = "darkgreen") ## End(Not run)
qcdir <- system.file("extdata", package="plinkQC") name <- "data" path2plink <- '/path/to/plink' ## Not run: fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE, verbose=FALSE, path2plink=path2plink) # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE, verbose=FALSE, path2plink=path2plink, highlight_samples = highlight_samples[,2], highlight_type = c("label", "color"), highlight_color = "darkgreen") ## End(Not run)
overviewPerIndividualQC
depicts results of
perIndividualQC
as intersection plots
(via upset
) and returns dataframes indicating
which QC checks individuals failed or passed.
overviewPerIndividualQC(results_perIndividualQC, interactive = FALSE)
overviewPerIndividualQC(results_perIndividualQC, interactive = FALSE)
results_perIndividualQC |
[list] Output of |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_overview) via ggplot2::ggsave(p=p_overview, other_arguments) or pdf(outfile) print(p_overview) dev.off(). |
Named [list] with i) nr_fail_samples: total number of samples [integer] failing perIndividualQC, ii) fail_QC containing a [data.frame] with samples that failed QC steps (excluding ancestry) with IID, FID, all QC steps applied by perIndividualQC (max=4), with entries=0 if passing the QC and entries=1 if failing that particular QC and iii) fail_QC_and_ancestry containing a [data.frame] with samples that failed ancestry and QC checks with IID, FID, QC_fail and Ancestry_fail, with entries=0 if passing and entries=1 if failing that check, iii) p_overview, a ggplot2-object 'containing' a sub-paneled plot with the QC-plots.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" ## Not run: fail_individuals <- perIndividualQC(qcdir=qcdir, indir=indir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE, do.run_check_sex=FALSE, do.run_check_ancestry=FALSE) overview <- overviewPerIndividualQC(fail_individuals) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" ## Not run: fail_individuals <- perIndividualQC(qcdir=qcdir, indir=indir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE, do.run_check_sex=FALSE, do.run_check_ancestry=FALSE) overview <- overviewPerIndividualQC(fail_individuals) ## End(Not run)
overviewPerMarkerQC
depicts results of perMarkerQC
as
an intersection plot (via upset
) and returns a
dataframe indicating which QC checks were failed or passed.
overviewPerMarkerQC(results_perMarkerQC, interactive = FALSE)
overviewPerMarkerQC(results_perMarkerQC, interactive = FALSE)
results_perMarkerQC |
[list] Output of |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_overview) via ggplot2::ggsave(p=p_overview, other_arguments) or pdf(outfile) print(p_overview) dev.off(). |
Named [list] with i) nr_fail_markers: total number of markers [integer] failing perMarkerQC, ii) fail_QC containing a [data.frame] with markers that failed QC steps: marker rsIDs in rows, columns are all QC steps applied by perMarkerQC (max=3), with entries=0 if passing the QC and entries=1 if failing that particular QC.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. # All quality control checks ## Not run: fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) overview <- overviewPerMarkerQC(fail_markers) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. # All quality control checks ## Not run: fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) overview <- overviewPerMarkerQC(fail_markers) ## End(Not run)
perIndividualQC checks the samples in the plink dataset for their total missingness and heterozygosity rates, the concordance of their assigned sex to their SNP sex, their relatedness to other study individuals and their genetic ancestry.
perIndividualQC( indir, name, qcdir = indir, dont.check_sex = FALSE, do.run_check_sex = TRUE, do.evaluate_check_sex = TRUE, maleTh = 0.8, femaleTh = 0.2, externalSex = NULL, externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", externalFemale = "F", fixMixup = FALSE, dont.check_het_and_miss = FALSE, do.run_check_het_and_miss = TRUE, do.evaluate_check_het_and_miss = TRUE, imissTh = 0.03, hetTh = 3, dont.check_relatedness = FALSE, do.run_check_relatedness = TRUE, do.evaluate_check_relatedness = TRUE, highIBDTh = 0.1875, mafThRelatedness = 0.1, filter_high_ldregion = TRUE, high_ldregion_file = NULL, genomebuild = "hg19", dont.check_ancestry = FALSE, do.run_check_ancestry = TRUE, do.evaluate_check_ancestry = TRUE, prefixMergedDataset, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, subplot_label_size = 9, title_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
perIndividualQC( indir, name, qcdir = indir, dont.check_sex = FALSE, do.run_check_sex = TRUE, do.evaluate_check_sex = TRUE, maleTh = 0.8, femaleTh = 0.2, externalSex = NULL, externalMale = "M", externalSexSex = "Sex", externalSexID = "IID", externalFemale = "F", fixMixup = FALSE, dont.check_het_and_miss = FALSE, do.run_check_het_and_miss = TRUE, do.evaluate_check_het_and_miss = TRUE, imissTh = 0.03, hetTh = 3, dont.check_relatedness = FALSE, do.run_check_relatedness = TRUE, do.evaluate_check_relatedness = TRUE, highIBDTh = 0.1875, mafThRelatedness = 0.1, filter_high_ldregion = TRUE, high_ldregion_file = NULL, genomebuild = "hg19", dont.check_ancestry = FALSE, do.run_check_ancestry = TRUE, do.evaluate_check_ancestry = TRUE, prefixMergedDataset, europeanTh = 1.5, defaultRefSamples = c("HapMap", "1000Genomes"), refSamples = NULL, refColors = NULL, refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID", refSamplesPop = "Pop", refColorsColor = "Color", refColorsPop = "Pop", studyColor = "#2c7bb6", label_fail = TRUE, highlight_samples = NULL, highlight_type = c("text", "label", "color", "shape"), highlight_text_size = 3, highlight_color = "#c51b8a", highlight_shape = 17, highlight_legend = FALSE, interactive = FALSE, verbose = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, subplot_label_size = 9, title_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be saved.
Per default, qcdir=indir. If do.evaluate_[analysis] is set to TRUE and
do.run_[analysis] is FALSE, |
dont.check_sex |
[logical] If TRUE, no sex check will be conducted; short for do.run_check_sex=FALSE and do.evaluate_check_sex=FALSE. Takes precedence over do.run_check_sex and do.evaluate_check_sex. |
do.run_check_sex |
[logical] If TRUE, run |
do.evaluate_check_sex |
[logical] If TRUE, run
|
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
externalSex |
[data.frame, optional] Dataframe with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex while Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
dont.check_het_and_miss |
[logical] If TRUE, no heterozygosity and missingness check will be conducted; short for do.run_check_heterozygosity=FALSE, do.run_check_missingness=FALSE and do.evaluate_check_het_and_miss=FALSE. Takes precedence over do.run_check_heterozygosity, do.run_check_missingness and do.evaluate_check_het_and_miss. |
do.run_check_het_and_miss |
[logical] If TRUE, run
|
do.evaluate_check_het_and_miss |
[logical] If TRUE, run
|
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
dont.check_relatedness |
[logical] If TRUE, no relatedness check will be conducted; short for do.run_check_relatedness=FALSE and do.evaluate_check_relatedness=FALSE. Takes precedence over do.run_check_relatedness and do.evaluate_check_relatedness. |
do.run_check_relatedness |
[logical] If TRUE, run
|
do.evaluate_check_relatedness |
[logical] If TRUE, run
|
highIBDTh |
[double] Threshold for acceptable proportion of IBD between pair of individuals. |
mafThRelatedness |
[double] Threshold of minor allele frequency filter for selecting variants for IBD estimation. |
filter_high_ldregion |
[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via |
high_ldregion_file |
[character] Path to file with high LD regions used
for filtering before IBD estimation if |
genomebuild |
[character] Name of the genome build of the PLINK file annotations, ie mappings in the name.bim file. Will be used to remove high-LD regions based on the coordinates of the respective build. Options are hg18, hg19 and hg38. See @details. |
dont.check_ancestry |
[logical] If TRUE, no ancestry check will be conducted; short for do.run_check_ancestry=FALSE and do.evaluate_check_ancestry=FALSE. Takes precedence over do.run_check_ancestry and do.evaluate_check_ancestry. |
do.run_check_ancestry |
[logical] If TRUE, run
|
do.evaluate_check_ancestry |
[logical] If TRUE, run
|
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sampleQC) via ggplot2::ggsave(p=p_sampleQC, other_arguments) or pdf(outfile) print(p_sampleQC) dev.off(). If TRUE, i) depicts the X-chromosomal heterozygosity (SNPSEX) of the samples split by their PEDSEX (if do.evaluate_check_sex is TRUE), ii) creates a scatter plot with samples' missingness rates on x-axis and their heterozygosity rates on the y-axis (if do.evaluate_check_het_and_miss is TRUE), iii) depicts all pair-wise IBD-estimates as histogram (if do.evaluate_check_relatedness is TRUE) and iv) creates a scatter plot of PC1 versus PC2 color-coded for samples of reference populations and study population (if do.check_ancestry is set to TRUE). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
subplot_label_size |
[integer] Size of the subplot labeling. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
perIndividualQC wraps around the individual QC functions
check_sex
, check_het_and_miss
,
check_relatedness
and check_ancestry
. For details
on the parameters and outputs, check these function documentations. For
detailed output for fail IIDs (instead of simple IID lists), run each
function individually.
Named [list] with i) fail_list, a named [list] with 1.
sample_missingness containing a [vector] with sample IIDs failing the
missingness threshold imissTh, 2. highIBD containing a [vector] with sample
IIDs failing the relatedness threshold highIBDTh, 3. outlying_heterozygosity
containing a [vector] with sample IIDs failing the heterozygosity threshold
hetTh, 4. mismatched_sex containing a [vector] with the sample IIDs failing
the sexcheck based on SNPSEX and femaleTh/maleTh and 5. ancestry containing
a vector with sample IIDs failing the ancestry check based on europeanTh and
ii) p_sampleQC, a ggplot2-object 'containing' a sub-paneled plot with the
QC-plots of check_sex
,
check_het_and_miss
,
check_relatedness
and check_ancestry
, which can
be shown by print(p_sampleQC).
List entries contain NULL if that specific check was not chosen.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" # All quality control checks ## Not run: # whole dataset fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE, do.run_check_sex=FALSE, do.run_check_ancestry=FALSE) # Only check sex and missingness/heterozygosity fail_sex_het_miss <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, dont.check_ancestry=TRUE, dont.check_relatedness=TRUE, interactive=FALSE, verbose=FALSE) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") individual_qc <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_ancestry=FALSE, do.evaluate_check_ancestry=TRUE, path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "color"), highlight_color="goldenrod") ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" # All quality control checks ## Not run: # whole dataset fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE, do.run_check_sex=FALSE, do.run_check_ancestry=FALSE) # Only check sex and missingness/heterozygosity fail_sex_het_miss <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, dont.check_ancestry=TRUE, dont.check_relatedness=TRUE, interactive=FALSE, verbose=FALSE) # subset of dataset with sample highlighting highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") individual_qc <- perIndividualQC(indir=indir, qcdir=qcdir, name=name, refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""), refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""), prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE, do.run_check_ancestry=FALSE, do.evaluate_check_ancestry=TRUE, path2plink=path2plink, remove_individuals=remove_individuals_file, highlight_samples=highlight_samples[,2], highlight_type = c("text", "color"), highlight_color="goldenrod") ## End(Not run)
perMarkerQC checks the markers in the plink dataset for their missingness
rates across samples, their deviation from Hardy-Weinberg-Equilibrium (HWE)
and their minor allele frequencies (MAF). Per default, it assumes that IDs of
individuals that have failed perIndividualQC
have been written
to qcdir/name.fail.IDs and removes these individuals when computing
missingness rates, HWE p-values and MAF. If the qcdir/name.fail.IDs file does
not exist, a message is written to stdout but the analyses will continue for
all samples in the name.fam/name.bed/name.bim dataset.
Depicts i) SNP missingness rates (stratified by minor allele
frequency) as histograms, ii) p-values of HWE exact test (stratified by all
and low p-values) as histograms and iii) the minor allele frequency
distribution as a histogram.
perMarkerQC( indir, qcdir = indir, name, do.check_snp_missingness = TRUE, lmissTh = 0.01, do.check_hwe = TRUE, hweTh = 1e-05, do.check_maf = TRUE, macTh = 20, mafTh = NULL, interactive = FALSE, verbose = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, subplot_label_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
perMarkerQC( indir, qcdir = indir, name, do.check_snp_missingness = TRUE, lmissTh = 0.01, do.check_hwe = TRUE, hweTh = 1e-05, do.check_maf = TRUE, macTh = 20, mafTh = NULL, interactive = FALSE, verbose = TRUE, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, legend_text_size = 5, legend_title_size = 7, axis_text_size = 5, axis_title_size = 7, title_size = 9, subplot_label_size = 9, path2plink = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
do.check_snp_missingness |
[logical] If TRUE, run
|
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
do.check_hwe |
[logical] If TRUE, run |
hweTh |
[double] Significance threshold for deviation from HWE. |
do.check_maf |
[logical] If TRUE, run |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_marker) via ggplot2::ggsave(p=p_marker, other_arguments) or pdf(outfile) print(p_marker) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
subplot_label_size |
[integer] Size of the subplot labeling. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
perMarkerQC wraps around the marker QC functions
check_snp_missingness
, check_hwe
and
check_maf
. For details on the parameters and outputs, check
these function documentations.
Named [list] with i) fail_list, a named [list] with 1.
SNP_missingness, containing SNP IDs [vector] failing the missingness
threshold lmissTh, 2. hwe, containing SNP IDs [vector] failing the HWE exact
test threshold hweTh and 3. maf, containing SNPs Ids [vector] failing the MAF
threshold mafTh/MAC threshold macTh and ii) p_markerQC, a ggplot2-object
'containing' a sub-paneled plot with the QC-plots of
check_snp_missingness
, check_hwe
and
check_maf
, which can be shown by print(p_markerQC).
List entries contain NULL if that specific check was not chosen.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. # All quality control checks ## Not run: # run on all markers and individuals fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. # All quality control checks ## Not run: # run on all markers and individuals fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) ## End(Not run)
Run plink –pca to calculate the principal components on merged genotypes of the study and reference dataset.
run_check_ancestry( indir, prefixMergedDataset, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
run_check_ancestry( indir, prefixMergedDataset, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files prefixMergedDataset.bim,prefixMergedDataset.fam and prefixMergedDataset.bed. |
prefixMergedDataset |
[character] Prefix of merged study and reference data files, i.e. prefixMergedDataset.bed, prefixMergedDataset.bim, prefixMergedDataset.fam. |
qcdir |
[character] /path/to/directory to save prefixMergedDataset.eigenvec as returned by plink –pca. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Both, run_check_ancestry
and its evaluation by
evaluate_check_ancestry
can simply be invoked by
check_ancestry
.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() prefixMergedDataset <- 'data.HapMapIII' path2plink <- 'path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # ancestry check on all individuals in dataset run <- run_check_ancestry(indir=indir, qcdir=qcdir, prefixMergedDataset, path2plink=path2plink) # ancestry check on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_ancestry(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file, path2plink=path2plink) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() prefixMergedDataset <- 'data.HapMapIII' path2plink <- 'path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # ancestry check on all individuals in dataset run <- run_check_ancestry(indir=indir, qcdir=qcdir, prefixMergedDataset, path2plink=path2plink) # ancestry check on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_ancestry(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file, path2plink=path2plink) ## End(Not run)
Run plink –het to calculate heterozygosity rates per individual.
run_check_heterozygosity( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
run_check_heterozygosity( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.het as returned by plink –het. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
All, run_check_heterozygosity
,
run_check_missingness
and their evaluation by
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # heterozygosity check on all individuals in dataset run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) #' # heterozygosity on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file,path2plink=path2plink) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # heterozygosity check on all individuals in dataset run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) #' # heterozygosity on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file,path2plink=path2plink) ## End(Not run)
Run plink –missing to calculate missing genotype rates per individual.
run_check_missingness( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
run_check_missingness( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.imiss as returned by plink –missing. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
All, run_check_heterozygosity
,
run_check_missingness
and their evaluation by
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # missingness check on all individuals in dataset run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) # missingness on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file, path2plink=path2plink) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # missingness check on all individuals in dataset run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name, path2plink=path2plink) # missingness on subset of dataset remove_individuals_file <- system.file("extdata", "remove_individuals", package="plinkQC") run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name, remove_individuals=remove_individuals_file, path2plink=path2plink) ## End(Not run)
Run plink –sexcheck to calculate the heterozygosity rate across X-chromosomal variants.
run_check_sex( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
run_check_sex( indir, name, qcdir = indir, verbose = FALSE, path2plink = NULL, keep_individuals = NULL, remove_individuals = NULL, exclude_markers = NULL, extract_markers = NULL, showPlinkOutput = TRUE )
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.sexcheck as returned by plink –check-sex. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Both run_check_sex
and its evaluation
evaluate_check_sex
can simply be invoked by
check_sex
.
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # simple sexcheck on all individuals in dataset run <- run_check_sex(indir=indir, qcdir=qcdir, name=name) # sexcheck on subset of dataset keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") run <- run_check_sex(indir=indir, qcdir=qcdir, name=name, keep_individuals=keep_individuals_file, path2plink=path2plink) ## End(Not run)
indir <- system.file("extdata", package="plinkQC") name <- 'data' qcdir <- tempdir() path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. ## Not run: # simple sexcheck on all individuals in dataset run <- run_check_sex(indir=indir, qcdir=qcdir, name=name) # sexcheck on subset of dataset keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") run <- run_check_sex(indir=indir, qcdir=qcdir, name=name, keep_individuals=keep_individuals_file, path2plink=path2plink) ## End(Not run)
Test all elements of a list if they are numeric, positive numbers, integers or proportions (range 0-1).
testNumerics(numbers, positives = NULL, integers = NULL, proportions = NULL)
testNumerics(numbers, positives = NULL, integers = NULL, proportions = NULL)
numbers |
[list] whose elements are tested for being numeric. |
positives |
[list] whose elements are tested for being positive numbers. |
integers |
[list] whose elements are tested for being integers. |
proportions |
[list] whose elements are tested for being proportions. between 0 and 1. |