Title: | Analysis of Massive SNP Arrays |
---|---|
Description: | Easy-to-use, efficient, flexible and scalable tools for analyzing massive SNP arrays. Privé et al. (2018) <doi:10.1093/bioinformatics/bty185>. |
Authors: | Florian Privé [aut, cre], Michael Blum [ths], Hugues Aschard [ths], Bjarni Jóhann Vilhjálmsson [ths] |
Maintainer: | Florian Privé <[email protected]> |
License: | GPL-3 |
Version: | 1.12.16 |
Built: | 2024-10-24 06:01:09 UTC |
Source: | https://github.com/privefl/bigsnpr |
For a bigSNP
:
snp_pruning()
: LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2
"
in PLINK.
This function is deprecated (see
this article).
snp_clumping()
(and bed_clumping()
): LD clumping. If you do not provide
any statistic to rank SNPs, it would use minor allele frequencies (MAFs),
making clumping similar to pruning.
snp_indLRLDR()
: Get SNP indices of long-range LD regions for the
human genome.
bed_clumping( obj.bed, ind.row = rows_along(obj.bed), S = NULL, thr.r2 = 0.2, size = 100/thr.r2, exclude = NULL, ncores = 1 ) snp_clumping( G, infos.chr, ind.row = rows_along(G), S = NULL, thr.r2 = 0.2, size = 100/thr.r2, infos.pos = NULL, is.size.in.bp = NULL, exclude = NULL, ncores = 1 ) snp_pruning( G, infos.chr, ind.row = rows_along(G), size = 49, is.size.in.bp = FALSE, infos.pos = NULL, thr.r2 = 0.2, exclude = NULL, nploidy = 2, ncores = 1 ) snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)
bed_clumping( obj.bed, ind.row = rows_along(obj.bed), S = NULL, thr.r2 = 0.2, size = 100/thr.r2, exclude = NULL, ncores = 1 ) snp_clumping( G, infos.chr, ind.row = rows_along(G), S = NULL, thr.r2 = 0.2, size = 100/thr.r2, infos.pos = NULL, is.size.in.bp = NULL, exclude = NULL, ncores = 1 ) snp_pruning( G, infos.chr, ind.row = rows_along(G), size = 49, is.size.in.bp = FALSE, infos.pos = NULL, thr.r2 = 0.2, exclude = NULL, nploidy = 2, ncores = 1 ) snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
S |
A vector of column statistics which express the importance
of each SNP (the more important is the SNP, the greater should be
the corresponding statistic). |
thr.r2 |
Threshold over the squared correlation between two SNPs.
Default is |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
exclude |
Vector of SNP indices to exclude anyway. For example,
can be used to exclude long-range LD regions (see Price2008). Another use
can be for thresholding with respect to p-values associated with |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
is.size.in.bp |
Deprecated. |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
LD.regions |
A |
snp_clumping()
(and bed_clumping()
): SNP indices that are kept.
snp_indLRLDR()
: SNP indices to be used as (part of) the 'exclude
'
parameter of snp_clumping()
.
Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005
test <- snp_attachExtdata() G <- test$genotypes # clumping (prioritizing higher MAF) ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos, thr.r2 = 0.1) # keep most of them -> not much LD in this simulated dataset length(ind.keep) / ncol(G)
test <- snp_attachExtdata() G <- test$genotypes # clumping (prioritizing higher MAF) ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos, thr.r2 = 0.1) # keep most of them -> not much LD in this simulated dataset length(ind.keep) / ncol(G)
Counts the number of 0s, 1s, 2s and NAs by variants in the bed file.
bed_counts( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), byrow = FALSE, ncores = 1 )
bed_counts( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), byrow = FALSE, ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
byrow |
Whether to count by individual rather than by variant?
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A matrix of with 4 rows and length(ind.col)
columns.
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) bed_counts(obj.bed, ind.col = 1:5) bed_counts(obj.bed, ind.row = 1:5, byrow = TRUE)
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) bed_counts(obj.bed, ind.col = 1:5) bed_counts(obj.bed, ind.row = 1:5, byrow = TRUE)
Cross-product between a "bed" object and a vector.
Missing values are replaced by 0 (after centering), as if they
had been imputed using parameter center
.
bed_cprodVec( obj.bed, y.row, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), center = rep(0, length(ind.col)), scale = rep(1, length(ind.col)), ncores = 1 )
bed_cprodVec( obj.bed, y.row, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), center = rep(0, length(ind.col)), scale = rep(1, length(ind.col)), ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
y.row |
A vector of same size as |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
.
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) y.row <- rep(1, nrow(obj.bed)) str(bed_cprodVec(obj.bed, y.row))
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) y.row <- rep(1, nrow(obj.bed)) str(bed_cprodVec(obj.bed, y.row))
Allele frequencies of a bed object.
bed_MAF( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
bed_MAF( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A data.frame with
$ac
: allele counts,
$mac
: minor allele counts,
$af
: allele frequencies,
$maf
: minor allele frequencies,
$N
: numbers of non-missing values.
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) bed_MAF(obj.bed, ind.col = 1:5)
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) bed_MAF(obj.bed, ind.col = 1:5)
Product between a "bed" object and a vector.
Missing values are replaced by 0 (after centering), as if they
had been imputed using parameter center
.
bed_prodVec( obj.bed, y.col, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), center = rep(0, length(ind.col)), scale = rep(1, length(ind.col)), ncores = 1 )
bed_prodVec( obj.bed, y.col, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), center = rep(0, length(ind.col)), scale = rep(1, length(ind.col)), ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
y.col |
A vector of same size as |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
center |
Vector of same length of |
scale |
Vector of same length of |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
.
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) y.col <- rep(1, ncol(obj.bed)) str(bed_prodVec(obj.bed, y.col))
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) y.col <- rep(1, ncol(obj.bed)) str(bed_prodVec(obj.bed, y.col))
Computing and projecting PCA of reference dataset to a target dataset.
bed_projectPCA( obj.bed.ref, obj.bed.new, k = 10, ind.row.new = rows_along(obj.bed.new), ind.row.ref = rows_along(obj.bed.ref), ind.col.ref = cols_along(obj.bed.ref), strand_flip = TRUE, join_by_pos = TRUE, match.min.prop = 0.5, build.new = "hg19", build.ref = "hg19", liftOver = NULL, ..., verbose = TRUE, ncores = 1 )
bed_projectPCA( obj.bed.ref, obj.bed.new, k = 10, ind.row.new = rows_along(obj.bed.new), ind.row.ref = rows_along(obj.bed.ref), ind.col.ref = cols_along(obj.bed.ref), strand_flip = TRUE, join_by_pos = TRUE, match.min.prop = 0.5, build.new = "hg19", build.ref = "hg19", liftOver = NULL, ..., verbose = TRUE, ncores = 1 )
obj.bed.ref |
Object of type bed, which is the mapping of the bed file of
the reference data. Use |
obj.bed.new |
Object of type bed, which is the mapping of the bed file of
the target data. Use |
k |
Number of principal components to compute and project. |
ind.row.new |
Rows to be used in the target data. Default uses them all. |
ind.row.ref |
Rows to be used in the reference data. Default uses them all. |
ind.col.ref |
Columns to be potentially used in the reference data. Default uses all the ones in common with target data. |
strand_flip |
Whether to try to flip strand? (default is |
join_by_pos |
Whether to join by chromosome and position (default), or instead by rsid. |
match.min.prop |
Minimum proportion of variants in the smallest data
to be matched, otherwise stops with an error. Default is |
build.new |
Genome build of the target data. Default is |
build.ref |
Genome build of the reference data. Default is |
liftOver |
Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux. |
... |
Arguments passed on to
|
verbose |
Output some information on the iterations? Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A list of 3 elements:
$obj.svd.ref
: big_SVD object computed from reference data.
$simple_proj
: simple projection of new data into space of reference PCA.
$OADP_proj
: Online Augmentation, Decomposition, and Procrustes (OADP)
projection of new data into space of reference PCA.
Projecting PCA using individuals from one dataset to other individuals from the same dataset.
bed_projectSelfPCA( obj.svd, obj.bed, ind.row, ind.col = attr(obj.svd, "subset"), ncores = 1 ) snp_projectSelfPCA( obj.svd, G, ind.row, ind.col = attr(obj.svd, "subset"), ncores = 1 )
bed_projectSelfPCA( obj.svd, obj.bed, ind.row, ind.col = attr(obj.svd, "subset"), ncores = 1 ) snp_projectSelfPCA( obj.svd, G, ind.row, ind.col = attr(obj.svd, "subset"), ncores = 1 )
obj.svd |
List with |
obj.bed |
Object of type bed, which is the mapping of the bed file of the data containing both the individuals that were used to compute the PCA and the other individuals to be projected. |
ind.row |
Rows (individuals) to be projected. |
ind.col |
Columns that were used for computing PCA. If bed_autoSVD was
used, then |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
G |
The FBM.code256 that was used to
compute |
A list of 3 elements:
$obj.svd.ref
: big_SVD object computed from reference data.
$simple_proj
: simple projection of new data into space of reference PCA.
$OADP_proj
: Online Augmentation, Decomposition, and Procrustes (OADP)
projection of new data into space of reference PCA.
Partial SVD (or PCA) of a genotype matrix stored as a PLINK (.bed) file.#'
bed_randomSVD( obj.bed, fun.scaling = bed_scaleBinom, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), k = 10, tol = 1e-04, verbose = FALSE, ncores = 1 )
bed_randomSVD( obj.bed, fun.scaling = bed_scaleBinom, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), k = 10, tol = 1e-04, verbose = FALSE, ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
fun.scaling |
A function with parameters
Default doesn't use any scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
k |
Number of singular vectors/values to compute. Default is |
tol |
Precision parameter of svds. Default is |
verbose |
Should some progress be printed? Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A named list (an S3 class "big_SVD") of
d
, the singular values,
u
, the left singular vectors,
v
, the right singular vectors,
niter
, the number of the iteration of the algorithm,
nops
, number of Matrix-Vector multiplications used,
center
, the centering vector,
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) str(bed_randomSVD(obj.bed))
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) str(bed_randomSVD(obj.bed))
Binomial(2, p) scaling where p
is estimated.
bed_scaleBinom( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
bed_scaleBinom( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
You will probably not use this function as is but as parameter
fun.scaling
of other functions (e.g. bed_autoSVD
and bed_randomSVD
).
A data frame with $center
and $scale
.
This scaling is widely used for SNP arrays. Patterson N, Price AL, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190.
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) str(bed_scaleBinom(obj.bed)) str(bed_randomSVD(obj.bed, bed_scaleBinom))
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") obj.bed <- bed(bedfile) str(bed_scaleBinom(obj.bed)) str(bed_randomSVD(obj.bed, bed_scaleBinom))
Compute from a bed object, with possible filtering and scaling
of
G
. For example, this can be used to compute GRMs.
bed_tcrossprodSelf( obj.bed, fun.scaling = bed_scaleBinom, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), block.size = block_size(length(ind.row)) )
bed_tcrossprodSelf( obj.bed, fun.scaling = bed_scaleBinom, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), block.size = block_size(length(ind.row)) )
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
fun.scaling |
A function with parameters
Default uses binomial scaling.
You can also provide your own |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
block.size |
Maximum number of columns read at once. Default uses block_size. |
A temporary FBM, with the following two attributes:
a numeric vector center
of column scaling,
a numeric vector scale
of column scaling.
Large matrix computations are made block-wise and won't be parallelized
in order to not have to reduce the size of these blocks. Instead, you can use
the MKL
or OpenBLAS in order to accelerate these block matrix computations.
You can control the number of cores used by these optimized matrix libraries
with bigparallelr::set_blas_ncores()
.
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) K <- bed_tcrossprodSelf(obj.bed) K[1:4, 1:6] / ncol(obj.bed)
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) K <- bed_tcrossprodSelf(obj.bed) K[1:4, 1:6] / ncol(obj.bed)
A reference class for storing a pointer to a mapped version of a bed file.
bed(bedfile)
bed(bedfile)
bedfile |
Path to file with extension ".bed" to read. You need the corresponding ".bim" and ".fam" in the same directory. |
A bed
object has many field:
$address
: address of the external pointer containing the underlying
C++ object, to be used internally as a XPtr<bed>
in C++ code
$extptr
: use $address
instead
$bedfile
: path to the bed file
$bimfile
: path to the corresponding bim file
$famfile
: path to the corresponding fam file
$prefix
: path without extension
$nrow
: number of samples in the bed file
$ncol
: number of variants in the bed file
$map
: data frame read from $bimfile
$fam
: data frame read from $famfile
$.map
: use $map
instead
$.fam
: use $fam
instead
$light
: get a lighter version of this object for parallel algorithms
to not have to transfer e.g. $.map
.
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") (obj.bed <- bed(bedfile))
bedfile <- system.file("extdata", "example-missing.bed", package = "bigsnpr") (obj.bed <- bed(bedfile))
An S3 class for representing information on massive SNP arrays.
A named list with at least 3 slots:
A FBM.code256 which is
a special Filebacked Big Matrix encoded with type raw
(one byte
unsigned integer), representing genotype calls and possibly imputed
allele dosages. Rows are individuals and columns are SNPs.
A data.frame
containing some information on the individuals
(read from a ".fam" file).
A data.frame
giving some information on the variants
(read from a ".bim" file).
Coefficient to convert to the liability scale. E.g. h2_liab = coef * h2_obs.
coef_to_liab(K_pop, K_gwas = 0.5)
coef_to_liab(K_pop, K_gwas = 0.5)
K_pop |
Prevalence in the population. |
K_gwas |
Prevalence in the GWAS. You should provide this if you used
( |
Scaling coefficient to convert e.g. heritability to the liability scale.
h2 <- 0.2 h2 * coef_to_liab(0.02)
h2 <- 0.2 h2 * coef_to_liab(0.02)
Download 1000 genomes project (phase 3) data in PLINK bed/bim/fam format, including 2490 (mostly unrelated) individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank.
download_1000G(dir, overwrite = FALSE, delete_zip = TRUE)
download_1000G(dir, overwrite = FALSE, delete_zip = TRUE)
dir |
The directory where to put the downloaded files. |
overwrite |
Whether to overwrite files when downloading and unzipping?
Default is |
delete_zip |
Whether to delete zip after decompressing the file in it?
Default is |
The path of the downloaded bed file.
Download Beagle 4.1 from https://faculty.washington.edu/browning/beagle/beagle.html
download_beagle(dir = tempdir())
download_beagle(dir = tempdir())
dir |
The directory where to put the Beagle Java Archive. Default is a temporary directory. |
The path of the downloaded Beagle Java Archive.
This function uses linear interpolation, whereas snp_asGeneticPos()
uses
nearest neighbors.
download_genetic_map( type = c("hg19_OMNI", "hg19_hapmap", "hg38_price"), dir, ncores = 1 ) snp_asGeneticPos2(infos.chr, infos.pos, genetic_map)
download_genetic_map( type = c("hg19_OMNI", "hg19_hapmap", "hg38_price"), dir, ncores = 1 ) snp_asGeneticPos2(infos.chr, infos.pos, genetic_map)
type |
Which genetic map to download. |
dir |
Directory where to download and decompress files. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
genetic_map |
A data frame with 3 columns: |
The hg19 genetic maps are downloaded from
https://github.com/joepickrell/1000-genomes-genetic-maps/
while the hg38 one is downloaded from
https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/
.
A data frame with 3 columns: chr
, pos
, and pos_cM
.
The new vector of genetic positions.
Download PLINK 1.9 from https://www.cog-genomics.org/plink2.
Download PLINK 2.0 from https://www.cog-genomics.org/plink/2.0/.
download_plink(dir = tempdir(), overwrite = FALSE, verbose = TRUE) download_plink2( dir = tempdir(), AVX2 = TRUE, ARM = FALSE, AMD = FALSE, overwrite = FALSE, verbose = TRUE )
download_plink(dir = tempdir(), overwrite = FALSE, verbose = TRUE) download_plink2( dir = tempdir(), AVX2 = TRUE, ARM = FALSE, AMD = FALSE, overwrite = FALSE, verbose = TRUE )
dir |
The directory where to put the PLINK executable. Default is a temporary directory. |
overwrite |
Whether to overwrite file? Default is |
verbose |
Whether to output details of downloading. Default is |
AVX2 |
Whether to download the AVX2 version? This is only available for
64 bits architectures. Default is |
ARM |
Whether to download an ARM version. Default is |
AMD |
Whether to download an AMD version. Default is |
The path of the downloaded PLINK executable.
34 long-range Linkage Disequilibrium (LD) regions for the human genome based on this wiki table.
LD.wiki34
LD.wiki34
A data frame with 34 rows (regions) and 4 variables:
Chr
: region's chromosome
Start
: starting position of the region (in bp)
Stop
: stopping position of the region (in bp)
ID
: some ID of the region.
Determine reference divergence while accounting for strand flips. This does not remove ambiguous alleles.
same_ref(ref1, alt1, ref2, alt2)
same_ref(ref1, alt1, ref2, alt2)
ref1 |
The reference alleles of the first dataset. |
alt1 |
The alternative alleles of the first dataset. |
ref2 |
The reference alleles of the second dataset. |
alt2 |
The alternative alleles of the second dataset. |
A logical vector whether the references alleles are the same. Missing values can result from missing values in the inputs or from ambiguous matching (e.g. matching A/C and A/G).
same_ref(ref1 = c("A", "C", "T", "G", NA), alt1 = c("C", "T", "C", "A", "A"), ref2 = c("A", "C", "A", "A", "C"), alt2 = c("C", "G", "G", "G", "A"))
same_ref(ref1 = c("A", "C", "T", "G", NA), alt1 = c("C", "T", "C", "A", "A"), ref2 = c("A", "C", "A", "A", "C"), alt2 = c("C", "G", "G", "G", "A"))
Polygenic Risk Scores for a grid of clumping and thresholding parameters.
Stacking over many Polygenic Risk Scores, corresponding to a grid of many different parameters for clumping and thresholding.
snp_grid_clumping( G, infos.chr, infos.pos, lpS, ind.row = rows_along(G), grid.thr.r2 = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95), grid.base.size = c(50, 100, 200, 500), infos.imp = rep(1, ncol(G)), grid.thr.imp = 1, groups = list(cols_along(G)), exclude = NULL, ncores = 1 ) snp_grid_PRS( G, all_keep, betas, lpS, n_thr_lpS = 50, grid.lpS.thr = 0.9999 * seq_log(max(0.1, min(lpS, na.rm = TRUE)), max(lpS, na.rm = TRUE), n_thr_lpS), ind.row = rows_along(G), backingfile = tempfile(), type = c("float", "double"), ncores = 1 ) snp_grid_stacking( multi_PRS, y.train, alphas = c(1, 0.01, 1e-04), ncores = 1, ... )
snp_grid_clumping( G, infos.chr, infos.pos, lpS, ind.row = rows_along(G), grid.thr.r2 = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95), grid.base.size = c(50, 100, 200, 500), infos.imp = rep(1, ncol(G)), grid.thr.imp = 1, groups = list(cols_along(G)), exclude = NULL, ncores = 1 ) snp_grid_PRS( G, all_keep, betas, lpS, n_thr_lpS = 50, grid.lpS.thr = 0.9999 * seq_log(max(0.1, min(lpS, na.rm = TRUE)), max(lpS, na.rm = TRUE), n_thr_lpS), ind.row = rows_along(G), backingfile = tempfile(), type = c("float", "double"), ncores = 1 ) snp_grid_stacking( multi_PRS, y.train, alphas = c(1, 0.01, 1e-04), ncores = 1, ... )
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
lpS |
Numeric vector of |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
grid.thr.r2 |
Grid of thresholds over the squared correlation between
two SNPs for clumping. Default is |
grid.base.size |
Grid for base window sizes. Sizes are then computed as
|
infos.imp |
Vector of imputation scores. Default is all |
grid.thr.imp |
Grid of thresholds over |
groups |
List of vectors of indices to define your own categories. This could be used e.g. to derive C+T scores using two different GWAS summary statistics, or to include other information such as functional annotations. Default just makes one group with all variants. |
exclude |
Vector of SNP indices to exclude anyway. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
all_keep |
Output of |
betas |
Numeric vector of weights (effect sizes from GWAS) associated
with each variant (column of |
n_thr_lpS |
Length for default |
grid.lpS.thr |
Sequence of thresholds to apply on |
backingfile |
Prefix for backingfiles where to store scores of C+T. As we typically use a large grid, this can result in a large matrix so that we store it on disk. Default uses a temporary file. |
type |
Type of backingfile values. Either |
multi_PRS |
Output of |
y.train |
Vector of phenotypes. If there are two levels (binary 0/1),
it uses |
alphas |
Vector of values for grid-search. See |
... |
Other parameters to be passed to |
snp_grid_PRS()
: An FBM
(matrix on disk) that stores the C+T scores
for all parameters of the grid (and for each chromosome separately).
It also stores as attributes the input parameters all_keep
, betas
,
lpS
and grid.lpS.thr
that are also needed in snp_grid_stacking()
.
Sequence, evenly spaced on a logarithmic scale
seq_log(from, to, length.out)
seq_log(from, to, length.out)
from , to
|
the starting and (maximal) end values of the
sequence. Of length |
length.out |
desired length of the sequence. A
non-negative number, which for |
A sequence of length length.out
, evenly spaced on a logarithmic scale
between from
and to
.
seq_log(1, 1000, 4) seq_log(1, 100, 5)
seq_log(1, 1000, 4) seq_log(1, 100, 5)
Estimation of ancestry proportions. Make sure to match summary statistics
using snp_match()
(and to reverse frequencies correspondingly).
snp_ancestry_summary( freq, info_freq_ref, projection, correction, min_cor = 0.4, sum_to_one = TRUE )
snp_ancestry_summary( freq, info_freq_ref, projection, correction, min_cor = 0.4, sum_to_one = TRUE )
freq |
Vector of frequencies from which to estimate ancestry proportions. |
info_freq_ref |
A data frame (or matrix) with the set of frequencies to be used as reference (one population per column). |
projection |
Matrix of "loadings" for each variant/PC to be used to project allele frequencies. |
correction |
Coefficients to correct for shrinkage when projecting. |
min_cor |
Minimum correlation between observed and predicted frequencies. Default is 0.4. When correlation is lower, an error is returned. For individual genotypes, this should be larger than 0.6. For allele frequencies, this should be larger than 0.9. |
sum_to_one |
Whether to force ancestry coefficients to sum to 1?
Default is |
Vector of coefficients representing the ancestry proportions.
Also (as attributes) cor_each
, the correlation between input
frequencies and each reference frequencies, and cor_pred
, the correlation
between input and predicted frequencies.
## Not run: # GWAS summary statistics for Epilepsy (supposedly in EUR+EAS+AFR) gz <- runonce::download_file( "http://www.epigad.org/gwas_ilae2018_16loci/all_epilepsy_METAL.gz", dir = "tmp-data") readLines(gz, n = 3) library(dplyr) sumstats <- bigreadr::fread2( gz, select = c("CHR", "BP", "Allele2", "Allele1", "Freq1"), col.names = c("chr", "pos", "a0", "a1", "freq") ) %>% mutate_at(3:4, toupper) # It is a good idea to filter for similar per-variant N (when available..) all_freq <- bigreadr::fread2( runonce::download_file("https://figshare.com/ndownloader/files/31620968", dir = "tmp-data", fname = "ref_freqs.csv.gz")) projection <- bigreadr::fread2( runonce::download_file("https://figshare.com/ndownloader/files/31620953", dir = "tmp-data", fname = "projection.csv.gz")) matched <- snp_match( mutate(sumstats, chr = as.integer(chr), beta = 1), all_freq[1:5], return_flip_and_rev = TRUE ) %>% mutate(freq = ifelse(`_REV_`, 1 - freq, freq)) res <- snp_ancestry_summary( freq = matched$freq, info_freq_ref = all_freq[matched$`_NUM_ID_`, -(1:5)], projection = projection[matched$`_NUM_ID_`, -(1:5)], correction = c(1, 1, 1, 1.008, 1.021, 1.034, 1.052, 1.074, 1.099, 1.123, 1.15, 1.195, 1.256, 1.321, 1.382, 1.443) ) # Some ancestry groups are very close to each other, and should be merged group <- colnames(all_freq)[-(1:5)] group[group %in% c("Scandinavia", "United Kingdom", "Ireland")] <- "Europe (North West)" group[group %in% c("Europe (South East)", "Europe (North East)")] <- "Europe (East)" tapply(res, factor(group, unique(group)), sum) ## End(Not run)
## Not run: # GWAS summary statistics for Epilepsy (supposedly in EUR+EAS+AFR) gz <- runonce::download_file( "http://www.epigad.org/gwas_ilae2018_16loci/all_epilepsy_METAL.gz", dir = "tmp-data") readLines(gz, n = 3) library(dplyr) sumstats <- bigreadr::fread2( gz, select = c("CHR", "BP", "Allele2", "Allele1", "Freq1"), col.names = c("chr", "pos", "a0", "a1", "freq") ) %>% mutate_at(3:4, toupper) # It is a good idea to filter for similar per-variant N (when available..) all_freq <- bigreadr::fread2( runonce::download_file("https://figshare.com/ndownloader/files/31620968", dir = "tmp-data", fname = "ref_freqs.csv.gz")) projection <- bigreadr::fread2( runonce::download_file("https://figshare.com/ndownloader/files/31620953", dir = "tmp-data", fname = "projection.csv.gz")) matched <- snp_match( mutate(sumstats, chr = as.integer(chr), beta = 1), all_freq[1:5], return_flip_and_rev = TRUE ) %>% mutate(freq = ifelse(`_REV_`, 1 - freq, freq)) res <- snp_ancestry_summary( freq = matched$freq, info_freq_ref = all_freq[matched$`_NUM_ID_`, -(1:5)], projection = projection[matched$`_NUM_ID_`, -(1:5)], correction = c(1, 1, 1, 1.008, 1.021, 1.034, 1.052, 1.074, 1.099, 1.123, 1.15, 1.195, 1.256, 1.321, 1.382, 1.443) ) # Some ancestry groups are very close to each other, and should be merged group <- colnames(all_freq)[-(1:5)] group[group %in% c("Scandinavia", "United Kingdom", "Ireland")] <- "Europe (North West)" group[group %in% c("Europe (South East)", "Europe (North East)")] <- "Europe (East)" tapply(res, factor(group, unique(group)), sum) ## End(Not run)
Use genetic maps available at https://github.com/joepickrell/1000-genomes-genetic-maps/ to interpolate physical positions (in bp) to genetic positions (in cM).
snp_asGeneticPos( infos.chr, infos.pos, dir = tempdir(), ncores = 1, rsid = NULL, type = c("OMNI", "hapmap") )
snp_asGeneticPos( infos.chr, infos.pos, dir = tempdir(), ncores = 1, rsid = NULL, type = c("OMNI", "hapmap") )
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
dir |
Directory where to download and decompress files.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
rsid |
If providing rsIDs, the matching is performed using those (instead of positions) and variants not matched are interpolated using spline interpolation of variants that have been matched. |
type |
Whether to use the genetic maps interpolated from "OMNI" (the default), or from "hapmap". |
The new vector of genetic positions.
Load a bigSNP from backing files into R.
snp_attach(rdsfile)
snp_attach(rdsfile)
rdsfile |
The path of the ".rds" which stores the |
This is often just a call to readRDS. But it also checks if you have moved the two (".bk" and ".rds") backing files to another directory.
The bigSNP
object.
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")) # Reading the bedfile and storing the data in temporary directory rds <- snp_readBed(bedfile, backingfile = tempfile()) # Loading the data from backing files test <- snp_attach(rds) str(test) dim(G <- test$genotypes) G[1:8, 1:8]
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")) # Reading the bedfile and storing the data in temporary directory rds <- snp_readBed(bedfile, backingfile = tempfile()) # Loading the data from backing files test <- snp_attach(rds) str(test) dim(G <- test$genotypes) G[1:8, 1:8]
Attach a "bigSNP" for examples and tests
snp_attachExtdata(bedfile = c("example.bed", "example-missing.bed"))
snp_attachExtdata(bedfile = c("example.bed", "example-missing.bed"))
bedfile |
Name of one example bed file. Either
|
The example "bigSNP", filebacked in the "/tmp/" directory.
Fast truncated SVD with initial pruning and that iteratively removes
long-range LD regions. Some variants are removing due to the initial clumping,
then more and more variants are removed at each iteration. You can access the
indices of the remaining variants with attr(*, "subset")
. If some of the
variants removed are contiguous, the regions are reported in attr(*, "lrldr")
.
snp_autoSVD( G, infos.chr, infos.pos = NULL, ind.row = rows_along(G), ind.col = cols_along(G), fun.scaling = snp_scaleBinom(), thr.r2 = 0.2, size = 100/thr.r2, k = 10, roll.size = 50, int.min.size = 20, alpha.tukey = 0.05, min.mac = 10, min.maf = 0.02, max.iter = 5, is.size.in.bp = NULL, ncores = 1, verbose = TRUE ) bed_autoSVD( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), fun.scaling = bed_scaleBinom, thr.r2 = 0.2, size = 100/thr.r2, k = 10, roll.size = 50, int.min.size = 20, alpha.tukey = 0.05, min.mac = 10, min.maf = 0.02, max.iter = 5, ncores = 1, verbose = TRUE )
snp_autoSVD( G, infos.chr, infos.pos = NULL, ind.row = rows_along(G), ind.col = cols_along(G), fun.scaling = snp_scaleBinom(), thr.r2 = 0.2, size = 100/thr.r2, k = 10, roll.size = 50, int.min.size = 20, alpha.tukey = 0.05, min.mac = 10, min.maf = 0.02, max.iter = 5, is.size.in.bp = NULL, ncores = 1, verbose = TRUE ) bed_autoSVD( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), fun.scaling = bed_scaleBinom, thr.r2 = 0.2, size = 100/thr.r2, k = 10, roll.size = 50, int.min.size = 20, alpha.tukey = 0.05, min.mac = 10, min.maf = 0.02, max.iter = 5, ncores = 1, verbose = TRUE )
G |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
fun.scaling |
A function with parameters
Default uses binomial scaling.
You can also provide your own |
thr.r2 |
Threshold over the squared correlation between two variants.
Default is |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
k |
Number of singular vectors/values to compute. Default is |
roll.size |
Radius of rolling windows to smooth log-p-values.
Default is |
int.min.size |
Minimum number of consecutive outlier variants
in order to be reported as long-range LD region. Default is |
alpha.tukey |
Default is |
min.mac |
Minimum minor allele count (MAC) for variants to be included.
Default is |
min.maf |
Minimum minor allele frequency (MAF) for variants to be included.
Default is |
max.iter |
Maximum number of iterations of outlier detection.
Default is |
is.size.in.bp |
Deprecated. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
verbose |
Output some information on the iterations? Default is |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
If you don't have any information about variants, you can try using
infos.chr = rep(1, ncol(G))
,
size = ncol(G)
(if variants are not sorted),
roll.size = 0
(if variants are not sorted).
A named list (an S3 class "big_SVD") of
d
, the singular values,
u
, the left singular vectors,
v
, the right singular vectors,
niter
, the number of the iteration of the algorithm,
nops
, number of Matrix-Vector multiplications used,
center
, the centering vector,
scale
, the scaling vector.
Note that to obtain the Principal Components, you must use predict on the result. See examples.
ex <- snp_attachExtdata() G <- ex$genotypes obj.svd <- snp_autoSVD(G, infos.chr = ex$map$chromosome, infos.pos = ex$map$physical.position) str(obj.svd)
ex <- snp_attachExtdata() G <- ex$genotypes obj.svd <- snp_autoSVD(G, infos.chr = ex$map$chromosome, infos.pos = ex$map$physical.position) str(obj.svd)
Imputation using Beagle version 4.
snp_beagleImpute( beagle.path, plink.path, bedfile.in, bedfile.out = NULL, memory.max = 3, ncores = 1, extra.options = "", plink.options = "", verbose = TRUE )
snp_beagleImpute( beagle.path, plink.path, bedfile.in, bedfile.out = NULL, memory.max = 3, ncores = 1, extra.options = "", plink.options = "", verbose = TRUE )
beagle.path |
Path to the executable of Beagle v4+. |
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
memory.max |
Max memory (in GB) to be used. It is internally rounded
to be an integer. Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
extra.options |
Other options to be passed to Beagle as a string. More options can be found at Beagle's website. |
plink.options |
Other options to be passed to PLINK as a string. More options can be found at https://www.cog-genomics.org/plink2/filter. |
verbose |
Whether to show PLINK log? Default is |
Downloads and more information can be found at the following websites
The path of the new bedfile.
Browning, Brian L., and Sharon R. Browning. "Genotype imputation with millions of reference samples." The American Journal of Human Genetics 98.1 (2016): 116-126.
download_plink download_beagle
Get significant (Pearson) correlations between nearby SNPs of the same chromosome (p-values are computed using a two-sided t-test).
snp_cor( Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna), size = 500, alpha = 1, thr_r2 = 0, fill.diag = TRUE, infos.pos = NULL, ncores = 1 ) bed_cor( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), size = 500, alpha = 1, thr_r2 = 0, fill.diag = TRUE, infos.pos = NULL, ncores = 1 )
snp_cor( Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna), size = 500, alpha = 1, thr_r2 = 0, fill.diag = TRUE, infos.pos = NULL, ncores = 1 ) bed_cor( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), size = 500, alpha = 1, thr_r2 = 0, fill.diag = TRUE, infos.pos = NULL, ncores = 1 )
Gna |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
alpha |
Type-I error for testing correlations.
Default is |
thr_r2 |
Threshold to apply on squared correlations. Default is |
fill.diag |
Whether to fill the diagonal with 1s (the default) or to keep it as 0s. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
The (Pearson) correlation matrix. This is a sparse symmetric matrix.
test <- snp_attachExtdata() G <- test$genotypes corr <- snp_cor(G, ind.col = 1:1000) corr[1:10, 1:10] # Sparsity length(corr@x) / length(corr)
test <- snp_attachExtdata() G <- test$genotypes corr <- snp_cor(G, ind.col = 1:1000) corr[1:10, 1:10] # Sparsity length(corr@x) / length(corr)
Fast imputation algorithm based on local XGBoost models.
snp_fastImpute( Gna, infos.chr, alpha = 1e-04, size = 200, p.train = 0.8, n.cor = nrow(Gna), seed = NA, ncores = 1 )
snp_fastImpute( Gna, infos.chr, alpha = 1e-04, size = 200, p.train = 0.8, n.cor = nrow(Gna), seed = NA, ncores = 1 )
Gna |
A FBM.code256
(typically |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
alpha |
Type-I error for testing correlations. Default is |
size |
Number of neighbor SNPs to be possibly included in the model
imputing this particular SNP. Default is |
p.train |
Proportion of non missing genotypes that are used for training
the imputation model while the rest is used to assess the accuracy of
this imputation model. Default is |
n.cor |
Number of rows that are used to estimate correlations. Default uses them all. |
seed |
An integer, for reproducibility. Default doesn't use seeds. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
An FBM with
the proportion of missing values by SNP (first row),
the estimated proportion of imputation errors by SNP (second row).
## Not run: fake <- snp_attachExtdata("example-missing.bed") G <- fake$genotypes CHR <- fake$map$chromosome infos <- snp_fastImpute(G, CHR) infos[, 1:5] # Still missing values big_counts(G, ind.col = 1:10) # You need to change the code of G # To make this permanent, you need to save (modify) the file on disk fake$genotypes$code256 <- CODE_IMPUTE_PRED fake <- snp_save(fake) big_counts(fake$genotypes, ind.col = 1:10) # Plot for post-checking ## Here there is no SNP with more than 1% error (estimated) pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5 df <- data.frame(pNA = infos[1, ], pError = infos[2, ]) # base R plot(subset(df, pNA > 0.001), pch = 20) idc <- lapply(seq_along(pvals), function(i) { curve(pvals[i] / x, from = 0, lwd = 2, col = colvals[i], add = TRUE) }) legend("topright", legend = pvals, title = "p(NA & Error)", col = colvals, lty = 1, lwd = 2) # ggplot2 library(ggplot2) Reduce(function(p, i) { p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i]) }, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) + geom_point() + coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) + theme_bigstatsr() ## End(Not run)
## Not run: fake <- snp_attachExtdata("example-missing.bed") G <- fake$genotypes CHR <- fake$map$chromosome infos <- snp_fastImpute(G, CHR) infos[, 1:5] # Still missing values big_counts(G, ind.col = 1:10) # You need to change the code of G # To make this permanent, you need to save (modify) the file on disk fake$genotypes$code256 <- CODE_IMPUTE_PRED fake <- snp_save(fake) big_counts(fake$genotypes, ind.col = 1:10) # Plot for post-checking ## Here there is no SNP with more than 1% error (estimated) pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5 df <- data.frame(pNA = infos[1, ], pError = infos[2, ]) # base R plot(subset(df, pNA > 0.001), pch = 20) idc <- lapply(seq_along(pvals), function(i) { curve(pvals[i] / x, from = 0, lwd = 2, col = colvals[i], add = TRUE) }) legend("topright", legend = pvals, title = "p(NA & Error)", col = colvals, lty = 1, lwd = 2) # ggplot2 library(ggplot2) Reduce(function(p, i) { p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i]) }, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) + geom_point() + coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) + theme_bigstatsr() ## End(Not run)
Fast imputation via mode, mean, sampling according to allele frequencies, or 0.
snp_fastImputeSimple( Gna, method = c("mode", "mean0", "mean2", "random"), ncores = 1 )
snp_fastImputeSimple( Gna, method = c("mode", "mean0", "mean2", "random"), ncores = 1 )
Gna |
A FBM.code256
(typically |
method |
Either |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A new FBM.code256
object (same file, but different code).
bigsnp <- snp_attachExtdata("example-missing.bed") G <- bigsnp$genotypes G[, 2] # some missing values G2 <- snp_fastImputeSimple(G) G2[, 2] # no missing values anymore G[, 2] # imputed, but still returning missing values G$copy(code = CODE_IMPUTE_PRED)[, 2] # need to decode imputed values G$copy(code = c(0, 1, 2, rep(0, 253)))[, 2] # "imputation" by 0
bigsnp <- snp_attachExtdata("example-missing.bed") G <- bigsnp$genotypes G[, 2] # some missing values G2 <- snp_fastImputeSimple(G) G2[, 2] # no missing values anymore G[, 2] # imputed, but still returning missing values G$copy(code = CODE_IMPUTE_PRED)[, 2] # need to decode imputed values G$copy(code = c(0, 1, 2, rep(0, 253)))[, 2] # "imputation" by 0
Fixation index (Fst), either per variant, or genome-wide
snp_fst(list_df_af, min_maf = 0, overall = FALSE)
snp_fst(list_df_af, min_maf = 0, overall = FALSE)
list_df_af |
List of data frames with |
min_maf |
Minimum MAF threshold (for the average of populations) to be
included in the final results. Default is |
overall |
Whether to compute Fst genome-wide ( |
If overall
, then one value, otherwise a value for each variant
with missing values for the variants not passing min_maf
.
This should be equivalent to using '--fst --within
' in PLINK.
Weir, B. S., & Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution, 1358-1370.
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) pop <- rep(1:3, c(143, 167, 207)) ind_pop <- split(seq_along(pop), pop) list_df_af <- lapply(ind_pop, function(ind) bed_MAF(obj.bed, ind.row = ind)) snp_fst(list_df_af) snp_fst(list_df_af[c(1, 2)], overall = TRUE) snp_fst(list_df_af[c(1, 3)], overall = TRUE) snp_fst(list_df_af[c(3, 2)], overall = TRUE)
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") obj.bed <- bed(bedfile) pop <- rep(1:3, c(143, 167, 207)) ind_pop <- split(seq_along(pop), pop) list_df_af <- lapply(ind_pop, function(ind) bed_MAF(obj.bed, ind.row = ind)) snp_fst(list_df_af) snp_fst(list_df_af[c(1, 2)], overall = TRUE) snp_fst(list_df_af[c(1, 3)], overall = TRUE) snp_fst(list_df_af[c(3, 2)], overall = TRUE)
Genomic Control
snp_gc(gwas)
snp_gc(gwas)
gwas |
A |
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997-1004.
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
Get information of individuals by matching from an external file.
snp_getSampleInfos( x, df.or.files, col.family.ID = 1, col.sample.ID = 2, col.infos = -c(1, 2), pair.sep = "-_-", ... )
snp_getSampleInfos( x, df.or.files, col.family.ID = 1, col.sample.ID = 2, col.infos = -c(1, 2), pair.sep = "-_-", ... )
x |
A bigSNP. |
df.or.files |
Either
|
col.family.ID |
Index of the column containing the family IDs to match with those of the study. Default is the first one. |
col.sample.ID |
Index of the column containing the sample IDs to match with those of the study. Default is the second one. |
col.infos |
Indices of the column containing the information you want. Default is all but the first and the second columns. |
pair.sep |
Separator used for concatenation of family and sample IDs to
make unique IDs for matching between the two datasets. Default is |
... |
Any additional parameter to pass to |
The requested information as a data.frame
.
test <- snp_attachExtdata() table(test$fam$family.ID) # Get populations clusters from external files files <- system.file("extdata", paste0("cluster", 1:3), package = "bigsnpr") bigreadr::fread2(files[1]) bigreadr::fread2(files[1], header = FALSE) # need header option here infos <- snp_getSampleInfos(test, files, header = FALSE) table(infos[[1]])
test <- snp_attachExtdata() table(test$fam$family.ID) # Get populations clusters from external files files <- system.file("extdata", paste0("cluster", 1:3), package = "bigsnpr") bigreadr::fread2(files[1]) bigreadr::fread2(files[1], header = FALSE) # need header option here infos <- snp_getSampleInfos(test, files, header = FALSE) table(infos[[1]])
lassosum2
snp_lassosum2( corr, df_beta, delta = c(0.001, 0.01, 0.1, 1), nlambda = 30, lambda.min.ratio = 0.01, dfmax = 2e+05, maxiter = 1000, tol = 1e-05, ind.corr = cols_along(corr), ncores = 1 )
snp_lassosum2( corr, df_beta, delta = c(0.001, 0.01, 0.1, 1), nlambda = 30, lambda.min.ratio = 0.01, dfmax = 2e+05, maxiter = 1000, tol = 1e-05, ind.corr = cols_along(corr), ncores = 1 )
corr |
Sparse correlation matrix as an SFBM.
If |
df_beta |
A data frame with 3 columns:
|
delta |
Vector of shrinkage parameters to try (L2-regularization).
Default is |
nlambda |
Number of different lambdas to try (L1-regularization).
Default is |
lambda.min.ratio |
Ratio between last and first lambdas to try.
Default is |
dfmax |
Maximum number of non-zero effects in the model.
Default is |
maxiter |
Maximum number of iterations before convergence.
Default is |
tol |
Tolerance parameter for assessing convergence.
Default is |
ind.corr |
Indices to "subset" |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A matrix of effect sizes, one vector (column) for each row in
attr(<res>, "grid_param")
. Missing values are returned when strong
divergence is detected.
LD scores
snp_ld_scores( Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna), size = 500, infos.pos = NULL, ncores = 1 ) bed_ld_scores( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), size = 500, infos.pos = NULL, ncores = 1 )
snp_ld_scores( Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna), size = 500, infos.pos = NULL, ncores = 1 ) bed_ld_scores( obj.bed, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), size = 500, infos.pos = NULL, ncores = 1 )
Gna |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
A vector of LD scores. For each variant, this is the sum of squared correlations with the neighboring variants (including itself).
test <- snp_attachExtdata() G <- test$genotypes (ld <- snp_ld_scores(G, ind.col = 1:1000))
test <- snp_attachExtdata() G <- test$genotypes (ld <- snp_ld_scores(G, ind.col = 1:1000))
LDpred2. Tutorial at https://privefl.github.io/bigsnpr/articles/LDpred2.html.
snp_ldpred2_inf(corr, df_beta, h2) snp_ldpred2_grid( corr, df_beta, grid_param, burn_in = 50, num_iter = 100, ncores = 1, return_sampling_betas = FALSE, ind.corr = cols_along(corr) ) snp_ldpred2_auto( corr, df_beta, h2_init, vec_p_init = 0.1, burn_in = 500, num_iter = 200, sparse = FALSE, verbose = FALSE, report_step = num_iter + 1L, allow_jump_sign = TRUE, shrink_corr = 1, use_MLE = TRUE, p_bounds = c(1e-05, 1), alpha_bounds = c(-1.5, 0.5), ind.corr = cols_along(corr), ncores = 1 )
snp_ldpred2_inf(corr, df_beta, h2) snp_ldpred2_grid( corr, df_beta, grid_param, burn_in = 50, num_iter = 100, ncores = 1, return_sampling_betas = FALSE, ind.corr = cols_along(corr) ) snp_ldpred2_auto( corr, df_beta, h2_init, vec_p_init = 0.1, burn_in = 500, num_iter = 200, sparse = FALSE, verbose = FALSE, report_step = num_iter + 1L, allow_jump_sign = TRUE, shrink_corr = 1, use_MLE = TRUE, p_bounds = c(1e-05, 1), alpha_bounds = c(-1.5, 0.5), ind.corr = cols_along(corr), ncores = 1 )
corr |
Sparse correlation matrix as an SFBM.
If |
df_beta |
A data frame with 3 columns:
|
h2 |
Heritability estimate. |
grid_param |
A data frame with 3 columns as a grid of hyper-parameters:
|
burn_in |
Number of burn-in iterations. |
num_iter |
Number of iterations after burn-in. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
return_sampling_betas |
Whether to return all sampling betas (after
burn-in)? This is useful for assessing the uncertainty of the PRS at the
individual level (see doi:10.1101/2020.11.30.403188).
Default is |
ind.corr |
Indices to "subset" |
h2_init |
Heritability estimate for initialization. |
vec_p_init |
Vector of initial values for p. Default is |
sparse |
In LDpred2-auto, whether to also report a sparse solution by
running LDpred2-grid with the estimates of p and h2 from LDpred2-auto, and
sparsity enabled. Default is |
verbose |
Whether to print "p // h2" estimates at each iteration. Disabled when parallelism is used. |
report_step |
Step to report sampling betas (after burn-in and before
unscaling). Nothing is reported by default. If using |
allow_jump_sign |
Whether to allow for effects sizes to change sign in
consecutive iterations? Default is |
shrink_corr |
Shrinkage multiplicative coefficient to apply to off-diagonal
elements of the correlation matrix. Default is |
use_MLE |
Whether to use maximum likelihood estimation (MLE) to estimate
alpha and the variance component (since v1.11.4), or assume that alpha is
-1 and estimate the variance of (scaled) effects as h2/(m*p), as it was
done in earlier versions of LDpred2-auto (e.g. in v1.10.8). Default is |
p_bounds |
Boundaries for the estimates of p (the polygenicity).
Default is |
alpha_bounds |
Boundaries for the estimates of |
For reproducibility, set.seed()
can be used to ensure that two runs of
LDpred2 give the exact same results (since v1.10).
snp_ldpred2_inf
: A vector of effects, assuming an infinitesimal model.
snp_ldpred2_grid
: A matrix of effect sizes, one vector (column)
for each row of grid_param
. Missing values are returned when strong
divergence is detected. If using return_sampling_betas
, each column
corresponds to one iteration instead (after burn-in).
snp_ldpred2_auto
: A list (over vec_p_init
) of lists with
$beta_est
: vector of effect sizes (on the allele scale); note that
missing values are returned when strong divergence is detected
$beta_est_sparse
(only when sparse = TRUE
): sparse vector of effect sizes
$postp_est
: vector of posterior probabilities of being causal
$corr_est
, the "imputed" correlations between variants and phenotypes,
which can be used for post-QCing variants by comparing those to
with(df_beta, beta / sqrt(n_eff * beta_se^2 + beta^2))
$sample_beta
: sparse matrix of sampling betas (see parameter report_step
),
not on the allele scale, for which you need to multiply by
with(df_beta, sqrt(n_eff * beta_se^2 + beta^2))
$path_p_est
: full path of p estimates (including burn-in);
useful to check convergence of the iterative algorithm
$path_h2_est
: full path of h2 estimates (including burn-in);
useful to check convergence of the iterative algorithm
$path_alpha_est
: full path of alpha estimates (including burn-in);
useful to check convergence of the iterative algorithm
$h2_est
: estimate of the (SNP) heritability (also see coef_to_liab)
$p_est
: estimate of p, the proportion of causal variants
$alpha_est
: estimate of alpha, the parameter controlling the
relationship between allele frequencies and expected effect sizes
$h2_init
and $p_init
: input parameters, for convenience
LD score regression
snp_ldsc( ld_score, ld_size, chi2, sample_size, blocks = 200, intercept = NULL, chi2_thr1 = 30, chi2_thr2 = Inf, ncores = 1 ) snp_ldsc2( corr, df_beta, blocks = NULL, intercept = 1, ncores = 1, ind.beta = cols_along(corr), chi2_thr1 = 30, chi2_thr2 = Inf )
snp_ldsc( ld_score, ld_size, chi2, sample_size, blocks = 200, intercept = NULL, chi2_thr1 = 30, chi2_thr2 = Inf, ncores = 1 ) snp_ldsc2( corr, df_beta, blocks = NULL, intercept = 1, ncores = 1, ind.beta = cols_along(corr), chi2_thr1 = 30, chi2_thr2 = Inf )
ld_score |
Vector of LD scores. |
ld_size |
Number of variants used to compute |
chi2 |
Vector of chi-squared statistics. |
sample_size |
Sample size of GWAS corresponding to chi-squared statistics. Possibly a vector, or just a single value. |
blocks |
Either a single number specifying the number of blocks,
or a vector of integers specifying the block number of each |
intercept |
You can constrain the intercept to some value (e.g. 1).
Default is |
chi2_thr1 |
Threshold on |
chi2_thr2 |
Threshold on |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
corr |
Sparse correlation matrix. Can also be an SFBM. |
df_beta |
A data frame with 3 columns:
|
ind.beta |
Indices in |
Vector of 4 values (or only the first 2 if blocks = NULL
):
[["int"]]
: LDSC regression intercept,
[["int_se"]]
: SE of this intercept,
[["h2"]]
: LDSC regression estimate of (SNP) heritability (also see
coef_to_liab),
[["h2_se"]]
: SE of this heritability estimate.
bigsnp <- snp_attachExtdata() G <- bigsnp$genotypes y <- bigsnp$fam$affection - 1 corr <- snp_cor(G, ind.col = 1:1000) gwas <- big_univLogReg(G, y, ind.col = 1:1000) df_beta <- data.frame(beta = gwas$estim, beta_se = gwas$std.err, n_eff = 4 / (1 / sum(y == 0) + 1 / sum(y == 1))) snp_ldsc2(corr, df_beta) snp_ldsc2(corr, df_beta, blocks = 20, intercept = NULL)
bigsnp <- snp_attachExtdata() G <- bigsnp$genotypes y <- bigsnp$fam$affection - 1 corr <- snp_cor(G, ind.col = 1:1000) gwas <- big_univLogReg(G, y, ind.col = 1:1000) df_beta <- data.frame(beta = gwas$estim, beta_se = gwas$std.err, n_eff = 4 / (1 / sum(y == 0) + 1 / sum(y == 1))) snp_ldsc2(corr, df_beta) snp_ldsc2(corr, df_beta, blocks = 20, intercept = NULL)
Split a correlation matrix in blocks as independent as possible. This finds the splitting in blocks that minimizes the sum of squared correlation between these blocks (i.e. everything outside these blocks). In case of equivalent splits, it then minimizes the sum of squared sizes of the blocks.
snp_ldsplit( corr, thr_r2, min_size, max_size, max_K = 500, max_r2 = 0.3, max_cost = ncol(corr)/200, pos_scaled = rep(0, ncol(corr)) )
snp_ldsplit( corr, thr_r2, min_size, max_size, max_K = 500, max_r2 = 0.3, max_cost = ncol(corr)/200, pos_scaled = rep(0, ncol(corr)) )
corr |
Sparse correlation matrix. Usually, the output of |
thr_r2 |
Threshold under which squared correlations are ignored.
This is useful to avoid counting noise, which should give clearer patterns
of costs vs. number of blocks. It is therefore possible to have a splitting
cost of 0. If this parameter is used, then |
min_size |
Minimum number of variants in each block. This is used not to have a disproportionate number of small blocks. |
max_size |
Maximum number of variants in each block. This is used not to have blocks that are too large, e.g. to limit computational and memory requirements of applications that would use these blocks. For some long-range LD regions, it may be needed to allow for large blocks. You can now provide a vector of values to try. |
max_K |
Maximum number of blocks to consider. All optimal solutions for K
from 1 to |
max_r2 |
Maximum squared correlation allowed for one pair of variants in
two different blocks. This is used to make sure that strong correlations are
not discarded and also to speed up the algorithm. Default is |
max_cost |
Maximum cost reported. Default is |
pos_scaled |
Vector of positions. The positions should be scaled so that limits of a block must be separated by a distance of 1 at the maximum. E.g. if the positions are in base pairs (bp), and you want a maximum distance of 10 Mbp, you need to provide the vector of positions divided by 10e6. |
Either NULL
when no block splitting satisfies the conditions,
or a tibble with seven columns:
$max_size
: Input parameter, useful when providing a vector of values to try.
$n_block
: Number of blocks.
$cost
: The sum of squared correlations outside the blocks.
$cost2
: The sum of squared sizes of the blocks.
$perc_kept
: Percentage of initial non-zero values kept within the blocks defined.
$all_last
: Last index of each block.
$all_size
: Sizes of the blocks.
$block_num
: Resulting block numbers for each variant. This is not reported
anymore, but can be computed with rep(seq_along(all_size), all_size)
.
## Not run: corr <- readRDS(url("https://www.dropbox.com/s/65u96jf7y32j2mj/spMat.rds?raw=1")) # adjust `THR_R2` depending on sample size used to compute corr # use e.g. 0.05 for small sample sizes, and 0.01 for large sample sizes THR_R2 <- 0.02 m <- ncol(corr) (SEQ <- round(seq_log(m / 30, m / 5, length.out = 10))) # replace `min_size` by e.g. 100 for larger data (res <- snp_ldsplit(corr, thr_r2 = THR_R2, min_size = 10, max_size = SEQ)) # add the variant block IDs corresponding to each split res$block_num <- lapply(res$all_size, function(.) rep(seq_along(.), .)) library(ggplot2) # trade-off cost / number of blocks qplot(n_block, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + scale_y_log10() + theme(legend.position = "top") + labs(x = "Number of blocks", color = "Maximum block size", y = "Sum of squared correlations outside blocks") # trade-off cost / number of non-zero values qplot(perc_kept, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + # scale_y_log10() + theme(legend.position = "top") + labs(x = "Percentage of non-zero values kept", color = "Maximum block size", y = "Sum of squared correlations outside blocks") # trade-off cost / sum of squared sizes qplot(cost2, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + scale_y_log10() + geom_vline(xintercept = 0)+ theme(legend.position = "top") + labs(x = "Sum of squared blocks", color = "Maximum block size", y = "Sum of squared correlations outside blocks") ## Pick one solution and visualize blocks library(dplyr) all_ind <- res %>% arrange(cost2 * sqrt(5 + cost)) %>% print() %>% slice(1) %>% pull(all_last) ## Transform sparse representation into (i,j,x) triplets corrT <- as(corr, "dgTMatrix") upper <- (corrT@i <= corrT@j & corrT@x^2 >= THR_R2) df <- data.frame( i = corrT@i[upper] + 1L, j = corrT@j[upper] + 1L, r2 = corrT@x[upper]^2 ) df$y <- (df$j - df$i) / 2 ggplot(df) + geom_point(aes(i + y, y, alpha = r2)) + theme_minimal() + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), strip.background = element_blank(), strip.text.x = element_blank()) + scale_alpha_continuous(range = 0:1) + scale_x_continuous(expand = c(0.02, 0.02), minor_breaks = NULL, breaks = head(all_ind[[1]], -1) + 0.5) + facet_wrap(~ cut(i + y, 4), scales = "free", ncol = 1) + labs(x = "Position", y = NULL) ## End(Not run)
## Not run: corr <- readRDS(url("https://www.dropbox.com/s/65u96jf7y32j2mj/spMat.rds?raw=1")) # adjust `THR_R2` depending on sample size used to compute corr # use e.g. 0.05 for small sample sizes, and 0.01 for large sample sizes THR_R2 <- 0.02 m <- ncol(corr) (SEQ <- round(seq_log(m / 30, m / 5, length.out = 10))) # replace `min_size` by e.g. 100 for larger data (res <- snp_ldsplit(corr, thr_r2 = THR_R2, min_size = 10, max_size = SEQ)) # add the variant block IDs corresponding to each split res$block_num <- lapply(res$all_size, function(.) rep(seq_along(.), .)) library(ggplot2) # trade-off cost / number of blocks qplot(n_block, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + scale_y_log10() + theme(legend.position = "top") + labs(x = "Number of blocks", color = "Maximum block size", y = "Sum of squared correlations outside blocks") # trade-off cost / number of non-zero values qplot(perc_kept, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + # scale_y_log10() + theme(legend.position = "top") + labs(x = "Percentage of non-zero values kept", color = "Maximum block size", y = "Sum of squared correlations outside blocks") # trade-off cost / sum of squared sizes qplot(cost2, cost, color = factor(max_size, SEQ), data = res) + theme_bw(14) + scale_y_log10() + geom_vline(xintercept = 0)+ theme(legend.position = "top") + labs(x = "Sum of squared blocks", color = "Maximum block size", y = "Sum of squared correlations outside blocks") ## Pick one solution and visualize blocks library(dplyr) all_ind <- res %>% arrange(cost2 * sqrt(5 + cost)) %>% print() %>% slice(1) %>% pull(all_last) ## Transform sparse representation into (i,j,x) triplets corrT <- as(corr, "dgTMatrix") upper <- (corrT@i <= corrT@j & corrT@x^2 >= THR_R2) df <- data.frame( i = corrT@i[upper] + 1L, j = corrT@j[upper] + 1L, r2 = corrT@x[upper]^2 ) df$y <- (df$j - df$i) / 2 ggplot(df) + geom_point(aes(i + y, y, alpha = r2)) + theme_minimal() + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), strip.background = element_blank(), strip.text.x = element_blank()) + scale_alpha_continuous(range = 0:1) + scale_x_continuous(expand = c(0.02, 0.02), minor_breaks = NULL, breaks = head(all_ind[[1]], -1) + 0.5) + facet_wrap(~ cut(i + y, 4), scales = "free", ncol = 1) + labs(x = "Position", y = NULL) ## End(Not run)
Minor Allele Frequency.
snp_MAF( G, ind.row = rows_along(G), ind.col = cols_along(G), nploidy = 2, ncores = 1 )
snp_MAF( G, ind.row = rows_along(G), ind.col = cols_along(G), nploidy = 2, ncores = 1 )
G |
A FBM.code256
(typically |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A vector of MAFs, corresponding to ind.col
.
obj.bigsnp <- snp_attachExtdata() str(maf <- snp_MAF(obj.bigsnp$genotypes))
obj.bigsnp <- snp_attachExtdata() str(maf <- snp_MAF(obj.bigsnp$genotypes))
Creates a manhattan plot.
snp_manhattan( gwas, infos.chr, infos.pos, colors = c("black", "grey60"), dist.sep.chrs = 1e+07, ind.highlight = integer(0), col.highlight = "red", labels = NULL, npoints = NULL, coeff = 1 )
snp_manhattan( gwas, infos.chr, infos.pos, colors = c("black", "grey60"), dist.sep.chrs = 1e+07, ind.highlight = integer(0), col.highlight = "red", labels = NULL, npoints = NULL, coeff = 1 )
gwas |
A |
infos.chr |
Vector of integers specifying each SNP's chromosome. |
infos.pos |
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP. |
colors |
Colors used for each chromosome (they are recycled). Default is an alternation of black and gray. |
dist.sep.chrs |
"Physical" distance that separates two chromosomes. Default is 10 Mbp. |
ind.highlight |
Indices of SNPs you want to highlight (of interest). Default doesn't highlight any SNPs. |
col.highlight |
Color used for highlighting SNPs. Default uses red. |
labels |
Labels of the x axis. Default uses the number of the
chromosome there are in |
npoints |
Number of points to keep (ranked by p-value) in order to get
a lighter object (and plot). Default doesn't cut anything.
If used, the resulting object will have an attribute called |
coeff |
Relative size of text. Default is |
If you don't have information of chromosome and position, you should simply
use plot
instead.
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
Match alleles between summary statistics and SNP information. Match by ("chr", "a0", "a1") and ("pos" or "rsid"), accounting for possible strand flips and reverse reference alleles (opposite effects).
snp_match( sumstats, info_snp, strand_flip = TRUE, join_by_pos = TRUE, remove_dups = TRUE, match.min.prop = 0.2, return_flip_and_rev = FALSE )
snp_match( sumstats, info_snp, strand_flip = TRUE, join_by_pos = TRUE, remove_dups = TRUE, match.min.prop = 0.2, return_flip_and_rev = FALSE )
sumstats |
A data frame with columns "chr", "pos", "a0", "a1" and "beta". |
info_snp |
A data frame with columns "chr", "pos", "a0" and "a1". |
strand_flip |
Whether to try to flip strand? (default is |
join_by_pos |
Whether to join by chromosome and position (default), or instead by rsid. |
remove_dups |
Whether to remove duplicates (same physical position)?
Default is |
match.min.prop |
Minimum proportion of variants in the smallest data
to be matched, otherwise stops with an error. Default is |
return_flip_and_rev |
Whether to return internal boolean variables
|
A single data frame with matched variants. Values in column $beta
are multiplied by -1 for variants with alleles reversed (i.e. swapped).
New variable "_NUM_ID_.ss"
returns the corresponding row indices of the
input sumstats
(first argument of this function), and "_NUM_ID_"
corresponding to the input info_snp
(second argument).
sumstats <- data.frame( chr = 1, pos = c(86303, 86331, 162463, 752566, 755890, 758144), a0 = c("T", "G", "C", "A", "T", "G"), a1 = c("G", "A", "T", "G", "A", "A"), beta = c(-1.868, 0.250, -0.671, 2.112, 0.239, 1.272), p = c(0.860, 0.346, 0.900, 0.456, 0.776, 0.383) ) info_snp <- data.frame( id = c("rs2949417", "rs115209712", "rs143399298", "rs3094315", "rs3115858"), chr = 1, pos = c(86303, 86331, 162463, 752566, 755890), a0 = c("T", "A", "G", "A", "T"), a1 = c("G", "G", "A", "G", "A") ) snp_match(sumstats, info_snp) snp_match(sumstats, info_snp, strand_flip = FALSE)
sumstats <- data.frame( chr = 1, pos = c(86303, 86331, 162463, 752566, 755890, 758144), a0 = c("T", "G", "C", "A", "T", "G"), a1 = c("G", "A", "T", "G", "A", "A"), beta = c(-1.868, 0.250, -0.671, 2.112, 0.239, 1.272), p = c(0.860, 0.346, 0.900, 0.456, 0.776, 0.383) ) info_snp <- data.frame( id = c("rs2949417", "rs115209712", "rs143399298", "rs3094315", "rs3115858"), chr = 1, pos = c(86303, 86331, 162463, 752566, 755890), a0 = c("T", "A", "G", "A", "T"), a1 = c("G", "G", "A", "G", "A") ) snp_match(sumstats, info_snp) snp_match(sumstats, info_snp, strand_flip = FALSE)
Compute the MAX3 statistic, which tests for three genetic models (additive, recessive and dominant).
snp_MAX3(Gna, y01.train, ind.train = rows_along(Gna), val = c(0, 0.5, 1))
snp_MAX3(Gna, y01.train, ind.train = rows_along(Gna), val = c(0, 0.5, 1))
Gna |
A FBM.code256
(typically |
y01.train |
Vector of responses, corresponding to |
ind.train |
An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. |
val |
Computing
|
P-values associated with returned scores are in fact the minimum of the p-values of each test separately. Thus, they are biased downward.
An object of classes mhtest
and data.frame
returning one
score by SNP. See methods(class = "mhtest")
.
Zheng, G., Yang, Y., Zhu, X., & Elston, R. (2012). Robust Procedures. Analysis Of Genetic Association Studies, 151-206. doi:10.1007/978-1-4614-2245-7_6.
set.seed(1) # constructing a fake genotype big.matrix N <- 50; M <- 1200 fake <- snp_fake(N, M) G <- fake$genotypes G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE) G[1:8, 1:10] # Specify case/control phenotypes fake$fam$affection <- rep(1:2, each = N / 2) # Get MAX3 statistics y01 <- fake$fam$affection - 1 str(test <- snp_MAX3(fake$genotypes, y01.train = y01)) # p-values are not well calibrated snp_qq(test) # genomic control is not of much help snp_qq(snp_gc(test)) # Armitage trend test (well calibrated because only one test) test2 <- snp_MAX3(fake$genotypes, y01.train = y01, val = 0.5) snp_qq(test2)
set.seed(1) # constructing a fake genotype big.matrix N <- 50; M <- 1200 fake <- snp_fake(N, M) G <- fake$genotypes G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE) G[1:8, 1:10] # Specify case/control phenotypes fake$fam$affection <- rep(1:2, each = N / 2) # Get MAX3 statistics y01 <- fake$fam$affection - 1 str(test <- snp_MAX3(fake$genotypes, y01.train = y01)) # p-values are not well calibrated snp_qq(test) # genomic control is not of much help snp_qq(snp_gc(test)) # Armitage trend test (well calibrated because only one test) test2 <- snp_MAX3(fake$genotypes, y01.train = y01, val = 0.5) snp_qq(test2)
Modify the physical position information of a data frame when converting genome build using executable liftOver.
snp_modifyBuild( info_snp, liftOver, from = "hg18", to = "hg19", check_reverse = TRUE, local_chain = NULL, base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/" )
snp_modifyBuild( info_snp, liftOver, from = "hg18", to = "hg19", check_reverse = TRUE, local_chain = NULL, base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/" )
info_snp |
A data frame with columns "chr" and "pos". |
liftOver |
Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux. |
from |
Genome build to convert from. Default is |
to |
Genome build to convert to. Default is |
check_reverse |
Whether to discard positions for which we cannot go back
to initial values by doing 'from -> to -> from'. Default is |
local_chain |
Local chain file (e.g. |
base_url |
From where to download the chain files. Default is
|
Input data frame info_snp
with column "pos" in the new build.
Hinrichs, Angela S., et al. "The UCSC genome browser database: update 2006." Nucleic acids research 34.suppl_1 (2006): D590-D598.
Method to detect genetic markers involved in biological adaptation. This provides a statistical tool for outlier detection based on Principal Component Analysis. This corresponds to the statistic based on mahalanobis distance, as implemented in package pcadapt.
snp_pcadapt( G, U.row, ind.row = rows_along(G), ind.col = cols_along(G), ncores = 1 ) bed_pcadapt( obj.bed, U.row, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
snp_pcadapt( G, U.row, ind.row = rows_along(G), ind.col = cols_along(G), ncores = 1 ) bed_pcadapt( obj.bed, U.row, ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
G |
A FBM.code256
(typically |
U.row |
Left singular vectors (not scores, |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
obj.bed |
Object of type bed, which is the mapping of some bed file.
Use |
An object of classes mhtest
and data.frame
returning one
score by SNP. See methods(class = "mhtest")
.
Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77.
snp_manhattan, snp_qq and snp_gc.
test <- snp_attachExtdata() G <- test$genotypes obj.svd <- big_SVD(G, fun.scaling = snp_scaleBinom(), k = 10) plot(obj.svd) # there seems to be 3 "significant" components pcadapt <- snp_pcadapt(G, obj.svd$u[, 1:3]) snp_qq(pcadapt)
test <- snp_attachExtdata() G <- test$genotypes obj.svd <- big_SVD(G, fun.scaling = snp_scaleBinom(), k = 10) plot(obj.svd) # there seems to be 3 "significant" components pcadapt <- snp_pcadapt(G, obj.svd$u[, 1:3]) snp_qq(pcadapt)
Quality Control based on Identity-by-descent (IBD) computed by PLINK 1.9 using its method-of-moments.
snp_plinkIBDQC( plink.path, bedfile.in, bedfile.out = NULL, pi.hat = 0.08, ncores = 1, pruning.args = c(100, 0.2), do.blind.QC = TRUE, extra.options = "", verbose = TRUE )
snp_plinkIBDQC( plink.path, bedfile.in, bedfile.out = NULL, pi.hat = 0.08, ncores = 1, pruning.args = c(100, 0.2), do.blind.QC = TRUE, extra.options = "", verbose = TRUE )
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
pi.hat |
PI_HAT value threshold for individuals (first by pairs)
to be excluded. Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
pruning.args |
A vector of 2 pruning parameters, respectively
the window size (in variant count) and the pairwise $r^2$ threshold
(the step size is fixed to 1). Default is |
do.blind.QC |
Whether to do QC with |
extra.options |
Other options to be passed to PLINK as a string (for the IBD part). More options can be found at https://www.cog-genomics.org/plink/1.9/ibd. |
verbose |
Whether to show PLINK log? Default is |
The path of the new bedfile. If no sample is filter, no new bed/bim/fam files are created and then the path of the input bedfile is returned.
Chang, Christopher C, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4 (1): 7. doi:10.1186/s13742-015-0047-8.
download_plink snp_plinkQC snp_plinkKINGQC
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") plink <- download_plink() bedfile <- snp_plinkIBDQC(plink, bedfile, bedfile.out = tempfile(fileext = ".bed"), ncores = 2) df_rel <- snp_plinkIBDQC(plink, bedfile, do.blind.QC = FALSE, ncores = 2) str(df_rel) library(ggplot2) qplot(Z0, Z1, data = df_rel, col = RT) qplot(y = PI_HAT, data = df_rel) + geom_hline(yintercept = 0.2, color = "blue", linetype = 2) snp_plinkRmSamples(plink, bedfile, bedfile.out = tempfile(fileext = ".bed"), df.or.files = subset(df_rel, PI_HAT > 0.2)) ## End(Not run)
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") plink <- download_plink() bedfile <- snp_plinkIBDQC(plink, bedfile, bedfile.out = tempfile(fileext = ".bed"), ncores = 2) df_rel <- snp_plinkIBDQC(plink, bedfile, do.blind.QC = FALSE, ncores = 2) str(df_rel) library(ggplot2) qplot(Z0, Z1, data = df_rel, col = RT) qplot(y = PI_HAT, data = df_rel) + geom_hline(yintercept = 0.2, color = "blue", linetype = 2) snp_plinkRmSamples(plink, bedfile, bedfile.out = tempfile(fileext = ".bed"), df.or.files = subset(df_rel, PI_HAT > 0.2)) ## End(Not run)
Quality Control based on KING-robust kinship estimator. More information can be found at https://www.cog-genomics.org/plink/2.0/distance#king_cutoff.
snp_plinkKINGQC( plink2.path, bedfile.in, bedfile.out = NULL, thr.king = 2^-3.5, make.bed = TRUE, ncores = 1, extra.options = "", verbose = TRUE )
snp_plinkKINGQC( plink2.path, bedfile.in, bedfile.out = NULL, thr.king = 2^-3.5, make.bed = TRUE, ncores = 1, extra.options = "", verbose = TRUE )
plink2.path |
Path to the executable of PLINK 2. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. Default is created by
appending |
thr.king |
Note that KING kinship coefficients are scaled such that duplicate samples have kinship 0.5, not 1. First-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc. It is conventional to use a cutoff of ~0.354 (2^-1.5, the geometric mean of 0.5 and 0.25) to screen for monozygotic twins and duplicate samples, ~0.177 (2^-2.5) to remove first-degree relations as well, and ~0.0884 (2^-3.5, default) to remove second-degree relations as well, etc. |
make.bed |
Whether to create new bed/bim/fam files (default). Otherwise, returns a table with coefficients of related pairs. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
extra.options |
Other options to be passed to PLINK2 as a string. |
verbose |
Whether to show PLINK log? Default is |
See parameter make-bed
.
Manichaikul, Ani, Josyf C. Mychaleckyj, Stephen S. Rich, Kathy Daly, Michele Sale, and Wei-Min Chen. "Robust relationship inference in genome-wide association studies." Bioinformatics 26, no. 22 (2010): 2867-2873.
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") plink2 <- download_plink2(AVX2 = FALSE) bedfile2 <- snp_plinkKINGQC(plink2, bedfile, bedfile.out = tempfile(fileext = ".bed"), ncores = 2) df_rel <- snp_plinkKINGQC(plink2, bedfile, make.bed = FALSE, ncores = 2) str(df_rel) ## End(Not run)
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") plink2 <- download_plink2(AVX2 = FALSE) bedfile2 <- snp_plinkKINGQC(plink2, bedfile, bedfile.out = tempfile(fileext = ".bed"), ncores = 2) df_rel <- snp_plinkKINGQC(plink2, bedfile, make.bed = FALSE, ncores = 2) str(df_rel) ## End(Not run)
Quality Control (QC) and possible conversion to bed/bim/fam files using PLINK 1.9.
snp_plinkQC( plink.path, prefix.in, file.type = "--bfile", prefix.out = paste0(prefix.in, "_QC"), maf = 0.01, geno = 0.1, mind = 0.1, hwe = 1e-50, autosome.only = FALSE, extra.options = "", verbose = TRUE )
snp_plinkQC( plink.path, prefix.in, file.type = "--bfile", prefix.out = paste0(prefix.in, "_QC"), maf = 0.01, geno = 0.1, mind = 0.1, hwe = 1e-50, autosome.only = FALSE, extra.options = "", verbose = TRUE )
plink.path |
Path to the executable of PLINK 1.9. |
prefix.in |
Prefix (path without extension) of the dataset to be QCed. |
file.type |
Type of the dataset to be QCed. Default is |
prefix.out |
Prefix (path without extension) of the bed/bim/fam dataset
to be created. Default is created by appending |
maf |
Minimum Minor Allele Frequency (MAF) for a SNP to be kept.
Default is |
geno |
Maximum proportion of missing values for a SNP to be kept.
Default is |
mind |
Maximum proportion of missing values for a sample to be kept.
Default is |
hwe |
Filters out all variants which have Hardy-Weinberg equilibrium
exact test p-value below the provided threshold. Default is |
autosome.only |
Whether to exclude all unplaced and non-autosomal
variants? Default is |
extra.options |
Other options to be passed to PLINK as a string. More
options can be found at https://www.cog-genomics.org/plink2/filter.
If using PLINK 2.0, you could e.g. use |
verbose |
Whether to show PLINK log? Default is |
The path of the newly created bedfile.
Chang, Christopher C, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4 (1): 7. doi:10.1186/s13742-015-0047-8.
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") prefix <- sub_bed(bedfile) plink <- download_plink() test <- snp_plinkQC(plink.path = plink, prefix.in = prefix, prefix.out = tempfile(), file.type = "--bfile", # the default (for ".bed") maf = 0.05, geno = 0.05, mind = 0.05, hwe = 1e-10, autosome.only = TRUE) test ## End(Not run)
## Not run: bedfile <- system.file("extdata", "example.bed", package = "bigsnpr") prefix <- sub_bed(bedfile) plink <- download_plink() test <- snp_plinkQC(plink.path = plink, prefix.in = prefix, prefix.out = tempfile(), file.type = "--bfile", # the default (for ".bed") maf = 0.05, geno = 0.05, mind = 0.05, hwe = 1e-10, autosome.only = TRUE) test ## End(Not run)
Create new bed/bim/fam files by removing samples with PLINK.
snp_plinkRmSamples( plink.path, bedfile.in, bedfile.out, df.or.files, col.family.ID = 1, col.sample.ID = 2, ..., verbose = TRUE )
snp_plinkRmSamples( plink.path, bedfile.in, bedfile.out, df.or.files, col.family.ID = 1, col.sample.ID = 2, ..., verbose = TRUE )
plink.path |
Path to the executable of PLINK 1.9. |
bedfile.in |
Path to the input bedfile. |
bedfile.out |
Path to the output bedfile. |
df.or.files |
Either
|
col.family.ID |
Index of the column containing the family IDs to match with those of the study. Default is the first one. |
col.sample.ID |
Index of the column containing the sample IDs to match with those of the study. Default is the second one. |
... |
Any additional parameter to pass to |
verbose |
Whether to show PLINK log? Default is |
The path of the new bedfile.
Compute a matrix product between BGEN files and a matrix. This removes the
need to read an intermediate FBM object with snp_readBGEN()
to compute the
product. Moreover, when using dosages, they are not rounded to two decimal
places anymore.
snp_prodBGEN( bgenfiles, beta, list_snp_id, ind_row = NULL, bgi_dir = dirname(bgenfiles), read_as = c("dosage", "random"), block_size = 1000, ncores = 1 )
snp_prodBGEN( bgenfiles, beta, list_snp_id, ind_row = NULL, bgi_dir = dirname(bgenfiles), read_as = c("dosage", "random"), block_size = 1000, ncores = 1 )
bgenfiles |
Character vector of paths to files with extension ".bgen". The corresponding ".bgen.bgi" index files must exist. |
beta |
A matrix (or a vector), with rows corresponding to |
list_snp_id |
List of character vectors of SNP IDs to read, with one
vector per BGEN file. Each SNP ID should be in the form
|
ind_row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. Don't use negative indices.
You can access the sample IDs corresponding to the genotypes from the .sample
file, and use e.g. |
bgi_dir |
Directory of index files. Default is the same as |
read_as |
How to read BGEN probabilities? Currently implemented:
|
block_size |
Maximum size of temporary blocks (in number of variants).
Default is |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
The product bgen_data[ind_row, 'list_snp_id'] %*% beta
.
Polygenic Risk Scores with possible clumping and thresholding.
snp_PRS( G, betas.keep, ind.test = rows_along(G), ind.keep = cols_along(G), same.keep = rep(TRUE, length(ind.keep)), lpS.keep = NULL, thr.list = 0 )
snp_PRS( G, betas.keep, ind.test = rows_along(G), ind.keep = cols_along(G), same.keep = rep(TRUE, length(ind.keep)), lpS.keep = NULL, thr.list = 0 )
G |
A FBM.code256
(typically |
betas.keep |
Numeric vector of weights associated with each SNP
corresponding to |
ind.test |
The individuals on whom to project the scores. Default uses all. |
ind.keep |
Column (SNP) indices to use (if using clumping, the output of snp_clumping). Default doesn't clump. |
same.keep |
A logical vector associated with |
lpS.keep |
Numeric vector of |
thr.list |
Threshold vector on |
A matrix of scores, where rows correspond to ind.test
and
columns correspond to thr.list
.
test <- snp_attachExtdata() G <- big_copy(test$genotypes, ind.col = 1:1000) CHR <- test$map$chromosome[1:1000] POS <- test$map$physical.position[1:1000] y01 <- test$fam$affection - 1 # PCA -> covariables obj.svd <- snp_autoSVD(G, infos.chr = CHR, infos.pos = POS) # train and test set ind.train <- sort(sample(nrow(G), 400)) ind.test <- setdiff(rows_along(G), ind.train) # 117 # GWAS gwas.train <- big_univLogReg(G, y01.train = y01[ind.train], ind.train = ind.train, covar.train = obj.svd$u[ind.train, ]) # clumping ind.keep <- snp_clumping(G, infos.chr = CHR, ind.row = ind.train, S = abs(gwas.train$score)) # -log10(p-values) and thresolding summary(lpS.keep <- -predict(gwas.train)[ind.keep]) thrs <- seq(0, 4, by = 0.5) nb.pred <- sapply(thrs, function(thr) sum(lpS.keep > thr)) # PRS prs <- snp_PRS(G, betas.keep = gwas.train$estim[ind.keep], ind.test = ind.test, ind.keep = ind.keep, lpS.keep = lpS.keep, thr.list = thrs) # AUC as a function of the number of predictors aucs <- apply(prs, 2, AUC, target = y01[ind.test]) library(ggplot2) qplot(nb.pred, aucs) + geom_line() + scale_x_log10(breaks = nb.pred) + labs(x = "Number of predictors", y = "AUC") + theme_bigstatsr()
test <- snp_attachExtdata() G <- big_copy(test$genotypes, ind.col = 1:1000) CHR <- test$map$chromosome[1:1000] POS <- test$map$physical.position[1:1000] y01 <- test$fam$affection - 1 # PCA -> covariables obj.svd <- snp_autoSVD(G, infos.chr = CHR, infos.pos = POS) # train and test set ind.train <- sort(sample(nrow(G), 400)) ind.test <- setdiff(rows_along(G), ind.train) # 117 # GWAS gwas.train <- big_univLogReg(G, y01.train = y01[ind.train], ind.train = ind.train, covar.train = obj.svd$u[ind.train, ]) # clumping ind.keep <- snp_clumping(G, infos.chr = CHR, ind.row = ind.train, S = abs(gwas.train$score)) # -log10(p-values) and thresolding summary(lpS.keep <- -predict(gwas.train)[ind.keep]) thrs <- seq(0, 4, by = 0.5) nb.pred <- sapply(thrs, function(thr) sum(lpS.keep > thr)) # PRS prs <- snp_PRS(G, betas.keep = gwas.train$estim[ind.keep], ind.test = ind.test, ind.keep = ind.keep, lpS.keep = lpS.keep, thr.list = thrs) # AUC as a function of the number of predictors aucs <- apply(prs, 2, AUC, target = y01[ind.test]) library(ggplot2) qplot(nb.pred, aucs) + geom_line() + scale_x_log10(breaks = nb.pred) + labs(x = "Number of predictors", y = "AUC") + theme_bigstatsr()
Creates a quantile-quantile plot from p-values from a GWAS study.
snp_qq(gwas, lambdaGC = TRUE, coeff = 1)
snp_qq(gwas, lambdaGC = TRUE, coeff = 1)
gwas |
A |
lambdaGC |
Add the Genomic Control coefficient as subtitle to the plot? |
coeff |
Relative size of text. Default is |
A ggplot2
object. You can plot it using the print
method.
You can modify it as you wish by adding layers. You might want to read
this chapter
to get more familiar with the package ggplot2.
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
set.seed(9) test <- snp_attachExtdata() G <- test$genotypes y <- rnorm(nrow(G)) gwas <- big_univLinReg(G, y) snp_qq(gwas) gwas_gc <- snp_gc(gwas) # this modifies `attr(gwas_gc, "transfo")` snp_qq(gwas_gc) # The next plot should be prettier with a real dataset snp_manhattan(gwas_gc, infos.chr = test$map$chromosome, infos.pos = test$map$physical.pos) + ggplot2::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") p <- snp_qq(gwas_gc) + ggplot2::aes(text = asPlotlyText(test$map)) + ggplot2::labs(subtitle = NULL, x = "Expected -log10(p)", y = "Observed -log10(p)") ## Not run: plotly::ggplotly(p, tooltip = "text")
Functions to read bed/bim/fam files into a bigSNP.
snp_readBed(bedfile, backingfile = sub_bed(bedfile)) snp_readBed2( bedfile, backingfile = sub_bed(bedfile), ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
snp_readBed(bedfile, backingfile = sub_bed(bedfile)) snp_readBed2( bedfile, backingfile = sub_bed(bedfile), ind.row = rows_along(obj.bed), ind.col = cols_along(obj.bed), ncores = 1 )
bedfile |
Path to file with extension ".bed" to read. You need the corresponding ".bim" and ".fam" in the same directory. |
backingfile |
The path (without extension) for the backing files for the cache of the bigSNP object. Default takes the bedfile without the ".bed" extension. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
For more information on these formats, please visit
PLINK webpage.
For other formats, please use PLINK to convert them in bedfiles,
which require minimal space to store and are faster to read. For example,
to convert from a VCF file, use the --vcf
option. See snp_plinkQC.
The path to the RDS file that stores the bigSNP
object.
Note that this function creates one other file which stores the values of
the Filebacked Big Matrix.
You shouldn't read from PLINK files more than once. Instead, use
snp_attach to load the "bigSNP" object in any R session from backing files.
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")) # Reading the bedfile and storing the data in temporary directory rds <- snp_readBed(bedfile, backingfile = tempfile()) # Loading the data from backing files test <- snp_attach(rds) str(test) dim(G <- test$genotypes) G[1:8, 1:8]
(bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")) # Reading the bedfile and storing the data in temporary directory rds <- snp_readBed(bedfile, backingfile = tempfile()) # Loading the data from backing files test <- snp_attach(rds) str(test) dim(G <- test$genotypes) G[1:8, 1:8]
Function to read the UK Biobank BGEN files into a bigSNP.
snp_readBGEN( bgenfiles, backingfile, list_snp_id, ind_row = NULL, bgi_dir = dirname(bgenfiles), read_as = c("dosage", "random"), ncores = 1 )
snp_readBGEN( bgenfiles, backingfile, list_snp_id, ind_row = NULL, bgi_dir = dirname(bgenfiles), read_as = c("dosage", "random"), ncores = 1 )
bgenfiles |
Character vector of paths to files with extension ".bgen". The corresponding ".bgen.bgi" index files must exist. |
backingfile |
The path (without extension) for the backing files (".bk" and ".rds") that are created by this function for storing the bigSNP object. |
list_snp_id |
List of character vectors of SNP IDs to read, with one
vector per BGEN file. Each SNP ID should be in the form
|
ind_row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. Don't use negative indices.
You can access the sample IDs corresponding to the genotypes from the .sample
file, and use e.g. |
bgi_dir |
Directory of index files. Default is the same as |
read_as |
How to read BGEN probabilities? Currently implemented:
|
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
For more information on this format, please visit BGEN webpage.
This function is designed to read UK Biobank imputation files. This assumes
that variants have been compressed with zlib, that there are only 2 possible
alleles, and that each probability is stored on 8 bits. For example, if you
use qctool to generate your own BGEN files, please make sure you are using
options '-ofiletype bgen_v1.2 -bgen-bits 8 -assume-chromosome
'.
If the format is not the expected one, this will result in an error or even
a crash of your R session. Another common source of error is due to corrupted
files; e.g. if using UK Biobank files, compare the result of tools::md5sum()
with the ones at https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=998.
You can look at some example code from my papers on how to use this function:
https://github.com/privefl/paper-infer/blob/main/code/prepare-geno-simu.R
https://github.com/privefl/paper-misspec/blob/main/code/prepare-genotypes.R
The path to the RDS file <backingfile>.rds
that stores the bigSNP
object created by this function.
Note that this function creates another file (.bk) which stores the values
of the FBM ($genotypes
). The rows corresponds to the order of ind_row
;
the columns to the order of list_snp_id
. The $map
component of the
bigSNP
object stores some information on the variants (including allele
frequencies and INFO scores computed from the imputation probabilities).
However, it does not have a $fam
component; you should use the individual
IDs in the .sample file (filtered with ind_row
) to add external
information on the individuals.
You shouldn't read from BGEN files more than once. Instead, use
snp_attach to load the "bigSNP" object in any R session from backing files.
Read variant info from one BGI file
snp_readBGI(bgifile, snp_id = NULL)
snp_readBGI(bgifile, snp_id = NULL)
bgifile |
Path to one file with extension ".bgi". |
snp_id |
Character vector of SNP IDs. These should be in the form
|
A data frame containing variant information.
Save a bigSNP
after having made some modifications to it.
As bigSNP
is an S3 class, you can add any slot you want
to an object of this class, then use snp_save
to
save these modifications in the corresponding ".rds" backing file.
snp_save(x, version = NULL)
snp_save(x, version = NULL)
x |
A bigSNP. |
version |
the workspace format version to use. |
The (saved) bigSNP
.
set.seed(1) # Reading example test <- snp_attachExtdata() # I can add whatever I want to an S3 class test$map$`p-values` <- runif(nrow(test$map)) str(test$map) # Reading again rds <- test$genotypes$rds test2 <- snp_attach(rds) str(test2$map) # new slot wasn't saved # Save it snp_save(test) # Reading again test3 <- snp_attach(rds) str(test3$map) # it is saved now # The complicated code of this function snp_save
set.seed(1) # Reading example test <- snp_attachExtdata() # I can add whatever I want to an S3 class test$map$`p-values` <- runif(nrow(test$map)) str(test$map) # Reading again rds <- test$genotypes$rds test2 <- snp_attach(rds) str(test2$map) # new slot wasn't saved # Save it snp_save(test) # Reading again test3 <- snp_attach(rds) str(test3$map) # it is saved now # The complicated code of this function snp_save
Binomial(n, p) scaling where n
is fixed and p
is estimated.
snp_scaleAlpha(alpha = -1) snp_scaleBinom(nploidy = 2)
snp_scaleAlpha(alpha = -1) snp_scaleBinom(nploidy = 2)
alpha |
Assumes that the average contribution (e.g. heritability)
of a SNP of frequency |
nploidy |
Number of trials, parameter of the binomial distribution.
Default is |
You will probably not use this function as is but as the
fun.scaling
parameter of other functions of package bigstatsr
.
A new function that returns a data.frame of two vectors
"center" and "scale" which are of the length of ind.col
.
This scaling is widely used for SNP arrays. Patterson N, Price AL, Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190.
set.seed(1) a <- matrix(0, 93, 170) p <- 0.2 a[] <- rbinom(length(a), 2, p) X <- add_code256(big_copy(a, type = "raw"), code = c(0, 1, 2, rep(NA, 253))) X.svd <- big_SVD(X, fun.scaling = snp_scaleBinom()) str(X.svd) plot(X.svd$center) abline(h = 2 * p, col = "red") plot(X.svd$scale) abline(h = sqrt(2 * p * (1 - p)), col = "red")
set.seed(1) a <- matrix(0, 93, 170) p <- 0.2 a[] <- rbinom(length(a), 2, p) X <- add_code256(big_copy(a, type = "raw"), code = c(0, 1, 2, rep(NA, 253))) X.svd <- big_SVD(X, fun.scaling = snp_scaleBinom()) str(X.svd) plot(X.svd$center) abline(h = 2 * p, col = "red") plot(X.svd$scale) abline(h = sqrt(2 * p * (1 - p)), col = "red")
Simulate phenotypes using a linear model. When a prevalence is given, the liability threshold is used to convert liabilities to a binary outcome. The genetic and environmental liabilities are scaled such that the variance of the genetic liability is exactly equal to the requested heritability, and the variance of the total liability is equal to 1.
snp_simuPheno( G, h2, M, K = NULL, alpha = -1, ind.row = rows_along(G), ind.possible = cols_along(G), prob = NULL, effects.dist = c("gaussian", "laplace"), ncores = 1 )
snp_simuPheno( G, h2, M, K = NULL, alpha = -1, ind.row = rows_along(G), ind.possible = cols_along(G), prob = NULL, effects.dist = c("gaussian", "laplace"), ncores = 1 )
G |
A FBM.code256
(typically |
h2 |
Heritability. |
M |
Number of causal variants. |
K |
Prevalence. Default is |
alpha |
Assumes that the average contribution (e.g. heritability)
of a SNP of frequency |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.possible |
Indices of possible causal variants. |
prob |
Vector of probability weights for sampling causal indices.
It can have 0s (discarded) and is automatically scaled to sum to 1.
Default is |
effects.dist |
Distribution of effects.
Either |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
A list with 3 elements:
$pheno
: vector of phenotypes,
$set
: indices of causal variants,
$effects
: effect sizes (of scaled genotypes) corresponding to set
.
$allelic_effects
: effect sizes, but on the allele scale (0|1|2).
A Split-Apply-Combine strategy to parallelize the evaluation of a function on each SNP, independently.
snp_split(infos.chr, FUN, combine, ncores = 1, ...)
snp_split(infos.chr, FUN, combine, ncores = 1, ...)
infos.chr |
Vector of integers specifying each SNP's chromosome. |
FUN |
The function to be applied. It must take a
FBM.code256 as first argument and |
combine |
function that is used by foreach::foreach to process the tasks results as they generated. This can be specified as either a function or a non-empty character string naming the function. Specifying 'c' is useful for concatenating the results into a vector, for example. The values 'cbind' and 'rbind' can combine vectors into a matrix. The values '+' and '*' can be used to process numeric data. By default, the results are returned in a list. |
ncores |
Number of cores used. Default doesn't use parallelism.
You may use |
... |
Extra arguments to be passed to |
This function splits indices for each chromosome, then apply a given function to each part (chromosome) and finally combine the results.
The result of foreach::foreach.
# parallelize over chromosomes made easy # examples of functions from this package snp_pruning snp_clumping snp_fastImpute
# parallelize over chromosomes made easy # examples of functions from this package snp_pruning snp_clumping snp_fastImpute
Subset (copy) of a bigSNP
, also stored on disk.
snp_subset( x, ind.row = rows_along(x$genotypes), ind.col = cols_along(x$genotypes), backingfile = NULL ) ## S3 method for class 'bigSNP' subset( x, ind.row = rows_along(x$fam), ind.col = rows_along(x$map), backingfile = NULL, ... )
snp_subset( x, ind.row = rows_along(x$genotypes), ind.col = cols_along(x$genotypes), backingfile = NULL ) ## S3 method for class 'bigSNP' subset( x, ind.row = rows_along(x$fam), ind.col = rows_along(x$map), backingfile = NULL, ... )
x |
A bigSNP. |
ind.row |
Indices of the rows (individuals) to keep. Negative indices can be used to exclude row indices. Default: keep them all. |
ind.col |
Indices of the columns (SNPs) to keep. Negative indices can be used to exclude column indices. Default: keep them all. |
backingfile |
Prefix of the two new files created (".bk" and ".rds"). By default, it is automatically determined by appending "_sub" and a number to the prefix of the input bigSNP backing files. |
... |
Not used. |
The path to the RDS file that stores the bigSNP
object.
str(test <- snp_attachExtdata()) # keep only first 50 samples and SNPs rdsfile <- snp_subset(test, ind.row = 1:50, ind.col = 1:50) str(snp_attach(rdsfile)) # remove only first 50 samples and SNPs rdsfile2 <- snp_subset(test, ind.row = -(1:50), ind.col = -(1:50)) str(snp_attach(rdsfile2))
str(test <- snp_attachExtdata()) # keep only first 50 samples and SNPs rdsfile <- snp_subset(test, ind.row = 1:50, ind.col = 1:50) str(snp_attach(rdsfile)) # remove only first 50 samples and SNPs rdsfile2 <- snp_subset(test, ind.row = -(1:50), ind.col = -(1:50)) str(snp_attach(rdsfile2))
P-value thresholding and correction of summary statistics for winner's curse.
snp_thr_correct(beta, beta_se, lpS, thr_lpS)
snp_thr_correct(beta, beta_se, lpS, thr_lpS)
beta |
Vector of effect sizes. |
beta_se |
Vector of standard errors for |
lpS |
Vector of -log10(p-value) associated with |
thr_lpS |
Threshold on |
beta
after p-value thresholding and shrinkage.
Zhong, H., & Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, 9(4), 621-634.
beta <- rnorm(1000) beta_se <- runif(1000, min = 0.3, max = 0.5) new_beta <- snp_thr_correct(beta, beta_se = beta_se, thr_lpS = 1) plot(beta / beta_se, new_beta / beta_se, pch = 20); abline(0, 1, col = "red") plot(beta, new_beta, pch = 20); abline(0, 1, col = "red") # Can provide -log10(p-values) instead of standard errors lpval <- -log10(pchisq((beta / beta_se)^2, df = 1, lower.tail = FALSE)) new_beta2 <- snp_thr_correct(beta, lpS = lpval, thr_lpS = 1) all.equal(new_beta2, new_beta)
beta <- rnorm(1000) beta_se <- runif(1000, min = 0.3, max = 0.5) new_beta <- snp_thr_correct(beta, beta_se = beta_se, thr_lpS = 1) plot(beta / beta_se, new_beta / beta_se, pch = 20); abline(0, 1, col = "red") plot(beta, new_beta, pch = 20); abline(0, 1, col = "red") # Can provide -log10(p-values) instead of standard errors lpval <- -log10(pchisq((beta / beta_se)^2, df = 1, lower.tail = FALSE)) new_beta2 <- snp_thr_correct(beta, lpS = lpval, thr_lpS = 1) all.equal(new_beta2, new_beta)
Function to write bed/bim/fam files from a bigSNP.
This will use the slot code
rounded to write 0s, 1s, 2s or NAs.
snp_writeBed(x, bedfile, ind.row = rows_along(G), ind.col = cols_along(G))
snp_writeBed(x, bedfile, ind.row = rows_along(G), ind.col = cols_along(G))
x |
A bigSNP. |
bedfile |
Path to file with extension ".bed" to create. |
ind.row |
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used. |
ind.col |
An optional vector of the column indices (SNPs) that are used.
If not specified, all columns are used. |
The input bedfile
path.
N <- 17 M <- 911 fake <- snp_fake(N, M) G <- fake$genotypes G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE) # Write the object as a bed/bim/fam object tmp <- tempfile(fileext = ".bed") bed <- snp_writeBed(fake, tmp) # Read this new file for the first time rds <- snp_readBed(bed, backingfile = tempfile()) # Attach object in R session fake2 <- snp_attach(rds) # Same content all.equal(fake$genotypes[], fake2$genotypes[]) all.equal(fake$fam, fake2$fam) all.equal(fake$map, fake2$map) # Two different backingfiles fake$genotypes$backingfile fake2$genotypes$backingfile
N <- 17 M <- 911 fake <- snp_fake(N, M) G <- fake$genotypes G[] <- sample(as.raw(0:3), size = length(G), replace = TRUE) # Write the object as a bed/bim/fam object tmp <- tempfile(fileext = ".bed") bed <- snp_writeBed(fake, tmp) # Read this new file for the first time rds <- snp_readBed(bed, backingfile = tempfile()) # Attach object in R session fake2 <- snp_attach(rds) # Same content all.equal(fake$genotypes[], fake2$genotypes[]) all.equal(fake$fam, fake2$fam) all.equal(fake$map, fake2$map) # Two different backingfiles fake$genotypes$backingfile fake2$genotypes$backingfile
Replace extension '.bed'
sub_bed(path, replacement = "", stop_if_not_ext = TRUE)
sub_bed(path, replacement = "", stop_if_not_ext = TRUE)
path |
String with extension '.bed'. |
replacement |
Replacement of '.bed'. Default replaces by nothing. Can be useful to replace e.g. by '.bim' or '.fam'. |
stop_if_not_ext |
If |
String with extension '.bed' replaced by replacement
.
path <- "toto.bed" sub_bed(path) sub_bed(path, ".bim") sub_bed(path, ".fam") sub_bed(path, "_QC", stop_if_not_ext = FALSE)
path <- "toto.bed" sub_bed(path) sub_bed(path, ".bim") sub_bed(path, ".fam") sub_bed(path, "_QC", stop_if_not_ext = FALSE)