Title: | Utility Functions for Large-scale Data |
---|---|
Description: | Utility functions for large-scale data. For now, package 'bigutilsr' mainly includes functions for outlier detection and unbiased PCA projection. |
Authors: | Florian Privé [aut, cre] |
Maintainer: | Florian Privé <[email protected]> |
License: | GPL-3 |
Version: | 0.3.9 |
Built: | 2024-10-22 15:22:08 UTC |
Source: | https://github.com/privefl/bigutilsr |
Transform a data frame into a matrix using one hot encoding.
as_model_matrix(df, intercept = FALSE)
as_model_matrix(df, intercept = FALSE)
df |
A data frame. |
intercept |
Whether to have a column with all |
A matrix.
mat <- as_model_matrix(iris) str(mat)
mat <- as_model_matrix(iris) str(mat)
Deprecated
covRob(data, ...)
covRob(data, ...)
data |
A matrix. |
... |
Not used. |
Computes a robust multivariate location and scatter estimate with a high breakdown point, using the pairwise algorithm proposed by Marona and Zamar (2002) which in turn is based on the pairwise robust estimator proposed by Gnanadesikan-Kettenring (1972).
covrob_ogk(U, niter = 2, beta = 0.9) dist_ogk(U, niter = 2, beta = 0.9)
covrob_ogk(U, niter = 2, beta = 0.9) dist_ogk(U, niter = 2, beta = 0.9)
U |
A matrix with no missing values and at least 2 columns. |
niter |
Number of number of iterations for the first step of the algorithm, usually 1 or 2 since iterations beyond the second do not lead to improvement. |
beta |
Coverage parameter for the final reweighted estimate.
Default is |
covrob_ogk()
: list of robust estimates, $cov
and $center
.
dist_ogk()
: vector of robust Mahalanobis (squared) distances.
Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.
Yohai, R.A. and Zamar, R.H. (1998) High breakdown point estimates of regression by means of the minimization of efficient scale JASA 86, 403–413.
Gnanadesikan, R. and John R. Kettenring (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 5) U <- svd$u dist <- dist_ogk(U) str(dist)
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 5) U <- svd$u dist <- dist_ogk(U) str(dist)
Compute the geometric median, i.e. the point that minimizes the sum of all
Euclidean distances to the observations (rows of U
).
geometric_median(U, tol = 1e-10, maxiter = 1000, by_grp = NULL)
geometric_median(U, tol = 1e-10, maxiter = 1000, by_grp = NULL)
U |
A matrix (e.g. PC scores). |
tol |
Convergence criterion. Default is |
maxiter |
Maximum number of iterations. Default is |
by_grp |
Possibly a vector for splitting rows of |
The geometric median of all rows of U
, a vector of the same size
as ncol(U)
. If providing by_grp
, then a matrix with rows being the
geometric median within each group.
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) pop <- rep(1:3, c(143, 167, 207)) svd <- svds(scale(X), k = 5) U <- sweep(svd$u, 2, svd$d, '*') plot(U, col = pop, pch = 20) med_all <- geometric_median(U) points(t(med_all), pch = 20, col = "blue", cex = 4) med_pop <- geometric_median(U, by_grp = pop) points(med_pop, pch = 20, col = "blue", cex = 2)
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) pop <- rep(1:3, c(143, 167, 207)) svd <- svds(scale(X), k = 5) U <- sweep(svd$u, 2, svd$d, '*') plot(U, col = pop, pch = 20) med_all <- geometric_median(U) points(t(med_all), pch = 20, col = "blue", cex = 4) med_pop <- geometric_median(U, by_grp = pop) points(med_pop, pch = 20, col = "blue", cex = 2)
Outlier detection based on departure from histogram. Suitable for compact values (need a space between main values and outliers).
hist_out(x, breaks = nclass.scottRob, pmax_out = 0.2, nboot = NULL)
hist_out(x, breaks = nclass.scottRob, pmax_out = 0.2, nboot = NULL)
x |
Numeric vector (with compact values). |
breaks |
Same parameter as for |
pmax_out |
Percentage at each side that can be considered outliers at
each step. Default is |
nboot |
Number of bootstrap replicates to estimate limits more robustly.
Default is |
A list with
x
: the initial vector, whose outliers have been removed,
lim
: lower and upper limits for outlier removal,
all_lim
: all bootstrap replicates for lim
(if nboot
not NULL
).
set.seed(1) x <- rnorm(1000) str(hist_out(x)) # Easy to separate x2 <- c(x, rnorm(50, mean = 7)) hist(x2, breaks = nclass.scottRob) str(hist_out(x2)) # More difficult to separate x3 <- c(x, rnorm(50, mean = 6)) hist(x3, breaks = nclass.scottRob) str(hist_out(x3)) str(hist_out(x3, nboot = 999))
set.seed(1) x <- rnorm(1000) str(hist_out(x)) # Easy to separate x2 <- c(x, rnorm(50, mean = 7)) hist(x2, breaks = nclass.scottRob) str(hist_out(x2)) # More difficult to separate x3 <- c(x, rnorm(50, mean = 6)) hist(x3, breaks = nclass.scottRob) str(hist_out(x3)) str(hist_out(x3, nboot = 999))
Find K nearest neighbours for multiple query points
knn_parallel(data, query = data, k, ..., ncores = bigparallelr::nb_cores())
knn_parallel(data, query = data, k, ..., ncores = bigparallelr::nb_cores())
data |
Mxd matrix of M target points with dimension d |
query |
Nxd matrix of N query points with dimension d (nb |
k |
an integer number of nearest neighbours to find |
... |
Arguments passed on to
|
ncores |
Number of cores to use. Default uses |
A list with elements nn.idx
(1-indexed indices) and
nn.dists
(distances), both of which are N x k matrices. See details
for the results obtained with1 invalid inputs.
## Not run: knn_parallel(matrix(1:4, 2), k = 2, ncores = 2)
## Not run: knn_parallel(matrix(1:4, 2), k = 2, ncores = 2)
LOF: Identifying Density-Based Local Outliers.
LOF( U, seq_k = c(4, 10, 30), combine = max, robMaha = FALSE, log = TRUE, ncores = 1 )
LOF( U, seq_k = c(4, 10, 30), combine = max, robMaha = FALSE, log = TRUE, ncores = 1 )
U |
A matrix, from which to detect outliers (rows). E.g. PC scores. |
seq_k |
Sequence of numbers of nearest neighbors to use.
If multiple |
combine |
How to combine results for multiple |
robMaha |
Whether to use a robust Mahalanobis distance instead of the
normal euclidean distance? Default is |
log |
Whether to return the logarithm of LOFs? Default is |
ncores |
Number of cores to use. Default is |
Breunig, Markus M., et al. "LOF: identifying density-based local outliers." ACM sigmod record. Vol. 29. No. 2. ACM, 2000.
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 10) llof <- LOF(svd$u) hist(llof, breaks = nclass.scottRob) tukey_mc_up(llof) llof_maha <- LOF(svd$u, robMaha = TRUE) hist(llof_maha, breaks = nclass.scottRob) tukey_mc_up(llof_maha) lof <- LOF(svd$u, log = FALSE) hist(lof, breaks = nclass.scottRob) str(hist_out(lof)) str(hist_out(lof, nboot = 100)) str(hist_out(lof, nboot = 100, breaks = "FD"))
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 10) llof <- LOF(svd$u) hist(llof, breaks = nclass.scottRob) tukey_mc_up(llof) llof_maha <- LOF(svd$u, robMaha = TRUE) hist(llof_maha, breaks = nclass.scottRob) tukey_mc_up(llof_maha) lof <- LOF(svd$u, log = FALSE) hist(lof, breaks = nclass.scottRob) str(hist_out(lof)) str(hist_out(lof, nboot = 100)) str(hist_out(lof, nboot = 100, breaks = "FD"))
Transform matrix to use Mahalanobis distance instead of Euclidean one.
maha_trans(U, estim = covrob_ogk(U))
maha_trans(U, estim = covrob_ogk(U))
U |
A matrix (e.g. PC scores). |
estim |
List of location and scatter estimates, |
U
, transformed.
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 5) U <- svd$u dist1 <- dist_ogk(U) U.maha <- maha_trans(U) dist2 <- rowSums(U.maha^2) all.equal(dist2, dist1)
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 5) U <- svd$u dist1 <- dist_ogk(U) U.maha <- maha_trans(U) dist2 <- rowSums(U.maha^2) all.equal(dist2, dist1)
Compute the Number of Classes for a Histogram
nclass.scottRob(x)
nclass.scottRob(x)
x |
a data vector. |
The suggested number of classes.
Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605–610. doi: 10.2307/2335182.
x <- rnorm(1000) hist(x, breaks = nclass.scott) hist(x, breaks = nclass.scottRob) x2 <- c(x, rnorm(50, mean = 50)) hist(x2, breaks = nclass.scott) hist(x2, breaks = nclass.scott, xlim = c(-5, 5)) hist(x2, breaks = nclass.scottRob, xlim = c(-5, 5))
x <- rnorm(1000) hist(x, breaks = nclass.scott) hist(x, breaks = nclass.scottRob) x2 <- c(x, rnorm(50, mean = 50)) hist(x2, breaks = nclass.scott) hist(x2, breaks = nclass.scott, xlim = c(-5, 5)) hist(x2, breaks = nclass.scottRob, xlim = c(-5, 5))
Estimate the number of distant spikes based on the histogram of eigenvalues.
pca_nspike(eigval, breaks = "FD", nboot = 100)
pca_nspike(eigval, breaks = "FD", nboot = 100)
eigval |
Eigenvalues (squared singular values). |
breaks |
Same parameter as for |
nboot |
Number of bootstrap replicates to estimate limits more robustly.
Default is |
The estimated number of distant spikes.
N <- 400; M <- 2000; K <- 8 U <- matrix(0, N, K); U[] <- rnorm(length(U)) V <- matrix(0, M, K); V[] <- rnorm(length(V)) # X = U V^T + E X <- tcrossprod(U, V) + 15 * rnorm(N * M) pca <- prcomp(X) eigval <- pca$sdev^2 plot(head(eigval, -1), log = "xy", pch = 20) pca_nspike(eigval)
N <- 400; M <- 2000; K <- 8 U <- matrix(0, N, K); U[] <- rnorm(length(U)) V <- matrix(0, M, K); V[] <- rnorm(length(V)) # X = U V^T + E X <- tcrossprod(U, V) + 15 * rnorm(N * M) pca <- prcomp(X) eigval <- pca$sdev^2 plot(head(eigval, -1), log = "xy", pch = 20) pca_nspike(eigval)
Online Augmentation, Decomposition, and Procrustes (OADP) projection of
PC loadings onto some study data X
.
pca_OADP_proj(X, loadings, sval) pca_OADP_proj2(XV, X_norm, sval)
pca_OADP_proj(X, loadings, sval) pca_OADP_proj2(XV, X_norm, sval)
X |
Data to get PC loadings into. |
loadings |
PC loadings of the reference PCA to project. |
sval |
Singular values of the reference PCA (sqrt of the eigen values).
Only the |
XV |
|
X_norm |
Vector of sums of squared rows (e.g. |
pca_OADP_proj()
: A list with the simple projection X %*% loadings
and the projection based on OADP.
pca_OADP_proj2()
: The projection based on OADP only
(a matrix of same size of XV
).
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) N <- 400; M <- ncol(X) ind <- sample(nrow(X), N) # Compute SVD using one part of samples svd <- svds(X[ind, ], k = 5) U <- sweep(svd$u, 2, svd$d, '*') col <- 2:3 plot(U[, col]) points(cbind(0, 0), pch = 8, col = "green", cex = 2) # Projecting other samples proj <- pca_OADP_proj(X = X[-ind, ], loadings = svd$v, sval = svd$d) points(proj$simple_proj[, col], col = "red", pch = 20) # shrunk towards 0 points(proj$OADP_proj[, col], col = "blue", pch = 20) # unshrunk
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) N <- 400; M <- ncol(X) ind <- sample(nrow(X), N) # Compute SVD using one part of samples svd <- svds(X[ind, ], k = 5) U <- sweep(svd$u, 2, svd$d, '*') col <- 2:3 plot(U[, col]) points(cbind(0, 0), pch = 8, col = "green", cex = 2) # Projecting other samples proj <- pca_OADP_proj(X = X[-ind, ], loadings = svd$v, sval = svd$d) points(proj$simple_proj[, col], col = "red", pch = 20) # shrunk towards 0 points(proj$OADP_proj[, col], col = "blue", pch = 20) # unshrunk
Predict method for class Procrustes
.
## S3 method for class 'Procrustes' predict(object, X, ...)
## S3 method for class 'Procrustes' predict(object, X, ...)
object |
Object of class |
X |
New matrix to transform. |
... |
Not used. |
X
, transformed.
Probabilistic set distance
prob_dist(U, kNN = 5, robMaha = FALSE, ncores = 1)
prob_dist(U, kNN = 5, robMaha = FALSE, ncores = 1)
U |
A matrix, from which to detect outliers (rows). E.g. PC scores. |
kNN |
Number of nearest neighbors to use. Default is |
robMaha |
Whether to use a robust Mahalanobis distance instead of the
normal euclidean distance? Default is |
ncores |
Number of cores to use. Default is |
Kriegel, Hans-Peter, et al. "LoOP: local outlier probabilities." Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009.
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 10) U <- svd$u test <- prob_dist(U) plof <- test$dist.self / test$dist.nn plof_ish <- test$dist.self / sqrt(test$dist.nn) plot(U[, 1:2], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20) plot(U[, 3:4], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20) plot(U[, 5:6], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20)
X <- readRDS(system.file("testdata", "three-pops.rds", package = "bigutilsr")) svd <- svds(scale(X), k = 10) U <- svd$u test <- prob_dist(U) plof <- test$dist.self / test$dist.nn plof_ish <- test$dist.self / sqrt(test$dist.nn) plot(U[, 1:2], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20) plot(U[, 3:4], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20) plot(U[, 5:6], col = (plof_ish > tukey_mc_up(plof_ish)) + 1, pch = 20)
Procrustes transform Y = pXR (after centering), where p is a scaling coefficient and R is a rotation matrix that minimize ||Y - pXR||_F.
procrustes(Y, X, n_iter_max = 1000, epsilon_min = 1e-07)
procrustes(Y, X, n_iter_max = 1000, epsilon_min = 1e-07)
Y |
Reference matrix. |
X |
Matrix to transform ( |
n_iter_max |
Maximum number of iterations. Default is |
epsilon_min |
Convergence criterion. Default is |
Object of class "procrustes", a list with the following elements:
$R
: the rotation matrix to apply to X
,
$rho
: the scaling coefficient to apply to X
,
$c
: the column centering to apply to the resulting matrix,
$diff
: the average difference between Y
and X
transformed.
You can use method predict()
to apply this transformation to other data.
A <- matrix(rnorm(200), ncol = 20) B <- matrix(rnorm(length(A)), nrow = nrow(A)) proc <- procrustes(B, A) str(proc) plot(B, predict(proc, A)); abline(0, 1, col = "red")
A <- matrix(rnorm(200), ncol = 20) B <- matrix(rnorm(length(A)), nrow = nrow(A)) proc <- procrustes(B, A) str(proc) plot(B, predict(proc, A)); abline(0, 1, col = "red")
Use the graphical lasso algorithm to regularize a square symmetric matrix (e.g. a covariance or correlation matrix) by assuming that its inverse has many zeros.
regul_glasso( mat, lambda, maxiter_outer = 200, maxiter_lasso = 200, tol = 1e-04, verbose = FALSE )
regul_glasso( mat, lambda, maxiter_outer = 200, maxiter_lasso = 200, tol = 1e-04, verbose = FALSE )
mat |
A square symmetric matrix. |
lambda |
Strength of regularization. It needs to be scaled with |
maxiter_outer |
Maximum number of iterations of the outer loop. Default is 200. |
maxiter_lasso |
Maximum number of iterations of each lasso solver. Default is 200. |
tol |
Tolerance for assessing convergence. Default is 1e-4 and it needs
to be scaled with |
verbose |
Whether to print iterations and differences. Default is FALSE. |
The regularized matrix, where the diagonal should be the same and
zeros should be kept as well. It also returns the lambda
used as an attribute.
(cov <- cov(iris[1:4])) lambda <- 1 / sqrt(nrow(iris)) (cov_regul <- regul_glasso(cov, lambda))
(cov <- cov(iris[1:4])) lambda <- 1 / sqrt(nrow(iris)) (cov_regul <- regul_glasso(cov, lambda))
Gaussian smoothing
rollmean(x, size)
rollmean(x, size)
x |
Numeric vector. |
size |
Radius of the smoothing (smaller than half of the length of |
Numeric vector of the same length as x
, smoothed.
(x <- rnorm(10)) rollmean(x, 3)
(x <- rnorm(10)) rollmean(x, 3)
Solve (A + lam I) x = b
solve_linear_system(A, b, add_to_diag = 0)
solve_linear_system(A, b, add_to_diag = 0)
A |
A symmetric square matrix. |
b |
A vector. |
add_to_diag |
One value to add to the diagonal of A (lam). Default is 0. |
The best solution x
of this linear system.
A <- matrix(rnorm(4), 2); A[1, 2] <- A[2, 1] # should be symmetric x <- rnorm(2) b <- A %*% x x2 <- drop(solve(A, b)) x3 <- solve_linear_system(A, b) rbind(x, x2, x3)
A <- matrix(rnorm(4), 2); A[1, 2] <- A[2, 1] # should be symmetric x <- rnorm(2) b <- A %*% x x2 <- drop(solve(A, b)) x3 <- solve_linear_system(A, b) rbind(x, x2, x3)
Outlier detection threshold (upper) based on Tukey's rule, corrected for skewness using the 'medcouple', and possibly corrected for multiple testing.
tukey_mc_up(x, coef = NULL, alpha = 0.05, a = -4, b = 3)
tukey_mc_up(x, coef = NULL, alpha = 0.05, a = -4, b = 3)
x |
Numeric vector. Should be somewhat normally distributed. |
coef |
number determining how far 'whiskers' extend out from the box.
If |
alpha |
See |
a , b
|
scaling factors multiplied by the medcouple
|
Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52, 5186–5201. doi:10.1016/j.csda.2007.11.008
hist(x <- c(rnorm(3, m = 6), rnorm(1e4, m = 0))) (q <- tukey_mc_up(x)) abline(v = q, col = "red") which(x > q)
hist(x <- c(rnorm(3, m = 6), rnorm(1e4, m = 0))) (q <- tukey_mc_up(x)) abline(v = q, col = "red") which(x > q)