Title: | Winsorize Data |
---|---|
Description: | Remove outliers by means of winsorization, ie shrinking outlying observations to the border of the main part of the data. This package started from the excellent robustHD package by Andreas Alfons in order to reduce the number dependent package being pulled in. We expect to update the code over time. |
Authors: | Andreas Alfons [aut] , Dirk Eddelbuettel [aut, cre] |
Maintainer: | Dirk Eddelbuettel <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.2.1 |
Built: | 2024-12-12 22:21:39 UTC |
Source: | https://github.com/eddelbuettel/winsorize |
Remove outliers by means of winsorization, ie shrinking outlying observations to the border of the main part of the data.
This package started from the excellent robustHD package by Andreas Alfons in order to reduce the number dependent package being pulled in. We expect to update the code over time.
Andreas Alfons wrote winsorize
as part as his excellent
robustHD
package.
Maintainer: Dirk Eddelbuettel <[email protected]>
See the robustHD
package for more.
Compute a robust correlation estimate based on winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.
corHuber(x, y, type = c("bivariate", "adjusted", "univariate"), standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, ...)
corHuber(x, y, type = c("bivariate", "adjusted", "univariate"), standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, ...)
x |
a numeric vector. |
y |
a numeric vector. |
type |
a character string specifying the type of
winsorization to be used. Possible values are
|
standardized |
a logical indicating whether the data are already robustly standardized. |
centerFun |
a function to compute a robust estimate
for the center to be used for robust standardization
(defaults to |
scaleFun |
a function to compute a robust estimate
for the scale to be used for robust standardization
(defaults to |
const |
numeric; tuning constant to be used in univariate or adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
tol |
a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since the points form almost a straight line, and the initial estimate is returned. |
... |
additional arguments to be passed to
|
The borders of the main part of the data are defined on
the scale of the robustly standardized data. In
univariate winsorization, the borders for each variable
are given by const
, thus a symmetric
distribution is assumed. In adjusted univariate
winsorization, the borders for the two diagonally
opposing quadrants containing the minority of the data
are shrunken by a factor that depends on the ratio
between the number of observations in the major and minor
quadrants. It is thus possible to better account for the
bivariate structure of the data while maintaining fast
computation. In bivariate winsorization, a bivariate
normal distribution is assumed and the data are shrunken
towards the boundary of a tolerance ellipse with coverage
probability prob
. The boundary of this ellipse is
thereby given by all points that have a squared
Mahalanobis distance equal to the quantile of the
distribution given by
prob
. Furthermore, the initial correlation matrix
required for the Mahalanobis distances is computed based
on adjusted univariate winsorization.
The robust correlation estimate.
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299.
## Not run: ## generate data library("mvtnorm") set.seed(1234) # for reproducibility Sigma <- matrix(c(1, 0.6, 0.6, 1), 2, 2) xy <- rmvnorm(100, sigma=Sigma) x <- xy[, 1] y <- xy[, 2] ## introduce outlier x[1] <- x[1] * 10 y[1] <- y[1] * (-5) ## compute correlation cor(x, y) corHuber(x, y) ## End(Not run)
## Not run: ## generate data library("mvtnorm") set.seed(1234) # for reproducibility Sigma <- matrix(c(1, 0.6, 0.6, 1), 2, 2) xy <- rmvnorm(100, sigma=Sigma) x <- xy[, 1] y <- xy[, 2] ## introduce outlier x[1] <- x[1] * 10 y[1] <- y[1] * (-5) ## compute correlation cor(x, y) corHuber(x, y) ## End(Not run)
Standardize data with given functions for computing center and scale.
standardize(x, centerFun = mean, scaleFun = sd) robStandardize(x, centerFun = median, scaleFun = mad, fallback = FALSE, eps = .Machine$double.eps, ...)
standardize(x, centerFun = mean, scaleFun = sd) robStandardize(x, centerFun = median, scaleFun = mad, fallback = FALSE, eps = .Machine$double.eps, ...)
x |
a numeric vector, matrix or data frame to be standardized. |
centerFun |
a function to compute an estimate of the center of a
variable (defaults to |
scaleFun |
a function to compute an estimate of the scale of a
variable (defaults to |
fallback |
a logical indicating whether standardization with
|
eps |
a small positive numeric value used to determine whether the robust scale estimate of a variable is too small (an effective zero). |
... |
currently ignored. |
robStandardize
is a wrapper function for robust standardization,
hence the default is to use median
and
mad
.
An object of the same type as the original data x
containing
the centered and scaled data. The center and scale estimates of the
original data are returned as attributes "center"
and "scale"
,
respectively.
The implementation contains special cases for the typically used
combinations mean
/sd
and
median
/mad
in order to reduce
computation time.
Andreas Alfons
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## standardize data x standardize(x) # mean and sd robStandardize(x) # median and MAD
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## standardize data x standardize(x) # mean and sd robStandardize(x) # median and MAD
Clean data by means of winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.
winsorize(x, ...) ## Default S3 method: winsorize(x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, return = c("data", "weights"), ...) ## S3 method for class 'matrix' winsorize(x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, return = c("data", "weights"), ...) ## S3 method for class 'data.frame' winsorize(x, ...)
winsorize(x, ...) ## Default S3 method: winsorize(x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, return = c("data", "weights"), ...) ## S3 method for class 'matrix' winsorize(x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, return = c("data", "weights"), ...) ## S3 method for class 'data.frame' winsorize(x, ...)
x |
a numeric vector, matrix or data frame to be cleaned. |
standardized |
a logical indicating whether the data are already robustly standardized. |
centerFun |
a function to compute a robust estimate for the center to
be used for robust standardization (defaults to
|
scaleFun |
a function to compute a robust estimate for the scale to
be used for robust standardization (defaults to |
const |
numeric; tuning constant to be used in univariate winsorization (defaults to 2). |
return |
character string; if |
prob |
numeric; probability for the quantile of the
|
tol |
a small positive numeric value used to determine singularity
issues in the computation of correlation estimates based on bivariate
winsorization (see |
... |
for the generic function, additional arguments to be passed
down to methods. For the |
The borders of the main part of the data are defined on the scale of the
robustly standardized data. In the univariate case, the borders are given
by const
, thus a symmetric distribution is assumed. In the
multivariate case, a normal distribution is assumed and the data are
shrunken towards the boundary of a tolerance ellipse with coverage
probability prob
. The boundary of this ellipse is thereby given by
all points that have a squared Mahalanobis distance equal to the quantile of
the distribution given by
prob
.
If standardize
is TRUE
and return
is "weights"
,
a set of data cleaning weights. Multiplying each observation of the
standardized data by the corresponding weight yields the cleaned
standardized data.
Otherwise an object of the same type as the original data x
containing the cleaned data is returned.
Data cleaning weights are only meaningful for standardized data. In the general case, the data need to be standardized first, then the data cleaning weights can be computed and applied to the standardized data, after which the cleaned standardized data need to be backtransformed to the original scale.
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299.
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## winsorize data x winsorize(x)
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## winsorize data x winsorize(x)