Robustbase.jl

Documentation for Robustbase.jl

Univariate: Robust scales

The univariate module contains several univariate location and scale estimators. They all implement the abstract base class RobustScale, which has the properties location and scale and can be fitted using the fit! method. Each class is expected to implement a _calculate method where the attributes scale_ and location_ are set.

Tau scale

Robustbase.TauType
Tau <: RobustScale

Tau(; c1::Float64=4.5, c2::Float64=3.0, consistency_correction::Bool=true, can_handle_nan::Bool=false)

Robust Tau-Estimate of Scale

Computes the robust τ-estimate of univariate scale, as proposed by Maronna and Zamar (2002); improved by a consistency factor

Keywords

`c1::Float64=4.5`, optional: constant for the weight function, defaults to 4.5
`c2::Float64=3.0`, optional: constant for the rho function, defaults to 3.0
`consistency_correction::Bool`, optional: boolean indicating if consistency for normality should be applied. Defaults to True.
`can_handle_nan::Bool`, optional: boolean indicating whether NANs can be handled, defaults to false.

References

Robust Estimates of Location and Dispersion for High-Dimensional Datasets,
Ricarco A Maronna and Ruben H Zamar (2002)

Examples

julia> x = hbk[:,1];
julia> tau = Tau();
julia> fit!(tau, x);
julia> location(tau)
1.5554543865072337

julia> scale(tau)
2.012461881814477
source

Qn scale

Robustbase.QnType
Qn <: RobustScale

Qn(;location_func::Function=median, consistency_correction::Bool=true, finite_correction::Bool=true, can_handle_nan::Bool = false)

Robust Location-Free Scale Estimate More Efficient than MAD

Keywords

`location_func::LocationOrScaleEstimator`, optional: as the Qn estimator does not
    estimate location, a location function should be explicitly passed.
`consistency_correction::Bool`, optional: boolean indicating if consistency for normality should be applied. Defaults to true.
`finite_correction::Bool`, optional: boolean indicating if finite sample correction should be applied. Defaults to true.

References

Christophe Croux and Peter J. Rousseeuw (1992). Time-efficient algorithms for two highly robust estimators of scale.

Donald B. Johnson and Tetsuo Mizoguchi (1978). Selecting the k^th element in X+Y and X1+...+Xm

Examples

julia> x = hbk[:,1];

julia> qn = Qn();
julia> fit!(qn, x);

julia> location(qn)         # the median
1.8

julia> scale(qn)
1.7427832460732984
source

Location

Robustbase.locationFunction
location(rs::RobustScale)

Univariate location

source
location(model::CovarianceEstimator)::Vector{Float64}

Location vector

source

Scale

Multivariate: Covariance

Various robust estimators of covariance matrices ("scatter matrices") have been proposed in the literature, with different properties. The covariance module implements several frequently used scatter estimators. They all use the new base class RobustCovariance which builds on the CovarianceEstimator class in StatsBase.

Classical Location and Scatter Estimation: CovClassic

Robustbase.CovClassicType
CovClassic <: CovarianceEstimator <: Any

CovClassic(;assume_centered=false)

Classical Location and Scatter Estimation

Compute the classical estimetas of the location vector and covariance matrix of a data matrix.

Examples

julia> cc=CovClassic();
julia> fit!(cc, hbk[:,1:3])
-> Method:  Classical Estimator. 

Estimate of location:
[3.20667, 5.59733, 7.23067]

Estimate of covariance:
3×3 Matrix{Float64}:
 13.3417  28.4692   41.244
 28.4692  67.883    94.6656
 41.244   94.6656  137.835
source

Covariance matrix: covariance

Correlation matrix: correlation

Robust Location and Scatter Estimation via MCD: CovMcd

Robustbase.CovMcdType
CovMcd <: RobustCovariance <: CovarianceEstimator 

CovMcd(;assume_centered=false, alpha=nothing, n_initial_subsets=500, n_initial_c_steps=2,
    n_best_subsets=10, n_partitions=nothing, tolerance=1e-8, reweighting=true, verbosity=Logging.Warn)

Robust Location and Scatter Estimation via MCD

Compute the Minimum Covariance Determinant (MCD) estimator, a robust multivariate location and scale estimate with a high breakdown point, via the 'Fast MCD' algorithm proposed in Rousseeuw and Van Driessen (1999).

Keywords:

`alpha::Float64 | Int | Nothing`, optional:
    size of the h subset.
    If an integer between n/2 and n is passed, it is interpreted as an absolute value.
    If a float between 0.5 and 1 is passed, it is interpreted as a proportation
    of n (the training set size).
    If None, it is set to (n+p+1) / 2.
    Defaults to Nothing.
`n_initial_subsets::Int`, optional: number of initial random subsets of size p+1
`n_initial_c_steps::Int`, optional: number of initial c steps to perform on all initial subsets
`n_best_subsets::Int`, optional: number of best subsets to keep and perform c steps on until convergence
`n_partitions::Int` optional: Number of partitions to split the data into.
    This can speed up the algorithm for large datasets (n > 600 suggested in paper)
    If None, 5 partitions are used if n > 600, otherwise 1 partition is used.
`tolerance::Float64`, optional: Minimum difference in determinant between two iterations to stop the C-step
`reweighting:Bool`, optional: Whether to apply reweighting to the raw covariance estimate

References:

Rousseeuw and Van Driessen, A Fast Algorithm for the Minimum Covariance Determinant
Estimator, 1999, American Statistical Association and
the American Society for Quality, TECHNOMETRICS

Examples

julia> mcd=CovMcd();
julia> fit!(mcd, hbk[:,1:3])
-> Method:  Fast MCD Estimator: (alpha=nothing ==> h=39)

Robust estimate of location:
[1.55833, 1.80333, 1.66]

Robust estimate of covariance:
3×3 Matrix{Float64}:
 1.21312    0.0239154  0.165793
 0.0239154  1.22836    0.195735
 0.165793   0.195735   1.12535

juilia> dd_plot(mcd)
source

Deterministic MCD estimator: DetMCD

Robustbase.DetMcdType
DetMcd(; assume_centered=false,  alpha=nothing, n_maxcsteps=200, tolerance=1e-8,
    reweighting=true, verbosity=Logging.Warn)

Deterministic MCD estimator (DetMCD) based on the algorithm proposed in Hubert, Rousseeuw and Verdonck (2012)

Keywords:

`alpha::Float64 | Int | Nothing`, optional: size of the h subset.
    If an integer between n/2 and n is passed, it is interpreted as an absolute value.
    If a float between 0.5 and 1 is passed, it is interpreted as a proportation
    of n (the training set size).
    If None, it is set to (n+p+1) / 2.
    Defaults to None.
`n_maxcsteps::Int=200`, optional: Maximum number of C-step iterations
`tolerance::Float64`, optional: Minimum difference in determinant between two iterations to stop the C-step
`reweighting::Bool`, optional: Whether to apply reweighting to the raw covariance estimate

References:

Hubert, Rousseeuw and Verdonck, A deterministic algorithm for robust location
and scatter, 2012, Journal of Computational and Graphical Statistics

Examples

julia> mcd=DetMcd();
julia> fit!(mcd, hbk[:,1:3])
-> Method:  Deterministic MCD: (alpha=nothing ==> h=39)

Robust estimate of location:
[1.5377, 1.78033, 1.68689]

Robust estimate of covariance:
3×3 Matrix{Float64}:
 1.2209     0.0547372  0.126544
 0.0547372  1.2427     0.151783
 0.126544   0.151783   1.15414

julia> dd_plot(mcd);
source

Orthogonalized Gnanadesikan-Kettenring estimator: CovOgk

Robustbase.CovOgkType
CovOgk(;store_precision::Bool=true, assume_centered::Bool=false, location_estimator::Function=median,
    scale_estimator::Function=MAD_scale, n_iterations::Int=2, reweighting::Bool=false,
    reweighting_beta::Float64=0.9)

Implementation of the Orthogonalized Gnanadesikan-Kettenring estimator for location dispersion proposed in Maronna, R. A., & Zamar, R. H. (2002)

Keywords:

store_precision (boolean, optional): whether to store the precision matrix
assume_centered (boolean, optional): whether the data is already centered
location_estimator (LocationOrScaleEstimator, optional): function to estimate the
    location of the data, should accept an array like input as first value and a named
    argument axis
scale_estimator (LocationOrScaleEstimator, optional): function to estimate the scale
    of the data, should accept an array like input as first value and a named argument
    axis
n_iterations (int, optional): number of iteration for orthogonalization step
reweighting (boolean, optional): whether to apply reweighting at the end
    (i.e. calculating regular location and covariance after filtering outliers based on
    Mahalanobis distance using OGK estimates)
reweighting_beta (float, optional): quantile of chi2 distribution to use as cutoff for
    reweighting

References:

Maronna, R. A., & Zamar, R. H. (2002).
Robust Estimates of Location and Dispersion for High-Dimensional Datasets.
Technometrics, 44(4), 307–317. http://www.jstor.org/stable/1271538

Examples

julia> ogk=CovOgk();
julia> fit!(ogk, hbk[:,1:3])
-> Method:  Orthogonalized Gnanadesikan-Kettenring Estimator

Robust estimate of location:
[1.56005, 2.22345, 2.12035]

Robust estimate of covariance:
3×3 Matrix{Float64}:
 3.3575    0.587449  0.699388
 0.587449  2.09268   0.285757
 0.699388  0.285757  2.77527

julia> dd_plot(ogk);
source

Data sets

Robustbase includes several datasets that are often used in the robustness literature. These datasets serve as standard examples and benchmarks, allowing users to easily test robust algorithms. They are also available in the R-packages robustbase and rrcov.

Hawkings & Bradu & Kass data

Robustbase.DataSets.hbkConstant

Hawkins & Bradu & Kass data

Components

  • x1::Float64: first independent variable.
  • x2::Float64: second independent variable.
  • x3::Float64: third independent variable.
  • y::Float64: dependent (response) variable.

Reference

Hawkins, D.M., Bradu, D., and Kass, G.V. (1984) Location of several outliers in multiple regression data using elemental sets. Technometrics 26, 197–208.

source

Animals data

Robustbase.DataSets.animalsConstant

Animals data

Components

  • names::AbstractString: names of animals.
  • body::Float64: body weight in kg.
  • brain::Float64: brain weight in g.

References

 Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
 Statistics with S-PLUS._ Third Edition. Springer.

 P. J. Rousseeuw and A. M. Leroy (1987) _Robust Regression and
 Outlier Detection._ Wiley, p. 57.
source

Stack Loss data

Robustbase.DataSets.stacklossConstant

Stack loss data

Components

  • airflow::Float64: flow of cooling air (independent variable).
  • watertemp::Float64: cooling water inlet temperature (independent variable).
  • acidcond::Float64: concentration of acid (independent variable).
  • stackloss::Float64: stack loss (dependent variable).

Outliers

Observations 1, 3, 4, and 21 are outliers.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_.  Wadsworth & Brooks/Cole.

Dodge, Y. (1996) The guinea pig of multiple regression. In: _Robust Statistics, Data Analysis, and Computer Intensive Methods;
In Honor of Peter Huber's 60th Birthday_, 1996, _Lecture Notes in Statistics_ *109*, Springer-Verlag, New York.
source

Modified Wood Gravity data

Robustbase.DataSets.woodConstant

Modified Wood Gravity Data

Components

  • x1::Float64: Random values.
  • x2::Float64: Random values.
  • x3::Float64: Random values.
  • x4::Float64: Random values.
  • x5::Float64: Random values.
  • y::Float64: Random values (independent variable).

References

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p.243, table 8.

source