Robustbase.jl
Documentation for Robustbase.jl
Univariate: Robust scales
The univariate module contains several univariate location and scale estimators. They all implement the abstract base class RobustScale, which has the properties location and scale and can be fitted using the fit! method. Each class is expected to implement a _calculate method where the attributes scale_ and location_ are set.
Tau scale
Robustbase.Tau — TypeTau <: RobustScale
Tau(; c1::Float64=4.5, c2::Float64=3.0, consistency_correction::Bool=true, can_handle_nan::Bool=false)Robust Tau-Estimate of Scale
Computes the robust τ-estimate of univariate scale, as proposed by Maronna and Zamar (2002); improved by a consistency factor
Keywords
`c1::Float64=4.5`, optional: constant for the weight function, defaults to 4.5
`c2::Float64=3.0`, optional: constant for the rho function, defaults to 3.0
`consistency_correction::Bool`, optional: boolean indicating if consistency for normality should be applied. Defaults to True.
`can_handle_nan::Bool`, optional: boolean indicating whether NANs can be handled, defaults to false.References
Robust Estimates of Location and Dispersion for High-Dimensional Datasets,
Ricarco A Maronna and Ruben H Zamar (2002)Examples
julia> x = hbk[:,1];
julia> tau = Tau();
julia> fit!(tau, x);
julia> location(tau)
1.5554543865072337
julia> scale(tau)
2.012461881814477Qn scale
Robustbase.Qn — TypeQn <: RobustScale
Qn(;location_func::Function=median, consistency_correction::Bool=true, finite_correction::Bool=true, can_handle_nan::Bool = false)Robust Location-Free Scale Estimate More Efficient than MAD
Keywords
`location_func::LocationOrScaleEstimator`, optional: as the Qn estimator does not
estimate location, a location function should be explicitly passed.
`consistency_correction::Bool`, optional: boolean indicating if consistency for normality should be applied. Defaults to true.
`finite_correction::Bool`, optional: boolean indicating if finite sample correction should be applied. Defaults to true.References
Christophe Croux and Peter J. Rousseeuw (1992). Time-efficient algorithms for two highly robust estimators of scale.
Donald B. Johnson and Tetsuo Mizoguchi (1978). Selecting the k^th element in X+Y and X1+...+XmExamples
julia> x = hbk[:,1];
julia> qn = Qn();
julia> fit!(qn, x);
julia> location(qn) # the median
1.8
julia> scale(qn)
1.7427832460732984Location
Robustbase.location — Functionlocation(rs::RobustScale)Univariate location
location(model::CovarianceEstimator)::Vector{Float64}Location vector
Scale
Robustbase.scale — Functionscale(rs::RobustScale)Univariate scale
Multivariate: Covariance
Various robust estimators of covariance matrices ("scatter matrices") have been proposed in the literature, with different properties. The covariance module implements several frequently used scatter estimators. They all use the new base class RobustCovariance which builds on the CovarianceEstimator class in StatsBase.
Classical Location and Scatter Estimation: CovClassic
Robustbase.CovClassic — TypeCovClassic <: CovarianceEstimator <: Any
CovClassic(;assume_centered=false)Classical Location and Scatter Estimation
Compute the classical estimetas of the location vector and covariance matrix of a data matrix.
Examples
julia> cc=CovClassic();
julia> fit!(cc, hbk[:,1:3])
-> Method: Classical Estimator.
Estimate of location:
[3.20667, 5.59733, 7.23067]
Estimate of covariance:
3×3 Matrix{Float64}:
13.3417 28.4692 41.244
28.4692 67.883 94.6656
41.244 94.6656 137.835Covariance matrix: covariance
Robustbase.covariance — Functioncovariance(model::CovarianceEstimator)::Matrix{Float64}Covariance matrix
Correlation matrix: correlation
Robustbase.correlation — Functioncorrelation(model::CovarianceEstimator)::Matrix{Float64}Correlation matrix
Robust Location and Scatter Estimation via MCD: CovMcd
Robustbase.CovMcd — TypeCovMcd <: RobustCovariance <: CovarianceEstimator
CovMcd(;assume_centered=false, alpha=nothing, n_initial_subsets=500, n_initial_c_steps=2,
n_best_subsets=10, n_partitions=nothing, tolerance=1e-8, reweighting=true, verbosity=Logging.Warn)Robust Location and Scatter Estimation via MCD
Compute the Minimum Covariance Determinant (MCD) estimator, a robust multivariate location and scale estimate with a high breakdown point, via the 'Fast MCD' algorithm proposed in Rousseeuw and Van Driessen (1999).
Keywords:
`alpha::Float64 | Int | Nothing`, optional:
size of the h subset.
If an integer between n/2 and n is passed, it is interpreted as an absolute value.
If a float between 0.5 and 1 is passed, it is interpreted as a proportation
of n (the training set size).
If None, it is set to (n+p+1) / 2.
Defaults to Nothing.
`n_initial_subsets::Int`, optional: number of initial random subsets of size p+1
`n_initial_c_steps::Int`, optional: number of initial c steps to perform on all initial subsets
`n_best_subsets::Int`, optional: number of best subsets to keep and perform c steps on until convergence
`n_partitions::Int` optional: Number of partitions to split the data into.
This can speed up the algorithm for large datasets (n > 600 suggested in paper)
If None, 5 partitions are used if n > 600, otherwise 1 partition is used.
`tolerance::Float64`, optional: Minimum difference in determinant between two iterations to stop the C-step
`reweighting:Bool`, optional: Whether to apply reweighting to the raw covariance estimateReferences:
Rousseeuw and Van Driessen, A Fast Algorithm for the Minimum Covariance Determinant
Estimator, 1999, American Statistical Association and
the American Society for Quality, TECHNOMETRICSExamples
julia> mcd=CovMcd();
julia> fit!(mcd, hbk[:,1:3])
-> Method: Fast MCD Estimator: (alpha=nothing ==> h=39)
Robust estimate of location:
[1.55833, 1.80333, 1.66]
Robust estimate of covariance:
3×3 Matrix{Float64}:
1.21312 0.0239154 0.165793
0.0239154 1.22836 0.195735
0.165793 0.195735 1.12535
juilia> dd_plot(mcd)Deterministic MCD estimator: DetMCD
Robustbase.DetMcd — TypeDetMcd(; assume_centered=false, alpha=nothing, n_maxcsteps=200, tolerance=1e-8,
reweighting=true, verbosity=Logging.Warn)Deterministic MCD estimator (DetMCD) based on the algorithm proposed in Hubert, Rousseeuw and Verdonck (2012)
Keywords:
`alpha::Float64 | Int | Nothing`, optional: size of the h subset.
If an integer between n/2 and n is passed, it is interpreted as an absolute value.
If a float between 0.5 and 1 is passed, it is interpreted as a proportation
of n (the training set size).
If None, it is set to (n+p+1) / 2.
Defaults to None.
`n_maxcsteps::Int=200`, optional: Maximum number of C-step iterations
`tolerance::Float64`, optional: Minimum difference in determinant between two iterations to stop the C-step
`reweighting::Bool`, optional: Whether to apply reweighting to the raw covariance estimateReferences:
Hubert, Rousseeuw and Verdonck, A deterministic algorithm for robust location
and scatter, 2012, Journal of Computational and Graphical StatisticsExamples
julia> mcd=DetMcd();
julia> fit!(mcd, hbk[:,1:3])
-> Method: Deterministic MCD: (alpha=nothing ==> h=39)
Robust estimate of location:
[1.5377, 1.78033, 1.68689]
Robust estimate of covariance:
3×3 Matrix{Float64}:
1.2209 0.0547372 0.126544
0.0547372 1.2427 0.151783
0.126544 0.151783 1.15414
julia> dd_plot(mcd);Orthogonalized Gnanadesikan-Kettenring estimator: CovOgk
Robustbase.CovOgk — TypeCovOgk(;store_precision::Bool=true, assume_centered::Bool=false, location_estimator::Function=median,
scale_estimator::Function=MAD_scale, n_iterations::Int=2, reweighting::Bool=false,
reweighting_beta::Float64=0.9)Implementation of the Orthogonalized Gnanadesikan-Kettenring estimator for location dispersion proposed in Maronna, R. A., & Zamar, R. H. (2002)
Keywords:
store_precision (boolean, optional): whether to store the precision matrix
assume_centered (boolean, optional): whether the data is already centered
location_estimator (LocationOrScaleEstimator, optional): function to estimate the
location of the data, should accept an array like input as first value and a named
argument axis
scale_estimator (LocationOrScaleEstimator, optional): function to estimate the scale
of the data, should accept an array like input as first value and a named argument
axis
n_iterations (int, optional): number of iteration for orthogonalization step
reweighting (boolean, optional): whether to apply reweighting at the end
(i.e. calculating regular location and covariance after filtering outliers based on
Mahalanobis distance using OGK estimates)
reweighting_beta (float, optional): quantile of chi2 distribution to use as cutoff for
reweightingReferences:
Maronna, R. A., & Zamar, R. H. (2002).
Robust Estimates of Location and Dispersion for High-Dimensional Datasets.
Technometrics, 44(4), 307–317. http://www.jstor.org/stable/1271538Examples
julia> ogk=CovOgk();
julia> fit!(ogk, hbk[:,1:3])
-> Method: Orthogonalized Gnanadesikan-Kettenring Estimator
Robust estimate of location:
[1.56005, 2.22345, 2.12035]
Robust estimate of covariance:
3×3 Matrix{Float64}:
3.3575 0.587449 0.699388
0.587449 2.09268 0.285757
0.699388 0.285757 2.77527
julia> dd_plot(ogk);Data sets
Robustbase includes several datasets that are often used in the robustness literature. These datasets serve as standard examples and benchmarks, allowing users to easily test robust algorithms. They are also available in the R-packages robustbase and rrcov.
Hawkings & Bradu & Kass data
Robustbase.DataSets.hbk — ConstantHawkins & Bradu & Kass data
Components
x1::Float64: first independent variable.x2::Float64: second independent variable.x3::Float64: third independent variable.y::Float64: dependent (response) variable.
Reference
Hawkins, D.M., Bradu, D., and Kass, G.V. (1984) Location of several outliers in multiple regression data using elemental sets. Technometrics 26, 197–208.
Animals data
Robustbase.DataSets.animals — ConstantAnimals data
Components
names::AbstractString: names of animals.body::Float64: body weight in kg.brain::Float64: brain weight in g.
References
Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
Statistics with S-PLUS._ Third Edition. Springer.
P. J. Rousseeuw and A. M. Leroy (1987) _Robust Regression and
Outlier Detection._ Wiley, p. 57.Stack Loss data
Robustbase.DataSets.stackloss — ConstantStack loss data
Components
airflow::Float64: flow of cooling air (independent variable).watertemp::Float64: cooling water inlet temperature (independent variable).acidcond::Float64: concentration of acid (independent variable).stackloss::Float64: stack loss (dependent variable).
Outliers
Observations 1, 3, 4, and 21 are outliers.References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole.
Dodge, Y. (1996) The guinea pig of multiple regression. In: _Robust Statistics, Data Analysis, and Computer Intensive Methods;
In Honor of Peter Huber's 60th Birthday_, 1996, _Lecture Notes in Statistics_ *109*, Springer-Verlag, New York.Modified Wood Gravity data
Robustbase.DataSets.wood — ConstantModified Wood Gravity Data
Components
x1::Float64: Random values.x2::Float64: Random values.x3::Float64: Random values.x4::Float64: Random values.x5::Float64: Random values.y::Float64: Random values (independent variable).
References
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p.243, table 8.