---
title: "Vignette: Basic Robust Estimators"
author: "Beat Hulliger and Tobias Schoch"
output:
html_document:
css: "fluent.css"
highlight: tango
vignette: >
%\VignetteIndexEntry{Basic Robust Estimators}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "",
prompt = TRUE,
dpi = 36,
fig.align = "center"
)
```
```{css, echo = FALSE}
.my-sidebar-orange {
padding-left: 1.5rem;
padding-top: 0.5rem;
padding-bottom: 0.25rem;
margin-top: 1.25rem;
margin-bottom: 1.25rem;
border: 1px solid #eee;
border-left-width: 0.75rem;
border-radius: .25rem;
border-left-color: #ce5b00;
}
.my-sidebar-blue {
padding-left: 1.5rem;
padding-top: 0.5rem;
padding-bottom: 0.25rem;
margin-top: 1.25rem;
margin-bottom: 1.25rem;
border: 1px solid #eee;
border-left-width: 0.75rem;
border-radius: .25rem;
border-left-color: #1f618d;
}
```
## Outline
In this vignette, we discuss how to use robust estimators of location (and
scale). The estimators are organized by i) estimating method, 2) population
characteristic and 3) type of implemented function.
#### Estimating methods
- Trimming (Chap. 2)
- Winsorization (Chap. 3)
- Weight reduction (Chap. 4)
- M-estimation (Chap. 5)
#### Population characteristics
- mean
- total
- (standard deviation $\rightarrow$ see utility functions, Chap. 6)
#### Type of implementation
- bare-bone methods (only estimate)
- survey methods (estimate, standard error, variance, etc.)
Bare-bone methods are stripped-down versions of the survey methods in terms of
functionality and informativeness. These functions may serve users and package
developers as building blocks. In particular, bare-bone functions cannot
compute variances. The survey methods are much more capable and depend---for
variance estimation---on the R package `survey` [Lumley](#biblio) (2021, 2010).
## Preparations
First, we load the namespace of the `robsurvey` package and attach it to the
search path.
```{r}
library("robsurvey", quietly = TRUE)
```
The argument `quietly = TRUE` suppresses the start-up message in the call of
`library("robsurvey")`.
## 1 LOS (Length-of-stay) Hospital Data
The `losdata` are a simple random sample without replacement of n = 70 patients
from a (fictive) hospital population of N = 2 479 patients in inpatient
treatment.[Note 2](#notes) Next, we attach the data to the search
path.
```{r}
data("losdata")
attach(losdata)
```
The first 3 rows of the data are:
```{r}
head(losdata, 3)
```
where
- `los` length-of-stay in hospital (days)
- `weight` sampling weight
- `fpc` population size (finite population correction)
We consider estimating average length-of-stay in hospital (LOS, days).
### 1.1 Survey design object
In order to use the **survey methods** (not bare-bone methods), we must
**attach** the `survey` package ([Lumley](#biblio), 2010, 2021) to the search
path
```{r, eval = FALSE}
library("survey")
```
```{r, echo = FALSE}
suppressPackageStartupMessages(library(survey))
```
and specify a survey or sampling design object
```{r, eval = FALSE}
dn <- svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata,
calibrate.formula = ~1)
```
```{r, echo = FALSE}
dn <- if (packageVersion("survey") >= "4.2") {
svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata,
calibrate.formula = ~1)
} else {
svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata)
}
```
**Note.** Since **version 4.2**, the **survey** package allows the definition
of pre-calibrated weights (see argument `calibrate.formula` of the function
`svydesign()`). This vignette uses this functionality (in some places). If you
have installed an earlier version of the `survey` package, this vignette will
automatically fall back to calling `svydesign()` without the calibration
specification. See vignette [Pre-calibrated
weights](https://CRAN.R-project.org/package=survey/vignettes/precalibrated.pdf)
of the `survey` package to learn more.
```{r, echo = FALSE, results = "asis"}
survey_version <- packageVersion("survey")
if (survey_version < "4.2") {
cat(paste0(''))
}
```
### 1.2 Exploring the data
The distribution of variable los is skewed to the right (see boxplot), and we
see a couple of rather heavy outliers. On the logarithmic scale, the
distribution is slightly skewed to the right. The outliers need not be errors.
Following [Chambers](#biblio) (1986), we distinguish representative outliers
from non-representative outliers.
The outliers visible in the boxplot refer to a few individuals who stayed for a
long time in inpatient care. Moreover, we assume that these outliers represent
patients in the population that are similar in value (i.e., representative
outliers).
```{r, echo = FALSE, out.width = "50%", fig.asp = 0.3}
layout(matrix(1:2, ncol = 2))
par(mar = c(4, 1, 1, 1))
svyboxplot(los ~ 1, dn, all.outliers = TRUE, xlab = "los", horizontal = TRUE)
svyboxplot(log(los) ~ 1, dn, all.outliers = TRUE, xlab = "log(los)",
horizontal = TRUE)
```
The outliers tend to inflate the variance of the weighted mean. Although the
outliers are not some kind of error, it is beneficiary to use estimators other
than the weighted mean to estimate the population average of los. We are
interested in robust estimators which---although being biased as estimators of
the population mean---will often have a smaller mean square error than the
weighted mean; thus, are more efficient.
## 2 Trimming
### 2.1 Bare-bone methods
The following estimation methods are available.
weighted_mean_trimmed() |
weighted_total_trimmed() |
In place of the weighted mean, we consider the 5% one-sided trimmed weighted
population mean of los. The lower end of the distribution is not trimmed (lower
bound: `LB = 0`). The 5% largest observations are trimmed (upper bound: `UB =
0.95`).
```{r}
weighted_mean_trimmed(los, weight, LB = 0, UB = 0.95)
```
We obtain an estimate of (roughly) 9.3 days. Note that the return value is a
scalar.
If a bare-bone method is called with argument `info = TRUE`, the function
returns a list, the names of which are shown below.
```{r}
m <- weighted_mean_trimmed(los, weight, LB = 0, UB = 0.95, info = TRUE)
names(m)
```
### 2.2 Survey methods
The survey methods are:
svymean_trimmed() |
svytotal_trimmed() |
As before, we are interested in computing the 5% one-sided trimmed weighted
population mean of los. In contrast to `weighted_mean_trimmed()`, the method
`svymean_trimmed()` computes the standard error of the estimate using the
functionality of the `survey` package.
```{r}
m <- svymean_trimmed(~los, dn, LB = 0, UB = 0.95)
m
```
The estimated location, variance, and standard error of the estimator can be
extracted from object `m` with the following commands.
```{r}
coef(m)
vcov(m)
SE(m)
```
The `summary()` method summarizes the most important facts about the estimate.
The summary is particularly useful for *M*-estimators (see below) but less so
for other estimators.
```{r}
summary(m)
```
Additional utility functions are `residuals()`, `fitted()`, and `robweights()`.
These functions are mainly relevant for *M*-estimators.
## 3 Winsorization
### 3.1 Bare-bone methods
The bare-bone methods are:
weighted_mean_winsorized() |
weighted_total_winsorized() |
weighted_mean_k_winsorized() |
weighted_total_k_winsorized() |
We compute the 5% one-sided winsorized weighted population mean of los. The
lower end of the distribution is not winsorized (lower bound: `LB = 0`). The 5%
largest observations are winsorized (upper bound: `UB = 0.95`).
```{r}
weighted_mean_winsorized(los, weight, LB = 0, UB = 0.95)
```
The one-sided winsorized estimators can also be specified in absolute terms by
winsorizing a fixed number $k=1,2,\ldots$ of observations. This estimator is
called the one-sided *k-winsorized* mean (and total) and is computed as follows
```{r}
weighted_mean_k_winsorized(los, weight, k = 1)
```
### 3.2 Survey methods
The survey methods are:
svymean_winsorized() |
svytotal_winsorized() |
svymean_k_winsorized() |
svytotal_k_winsorized() |
The utility functions `coef()`, `vcov()`, `SE()`, `summary()`, `residuals()`,
`fitted()`, and `robweights()` are available.
## 4 Weight Reduction Methods
Winsorization and trimming act directly on the values of an estimator. Other
estimation methods reduce the sampling weight of potential outliers instead. A
hybrid method of winsorization and weight reduction to treat influential
observations has been proposed by [Dalén](#biblio) (1987). An observation $y_i$
is called influential if its expanded value, $w_iy_i$, is exceedingly large.
Let $c>0$ denote a winsorization or censoring cutoff value. Dalén's estimator
`"Z2"` and `"Z3"` of the population $y$-total are given by $\sum_{i \in s} [w_i
y_i]_{\circ}^c$, where $\circ$ is a placeholder for `"Z2"` or `"Z3"` and
$$
\begin{align*}
[w_i y_i]_{Z2}^c =
\begin{cases}
w_i y_i & \text{if} \quad w_i y_i \leq c, \\
c & \text{otherwise},
\end{cases}
&\qquad \text{and} \qquad
[w_i y_i]_{Z3}^c =
\begin{cases}
w_i y_i & \text{if} \quad w_i y_ \leq c, \\
c + (y_i - c/w_i) & \text{otherwise}.
\end{cases}
\end{align*}
$$
Estimator `"Z2"` censors the terms $w_iy_i$ at $c$. In estimator `"Z3"`,
observations $y_i$ such that $w_iy_i > c$ contribute to the estimated total
only with $c$ plus the excess over the cutoff, $(w_iy_i - c)$. Note that the
excess over the threshold has a weight of 1.0 ([Lee](#biblio), 1995). An
estimator of the population $y$-mean obtains by dividing the estimator of the
estimated $y$-total by the (estimated) population size.
From a practical point of view, the choice of constant $c$ in Dalén's
estimators is rather tricky because we cannot only derive $c$ from a large
order statistic, say $y_{(k)}$, $k < n$ (like for trimming). Instead, the
corresponding weight $w_{(k)}$ needs to be taken into account.
### 3.1 Bare-bone methods
The bare-bone methods are:
weighted_mean_dalen() |
weighted_total_dalen() |
The estimators `"Z2"` and `"Z3"` can be specified by the argument `type`; by
default, `type = "Z2"`. The censoring threshold $c$ is implemented as argument
`censoring`.
```{r}
weighted_mean_dalen(los, weight, censoring = 1500)
```
### 4.2 Survey methods
The survey methods are:
svymean_dalen() |
svytotal_dalen() |
The utility functions `coef()`, `vcov()`, `SE()`, `summary()`, `residuals()`,
`fitted()`, and `robweights()` are available.
## 5 *M*-Estimation
### 5.1 Bare-bone methods
The bare-bone methods are:
weighted_mean_huber() |
weighted_total_huber() |
weighted_mean_tukey() |
weighted_total_tukey() |
The estimators with postfix `_huber` and `_tukey` are based on, respectively,
the Huber and Tukey (biweight) $\psi$-function.
As losdata is a simple random sample, *M*-estimators of `type = "rht"` is the
method of choice. Here, we compute the Huber-type robust weighted *M*-estimator
of the mean with robustness tuning constant $k=8$.
```{r}
weighted_mean_huber(los, weight, type = "rwm", k = 8)
```
The function `huber2()` is an implementation of the weighted Huber proposal 2
estimator. It is only available as bare-bone method.[Note 3](#notes)
```{r}
huber2(los, weight, k = 8)
```
### 5.2 Survey methods
The survey methods are:
svymean_huber() |
svytotal_huber() |
svymean_tukey() |
svytotal_tukey() |
The Huber *M*-estimator of the mean (and its standard error) can be computed
with
```{r}
m <- svymean_huber(~los, dn, type = "rwm", k = 8)
m
```
The `summary()` method summarizes the most important facts about the
*M*-estimate
```{r}
summary(m)
```
The estimated scale (weighted MAD) can be extracted with the `scale()`
function. Additional utility functions are `coef()`, `vcov()`, `SE()`,
`residuals()`, `fitted()`, and `robweights()`. The following figure shows a
plot of the robustness weights against the residuals. We see that large
residuals are downweighted.
```{r, eval = FALSE}
plot(residuals(m), robweights(m))
```
```{r, echo = FALSE, out.width = "50%"}
par(mar = c(4, 4, 1, 0))
plot(residuals(m), robweights(m))
```
### 5.3 Adaptive estimation
An adaptive *M*-estimator of the total (or mean) is defined by letting the data
chose the tuning constant $k$. Let $\widehat{T}$ denote the weighted total, and
let $\widehat{T}_k$ be the Huber *M*-estimator of the weighted total with
robustness tuning constant $k$. Under quite general regularity conditions, the
estimated mean square error (MSE) of $\widehat{T}_k$ can be approximated by
(see e.g., [Gwet and Rivest](#biblio), 1992; [Hulliger](#biblio), 1995)
$$\widehat{\mathrm{mse}}\big(\widehat{T}_{k}\big) \approx
\mathrm{var}\big(\widehat{T}_{k}\big) +\big(\widehat{T} -
\widehat{T}_{k}\big)^2.$$
The minimum estimated risk (MER) estimator ([Hulliger](#biblio), 1995) selects
$k$ such that $\widehat{\mathrm{mse}}\big(\widehat{T}_{k}\big)$ is minimal
(among all candidate estimators). Now, suppose that we have been working on the
*M*-estimator with $k=8$.
```{r}
m <- svymean_huber(~los, dn, type = "rwm", k = 8)
```
Next, we compute the MER, starting from the current *M*-estimate, i.e., object
`m`.
```{r}
mer(m)
```
Hence, the MER is `r -round(as.numeric(100 * mer(m)$robust$rel_mse), 0)`% more
efficient than the classical estimator as an estimator of the population total.
## 6 Utility Functions
### 6.1 Weighted quantile and median
The weighted quantile (and median) can be computed by
```{r}
weighted_quantile(los, weight, probs = c(0.1, 0.9))
weighted_median(los, weight)
```
When all weights are equal, `weighted_quantile()` is equal to
`base::quantile()` with argument `type = 2`.
### 6.2 Weighted MAD: median absolute deviation
The normalized weighted median absolute deviations about the weighted median
can be computed with
```{r}
weighted_mad(los, weight)
```
By default, the normalization constant to make the weighted MAD an unbiased
estimator of scale at the Gaussian core model is `constant = 1.482602`. This
constant can be changed if necessary.
### 6.3 Weighted IQR: interquartile range
The normalized weighted interquartile range can be computed with
```{r}
weighted_IQR(los, weight)
```
By default, the normalization constant to make the weighted IQR an unbiased
estimator of scale at the Gaussian core model is `constant = 0.7413`. This
constant can be changed if necessary.
---
## Notes {#notes}
1 All bare-bone methods can be called with the argument `info =
TRUE` to return a list with the following entries: characteristic (e.g., mean),
estimator (e.g., trimmed estimator), estimate (numerical value), variance (by
default: `NA`), robust (list of arguments that specify robustness), residuals
(numerical vector), model (list of data used for estimation), design (by
default: `NA`), call.
2 We have constructed the `losdata` as a showcase; though, the LOS
measurements are real data that we have taken from the $201$ observations in
[Ruffieux](#biblio) et al. (2000). Our `losdata` are a sample of size $n = 70$
from the $201$ original observations.
3 The function `huber2()` is is similar to `MASS::hubers()`
([Venables and Ripley](#biblio), 2002). It differs from the implementation in
`MASS` in that it allows for weights and is initialized by the (normalized)
weighted interquartile range (IQR) not the median absolute deviations (MAD).
## Bibliographical notes
The paper of [Chambers](#biblio) (1986) is the landmark paper about outliers in
finite population sampling. [Lee](#biblio) (1995) and [Beaumont and
Rivest](#biblio) (2009) are a good starting point to learn about robustness in
finite population sampling.
Trimming and winsorization are discussed in [Lee](#biblio) (1995) and [Beaumont
and Rivest](#biblio) (2009). The variance estimators of the weighted trimmed
and winsorized estimators are straightforward adaptions of the classical
estimators; see [Huber and Ronchetti](#biblio) (2009, Chap. 3.3) or
[Serfling](#biblio) (1980, Chap. 8). A rigorous treatment in the context of
finite population sampling can be found in [Shao](#biblio) (1994).
[Rao](#biblio) (1971) was among the first to propose weight reduction. Consider
a sample of size $n$, and suppose that the $i$th observation is an outlier. He
suggested to reduce the outlier’s sampling weight $w_i$ to one, and
redistribute the weight difference $w_i−1$ among the remaining observations. As
a result, observation $i$ does not represent other values like it. Dalén's
estimator offers a more general notion of weight reducution; see
[Dalén](#biblio) (1987) and also [Chen et al.](#biblio) (2017).
In the context of finite population sampling, M-estimators were first studied
by [Chambers](#biblio) (1986). He investigated robust methods in the model- or
prediction based framework of [Royall and Cumberland](#biblio) (1981).
Model-assisted estimators were introduced (for ratio estimation) by [Gwet and
Rivest](#biblio) (1992) and studied by [Lee](#biblio) (1995), and
[Hulliger](#biblio) (1995, 1999, 2005). A recent comprehensive treatment can be
found in [Beaumont and Rivest](#biblio) (2009).
## References {#biblio}
BEAUMONT, J.-F. AND RIVEST, L.-P. (2009). Dealing with outliers in survey data.
In: *Sample Surveys: Theory, Methods and Inference* ed. by Pfeffermann, D. and
Rao, C. R. Volume 29A of Handbook of Statistics, Amsterdam: Elsevier, Chap. 11,
247–280. [DOI:
10.1016/S0169-7161(08)00011-4](https://doi.org/10.1016/S0169-7161(08)00011-4)
CHAMBERS, R. (1986). Outlier Robust Finite Population Estimation. *Journal of
the American Statistical Association* **81**, 1063–1069, [DOI:
10.1080/01621459.1986.10478374](https://doi.org/10.1080/01621459.1986.10478374).
DALEN, J. (1987). *Practical Estimators of a Population Total Which Reduce the
Impact of Large Observations*. Research report, Statistics Sweden, Stockholm.
CHEN, Q., ELLIOTT, M. R., HAZIZA, D., YANG, Y., GHOSH, M., LITTLE, R. J. A.,
SEDRANSK, J. AND THOMPSON, M. (2017). Approaches to Improving Survey-Weighted
Estimates. *Statistical Science* **32**, 227–248, [DOI:
10.1214/17-STS609](https://doi.org/10.1214/17-STS609).
GWET, J.-P. AND RIVEST, L.-P. (1992). Outlier Resistant Alternatives to the
Ratio Estimator. *Journal of the American Statistical Association* **87**,
1174–1182, [DOI: 10.1080/01621459.1992.
10476275](https://doi.org/10.1080/01621459.1992.10476275).
HUBER, P. J. AND RONCHETTI, E. (2009). *Robust Statistics*, New York: John
Wiley & Sons, 2nd edition, [DOI:
10.1002/9780470434697](https://doi.org/10.1002/9780470434697).
HULLIGER, B. (2006). Horvitz–Thompson Estimators, Robustified. In:
*Encyclopedia of Statistical Sciences* ed. by Kotz, S. Volume 5, Hoboken (NJ):
John Wiley & Sons, 2nd edition, [DOI:
10.1002/0471667196.ess1066.pub2](https://doi.org/10.1002/0471667196.ess1066.pub2).
HULLIGER, B. (1999). Simple and robust estimators for sampling. In:
*Proceedings of the Survey Research Methods Section*, American Statistical
Association, 54–63, American Statistical Association.
HULLIGER, B. (1995). Outlier Robust Horvitz–Thompson Estimators. *Survey
Methodology* **21**, 79–87.
LEE, H. (1995). Outliers in business surveys. In: *Business survey methods* ed.
by Cox, B. G., Binder, D. A., Chinnappa, B. N., Christianson, A., Colledge, M.
J. and Kott, P. S. New York: John Wiley & Sons, Chap. 26, 503–526, [DOI:
10.1002/9781118150504.ch26](https://doi.org/10.1002/9781118150504.ch26).
LUMLEY, T. (2021). survey: analysis of complex survey samples. R package
version 4.0, URL https://CRAN.R-project.org/package=survey.
LUMLEY, T. (2010). *Complex Surveys: A Guide to Analysis Using R: A Guide to
Analysis Using R*, Hoboken (NJ): John Wiley & Sons.
RAO, J. N. K. (1971). Some Aspects of Statistical Inference in Problems of
Sampling from Finite Populations. In: *Foundations of Statistical Inference*
ed. by Godambe, V. P. and Sprott, D. A. Toronto: Holt, Rinehart, and Winston,
171–202.
ROYALL, R. M. AND CUMBERLAND, W. G. (1981). An Empirical Study of the Ratio
Estimator and Estimators of its Variance. *Journal of the American Statistical
Association* **76**, 66–82, [DOI:
10.2307/2287043](https://doi.org/10.2307/2287043).
RUFFIEUX, C., PACCAUD, F. AND MARAZZI, A. (2000). Comparing rules for
truncating hospital length of stay. *Casemix Quarterly* **2**, 3–11.
SHAO, J. (1994). L-Statistics in complex survey problems. *The Annals of
Statistics* **22**, 946–967, [DOI:
10.1214/aos/1176325505](https://doi.org/10.1214/aos/1176325505).
SERFLING, R. J. (1980). *Approximation theorems of mathematical statistics*,
New York: John Wiley & Sons. [DOI:
10.1002/9780470316481](https://doi.org/10.1002/9780470316481).
VENABLES, W. N. AND RIPLEY, B. D. (2002). *Modern Applied Statistics with S*,
New York: Springer-Verlag, 4th edition, [DOI:
10.1007/978-0-387-21706-2](https://doi.org/10.1007/978-0-387-21706-2).