CSI 779 / STAT 759 Comments on Wilcox (1997) Chapter 3 (last updated 2/25/03) In Chapter 2 Wilcox introduced some "measures of location and scale" for a distribution (i.e., a population). While the names of these measures ("trimmed mean", "Winsorized mean", "mean absolute deviation", and so on) and the functionals that define them may look familiar, definitions for such population measures are somewhat unusual. Most people define them for samples (using the same functionals on the ECDF) rather than for the population. The main thrust of Chapter 3 is to estimate these measures and the variances of the estimators. The topics of Chapter 3 are important. We will generally use slightly different words to describe the problem, however. We will use the standard, precise words of the field of statistics. We will speak of measures of the population as "parameters". We will speak of measures of the sample as "statistics". Statistics are observable random variables. Parameters are unobservables; they may or may not be random variables. The word "statistic" will also be used to refer to a realization of a statistic. --- Okay, so this is not so precise because the word is overloaded. We have lots of overloaded words in precise discourse, however, and either the context or qualifying words resolve the meaning. At the beginning of Section 3.1, Wilcox uses the phrase A "estimates" B. It is not clear what this means --- actually, it is clear what he means, but we should adopt a more precise approach to this problem. We will not precisely define "estimate" as a verb; we will use the phrase "A estimates B" only to mean that we are using A as an estimate (or estimator) of B. If we want to estimate B, we can use anything we want. Some statistical properties of one estimator may be "better" than those properties of another estimator. Other statistical properties of the second estimator may be "better" than those other properties for the first estimator. So let's be intelligent and precise: 1. something estimates something else if that is the way we use it. 2. because anything can be used as an estimate, we want to know which estimator (which way of forming an estimate) is "better". 3. we must define "better" before we can proceed. (We know some measures of goodness of estimators, so let's just use what we know, unbiasedness, minimum MSE, etc.) 4. how do we even get started? (In Stat 101, or in Wilcox's book, we are given some estimators. Where did they come from?) We develop estimators based on heuristics: ECDF, moments, MLE, LS, etc. (see my handout). Then we check out the goodness properties from the previous step. (The difference in "estimate" and "estimator" is that "estimator" is a function of a random variable, and, hence, itself is a random variable; "estimate" is a realization of that random variable.) We can take all of the definitions of "robust measures" from Chapter 2, and immediately we have definitions of the location estimators we actually use: the robust estimators, such as the trimmed mean, the Winsorized mean, the quantiles, especially the median, M estimators of location (and two variations, a one-step M estimator, and a W estimator. Wilcox discusses these in Chapter 3, but he does not use sample versions of the scale "measures" he defined in Chapter 2 (mean deviation from the mean, mean absolute deviation, etc.). Rather, he discusses some ad hoc estimators of the variances or standard deviations of the robust estimators (in Sections 3.1 through 3.7), and then (in Section 3.8) discusses two additional estimators of the population variance, the biweight midvariance and the bend midvariance. In the beginning of Section 3.1, Wilcox briefly mentions the problem of choice of the cutoff proportion for trimmed or Winsorized means. He points out that if the proportion is too small, the estimator can have a very large variance. Of course, on the other hand, if the proportion is too large, the efficience of the estimator will be poor (though this would never be as great a problem as that resulting from a proportion too small). He then mentions use of the sample itself to decide on the proportion. This kind of approach is called an adaptive procedure. Hogg (1974) suggested and studied adaptive robust procedures. There has not been a lot of work with adaptive procedures because it is so difficult to work out their properties. This is a common problem in statistics; it is the choice of a "tuning parameter". (Other places where this problem arises is in kernel-based methods, where the tuning parameter is the window width.) A common approach to this problem is to use cross-validation to determine an "optimum" value of the tuning parameter. Alam and Mitra (1987) used the cross-validation error to determine the tuning parameters in adaptive robust regression. While optimization methods play an important role in statistical inference, they must be used with some care. Optimization finds the "best" under a given set of assumptions or with a given dataset; hence, the optimization can magnify the effects of the assumptions, and/or the characteristics of the particular random sample. Leger and Romano (1990) suggested use of bootstrap techniques to determine the value of the tuning parameter for trimmed means. Turkheimer, Pettigrew, Sokoloff, and Schmidt(1999) used bootstrap and permutation methods to try to identify a minimun variance robust estimator. Adaptive procedures remain an area that is largely unexplored. Probability statements play a major role in statistical inference. These statements form the basis for confidence intervals and tests of hypotheses. Making a statement of a probability in general depends on knowing the distribution of a random variable (the statistic). Even if we know the distribution of the underlying population random variable, it is very rare for us to know the distribution of the statistic. (Although many of these are famous, and you know them well -- Student's t, F, chi-squared, etc. -- but what about the distribution of the Winsorized mean from a sample from a simple normal distribution? It's not easy. And, of course, if the underlying distribution is not normal, it's even more difficult.) There are two approaches: asymptotic approximations (asymptotic inference), and simulation (computational inference). The asymptotic approximations are often based on central-limit-type properties, which depend only on knowing something that is consistent for the standard deviation (there are also other assumptions). The main point of Sections 3.1 through 3.7 concerns the variance of the robust estimators of location. These will be used in later chapters to set confidence intervals or to test hypotheses using asymptotic normal approximations. Wilcox uses an asymptotic relationship that equates the trimmed mean with the population mean plus the sum of the influence function for the trimmed mean evaluated at all sample points to derive an approximation for the variance of the trimmed mean given in equation (3.4). (*** Review the definition of the influence function, pp. 15,16 *** Note that of the measures discussed in Chapter 2, the influence function --- or influence curve; there's no meaningful difference --- is most useful for the population, rather than for the sample, although there is a finite sample version, sometimes called the sensitivity curve.) Now, the question is how to estimate (3.4), the sum of squares of the trimmed mean influence curve evaluated at each of the sample points. In Chapter 2, Wilcox showed that the influence curve of the trimmed mean depends on the **population** Winsorized mean. A good estimator of the standard deviation can then be written in terms of the sample variance of the Winsorized values, W_i, and given at the top of p. 37. (Note that Wilcox uses \bar{X}_w and \bar{W} to mean the same thing.) This approach does not depend on the assumption of normality of the distribution. It does use asymptotic approximations, however, and its use in setting confidence intervals or testing hypotheses will use asymptotic normality assumptions. An alternate approach to estimating the variance of the trimmed mean or Winsorized mean is to use the moments of normal order statistics to develop expression for the variances. This approach is taken by Caperaa and Rivest (1995) and by David and Balakrishnan (1996), and used for a sample with outliers by Balakrishnan and Kannan (2003). The finite sample breakdown point is a useful concept. The larger issues of data analysis should be kept in mind when considering the breakdown point of a statistical procedure. Estimation of sample quantiles with the order statistic that is closest in rank is straightforward. It can also be expressed as the population quantile plus the sum of the influence function for the quantile evaluated at all sample points. This leads to the approximation at the top of page 40. This quantity depends on the density at a point -- so the probability density must be estimated. This can be estimated from the sample in various ways. For the median, i.e., q=.5, the estimator (3.11) is valid. (**It is not valid for other quantiles.**) Another estimator is given by Maritz and Jarrett (1978). (**The expression given at the bottom of page 41 for the beta density is not standard.**) Use of single sample order statistics to estimate a quantile is not very good. Harrell and Davis (1982) suggested a better estimator for a given quantile that is based on a beta-weighted linear combination of all the order statistics. (Wilcox's weights are correct, though his definition of the beta is not the standard one.) The variance of the Harrell-Davis estimator is very difficult to work out. Wilcox suggests use of the bootstrap. He warns of some pitfalls of the method. It should be used with care! M estimators of location are discussed in Section 3.4. Refer back to Chapter 2, and Table 2.1, p. 22 for discussion of M-estimators (though in a "population context"). These are adaptive estimators in the sense that a measure of scale is required. A common approach is to use the MAD adjusted by its large sample expectation in a normal distribution: z_.75 \sigma. There is also a tuning parameter, K. The M estimator can be computed with Newton iterations. The M estimator can be expressed as the sum of the M measure of location for the population plus the sum of the influence function for the M estimator evaluated at all sample points. (Just as for the other estimators above.) This is much more complicated than the influence functions for the other estimators. Wilcox presents an expression at the bottom of page 53 that involves the density function. Again, this could be estimated. The standard deviation of the M estimator can be also be estimated using the bootstrap. A variations of M-estimators is the "one-step estimator", p. 57. A computational technique is iteratively reweighted least squares. Comparisons of estimators have been widely carried out in Monte Carlo studies. More can always be done!