ChipCenter Questlink
SEARCH CHIPCENTER
Search Type:
Search for:




Knowledge Centers
Product Reviews
Data Sheets
Guides & Experts
News
International
Ask Us
Circuit Cellar Online
App Notes
NetSeminars
Careers
Resources
FAQ
EE Times Network
Electronics Group Sites


DSP Main | Archives | Feedback

Simple methods overcome problems that least-squares encounters with outliers

by Mark Sullivan

Engineers who want to estimate a signal parameter from a dataset can choose from a variety of methods. Perhaps the most commonly used method, known as least squares, finds the value that minimizes the squared magnitude of a suitably defined error. Not only is this approach easy to implement, it tends to produce good estimates for Gaussian signals. Further, many real-world signals fit this model reasonably well because of a statistical "law" known as the central limit theorem, which can be paraphrased as follows: sums of random quantities become Gaussian as the number of components in the sum increases.

Another "law" you're probably familiar with is Murphy's, in particular the version that says "bad measurements happen." In running an experiment, you'll often encounter errant datapoints, more formally called outliers, that undermine the Gaussian model. Indeed, a dataset with only one or two corrupt samples can result in poor parameter estimates from a least-squares method. This column describes methods of estimating parameters that are much less sensitive to outliers.

Trim the mean

To see how the first method works, assume you're trying to estimate the mean of a noisy signal. Engineers generally solve this problem by computing an average across all samples; given a dataset x0 , x1,...xN-1, the sample average

minimizes

so it is a least-squares parameter estimate. This estimate approaches the mean of the process as the number of samples grows large.

To determine how well an average works as an estimator of the mean, you can work with a random-number generator. For instance, the author used a Gaussian random-number generator to produce sets of 100 samples with a mean of zero and an RMS noise amplitude of unity. In this example, the true mean equals zero, and the unit-amplitude Gaussian noise represents ordinary measurement noise.

A simple way to simulate the effect of outliers is to multiply randomly selected samples by 10, which has the effect of introducing a small number of samples with ten times the amount of ordinary measurement noise. The probability of any sample being chosen is the contamination level ε, so on average 100ε samples from any dataset are multiplied by ten.

Fig 1 -- Sorting a dataset by amplitude and deleting M points from the top and bottom can reduce noise. In these curves, M = 5 discards 10% of the samples, while M = 25 removes half.

Fig 1 shows what happens to the RMS deviation of as a function of ε. As you can see, it takes only a few wild datapoints to substantially increase the measurement noise of the sample average.

In this case it would be simple to identify harmful outliers and excise them from the data before computing the average. One easy way to do so is to sort the samples by amplitude and then remove M samples from the beginning and the end of the sorted list before computing the average, thereby removing samples with the widest variation from the mean. Such a trimmed mean exhibits much less sensitivity to outliers as you can see in the curves in Fig 1 for M = 5 and M = 25. High values of M provide greater protection from outliers at high levels of contamination, but you must take care not to make M too high, otherwise you throw out points that are actually good datapoints, which reduces the estimate's reliability. Make sure to examine the data and determine the likelihood of outliers before deciding how many points to exclude.

Time to leave one out

While the previous method works well for many applications, in some cases the parameter being estimated is affected by other kinds of outliers you can't detect by looking only at amplitude. Suppose, for example, you have a slowly varying signal and want to estimate the correlation coefficient.

Fig 2 -- In some waveforms, noise is characterized by a sudden change in value rather than high amplitude alone.

Consider the signal in Fig 2, which the author generated by filtering Gaussian noise with a 1st-order recursive filter to produce the slow variation, and then he replaced random samples with unfiltered Gaussian noise to simulate outliers. In this case, an outlier is characterized by a rapid change in value instead of a high amplitude.

This type of measurement error affects the correlation coefficient, which you can estimate with the equation

Table 1 shows the statistics of this estimator in a series of simulation experiments using filtered Gaussian noise that has a true correlation coefficient of 0.99.

Contamination Level

Conventional

Leaving-out-one

Mean

RMS Deviation

Mean

RMS Deviation

0.00

0.977560

0.024713

0.986178

0.021458

0.01

0.947205

0.068583

0.977060

0.034906

0.02

0.918374

0.087921

0.962757

0.048406


Table 1 -- When input data is corrupted by outliers, the leaving-one-out method yields mean values closer to the true value of 0.99 and exhibits less RMS deviation than the correlation coefficient estimator.

As the level of contamination (the relative number of outliers) increases, you should be aware of two effects: first, the average error or bias increases; second, the RMS measurement noise increases, as well. The estimator's sensitivity to errors in just a few samples is a result of using a least-squares method; given a dataset x0, x1,...xN-1, the correlation coefficient estimate minimizes

One way to improve the estimator is to use a "leaving one out" method. Here you compute a correlation estimate but delete one sample from the measurement set. When you run the computation several times, each time leaving out a different datapoint, you'll find that almost all of the correlation coefficients are reasonably close in value except for those that left out one of the outliers. You can conclude that the most divergent estimate likely results from deleting a "bad" measurement and therefore should be selected as the final result. To deal with multiple outliers, you can repeat the process on the new dataset with the first outlier removed.

Just one iteration of this procedure improves the correlation coefficient estimate. The results in Table 1 are based on producing an estimate using the leaving-one-out method and selecting the estimate with the highest correlation. In this case the estimator's performance is superior to a least-squares estimator even when no outliers are present -- but this situation is an unusual exception to the more general case where the performance of the leaving-one-out method is worse than least squares at ε = 0.

These two examples serve to illustrate some principles of estimating signal parameters when data is contaminated by a few large measurement errors. The reference listed below is a good source for theory and some practice on this topic. Real-world signals often exhibit these kinds of problems, and if the resulting loss of parameter accuracy is troublesome, then give serious consideration to estimators other than least-squares.

Reference

Huber, Peter J, Robust Statistics, 1981, John Wiley & Sons (New York, NY), ISBN 0-471-41805-6.

Mark Sullivan (dalek@radix.net) is Chief Scientist at SkyBitz Inc (Herndon, VA), a developer of tracking and communications services based on the GLS (Global Locating Systems) technology it invented. Mark received a PhD in Information Technology from George Mason Univ.

This article originally appeared in Personal Engineering & Instrumentation News, October 1996, pgs 69-71. Reprinted with permission of PEC Inc; all rights reserved.
Click here to get your listing up.

Copyright © 2003 ChipCenter-QuestLink
About ChipCenter-Questlink  Contact Us  Privacy Statement   Advertising Information  FAQ