Main Content

Data Smoothing and Outlier Detection

Data smoothing refers to techniques for eliminating unwanted noise or behaviors in data, while outlier detection identifies data points that are significantly different from the rest of the data.

Moving Window Methods

Moving window methods are ways to process data in smaller batches at a time, typically in order to statistically represent a neighborhood of points in the data. The moving average is a common data smoothing technique that slides a window along the data, computing the mean of the points inside of each window. This can help to eliminate insignificant variations from one data point to the next.

For example, consider wind speed measurements taken every minute for about 3 hours. Use the movmean function with a window size of 5 minutes to smooth out high-speed wind gusts.

load windData.mat
mins = 1:length(speed);
window = 5;
meanspeed = movmean(speed,window);
plot(mins,speed,mins,meanspeed)
axis tight
legend("Measured Wind Speed","Average Wind Speed over 5 min Window")
xlabel("Time")
ylabel("Speed")

Similarly, you can compute the median wind speed over a sliding window using the movmedian function.

medianspeed = movmedian(speed,window);
plot(mins,speed,mins,medianspeed)
axis tight
legend("Measured Wind Speed","Median Wind Speed over 5 min Window")
xlabel("Time")
ylabel("Speed")

Not all data is suitable for smoothing with a moving window method. For example, create a sinusoidal signal with injected random noise.

t = 1:0.2:15;
A = sin(2*pi*t) + cos(2*pi*0.5*t);
Anoise = A + 0.5*rand(1,length(t));
plot(t,A,t,Anoise)
axis tight
legend("Original Data","Noisy Data")

Use a moving mean with a window size of 3 to smooth the noisy data.

window = 3;
Amean = movmean(Anoise,window);
plot(t,A,t,Amean)
axis tight
legend("Original Data","Moving Mean - Window Size 3")

The moving mean achieves the general shape of the data, but doesn't capture the valleys (local minima) very accurately. Since the valley points are surrounded by two larger neighbors in each window, the mean is not a very good approximation to those points. If you make the window size larger, the mean eliminates the shorter peaks altogether. For this type of data, you might consider alternative smoothing techniques.

Amean = movmean(Anoise,5);
plot(t,A,t,Amean)
axis tight
legend("Original Data","Moving Mean - Window Size 5")

Common Smoothing Methods

The smoothdata function provides several smoothing options such as the Savitzky-Golay method, which is a popular smoothing technique used in signal processing. By default, smoothdata chooses a best-guess window size for the method depending on the data.

Use the Savitzky-Golay method to smooth the noisy signal Anoise, and output the window size that it uses. This method provides a better valley approximation compared to movmean.

[Asgolay,window] = smoothdata(Anoise,"sgolay");
plot(t,A,t,Asgolay)
axis tight
legend("Original Data","Savitzky-Golay","location","best")

window
window = 3

The robust Lowess method is another smoothing method that is particularly helpful when outliers are present in the data in addition to noise. Inject an outlier into the noisy data, and use robust Lowess to smooth the data, which eliminates the outlier.

Anoise(36) = 20;
Arlowess = smoothdata(Anoise,"rlowess",5);
plot(t,Anoise,t,Arlowess)
axis tight
legend("Noisy Data","Robust Lowess")

Detecting Outliers

Outliers in data can significantly skew data processing results and other computed quantities. For example, if you try to smooth data containing outliers with a moving median, you can get misleading peaks or valleys.

Amedian = smoothdata(Anoise,"movmedian");
plot(t,Anoise,t,Amedian)
axis tight
legend("Noisy Data","Moving Median")

The isoutlier function returns a logical 1 when an outlier is detected. Verify the index and value of the outlier in Anoise.

TF = isoutlier(Anoise);
ind = find(TF)
ind = 36
Aoutlier = Anoise(ind)
Aoutlier = 20

You can replace outliers in your data by using the filloutliers function and specifying a fill method. For example, fill the outlier in Anoise with the value of its neighbor immediately to the right.

Afill = filloutliers(Anoise,"next");
plot(t,Anoise,t,Afill,"o-")
axis tight
legend("Noisy Data with Outlier","Noisy Data with Filled Outlier")

Alternatively, you can remove outliers from your data by using the rmoutliers function. For example, remove the outlier in Anoise.

Aremove = rmoutliers(Anoise);
plot(t,Anoise,t(~TF),Aremove,"o-")
axis tight
legend("Noisy Data with Outlier","Noisy Data with Outlier Removed")

Nonuniform Data

Not all data consists of equally spaced points, which can affect methods for data processing. Create a datetime vector that contains irregular sampling times for the data in Airreg. The time vector represents samples taken every minute for the first 30 minutes, then hourly over two days.

t0 = datetime(2014,1,1,1,1,1);
timeminutes = sort(t0 + minutes(1:30));
timehours = t0 + hours(1:48);
time = [timeminutes timehours];
Airreg = rand(1,length(time));
plot(time,Airreg)
axis tight

By default, smoothdata smooths with respect to equally spaced integers, in this case, 1,2,...,78. Since integer time stamps do not coordinate with the sampling of the points in Airreg, the first half hour of data still appears noisy after smoothing.

Adefault = smoothdata(Airreg,"movmean",3);
plot(time,Airreg,time,Adefault)
axis tight
legend("Original Data","Smoothed Data with Default Sample Points")

Many data processing functions in MATLAB®, including smoothdata, movmean, and filloutliers, allow you to provide sample points, ensuring that data is processed relative to its sampling units and frequencies. To remove the high-frequency variation in the first half hour of data in Airreg, use the SamplePoints name-value argument with the time stamps in time.

Asamplepoints = smoothdata(Airreg,"movmean", ...
    hours(3),"SamplePoints",time);
plot(time,Airreg,time,Asamplepoints)
axis tight
legend("Original Data","Smoothed Data with Sample Points")

See Also

Functions

Live Editor Tasks

Apps

Related Topics