Identify and fill missing time gaps with NaN
19 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Good day all,
I have a dataset consisting of time (serial date) and measurements at a time intervals of 5 min over 7 years. I am trying to make a temporally continuous dataset as the data has numerous random gaps over the measurement period which I want to fill with NaN. Example of data:
735425.190972222 269.516000000000
735425.194444445 269.585000000000
735425.197916667 269.452000000000
735425.201388889 269.691000000000
735425.204861111 270.098000000000
735425.208333333 270.099000000000
735425.211805556 270.506000000000
…
My first step is to create a continuous column vector for the entire timeframe using the start time and end time using a 5 min time interval and then add a second column of NaN.
735425.190972222 NaN
735425.194444445 NaN
735425.197916667 NaN
735425.201388889 NaN
735425.204861111 NaN
735425.208333333 NaN
735425.211805556 NaN
…
Then I use the intersect function to find where the times are the same. But for every 4-9 time intervals the times are found to be different even though it is the same. This pattern continues throughout the entire dataset where times are the same.
735425.190972222 269.516000000000
735425.194444445 269.585000000000
735425.197916667 NaN
735425.201388889 269.691000000000
735425.204861111 270.098000000000
735425.208333333 270.099000000000
735425.211805556 NaN
…
The code I am using for the intersect looks like this:
% Continuous dataset with NaN values
time_cont = [time.ts:time.dt:time.te]';
filled_data = [time_cont NaN(numel(time_cont),1)];
[bla ia ib] = intersect(filled_data(:,1),original_data(:,1));
filled_data(ia,2) = original_data(ib,2);
Any ideas why this happens or what I am doing wrong?
Thanks in advance for the help! Or any other suggestions on how to identify and fill the gaps would be appreciated.
1 Kommentar
Rik
am 17 Jul. 2017
Try to convert it to a class that is less susceptible to rounding errors. Floats sometimes will be subtly different so that A==A could return false (this specific example is very unlikely, if not impossible, but it may be something that intersect feels the effect of).
I would suggest converting that first col to a datetime array. Maybe the ismember function has better results, you could check that as well.
Akzeptierte Antwort
Jan
am 17 Jul. 2017
This will be caused by the limited precision of floating point values. See:
any(0:0.1:1 == 0.3) % FALSE!
735425.197916667 might be rounded. So either compare the values with a tolerance by ismembertol, or use an integer type to represent the dates:
time_cont = round((time.ts:time.dt:time.te).' * 86400);
Now you compare the rounded seconds. Intervals of 5 minutes would be:
time_cont = round((time.ts:time.dt:time.te).' * 288);
Then your intersect method would work again.
2 Kommentare
dpb
am 17 Jul. 2017
This works to smoosh your way around the issue after the fact, yes.
However, the general solution with datenum and/or datetime is as shown below to not use fractional portions of a day with colon or linspace but to use the time value generation function you intend to use with integer values of the desired granularity.
These routines roll over correctly including days of month and leap year and will generate self-consistent results then that do consistently work for comparisons.
Weitere Antworten (2)
dpb
am 17 Jul. 2017
"Why?" is floating point roundoff by having built the time series vector by using colon and a dt value computed as a fractional floating point value.
Build the time vector by incrementing the minute field in datenum instead as integer number of minutes and then the consistency in how the values are rounded internally will be the same.
[Y,M,D,H,MN,S] = datevec(time.ts); % initial date values
time_cont = datenum(Y,M,D,H,MN+[0:5*365*24*60/12]',0); % 5 yr at 5 minute intervals
The above will be for 365 day years, use the difference in start and end times to compute the actual number needed depending on the number of leap years in the particular 5-yr span.
You might look at using the builtin timeseries data type that has some features for fixed-time sampled series.
2 Kommentare
dpb
am 24 Jul. 2017
Bearbeitet: dpb
am 24 Jul. 2017
..."increment is only 1 minute..."
That's what happens when just write "air code" of typing in the Answer window...you're correct in adding the 5 increment parameter in the colon expression.
What you've done (or the equivalent that can be written in many different ways) is correct when using datenum. But, unless you're stuck with a old release that doesn't yet incorporate it, I'd suggest switching over to the newer datetime object--it has many enhancements over datenum.
If you make that switch then when you have both start and end, use them directly to compute the end point for the colon expansion (it took me a long time to catch on to this and I still tend to forget and try to write an upper limit because didn't have the builtin functions like minutes with old datenum).
>> ts=datetime([2011 8 28 8 25 0]);
>> te=datetime([2013 7 10 13 5 0]);
>> dt=datetime(2011, 8, 28, 8, 25+[0:5:minutes(te-ts)].', 0);
>> whos dt
Name Size Bytes Class Attributes
dt 196473x1 3143625 datetime
>> dt(1), dt(end)
ans =
28-Aug-2011 08:25:00
ans =
10-Jul-2013 13:05:00
>>
You can write using colon with dateteimes but I'm not positive yet on syntax for fractional hours. I've not tested the floating point delta-t to see if there's consistent rounding internally there or not. Well, guess it wouldn't take much to try here, now, would it?
Oh, while playing the right syntax came to me to simplify the whole process!!!
> dtv=[ts:minutes(5):te].';
>> all(dtv==dt)
ans =
1
>>
This is internally self-consistent, too. Great advancement! Learnt me something new, too... :)
The latter alone is reason enough to switch.
Peter Perkins
am 24 Jul. 2017
There's a lot going on here, much of it due, as Walter says, to the way that datenums represent time: counting days since a long time ago. Their limit of useful precision for contemporary dates is something around 10 microseconds.
Warren, I can't tell where your original datenums came from. As typed, they did not come fram a call to datenum, or at least not from this one
>> format long
>> dn1 = [735425.190972222; 735425.194444445; 735425.197916667; 735425.201388889; 735425.204861111; 735425.208333333; 735425.211805556]
dn1 =
1.0e+05 *
7.354251909722220
7.354251944444449
7.354251979166670
7.354252013888890
7.354252048611110
7.354252083333330
7.354252118055560
>> dn2 = datenum({'10-Jul-2013 04:35:00'; '10-Jul-2013 04:40:00'; '10-Jul-2013 04:45:00'; '10-Jul-2013 04:50:00'; '10-Jul-2013 04:55:00'; '10-Jul-2013 05:00:00'; '10-Jul-2013 05:05:00'})
dn2 =
dn2 =
1.0e+05 *
7.354251909722223
7.354251944444445
7.354251979166667
7.354252013888889
7.354252048611111
7.354252083333334
7.354252118055555
Those two things look the same out to 15 digits (format longg) but not to 16 (format long). Maybe you copied from a command window set to longg and pasted to Answers, except that the second one does not display the same even for format longg, so maybe what you posted came from somewhere else.
Anyhow, datetime uses a much better internal representation that avoids that level of round-off. And in fact, the conversion from datenum to datetime tries to account for the round-off inherent in datenums:
>> d2 = datetime('10-Jul-2013 04:35:00','Format','dd-MMM-yyyy HH:mm:ss.SSSSSSSSS') + minutes(0:5:30)'
d2 =
7×1 datetime array
10-Jul-2013 04:35:00.000000000
10-Jul-2013 04:40:00.000000000
10-Jul-2013 04:45:00.000000000
10-Jul-2013 04:50:00.000000000
10-Jul-2013 04:55:00.000000000
10-Jul-2013 05:00:00.000000000
10-Jul-2013 05:05:00.000000000
>> dn3 = datenum(d2)
dn3 =
1.0e+05 *
7.354251909722223
7.354251944444445
7.354251979166667
7.354252013888889
7.354252048611111
7.354252083333334
7.354252118055555
You can see that the datetime has hit the 5 minute points exactly, and in fact dn2->d2->dn3 is an exact round trip. But the datenums you've posted are far enough away from what they "ought" to be for "datenums at 5min intervals" that datetime's conversion just leaves them alone:
>> d1 = datetime(dn1,'ConvertFrom','datenum','Format','dd-MMM-yyyy HH:mm:ss.SSSSSSSSS')
d1 =
7×1 datetime array
10-Jul-2013 04:34:59.999982177
10-Jul-2013 04:40:00.000044677
10-Jul-2013 04:45:00.000026855
10-Jul-2013 04:50:00.000009033
10-Jul-2013 04:54:59.999990966
10-Jul-2013 04:59:59.999973144
10-Jul-2013 05:05:00.000035644
Given that, you're undershooting the correct values by a hair. If you have access to R2016b or later, it's really easy to fix that using a timetable.
>> tt = retime(tt,datetime('10-Jul-2013 04:35:00')+minutes(0:5:30),'nearest')
tt =
7×1 timetable
Time x
____________________ _______
10-Jul-2013 04:35:00 269.516
10-Jul-2013 04:40:00 269.585
10-Jul-2013 04:45:00 269.452
10-Jul-2013 04:50:00 269.691
10-Jul-2013 04:55:00 270.098
10-Jul-2013 05:00:00 270.099
10-Jul-2013 05:05:00 270.506
You can also use retime to fill in any actual gaps in your 5-minutely data, I'm not sure if the gaps you refer to are just this whole round-off issue, or if you have timestamps that are just completely not there.
1 Kommentar
dpb
am 24 Jul. 2017
Bearbeitet: dpb
am 27 Jul. 2017
The problem is that Warren (like any unsuspecting user would/will) used a fractional portion of a day in a colon expression to build his time series. In general that will not match datenum internal calculations owing to the way the internal rounding is done. It's easy enough to demonstrate the effect--
>> dn0=datenum(2013,7,10,4,35,0); % start date: "exact" datenum
>> dn1=datenum(2020,7,10,4,35,0); % ditto end
>> dn=[dn0:5/(24*60):dn1]; % naive 5-minute increments
>> dnD=datenum(2013,7,10,4,35+[0:5:(dn1-dn0)*24*60],0); % roll over minuts internal
>> all(dn==dnD) % are they the same? No, not entirely anyway...
ans =
0
>> sum(dnD~=dn)/numel(dn) % how many miss of total?
ans =
0.2222
>>
Also interesting to check on magnitudes...
>> max(abs(dn-dnD))
ans =
1.1642e-10
>> eps(dn(1))
ans =
1.1642e-10
>>
which shows it is, indeed in that last bit of the floating point value and how rounding is done between the two ways the values are computed.
From his initial description (and my hard won experience having done the same thing) I was/am sure that's where his problems arose.
I just checked and the new, improved colon with datetime objects is smart enough such that the rounding is handled the same way between the two ways of writing the expression.
The upshot is as you say "switch to datetime" but if one is going to use datenum then to avoid rounding errors always build vectors of granularity less than full days by adding increments of integers of the proper granularity -- floating point approximations don't work reliably for subsequent comparisons.
I don't know as it's not documented just how datetime stores the values altho I noticed in some of the routines that use it that is stored as a complex double so one presumes there's some division of higher level increments vis a vis the lower in the real/imaginary parts. The guts aren't exposed so can't poke around without using mex or somesuch other deep probing which I've not done... In some earlier releases (I'm at R2014b) there were similar issues between colon expressions and linspace not all elements of two vectors with the same start/end values and number of elements would match. A comparison would thus fail for elements within the two.
ADDENDUM
On checking I stand corrected--the colon and linspace mismatch still exists thru R2014b anyways: I had thought I had checked this earlier and it was no longer so but still is:
>> l=linspace(0,13000,200);
>> c=[0:13000/199:13000];
>> sum(l~=c)
ans =
66
>> max(abs(l-c))
ans =
1.8190e-12
>> eps(l(end))
ans =
1.8190e-12
>> version
ans =
8.4.0.150421 (R2014b)
>>
Siehe auch
Kategorien
Mehr zu Dates and Time finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!