Identify and fill missing time gaps with NaN

Question

warren am 17 Jul. 2017

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/349058-identify-and-fill-missing-time-gaps-with-nan

Bearbeitet: dpb am 27 Jul. 2017

Good day all,

I have a dataset consisting of time (serial date) and measurements at a time intervals of 5 min over 7 years. I am trying to make a temporally continuous dataset as the data has numerous random gaps over the measurement period which I want to fill with NaN. Example of data:

735425.190972222  269.516000000000
735425.194444445  269.585000000000
735425.197916667  269.452000000000
735425.201388889  269.691000000000
735425.204861111  270.098000000000
735425.208333333  270.099000000000
735425.211805556  270.506000000000
…

My first step is to create a continuous column vector for the entire timeframe using the start time and end time using a 5 min time interval and then add a second column of NaN.

735425.190972222  NaN
735425.194444445  NaN
735425.197916667  NaN
735425.201388889  NaN
735425.204861111  NaN
735425.208333333  NaN
735425.211805556  NaN
…

Then I use the intersect function to find where the times are the same. But for every 4-9 time intervals the times are found to be different even though it is the same. This pattern continues throughout the entire dataset where times are the same.

    735425.190972222  269.516000000000
    735425.194444445  269.585000000000
    735425.197916667  NaN
    735425.201388889  269.691000000000
    735425.204861111  270.098000000000
    735425.208333333  270.099000000000
    735425.211805556  NaN
    …

The code I am using for the intersect looks like this:

% Continuous dataset with NaN values
time_cont = [time.ts:time.dt:time.te]';
filled_data = [time_cont NaN(numel(time_cont),1)];
[bla ia ib] = intersect(filled_data(:,1),original_data(:,1));
filled_data(ia,2) = original_data(ib,2);

Any ideas why this happens or what I am doing wrong?

Thanks in advance for the help! Or any other suggestions on how to identify and fill the gaps would be appreciated.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Rik am 17 Jul. 2017

Try to convert it to a class that is less susceptible to rounding errors. Floats sometimes will be subtly different so that A==A could return false (this specific example is very unlikely, if not impossible, but it may be something that intersect feels the effect of).

I would suggest converting that first col to a datetime array. Maybe the ismember function has better results, you could check that as well.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Jan am 17 Jul. 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/349058-identify-and-fill-missing-time-gaps-with-nan#answer_274439

In MATLAB Online öffnen

This will be caused by the limited precision of floating point values. See:

any(0:0.1:1 == 0.3) % FALSE!

735425.197916667 might be rounded. So either compare the values with a tolerance by ismembertol, or use an integer type to represent the dates:

time_cont = round((time.ts:time.dt:time.te).' * 86400);

Now you compare the rounded seconds. Intervals of 5 minutes would be:

time_cont = round((time.ts:time.dt:time.te).' * 288);

Then your intersect method would work again.

2 Kommentare
Keine anzeigenKeine ausblenden

warren am 17 Jul. 2017

Thanks Jan. This answer worked perfectly with the rounded seconds. I realized that it must of been a rounding issue because with a long time series, you can see the rounding is different. But I could not understand why it was doing it for short time series and was not sure how to get around it.

dpb am 17 Jul. 2017

This works to smoosh your way around the issue after the fact, yes.

However, the general solution with datenum and/or datetime is as shown below to not use fractional portions of a day with colon or linspace but to use the time value generation function you intend to use with integer values of the desired granularity.

These routines roll over correctly including days of month and leap year and will generate self-consistent results then that do consistently work for comparisons.

Melden Sie sich an, um zu kommentieren.

Answer 2

dpb am 17 Jul. 2017

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/349058-identify-and-fill-missing-time-gaps-with-nan#answer_274435

In MATLAB Online öffnen

"Why?" is floating point roundoff by having built the time series vector by using colon and a dt value computed as a fractional floating point value.

Build the time vector by incrementing the minute field in datenum instead as integer number of minutes and then the consistency in how the values are rounded internally will be the same.

[Y,M,D,H,MN,S] = datevec(time.ts);  % initial date values
time_cont = datenum(Y,M,D,H,MN+[0:5*365*24*60/12]',0);  % 5 yr at 5 minute intervals

The above will be for 365 day years, use the difference in start and end times to compute the actual number needed depending on the number of leap years in the particular 5-yr span.

You might look at using the builtin timeseries data type that has some features for fixed-time sampled series.

2 Kommentare
Keine anzeigenKeine ausblenden

warren am 24 Jul. 2017

In MATLAB Online öffnen

dpb, I have been looking at your answer to try and better understand using time in MATLAB.

I understand what you are staying with incrementing the minutes using the difference in start and end times, however using the code you suggested. Looking at only the minutes, although you have determined the number of 5 minute intervals for a 5 year period, the increment is only 1 minute for ~5 months/152 days (219000 minutes).

MN+[0:5*365*25*60/12]'

To solve this using an example from my data, I have a timeseries with a start time (time.ts) (2011 8 28 8 25 0) and an end time (time.te) (2013 7 10 13 5 0), there are 157905 measurements recorded (time.tr) and 38568 missing measurements, therefore the time_cont series should contain 196473 timesteps (time.tn) of 5 minute intervals. Using what you suggested:

[Y,M,D,H,MN,S] = datevec(time.ts);  % initial date values
time_cont = datenum(Y,M,D,H,MN+[0:5:time.tn*5]',0);  % time.tn 5 minute
intervals from time.ts

Do you agree with what I have done? I have done a few checks and it all seems to be in order.

dpb am 24 Jul. 2017

Bearbeitet: dpb am 24 Jul. 2017

In MATLAB Online öffnen

..."increment is only 1 minute..."

That's what happens when just write "air code" of typing in the Answer window...you're correct in adding the 5 increment parameter in the colon expression.

What you've done (or the equivalent that can be written in many different ways) is correct when using datenum. But, unless you're stuck with a old release that doesn't yet incorporate it, I'd suggest switching over to the newer datetime object--it has many enhancements over datenum.

If you make that switch then when you have both start and end, use them directly to compute the end point for the colon expansion (it took me a long time to catch on to this and I still tend to forget and try to write an upper limit because didn't have the builtin functions like minutes with old datenum).

>> ts=datetime([2011 8 28 8 25 0]);
>> te=datetime([2013 7 10 13 5 0]);
>> dt=datetime(2011, 8, 28, 8, 25+[0:5:minutes(te-ts)].', 0);
>> whos dt
Name           Size              Bytes  Class       Attributes
dt        196473x1             3143625  datetime              
>> dt(1), dt(end)
ans = 
 28-Aug-2011 08:25:00
ans = 
 10-Jul-2013 13:05:00
>>

You can write using colon with dateteimes but I'm not positive yet on syntax for fractional hours. I've not tested the floating point delta-t to see if there's consistent rounding internally there or not. Well, guess it wouldn't take much to try here, now, would it?

Oh, while playing the right syntax came to me to simplify the whole process!!!

> dtv=[ts:minutes(5):te].';
>> all(dtv==dt)
ans =
   1
>>

This is internally self-consistent, too. Great advancement! Learnt me something new, too... :)

The latter alone is reason enough to switch.

Melden Sie sich an, um zu kommentieren.

Answer 3

Peter Perkins am 24 Jul. 2017

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/349058-identify-and-fill-missing-time-gaps-with-nan#answer_275343

In MATLAB Online öffnen

There's a lot going on here, much of it due, as Walter says, to the way that datenums represent time: counting days since a long time ago. Their limit of useful precision for contemporary dates is something around 10 microseconds.

Warren, I can't tell where your original datenums came from. As typed, they did not come fram a call to datenum, or at least not from this one

>> format long
>> dn1 = [735425.190972222; 735425.194444445; 735425.197916667; 735425.201388889; 735425.204861111; 735425.208333333; 735425.211805556]
dn1 =
0e+05 *
354251909722220
354251944444449
354251979166670
354252013888890
354252048611110
354252083333330
354252118055560
>> dn2 = datenum({'10-Jul-2013 04:35:00'; '10-Jul-2013 04:40:00'; '10-Jul-2013 04:45:00'; '10-Jul-2013 04:50:00'; '10-Jul-2013 04:55:00'; '10-Jul-2013 05:00:00'; '10-Jul-2013 05:05:00'})
dn2 = 
dn2 =
0e+05 *
354251909722223
354251944444445
354251979166667
354252013888889
354252048611111
354252083333334
354252118055555

Those two things look the same out to 15 digits (format longg) but not to 16 (format long). Maybe you copied from a command window set to longg and pasted to Answers, except that the second one does not display the same even for format longg, so maybe what you posted came from somewhere else.

Anyhow, datetime uses a much better internal representation that avoids that level of round-off. And in fact, the conversion from datenum to datetime tries to account for the round-off inherent in datenums:

>> d2 = datetime('10-Jul-2013 04:35:00','Format','dd-MMM-yyyy HH:mm:ss.SSSSSSSSS') + minutes(0:5:30)'
d2 = 
  7×1 datetime array
   10-Jul-2013 04:35:00.000000000
   10-Jul-2013 04:40:00.000000000
   10-Jul-2013 04:45:00.000000000
   10-Jul-2013 04:50:00.000000000
   10-Jul-2013 04:55:00.000000000
   10-Jul-2013 05:00:00.000000000
   10-Jul-2013 05:05:00.000000000
>> dn3 = datenum(d2)
dn3 =
   1.0e+05 *
   7.354251909722223
   7.354251944444445
   7.354251979166667
   7.354252013888889
   7.354252048611111
   7.354252083333334
   7.354252118055555

You can see that the datetime has hit the 5 minute points exactly, and in fact dn2->d2->dn3 is an exact round trip. But the datenums you've posted are far enough away from what they "ought" to be for "datenums at 5min intervals" that datetime's conversion just leaves them alone:

>> d1 = datetime(dn1,'ConvertFrom','datenum','Format','dd-MMM-yyyy HH:mm:ss.SSSSSSSSS')
d1 = 
  7×1 datetime array
   10-Jul-2013 04:34:59.999982177
   10-Jul-2013 04:40:00.000044677
   10-Jul-2013 04:45:00.000026855
   10-Jul-2013 04:50:00.000009033
   10-Jul-2013 04:54:59.999990966
   10-Jul-2013 04:59:59.999973144
   10-Jul-2013 05:05:00.000035644

Given that, you're undershooting the correct values by a hair. If you have access to R2016b or later, it's really easy to fix that using a timetable.

>> tt = retime(tt,datetime('10-Jul-2013 04:35:00')+minutes(0:5:30),'nearest')
tt =
  7×1 timetable
            Time               x   
    ____________________    _______
    10-Jul-2013 04:35:00    269.516
    10-Jul-2013 04:40:00    269.585
    10-Jul-2013 04:45:00    269.452
    10-Jul-2013 04:50:00    269.691
    10-Jul-2013 04:55:00    270.098
    10-Jul-2013 05:00:00    270.099
    10-Jul-2013 05:05:00    270.506

You can also use retime to fill in any actual gaps in your 5-minutely data, I'm not sure if the gaps you refer to are just this whole round-off issue, or if you have timestamps that are just completely not there.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

dpb am 24 Jul. 2017

Bearbeitet: dpb am 27 Jul. 2017

In MATLAB Online öffnen

The problem is that Warren (like any unsuspecting user would/will) used a fractional portion of a day in a colon expression to build his time series. In general that will not match datenum internal calculations owing to the way the internal rounding is done. It's easy enough to demonstrate the effect--

>> dn0=datenum(2013,7,10,4,35,0);  % start date: "exact" datenum
>> dn1=datenum(2020,7,10,4,35,0);  % ditto end
>> dn=[dn0:5/(24*60):dn1];         % naive 5-minute increments
>> dnD=datenum(2013,7,10,4,35+[0:5:(dn1-dn0)*24*60],0);  % roll over minuts internal
>> all(dn==dnD)   % are they the same?  No, not entirely anyway...
ans =
   0
>> sum(dnD~=dn)/numel(dn)       % how many miss of total?
ans =
  0.2222
>>

Also interesting to check on magnitudes...

>> max(abs(dn-dnD))
ans =
 1.1642e-10
>> eps(dn(1))
ans =
 1.1642e-10
>>

which shows it is, indeed in that last bit of the floating point value and how rounding is done between the two ways the values are computed.

From his initial description (and my hard won experience having done the same thing) I was/am sure that's where his problems arose.

I just checked and the new, improved colon with datetime objects is smart enough such that the rounding is handled the same way between the two ways of writing the expression.

The upshot is as you say "switch to datetime" but if one is going to use datenum then to avoid rounding errors always build vectors of granularity less than full days by adding increments of integers of the proper granularity -- floating point approximations don't work reliably for subsequent comparisons.

I don't know as it's not documented just how datetime stores the values altho I noticed in some of the routines that use it that is stored as a complex double so one presumes there's some division of higher level increments vis a vis the lower in the real/imaginary parts. The guts aren't exposed so can't poke around without using mex or somesuch other deep probing which I've not done... In some earlier releases (I'm at R2014b) there were similar issues between colon expressions and linspace not all elements of two vectors with the same start/end values and number of elements would match. A comparison would thus fail for elements within the two.

ADDENDUM

On checking I stand corrected--the colon and linspace mismatch still exists thru R2014b anyways: I had thought I had checked this earlier and it was no longer so but still is:

>> l=linspace(0,13000,200);
>> c=[0:13000/199:13000];
>> sum(l~=c)
ans =
  66
>> max(abs(l-c))
ans =
 1.8190e-12
>> eps(l(end))
ans =
 1.8190e-12
>> version
ans =
8.4.0.150421 (R2014b)
>>

Melden Sie sich an, um zu kommentieren.

Identify and fill missing time gaps with NaN

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (2)

2 Kommentare
Keine anzeigenKeine ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Identify and fill missing time gaps with NaN

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (2)

2 Kommentare Keine anzeigenKeine ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden