How to identify same pattern with phase difference among rows?
Ältere Kommentare anzeigen
Hi all!
I have constructed a matrix (254864x6) with the following logic: Create all possible combinations of six integers (representing time intervals) between 5 and 20 that sum up to 60. This actually represents the time interval between two consecutive discrete events (6 such events in total) that have a repetition period T=1hour.
I know that this matrix has "double rows" in the sense that the same pattern repeats (if the observation period is e.g. 6h) but with some phase difference. For example, the rows:
5 5 20 20 5 5
and
5 5 5 5 20 20
or
10 10 15 15 5 5
and
15 15 5 5 10 10
My question is, how can I identify these rows, in a robust way, so I can save some memory and calculation time (by discarding them) in what I am doing further with this script (which uses this matrix as an input to some other process)?
Thanks,
Iro
5 Kommentare
Walter Roberson
am 15 Mai 2015
Is a "phase shift" a case where one vector is a circular shift of the other?
Walter Roberson
am 18 Mai 2015
Define a canonical form for each vector by arranging it so that the largest value is as far left as possible. If the vector has multiple copies of the maximum, then the one to move to the left is the one that is followed by the biggest value (this implies that if there two copies of the maximum in a row, they will move to the beginning.) It's almost 2:30 AM here and I can't quite think of an efficient way of doing this. The process is akin to shifting the order of digits in a number to obtain the largest possible integer.
Once you have that canonical number, subtract 4 from each digit, so that the range of values goes from 1 to 16. Now use
idx = sub2idx([16 16 16 16 16 16], TheVector)
Keep an Encountered logical() vector.
If idx > length(Encountered) then the combination has not been encountered before. Set Encountered(idx) = true to mark that you have seen it now. This will extend the array, writing false between the previous end and the new location.
Otherwise if Encountered(idx) is true then you have seen this pattern before, so discard it.
Otherwise, you have not encountered the pattern before. Set Encountered(idx) = true to mark that you have seen it now.
The vector that will lead to the largest index will, I think, be [20, 5, 5, 5, 6, 19], which would be element #14745616 of the Encountered vector. That's not a negligible amount of storage. It's about 10 times what you would use for your full 254864 x 6 matrix. It mostly has the advantage of being fast once the canonicalization is done.
I do not know at the moment how you arrived at your figure of 254864 x 6, as there are 16777216 possible entries for 6 numbers each in the range 5 to 20. I just ran a test and 701246 of the possibilities add up to 60. (Better remember to use uint8() for your data storage.) The Encounters vector I proposed would be about 5% occupied. But it's a linear search time. You could keep a canonicalized version of your present entries, which would save you from having to do circshifts on every existing entry to find matches, but even if you kept that sorted, searching it would be a 2*log2 process of the table size, repeated for each search. My calculation is that binary search would make the run about 35 times slower than linear. ismember() implements binary search on proper arguments, but with the number of searches you would need to do, you would want to drop the error checking and call the C routine ismembc2() directly.
Walter Roberson
am 18 Mai 2015
By the way, the reason for going for the largest digits first is that the amount of memory required is determined by the final dimension, so you want the final dimension to be less than the maximum.
Antworten (0)
Kategorien
Mehr zu Logical finden Sie in Hilfe-Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!