Fast r-contiguous matching

Question

BigBang am 23 Nov. 2015

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/257051-fast-r-contiguous-matching

Kommentiert: BigBang am 8 Dez. 2015

Hi,

I want to compare these two strings by r-contiguous matching rule. So in this example if we set r as 6 then it will return true for the first example and false for the second example.

Example 1:

A='101010111111000000'
B='01101101101010001011'
return true (since it they both have 6 contiguous match '101010')

Example 2:

A='1010101010101010101'
B='11111111000000101'
return false (since they both have only 4 contiguous match '0101')

What is the fastest way in MATLAB since my data is huge? Thanks.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

arich82 am 23 Nov. 2015

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/257051-fast-r-contiguous-matching#answer_200941

In MATLAB Online öffnen

How big is 'huge'? Assuming I've understood your requirements, your accepted solution seems terribly inefficient, memory-wise, and doesn't scale particularly well.

An alternative approach might be to interpret each r-consecutive digits as a binary integer representation, use a sliding filter for a binary to decimal conversion, then check to see if any integers in one set match the integers in the other.

I think the following captures the edge cases correctly, but you might want to test it further.

% build dummy data
rng(0);
A = rand(1.0e4, 1) > 0.5; A = num2str(A); A(A == ' ') = [];
B = rand(1.2e4, 1) > 0.5; B = num2str(B); B(B == ' ') = [];
r = 6;
% stackoverflow solution
tic;
matches = bsxfun(@eq,A,B.');
intv = (0:r-1)*(size(matches,1)+1)+1;
idx = find(matches);
idx = idx(idx <= max(idx) - max(intv));
out = any(all(matches(bsxfun(@plus,idx,intv)),2));
toc;
% proposed binary conversion solution
tic;
bin = 2.^(0:r - 1);
A2 = filter(bin, 1, A == '1');
B2 = filter(bin, 1, B == '1');
bool = any(ismember(A2(r:end), B2(r:end))); % need to trim first r-1 entries
toc; 
% check
out == bool

output (note that I wrapped these in a function to ensure JIT acceleration in version 2014a):

Elapsed time is 5.936476 seconds.
Elapsed time is 0.002765 seconds.
ans =
       1

For sizes in the range of 1.0e5, my computer runs out of RAM using the accepted method and starts page swapping, which is impressive since I have 256GB of RAM (I got tired of waiting for a result); for the proposed solution, it only takes 0.022 seconds.

Note that if numel(A) is small (e.g. < 30), the accepted solution may indeed be faster.

The second approach could probably be made faster still, however, if you weren't starting from strings (e.g. if instead you used logical arrays for A and B).

Let me know if you see an error in the logic of the proposed solution, or if it doesn't work for your use case. (And if you find it useful, you could still upvote it for posterity.)