What is the complexity of ismember function?

5 Ansichten (letzte 30 Tage)
Anat Kira
Anat Kira am 5 Okt. 2018
Bearbeitet: Bruno Luong am 12 Okt. 2018
I want to know what is the complexity of the ismember function. Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order (number can be missing) and no number can repeats more than once. I am using this syntax: [lia,locb]=ismember(A,B).
thank you!

Akzeptierte Antwort

Walter Roberson
Walter Roberson am 5 Okt. 2018
The ismember code first checks the array with issorted(B), which is O(length(B)) . If necessary it sorts it, which does not apply in your described case. Then it calls an internal binary search routine passing in A and sorted B; average complexity of that is probably length(A) * log2(length(B)) .
It is not specified whether the internal routine does any optimization such as starting from the position of the previous result if the next value to be searched is greater than the immediately preceding value to be looked for.
  3 Kommentare
Stephen23
Stephen23 am 5 Okt. 2018
"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted?"
Walter Roberson
Walter Roberson am 5 Okt. 2018
In the situation where the inputs are both sorted, then there is probably a lower complexity algorithm, but it would not be easy to calculate what the complexity would be.
For example, you could do a full binary search of B to find the first element of A. Once you know that, that provides a lower bound for the next search, but not an upper bound. If you do your next binary search from there to the end of B, then each time you are throwing away information about the upper end of B. So you could alternate searching from the two ends, which would narrow down first the lower bound and then the upper bound on pairs of searches.
Now, if you were to implement that in MATLAB, the overhead of the coding would be high compared to a C implementation. It might take tens of millions of input elements for the reduced order complexity to overcome the implementation inefficiencies if you implemented in MATLAB. Algorithmic complexity is not the only thing you need to consider in practice.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (3)

Bruno Luong
Bruno Luong am 5 Okt. 2018
Bearbeitet: Bruno Luong am 5 Okt. 2018
"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"
You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.
Of course (a+b) is the best possible complexity one can achieve.
Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.
  3 Kommentare
Bruno Luong
Bruno Luong am 5 Okt. 2018
Bearbeitet: Bruno Luong am 5 Okt. 2018
Theoretically that is right for this degenerated case.
In practice such function - if programmed generically as a library as all MATLAB stock function - must assert for A and B to be sorted, and that verification step alone requires already (a+b).
Walter Roberson
Walter Roberson am 5 Okt. 2018
Stephen Cobeldick pointed to a discussion of the internal functions that do not check first if the values are sorted.
Sorted is being given as a pre-condition, and that potentially changes the algorithms.

Melden Sie sich an, um zu kommentieren.


Bruno Luong
Bruno Luong am 5 Okt. 2018
Bearbeitet: Bruno Luong am 12 Okt. 2018
"Is there a better function in matlab (with lower complexity) that i can use that assuming the input is sorted"
You could do with complexity of (a+b) just by scan alternatively A(i) and B(j) such that B(j-1) <= A(i) < B(j) at all steps and derive the result from that.
Of course (a+b) is the best possible complexity one can achieve.
MATLAB does not have such algo in a ready-to-be-use function. Now you can do in MATLAB for for loop but as we know MATLAB for loop is far from the speed for C for-loop. So the best is to program a MEX file. Such algorithm is closely related to merge sorted array problem (see this FEX). This problem is one step of the method of sorting by divide and conquer method.
Note: (a+b) is already the time spend to construct A and B, admittedly outside the ismembersorted algo itself, .
Just for you Walter, who seems to like to scratch the details, I'll modify the algorithm slightly by chopping the non irrelevant head and tail parts of B by two binary searches.
[~,i] = histc(A([1,end]),[B,Inf]); % O(log(b))
i1 = max(i(1),1);
i2 = i(2);
then followed by the (linear-time) alternating merge scan on A and B(i1:i2).
The complexity of both steps combined is O(log(b)) + O(a+i2-i1);
It's now has the lowest complexity not only in average sense but for all the cases.
You might say actually the construction of [B;Inf] is O(b), but I can tweak the chopping without explicitly constructing [B;Inf], thus for simplicity of the explanation I can assume the construction of [B;Inf] is actually free.
Implementation in MEX
/**************************************************************************
* MATLAB mex function ismembers.c
* Calling syntax:
* >> TF = ismembers(A,B)
*
* Like MATLAB stock function ISMEMBER but assuming A and B are numerical
* vectors and sorted (not checked)
*
* Complex array and string arrays not supported
*
* Author Bruno Luong <brunoluong@yahoo.com>
*************************************************************************/
#include "mex.h"
#include "matrix.h"
/* Define correct type depending on platform
* You might have to modify here depending on your compiler */
#if defined(_MSC_VER) || defined(__BORLANDC__)
typedef __int64 int64;
typedef __int32 int32;
typedef __int16 int16;
typedef __int8 int08;
typedef unsigned __int64 uint64;
typedef unsigned __int32 uint32;
typedef unsigned __int16 uint16;
typedef unsigned __int8 uint08;
#else /* LINUX + LCC, CAUTION: not tested by the author */
typedef long long int int64;
typedef long int int32;
typedef short int16;
typedef char int08;
typedef unsigned long long int uint64;
typedef unsigned long int uint32;
typedef unsigned short uint16;
typedef unsigned char uint08;
#endif
// SCAN engine Template
#define SCAN(type) { \
Ae = (void*)((type*)A + na); \
Be = (void*)((type*)B + nb); \
while (1) \
{ \
if ( *((type*)A) < *((type*)B) ) { \
TF++; \
A = (void*)((type*)A+1); \
if (A>=Ae) break; \
} \
else if ( *((type*)A) == *((type*)B) ) { \
*TF++ = 1; \
A = (void*)((type*)A+1); \
if (A>=Ae) break; \
} \
else { \
B = (void*)((type*)B+1); \
if (B>=Be) break; \
}; \
}; \
}
// SCAN_LOC engine Template
#define SCAN_LOC(type) { \
Ae = (void*)((type*)A + na); \
Be = (void*)((type*)B + nb); \
while (1) \
{ \
if ( *((type*)A) < *((type*)B) ) { \
TF++; \
A = (void*)((type*)A+1); \
loc++; \
if (A>=Ae) break; \
} \
else if ( *((type*)A) == *((type*)B) ) { \
*TF++ = 1; \
A = (void*)((type*)A+1); \
*loc++ = k; \
if (A>=Ae) break; \
} \
else { \
B = (void*)((type*)B+1); \
k++; \
if (B>=Be) break; \
}; \
}; \
}
/* Gateway routine ismembers */
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[]) {
const mxArray *A_IN, *B_IN;
mxClassID ClassID;
mwIndex m, n, na, nb, nc;
void *A, *B, *Ae, *Be;
double *loc;
uint32 k;
uint08 *TF;
if (nrhs!=2)
mexErrMsgTxt("ismembers: Two input arguments are required.");
/* Check data type of input argument */
B_IN = prhs[1];
m = (mwIndex)mxGetM(B_IN);
n = (mwIndex)mxGetN(B_IN);
if ((m > 1) && (n > 1))
mexErrMsgTxt("ismembers: Second input argument must be a vector.");
/* Get the total number elements of B */
nb = m*n;
/* Check data type of input argument */
A_IN = prhs[0];
m = (mwIndex)mxGetM(A_IN);
n = (mwIndex)mxGetN(A_IN);
if ((m > 1) && (n > 1))
mexErrMsgTxt("ismembers: First input argument must be a vector.");
/* Get the total number elements of A */
na = m*n;
plhs[0] = mxCreateLogicalMatrix(m, n);
TF = mxGetLogicals(plhs[0]);
ClassID = mxGetClassID(A_IN);
if (ClassID != mxGetClassID(B_IN))
mexErrMsgTxt("Two inputs must have the same classes.");
A = mxGetDoubles(A_IN);
B = mxGetDoubles(B_IN);
if (nlhs<=1) {
switch (ClassID) {
case mxDOUBLE_CLASS:
SCAN(double);
break;
case mxSINGLE_CLASS:
SCAN(float);
break;
case mxINT64_CLASS:
SCAN(int64);
break;
case mxUINT64_CLASS:
SCAN(uint64);
break;
case mxINT32_CLASS:
SCAN(int32);
break;
case mxUINT32_CLASS:
SCAN(uint32);
break;
case mxCHAR_CLASS:
SCAN(uint16);
break;
case mxINT16_CLASS:
SCAN(int16);
break;
case mxUINT16_CLASS:
SCAN(uint16);
break;
case mxLOGICAL_CLASS:
SCAN(uint08);
break;
case mxINT8_CLASS:
SCAN(int08);
break;
case mxUINT8_CLASS:
SCAN(uint08);
break;
default:
mexErrMsgTxt("ismembers: Class not supported.");
} /* switch */
}
else
{
plhs[1] = mxCreateNumericMatrix(m, n, mxDOUBLE_CLASS, mxREAL);
loc = mxGetDoubles(plhs[1]);
k = 1; /* MATLAB indexing starts with 1 */
switch (ClassID) {
case mxDOUBLE_CLASS:
SCAN_LOC(double);
break;
case mxSINGLE_CLASS:
SCAN_LOC(float);
break;
case mxINT64_CLASS:
SCAN_LOC(int64);
break;
case mxUINT64_CLASS:
SCAN_LOC(uint64);
break;
case mxINT32_CLASS:
SCAN_LOC(int32);
break;
case mxUINT32_CLASS:
SCAN_LOC(uint32);
break;
case mxCHAR_CLASS:
SCAN_LOC(uint16);
break;
case mxINT16_CLASS:
SCAN_LOC(int16);
break;
case mxUINT16_CLASS:
SCAN_LOC(uint16);
break;
case mxLOGICAL_CLASS:
SCAN_LOC(uint08);
break;
case mxINT8_CLASS:
SCAN_LOC(int08);
break;
case mxUINT8_CLASS:
SCAN_LOC(uint08);
break;
default:
mexErrMsgTxt("ismembers: Class not supported.");
} /* switch */
}
return;
} /* ismembers */
  8 Kommentare
Anat Kira
Anat Kira am 11 Okt. 2018
My first question was theoretical and the second was practical, but all the answers were very helpful for me, so thanks to everyone who answered.
Bruno Luong
Bruno Luong am 12 Okt. 2018
Bearbeitet: Bruno Luong am 12 Okt. 2018
Here is a test results on arrays of 10000 elements for both A and B, results might vary for different input size and density.
ismember time = 146.440 [ms]
ismembc/ismembc2 time = 156.300 [ms]
ismembers time = 121.969 [ms]
NOTE: ISMEMBC2 returns that index of the last occurrence in B so it's not 100% compatible with ISMEMBER which returns the first occurrence (unless using 'legacy' argument).
Test script is
ntest = 100;
t = zeros(3,ntest);
for k=1:ntest
A = sort(ceil(20000*rand(1,1e4)));
B = sort(ceil(20000*rand(1,1e4)));
tic
[TF1,loc1] = ismember(A,B);
t(1,k) = toc;
tic
TF2 = ismembc(A,B);
loc2 = ismembc2(A,B);
t(2,k) = toc;
tic
[TF3,loc3] = ismembers(A,B);
t(3,k) = toc;
end
t = mean(t,2)*1e6;
fprintf('ismember time = %1.3f [ms]\n', t(1));
fprintf('ismembc/ismembc2 time = %1.3f [ms]\n', t(2));
fprintf('ismembers time = %1.3f [ms]\n', t(3));

Melden Sie sich an, um zu kommentieren.


Bruno Luong
Bruno Luong am 7 Okt. 2018
Bearbeitet: Bruno Luong am 7 Okt. 2018
I run this code (R2018b) by changing the length of B is ISMEMBER(A,B), length A fix. It shows clearly the timing is NOT proportional to log(b) but rather to b.
For ISMEMBERC and ISMEMBERC2 is not proportional to b but something lower, certainly log(b).
The complexity of large array A/B of ISMEMBER is (a + b)*log(b)
a = 10;
b = 2.^(1:22);
t = zeros(size(b));
for n=1:length(b)
A = rand(1,10);
B = rand(1,b(n));
tic
ismember(A,B);
t(n) = toc;
end
close all
subplot(1,2,1);
semilogx(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')
subplot(1,2,2);
loglog(b,t);
xlabel('length(B)');
ylabel('time of ISMEMBER(...,B)')
  2 Kommentare
Bruno Luong
Bruno Luong am 7 Okt. 2018
For comparison, same curves for ISMEMBC and ISMEMBC2, the engine of ISMEMBER without argument checking:
Walter Roberson
Walter Roberson am 11 Okt. 2018
Original question stated "Assuming that my vectors A and B are from size a and b respectively. both of the vectors have integer numbers that arranged in ascending order"
The code
A = rand(1,10);
B = rand(1,b(n));
ismember(A,B);
violates the "arranged in ascending order" pre-condition that is key to the question. The question is specifically about what happens given that pre-condition. As I stated in my Answer, ismember() first checks for B sorted, which is an O(b) process where b = length(B). When B is not sorted, then it needs to sort it; as quicksort is used, that is O(b*log(b)) . Once B is known sorted then ismemberc or (more recently) builtin('_ismemberhelper') is called to do a binary search. With A not being sorted, we can surmise that the overall search process after the sorting check would be a * log2(b) where a = length(A) . In the sorted case this leads to an overall complexity for ismember of b + a*log2(b) .
Except in the degenerate case of b = 1 or 2, this will be worse than a + b, indicating that given those pre-conditions, you can do better than calling ismember() itself -- that at worst you could use the a+b algorithm Bruno indicated earlier.
An algorithm such as the a+b algorithm Bruno indicated might perhaps be the most practical algorithm over a fair range of array sizes. More complicated algorithms can be designed that have a lower theoretical complexity cost, but the overheads of setting them up are not free, and probably have worse cache-line effects, so you might potentially have to go to pretty large matrix sizes before a theoretically better algorithm would do better: "Quality of Implementation" would be important for practical purposes.
On the other hand, consider if a is much smaller than b. Suppose, for example, that a = 100 and b = 2^30 . Then O(a+b) would be approximately 1073741924 comparisons. But using the internal assumed-sorted ismemberc or builtin('_ismemberhelper') would be O(a*log2(b)) = O(100 * log2(2^30)) = O(100 * 30) or about 3000. Thus, an O(a+b) algorithm might be considerably worse when the sizes are rather different from each other.
Are there theoretically better algorithms under the sorted precondition without assuming that the sizes are so unequal? It looks like there may be. https://stackoverflow.com/questions/12993655/find-common-elements-in-2-sorted-arrays describes a "blended" algorithm that it is claimed has better complexity, and references a paper on merging that proves a bound.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Write C Functions Callable from MATLAB (MEX Files) finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by