Matlab comparison of two large matricies

7 views (last 30 days)
I am trying to retrieve the index of exact matches (row-specific) between two large matricies. I have a n x 61 matrix A containing values from 0 to 9 and another n x 61 matrix B , whereas each row here contains values from 0 to 9 but mostly NaN (only 2 to 8 columns in each row of matrix B contain actual numbers). Matrix A can be expected to have between 1.5 million and 3 million rows, whereas matrix B has around 0.2 to 0.5 million rows. Here is an example of the setup:
% create matrix a with random data
dataSample = [0 9];
numRows = 1000000;
numCols = 61;
A = randi(dataSample,numRows,numCols);
% create matrix B with random data
numRows = 100000;
numCols = 61;
numColsUse = 2:8;
dataRange = 0:9;
B = NaN(numRows,numCols);
for i = 1:size(B,1)
% randomly selet number of columns to fill
numColsFill = datasample(numColsUse,1);
% randomly select column index from available columns
colIdx = datasample([1:numCols],numColsFill);
% randomly select values from 0 to 9
numFill = datasample([0:9],numColsFill);
% insert numbers at respective column in matrix B
B(i,colIdx) = numFill;
end
I want to compare every single row of matrix A with the entire matrix B and find exact matches, where the numbers of matrix B match the numbers of matrix A at their respective positions (columns) - hence the NaN in matrix B are to be ignored.
I can achieve the desired result using cellfun, where I slice matrix A in several subsets and then use a custom function to compare the rows of the subset with each row in matrix B, like so:
% put all rows of matrix B in single cell
cellB = {B};
% take subset of matrix A and convert to cell array
subA = A(1000:5000,:);
subA = num2cell(subA,2);
% prepare cellB to meet cellfun conditions
cellB = repmat(cellB, [size(subA,1) 1]);
% apply cellfun to retrieve index of each exact match
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
Function findMatch looks as follows:
function [ idx ] = LTableEval( cellB, subA )
idxCheckLT = lt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheckGT = gt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheck = idxCheckLT + idxCheckGT;
idxSum = sum(idxCheck,2);
idx = find(idxSum == 0);
end
This approach works, but it appears to be very inefficient, especially RAM-wise, as the cellfun requires all inputs to have the same size and hence a multiplication of the same data sets. Any ideas on how to tackle this problem in a more efficient way? Many thanks!

Accepted Answer

Guillaume
Guillaume on 28 May 2017
This is how I'd do it:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
Bcols = find(~isnan(B(Brow, :)));
matches{Brow} = find(all(A(:, Bcols) == B(Brow, Bcols), 2)); %requires R2016b or later
end
It's certainly a lot more efficient than any of the solutions you already have.
Note: in R2015b or earlier replace the relevant line by:
matches{Brow} = find(all(bsxfun(@eq, A(:, Bcols), B(Brow, Bcols)), 2));
  4 Comments
Benvaulter
Benvaulter on 29 May 2017
It works like a charm - absolutely brillinat - thank you so much!

Sign in to comment.

More Answers (1)

Matthew Eicholtz
Matthew Eicholtz on 26 May 2017
A couple comments:
1. Did you mean to convert to the cell array in this manner?
subA = num2cell(subA);
If you want to look at each row as its own cell, I think you need:
subA = num2cell(subA,2);
2. I am not sure how much more efficient this solution will be, but you can replace
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
with
idxContainer = cellfun(@(x,y) find(all(isnan(x)|x==y,2)), cellB, subA, 'UniformOutput', false);
Let me know if this helps at all.
  1 Comment
Benvaulter
Benvaulter on 28 May 2017
Hi Matthew,
thanks for your suggestions. First of all, you are correct, I has to be
subA = num2cell(subA,2);
as I want to conduct the operation on each row. Your alternative suggestion regarding the cellfunction returns the following error for me:
Error using ==
Matrix dimensions must agree.
Error in @(x,y)find(all(isnan(x)|x==y,2))
I have found other solutions suggesting matrix substraction, which makes sense to me. However, I am facing two challenges:
1. How to maximize performance when comparing each single row (i.e. what you suggested)
2. How to best setup the overall code (i.e. using cellfun vs. for-loops, vs. other ideas) to maximize speed overall (with these amounts of data, I also believe RAM might be a factor in what I am able to do.
Someone suggested the following solution:
for br = 1:size(B,1)
abs_diff = abs(repmat(B(br,:),[size(A,1) 1]) - A);
abs_diff(isnan(abs_diff)) = 0;
match = abs_diff == 0;
ind = find(sum(match,2)==size(match,2));
matches{br} = [repmat(br,[length(ind) 1]) ind];
end
matches = cell2mat(matches');
which is already a step into the right correction and which does not use "slicing the dataset" into smaller parts. However, each iteration of the loop here is still over a second long and I am wondering whether there are still better ways to do this. Any help is much appreciated - thanks!

Sign in to comment.

Categories

Find more on Creating and Concatenating Matrices in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by