Matlab comparison of two large matricies
    8 Ansichten (letzte 30 Tage)
  
       Ältere Kommentare anzeigen
    
    Benvaulter
 am 26 Mai 2017
  
    
    
    
    
    Kommentiert: Benvaulter
 am 29 Mai 2017
            I am trying to retrieve the index of exact matches (row-specific) between two large matricies. I have a n x 61 matrix A containing values from 0 to 9 and another n x 61 matrix B , whereas each row here contains values from 0 to 9 but mostly NaN (only 2 to 8 columns in each row of matrix B contain actual numbers). Matrix A can be expected to have between 1.5 million and 3 million rows, whereas matrix B has around 0.2 to 0.5 million rows. Here is an example of the setup:
% create matrix a with random data
dataSample = [0 9];
numRows = 1000000;
numCols = 61;
A = randi(dataSample,numRows,numCols);
% create matrix B with random data
numRows = 100000;
numCols = 61;
numColsUse = 2:8;
dataRange = 0:9;
B = NaN(numRows,numCols);
for i = 1:size(B,1)
      % randomly selet number of columns to fill
      numColsFill = datasample(numColsUse,1);
      % randomly select column index from available columns
      colIdx = datasample([1:numCols],numColsFill);
      % randomly select values from 0 to 9
      numFill = datasample([0:9],numColsFill);
      % insert numbers at respective column in matrix B
      B(i,colIdx) = numFill;
end
I want to compare every single row of matrix A with the entire matrix B and find exact matches, where the numbers of matrix B match the numbers of matrix A at their respective positions (columns) - hence the NaN in matrix B are to be ignored.
I can achieve the desired result using cellfun, where I slice matrix A in several subsets and then use a custom function to compare the rows of the subset with each row in matrix B, like so:
% put all rows of matrix B in single cell
cellB = {B};
% take subset of matrix A and convert to cell array
subA = A(1000:5000,:);
subA = num2cell(subA,2);
% prepare cellB to meet cellfun conditions
cellB = repmat(cellB, [size(subA,1) 1]);
% apply cellfun to retrieve index of each exact match
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
Function findMatch looks as follows:
function [ idx ] = LTableEval( cellB,  subA )
      idxCheckLT = lt(cellB, repmat(subA, [size(cellB,1) 1]));
      idxCheckGT = gt(cellB, repmat(subA, [size(cellB,1) 1]));
      idxCheck = idxCheckLT + idxCheckGT;
      idxSum = sum(idxCheck,2);
      idx = find(idxSum == 0);
end
This approach works, but it appears to be very inefficient, especially RAM-wise, as the cellfun requires all inputs to have the same size and hence a multiplication of the same data sets. Any ideas on how to tackle this problem in a more efficient way? Many thanks!
0 Kommentare
Akzeptierte Antwort
  Guillaume
      
      
 am 28 Mai 2017
        This is how I'd do it:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
    Bcols = find(~isnan(B(Brow, :)));
    matches{Brow} = find(all(A(:, Bcols) == B(Brow, Bcols), 2));  %requires R2016b or later
end
It's certainly a lot more efficient than any of the solutions you already have.
Note: in R2015b or earlier replace the relevant line by:
matches{Brow} = find(all(bsxfun(@eq, A(:, Bcols), B(Brow, Bcols)), 2));
4 Kommentare
  Guillaume
      
      
 am 29 Mai 2017
				Try this:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
    Bcols = find(~isnan(B(Brow, :)));
    matchedrows = find(all(A(:, Bcols) == B(Brow, Bcols), 2));
    matches{Brow} = [matchedrows, repmat(Brow, size(matchedrows))];
end
matches = cell2mat(matches);
finalScores = accumarray(matches(:, 1), matches(:, 2), [size(A, 1), 1], @(Brows) mean(Bscores(Brows)), nan);
However, if most rows of A all match at least some rows of B it may be more efficient to have just one loop over the rows of A. (I regard accumarray as another loop)
Weitere Antworten (1)
  Matthew Eicholtz
      
 am 26 Mai 2017
        A couple comments:
1. Did you mean to convert to the cell array in this manner?
subA = num2cell(subA);
If you want to look at each row as its own cell, I think you need:
subA = num2cell(subA,2);
2. I am not sure how much more efficient this solution will be, but you can replace
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
with
idxContainer = cellfun(@(x,y) find(all(isnan(x)|x==y,2)), cellB, subA, 'UniformOutput', false);
Let me know if this helps at all.
1 Kommentar
Siehe auch
Kategorien
				Mehr zu Performance and Memory finden Sie in Help Center und File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

