How to use map and reduce efficiently?
6 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Daniel Pinto
am 8 Aug. 2019
Bearbeitet: Daniel Pinto
am 9 Aug. 2019
I have the following table, which has over 40 million rows and 5 columns:
The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore)
% gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip);
% creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP);
IDnum = cellstr(IDnum);
% finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares);
dataTotalShares = data.shrout2;
for ii = 1:numel(intermKeys)
totalOwnership = sum(dataOwnership(idx==ii));
totalShares = mode(dataTotalShares(idx==ii));
totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];
add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
end
end
and
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
databasereducedFinal = array2table([]);
while hasnext(intermValIter)
databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
end
add(outKVStore, 'output', databasereducedFinal);
end
I then run
output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:
I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.
5 Kommentare
Akzeptierte Antwort
Guillaume
am 8 Aug. 2019
Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},
t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));
I was getting an error message
What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.
5 Kommentare
Guillaume
am 9 Aug. 2019
Bearbeitet: Guillaume
am 9 Aug. 2019
Your mapreduce algorithm is problematic as it assumes that you get all the values (sum and mode) associated with a key (pair of rdate, cusip) in just one map pass. If the data in your table is not sorted by the key, it certainly won't be the case and even if it is, it may be that a key straddles two calls to map.
E.g. let's assume your data is sorted and in one of the call to gvkeyMapFun3, you get:
rdata cusip shares shrout2
---------------------------------------------
... ... ... ...
... ... ... ...
'J' 'KK' 123 88
'X' 'YY' 111 1
'X' 'YY' 123 1
'X' 'YY' 245 2
'X' 'YY' 785 2
'X' 'YY' 951 1
On the next call to gvkeyMapFun3, you get:
rdata cusip shares shrout2
---------------------------------------------
'X' 'YY' 632 2
'X' 'YY' 421 3
'X' 'YY' 248 2
'X' 'YY' 892 3
'X' 'YY' 230 3
'Z' 'NN' 673 45
... ... ... ...
As you can see the mode of your key pair {'X', 'YY'} is 2 which you never got with your current algorithm.
Coping with that significantly complicate the mode calculation, as you can't calculate the mode in the mapping function. Instead you have to keep around the histogram of all the values of shrout2 and calculate the mode yourself in the reduce function, once you've combined all the histograms for the key.
Here is something that should work properly:
function gvkeyMapFun3(data, ~, intermKVStore)
keys = strcat(data.rdate, '-', data.cusip); %may be a better way to build your keys. Will be easier to separate afterwards
[intermKeys, ~, keyid] = unique(keys, 'stable'); %I don't see the point of 'stable'. Probably makes no difference
sharesum = accumarray(keyid, data.shares); %sum up the shares that match the same keyid
shrouthist = accumarray(keyid, data.shrout2, [], @tempmode); %compute histogram of shrout2 for each keyid
values = num2cell([num2cell(sharesum), shrouthist], 2); %for each key store a 1x2 cell array, 1st element is the share sum, 2nd element is a 2 column matrix [values, counts]
addmulti(intermKVStore, intermKeys, values);
end
%can be local in gvKeyMapFun3.m
function modehist = tempmode(shrout2)
%compute the histogram of shrout2, receives values that match just one key
%returns a scalar cell array containing a Nx2 matrix. First column is unique values, 2nd column is their histogram
[shroutval, ~, id] = unique(shrout2);
shroutcount = accumarray(id, 1);
modehist = {[shroutval, shroutcount]};
end
function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)
sharesum = 0;
shrouthist = zeros(0, 2);
while hasnext(intermValIter)
values = getnext(intermValIter);
sharesum = sharesum + values{1};
shrouthist = [shrouthist; values{2}]; %#ok<AGROW>
end
%now that we have the full histogram of shrout2, we can calculate the mode
[allshrout, ~, id] = unique(shrouthist(:, 1));
count = accumarray(id, shrouthist(:, 2));
[~, moderow] = max(count);
shroutmode = allshrout(moderow);
result = sharesum / 1000 / shroutmode; %Not sure where the 1000 comes from. Was never mentioned before
add(outKVStore, intermKey, result);
end
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Tall Arrays finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!