speed up renamecats/categorical multiple columns
2 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
["123", "234", "345", "456", "567"]', ...
'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
["apple", "banana", "cat", "dog", "elephont"]', ...
'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
["East", "West", "North", "South", "Middle"]', ...
'VariableNames', {'Code', 'Meaning'});
Vars = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
dictionary.(Vars{iC}).Meaning);
end
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!
0 Kommentare
Antworten (1)
Campion Loong
am 9 Okt. 2020
Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"], dictionary.A.Code, dictionary.A.Meaning),...
categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
3 Kommentare
Campion Loong
am 9 Okt. 2020
If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.
If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:
Siehe auch
Kategorien
Mehr zu Spreadsheets finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!