How can I find numeric values (stored as strings) in table of strings and convert them to double?

71 Ansichten (letzte 30 Tage)
Hello,
I have some experience with Matlab and I've found a lot of solutions online so far, but I can't find anything that would solve my current problem. As there are some threads that are similar but not quite IT, I will try to be as specific as possible.
Background:
  • We have a really huge tool that processes our entire data. It reads a lot of input data (via readtable from Excel-Files), combines all the data, compares, calculates, appends etc. etc. In the end, there is a single table (a literal Matlab table) and we write it to an Excel-File with writetable.
  • We experienced over the course of building this tool that it's easier to force the VariavleType to be string. So basically this is our readtable Code:
opts = detectImportOptions(xls,'Sheet',sheet);
opts = setvartype(opts, opts.SelectedVariableNames, 'string'); % force VariableType to be string (both ext and numbers)
data = readtable(xls,opts,'ReadVariableNames',false,'basic',true);
  • It makes comparisons easier, but of course, if we want to calculate something, we have to use str2double and string again:
data.tq = string(str2double(data.x) + str2double(data.y));
Problem:
So this table, we create, is a table full of strings, where also the numbers are stored as strings. With Matlab2018 writetable worked perfectly or let's say "forgivingly". After writetable, we open the Excel-File via ActiveX and do the formatting directly in Excel based on CustomProperties where we, e.g., defined the number of decimals. Even though, we have a table with numbers stored as strings, we've never had a problem to format them in Excel. They, somehow, were stored as numbers in Excel:
Now I'm testing our tool for Matlab2022 and I guess, Matlab has become more rigorous over time. So when we use writetable now, numbers stored as string in the Matlab table will be also be stored as string in Excel. Which is frustrating because now, the formatting via ActiveX doesn't work anymore .
I tried various things:
  1. PreserveFormat in writetable, but apparently it means preserving the Excel Format
  2. Converting the table to an array and/or cell, trying to convert strings to numbers where there are numbers. But converting a table into an array or cell will lead to numeric values stored as char - same problem.
  3. I tried this: How to extract the numerial values out of readtable output? but cellfun(@isnumeric,table2cell(data)) doesn't work because of issue in 2. Or maybe it does with some modifcations that I don't know yet.
  4. In one case, this workaround works. (But only because, in this case, I could set a CustomProperty that allows me to define the variables that are numbers and unfortunately, this is not always the case. So not a universal solution.)
idxDecNum = find(strcmp('num',data.Properties.CustomProperties.VarOutType)); % Find columns where the CustomProperty VarOutType is defined as 'num'
data = convertvars(out,[idxDecNum],'double'); % convert only those found to double
for i = 1:size(idxDecNum,2) % convert each column to cell and store them in table
data.(data.Properties.VariableNames{idxDecNum(i)}) = num2cell(data.(data.Properties.VariableNames{idxDecNum(i)}));
end
My question(s):
  • Is there a possibility to find numeric values that are stored as string/char within a cell array and to convert only them? To be clear, the columns either contain purely numeric values or purely text. There is no mix of numbers and text within a column. But, of course, entries might be empty. So it would not be possible to check the first row for numeric and text values as the respective value might just be empty in the first row and contain data in the second or any following row.
  • Or is there an ActiveX command that converts all numbers stored as text to numbers? (However, I'd rather fix this problem in Matlab.)
Any kind of ideas are appreciated! But no, I really would like to avoid changing the readtable-command. First, I want to try to find a solution for the end.
Thank you, Laura
  3 Kommentare
Stephen23
Stephen23 am 23 Nov. 2022
So you painted yourself into a corner by relying on some undocumented and very dubious behavior that was later fixed, all to avoid handling numeric data as numeric data? Data design that leads to complicated, inefficient (all of those type conversions, storing numeric data as text) and fragile (proof is in the question) code is perhaps a sign that the data design should be revised.
Although not the advice you requested, perhaps you should revise your overall approach and treat numeric data as ... numeric.
Laura V.
Laura V. am 23 Nov. 2022
@dpb @Stephen23 Thank you for your opinion! :) Voss's solution worked and I will use it for now. However, you are both absolutely right! We should definitely take the time to revise our code completely! :)

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Voss
Voss am 23 Nov. 2022
"Is there a possibility to find numeric values that are stored as string/char within a cell array and to convert only them?"
Yes.
Example:
% a cell array containing mixed types:
C = {1 '2' "green" []; 'forest' "" 5 "6"; '7' 'tree' 9 NaN}
C = 3×4 cell array
{[ 1]} {'2' } {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {["6" ]} {'7' } {'tree'} {[ 9]} {[ NaN]}
% find the chars and strings:
is_text = cellfun(@(x)ischar(x) || isstring(x),C);
% convert the char/string elements to numeric.
% use cellfun(@str2double,_) rather than str2double alone, in order to
% handle mixed chars and strings (e.g., str2double({'4' "6"}) returns [4 NaN]):
C_numeric = NaN(size(C));
C_numeric(is_text) = cellfun(@str2double,C(is_text));
% take the non-NaN elements of C_numeric (i.e., those that were in fact
% numbers stored as chars/strings) and put them where they belong in C:
is_numeric_text = ~isnan(C_numeric);
C(is_numeric_text) = num2cell(C_numeric(is_numeric_text))
C = 3×4 cell array
{[ 1]} {[ 2]} {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {[ 6]} {[ 7]} {'tree'} {[ 9]} {[ NaN]}
A very similar (simpler) alternative:
% a cell array containing mixed types:
C = {1 '2' "green" []; 'forest' "" 5 "6"; '7' 'tree' 9 NaN};
% apply str2double on all cells (returns NaN for those already numeric):
C_numeric = cellfun(@str2double,C);
% take the non-NaN elements of C_numeric (i.e., those that were in fact
% numbers stored as chars/strings) and put them where they belong in C:
is_numeric_text = ~isnan(C_numeric);
C(is_numeric_text) = num2cell(C_numeric(is_numeric_text))
C = 3×4 cell array
{[ 1]} {[ 2]} {["green"]} {0×0 double} {'forest'} {["" ]} {[ 5]} {[ 6]} {[ 7]} {'tree'} {[ 9]} {[ NaN]}

Weitere Antworten (1)

dpb
dpb am 23 Nov. 2022
Bearbeitet: dpb am 23 Nov. 2022
Well, against better judgement, if are adamant about not fixing the problem at its core, then given the previously stated condition that the table variables are all either string or numeric (and a MATLAB table can't mix data types in a variable anyway, other than using a cell array), then I'd do something more like--
isNumberColumn=varfun(@(v)all(isfinite(str2double(v))),tYourTable,'OutputFormat','uniform');
tYourTable=convertvars(tYourTable,tYourTable.Properties.VariableNames(isNumberColumn),'double');
The above will attempt to convert all columns first, but only those that are determined to be able to be will be put into the table...of course, you could have determined this when reading the file by using detectImportOptions that would have found out which were numeric and handled it right off the bat.
With a well-formed input file, using detectImportOptions is probably not required; the builtin scanning inside readtable and friends will almost certainly get it right unless the input file format isn't kosher.
It's certainly not clear what issue you may have run into originally to have chosen this route; it would still be helpful to see the input file format; it just doesn't seem like one would choose to do the force-to-string thing unless there's an issue in the file format.
  3 Kommentare
Stephen23
Stephen23 am 24 Nov. 2022
Bearbeitet: Stephen23 am 24 Nov. 2022
@Laura V. this is a well known issue and very common, even in commercial SW developement.
There is often a point where the effort required to retain and maintain some code is more than the cost of rewriting it in a better way. Even big apps and online services have to do this sometimes:
Whether you expected it or not at the start of the project, you are now co-authors and developers of code: it is worth acknowledging that, because there is a large body of knowledge and experience that you can benefit from, including the importance of versioning, backups, code helper tools, branching, (backwards/forwards) compatibility, and countless other things that countless developers have learned through trial, error, pain and lots of tears. Learning from the mistakes of others is faster than making all mistakes yourself :)
No one here can tell you if/when code refactoring would be the best for your project... but I recommend that you keep an open mind and stay well-informed :)
dpb
dpb am 24 Nov. 2022
If you're not careful, that 10% will keep growing to counteract the proliferation of tools and datasets...@Stephen23's points are well taken to heart.
One suggestion I'd make going forward in the interim to help would be to build a library of import options objects for a given product/version and save the library of those. Then you can have your main code detect the version/product and use the prebuilt options object for that particular file on import. Ideally, your product design built in a product ID/release that you can read; if not that's a second immediate thing to add going forward.

Melden Sie sich an, um zu kommentieren.

Produkte


Version

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by