How to find strings in a very large array of data?

11 Ansichten (letzte 30 Tage)

Steven am 20 Nov. 2019

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data

Bearbeitet: per isakson am 23 Nov. 2019

I have a csv file containing a large number of numbers and a few random strings like 'zgdf'. I need to find them and set them to zero. I cannot use 'csvread' (due to strings), so I use 'textscan' to read the file.

I then turn the data to digits using str2double. MATLAB then turns the string values to NaN which is fine for me, but it takes a long time, specially because this has to be done for many similar files.

Any faster method to sort this out?

This is how I read the data (original file has two columns and large number or rows):

fileID = fopen(filename);
C = textscan(fileID,'%s %s','Delimiter',',');
fclose(fileID); 
for i = 1: length (C{1})
    D(i) = str2double(C{1}{i});
end

Thanks

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Adam Danz am 21 Nov. 2019

Bearbeitet: Adam Danz am 21 Nov. 2019

Knowing your matlab relase is usually helpful which is why it's included as an optional field when you're forming a question in this forum.

I've confirmed that the loop method of str2double() is indeed faster than the direct application to the cell array. Sometimes loops are faster.

See method 3 in my answer which applies your sscanf idea and avoids the error you described.

See method 4 for a FEX function that is like str2double() but much faster.

Method 5 is very fast but requires r2019a.

Lastly, whenever you build a variable within a loop, always pre-allocate the variable. Not pre-allocating the variable will definitely slow down your code.

Ridwan Alam am 21 Nov. 2019

Bearbeitet: Ridwan Alam am 21 Nov. 2019

@Steven

I have updated my answer with the syntax for textscan with "TreatAsEmpty" option. It returns NaN in place of those known noisy chars. Using the ["EmptyValue",0] option will return 0 instead of NaN.

Not sure how much speed up will that do though :(

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Akzeptierte Antwort

Adam Danz am 20 Nov. 2019

2
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402480

Bearbeitet: Adam Danz am 21 Nov. 2019

[This answer has been reorganized following the discussion in the comment section under the question]

Method 1

fid = fopen('myCSVfile.csv');             
C = textscan(fid,'%s %s','Delimiter',',');
fclose(fid);                              
A = str2double(C{1});  % Faster than doing the same thing in a loop.           

[update] the loop method below is actually faster

A = zeros(size(C{1})); % <--- always pre-allocate! 
for i = 1:numel(C{1})
    A = str2double(C{1}{i});
end

Method 2

Try this modification of the script produced by ImportData tool. Rather than importing your data and then converting it using str2double(), this imports the data as numeric and replaces non-numeric elements with NaN. I think it should be faster than your approach but I doubt it is much faster (or maybe it's not faster at all).

The only 2 variables you'll need to change to adapt to your data are

file (the filename, or, preferably, the full path to your file)
The NumerVariables value (number of columns of data)

%% Setup the Import Options and import the data
file = "C:\Users\name\Documents\MATLAB\myCSVfile.csv";   % Full path to your file (or just file name)
opts = delimitedTextImportOptions("NumVariables", 2);    % Number of columns of data
opts.VariableTypes(:) = {'double'};                      % read in all data as double (nan for strings)
opts.Delimiter = ",";
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";       
Data = readtable(file, opts);                            % Read in as table
Data = Data{:,:};                                        % Convert to matrix

Method 3

D = zeros(size(C{1}));     % <--- pre-allocate!
for j = 1: length (C{1})
    s = sscanf(C{1}{j},'%f');
    if ~isempty(s)
        D(j) = s;
    end
end

This is 4.5x faster than method 1.

Method 4

This FEX function is designed to overcome the slow speed of str2double()

https://www.mathworks.com/matlabcentral/fileexchange/28893-fast-string-to-double-conversion

Method 5

A very fast solution is to read the data in using readmatrx() which automatically converts non-numeric elements to NaN but it requires r2019a.

file = 'myCSVfile.csv'; 
D = readmatrix(file);   %that's it, just 2 lines

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Steven am 21 Nov. 2019

Bearbeitet: Steven am 21 Nov. 2019

Thanks Adam,

I tried on 2018b and Method 2 was much faster! Thanks.

On my PC, this is how long each took for a given file:

Method 1: 5.8 s

Method 2: 0.6 s

Method 3: 3.1 s

I couldn't check method 5 though.

Great experience!

Thanks guys

Adam Danz am 21 Nov. 2019

Thanks for the feedback!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (2)

Ridwan Alam am 20 Nov. 2019

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402477

Bearbeitet: Ridwan Alam am 21 Nov. 2019

Given, the list of noise is {'a', 'b', 'ee'}:

C = cell2mat(textscan(fileID,'%f %f','Delimiter',',','TreatAsEmpty',{'a','b','ee'},'EmptyValue',0));

Try this!!

%% Old Answer

Updated using Method 1 from Adam:

C = textscan(fileID,'%s %s','Delimiter',',');
C = [str2double(C{1}) str2double(C{2})];
C(isnan(C)) = 0;

9 Kommentare
7 ältere Kommentare anzeigen7 ältere Kommentare ausblenden

Steven am 21 Nov. 2019

Thank you Ridwan.

Ridwan Alam am 21 Nov. 2019

Sure, Steven. Please vote up if you liked the conversation. Thanks!

Melden Sie sich an, um zu kommentieren.

per isakson am 21 Nov. 2019

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402527

Bearbeitet: per isakson am 23 Nov. 2019

"random strings like 'zgdf'" If that means letters of the US alphabet, this code is rather fast.

%%
chr = fileread('cssm.txt');
chr = regexprep( chr, '[A-Za-z]+', '0.0' );
cac = textscan( chr, '%f%f', 'Delimiter',',', 'CollectOutput',true );
num = cac{1};

result

>> num(1:10,:)
ans =
      0.81472      0.15761
            0      0.97059
      0.12699      0.95717
      0.91338      0.48538
      0.63236      0.80028
      0.09754      0.14189
       0.2785            0
      0.54688      0.91574
            0      0.79221
      0.96489      0.95949

Where cssm.txt contains

81472, 0.15761
abc    , 0.97059
12699, 0.95717
91338, 0.48538
63236, 0.80028
09754, 0.14189
27850, def
54688, 0.91574
zgdf   , 0.79221
96489, 0.95949
et cetera

In response to comments

See the caveat in the first line of my answer.

I fail to find a regular expression for "not a legal number" and if one exists it might not be that fast.

It's straight forward to add a few (many becomes impractical) characters, e.g. '^â', and make sure that the string is followed by comma or end of line.

>> chr = regexprep( '12.3, abc, g^â, 1.0e5, def ', '(?m)[A-Za-zâ^]+(?=\x20*\r?(,|$))', '0.0' )
chr =
    '12.3, 0.0, 0.0, 1.0e5, 0.0 '
>>

Look ahead, e.g. '(?=\x20*\r?(,|$))', is reasonable fast, but look behind sometimes ruins the performance.

The above regex fails for 'def1', '1deg' and '10a'

fileread in combination with CRLF as newline character poses a problem when using regular expressions. The anchor $ doesn't recognise CRLF as newline. (Please tell me if I missed something.) The best way to avoid this problem is to replace fileread by a function that uses

[fid, msg] = fopen( filespec, 'rt' );
chr = fread( fid, inf, '*char' ); 

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Steven am 21 Nov. 2019

Bearbeitet: Steven am 21 Nov. 2019

Thanks Per.

Sometimes, characters include something like "g^â".

per isakson am 22 Nov. 2019

Bearbeitet: per isakson am 22 Nov. 2019

I added a response to my answer.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Kategorien

MATLAB Language Fundamentals Data Types Data Type Conversion

Mehr zu Data Type Conversion finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

How to find strings in a very large array of data?

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Weitere Antworten (2)

9 Kommentare
7 ältere Kommentare anzeigen7 ältere Kommentare ausblenden

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

How to find strings in a very large array of data?

10 Kommentare 8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Weitere Antworten (2)

9 Kommentare 7 ältere Kommentare anzeigen7 ältere Kommentare ausblenden

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

10 Kommentare
8 ältere Kommentare anzeigen8 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

9 Kommentare
7 ältere Kommentare anzeigen7 ältere Kommentare ausblenden

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden