How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

Question

Jude am 27 Dez. 2023

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file

Kommentiert: Jude am 28 Dez. 2023

Akzeptierte Antwort: Stephen23

In MATLAB Online öffnen

REVISED:

Hello Folks,

I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.

Objective:

Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.

Desired output:

Code Issue:

Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?

clear
clc
SearchTerms = {...
                'Term_1', 'Blanket';...
                'Term_2', 'blah';...
                'Term_3', 'of';...
                'Term_4', '(dat|not)\d{1}';...
                'Term_5', '(dat|not)\d{23}'...
              };
Term_IDs = SearchTerms(:,1);       % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Lines = Text{1,1};
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      0         0         0         0         0   

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Jude am 28 Dez. 2023

Unsuccessfully, I have also tried...

1. Trouble with line below is getting regexpPattern to work,

C = categorical(Lines, Term_Patterns,Term_IDs,"Ordinal",true);

2. Line below looked workable but I am having trouble with implementation

C = discretize(Lines, contains(Lines, regexpPattern(Term_Patterns)), 'categorical', Term_IDs')

3. Currently looking into using the dictionary function to convert <Lines> into a line-by-line representation of

<Term_IDs> where applicable then follow up with the categorical function and histocounts function to get the

counts.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Stephen23 am 28 Dez. 2023

2
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file#answer_1379551

Bearbeitet: Stephen23 am 28 Dez. 2023

In MATLAB Online öffnen

File_1_rev1.txt

SearchTerms = {...
    'Term_1', 'Blanket';...
    'Term_2', 'blah';...
    'Term_3', 'of';...
    'Term_4', '(dat|not)\d{1}';...
    'Term_5', '(dat|not)\d{23}'...
    };
Term_IDs      = SearchTerms(:,1);  % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
L = readlines('File_1_rev1.txt')
L = 5729×1 string array
    "Blanket Blanket Blanket"
    ""
    "This"
    "is a test"
    "a test Of your"
    "testing system"
    "this text does"
    "not mean anything."
    "! Do not5 mind spe$cial charac7er5~"
    "not mean anything."
    ""
    ""
    "this text does"
    "testing system"
    "a test of your"
    "is a test"
    "This"
    "55 !! && Test"
    "dat3 field blah"
    "blah Blah"
    "case sensitive or not"
    "might want to create counts"
    "for each maybe not.  This"
    "is the end oF an example,"
    "instead of having actual"
    "data with millions of lines"
    "of text. "
    ""
    ""
    "This"
P = regexpPattern(Term_Patterns);
F = @(p)nnz(contains(L,p));
V = arrayfun(F,P)
V = 5×1
     1
   424
   848
   424
     0
T = unstack(table(V,Term_IDs),'V','Term_IDs')
T = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      1        424       848       424        0   

2 Kommentare
Keine anzeigenKeine ausblenden

Dyuman Joshi am 28 Dez. 2023

+1 for readlines()

Jude am 28 Dez. 2023

@Stephen23, thank you for sharing your solution with me. I like this vectorized approach.

Melden Sie sich an, um zu kommentieren.

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden