REVISED:
Hello Folks,
I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.
Objective:
Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.
Desired output:
Code Issue:
Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?
'Term_4', '(dat|not)\d{1}';...
'Term_5', '(dat|not)\d{23}'...
Term_IDs = SearchTerms(:,1);
Term_Patterns = SearchTerms(:,2);
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result =
Term_1 Term_2 Term_3 Term_4 Term_5
______ ______ ______ ______ ______
0 0 0 0 0