Alternative for regex to find line that characters are repeated on consecutively.

Question

N/A am 15 Jul. 2021

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/879103-alternative-for-regex-to-find-line-that-characters-are-repeated-on-consecutively

Kommentiert: Rik am 19 Mai 2023

Currently I have very large block of data that looks like this (Very many):

>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1
MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQE
LYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLD
AIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF
>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1
MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPK
FRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPL
GVGAALVTADDFLVFLRRSQQVAEAPGLVDV

I am trying to make a script that finds strings of ten or more consecutive E/D characters, like in the the first block of data in the section above. Basically I am asking for a way that is an alternative for regex, as I have not found any way to make a pattern to do so on regex. I want to know which lines in the large text file the consecutive characters were found on. Really just looking for an alternative to regex, if anyone has any good suggestions. This is part of the code I was using before.

inp = {''};
form = '[de]{10,}';
calc = regexp(inp,form,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>10),calc);
find(idx)

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Stephen23 am 15 Jul. 2021

Bearbeitet: Stephen23 am 15 Jul. 2021

"...I have not found any way to make a pattern to do so on regex"

What is the specific problem with that regular expression? It works for me, when adjusted for the characters you want:

inp = {'>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1 MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQELYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLDAIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF';...
'>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1 MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPKFRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPLGVGAALVTADDFLVFLRRSQQVAEAPGLVDV'};
rgx = '[DE]{10,}';
tmp = regexp(inp,rgx,'once');
idx = ~cellfun(@isempty,tmp)
idx = 2×1 logical array
   1
   0

Did you notice that your regular expression matches the lowercase characters 'd' and 'e', although the data you want to match consists of the uppercase characters 'D' and 'E' ? Did you attempt to match the the correct character case or use REGEXPI ?

Stephen23 am 15 Jul. 2021

You can loop over blocks of the lines using TEXTSCAN: did you try that?

Rik am 19 Mai 2023

I recovered the removed content from the Google cache (something which anyone can do). Editing away your question is very rude. Someone spent time reading your question, understanding your issue, figuring out the solution, and writing an answer.

You chose to publish the contents of your question. You can't retract that now.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Walter Roberson am 15 Jul. 2021

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/879103-alternative-for-regex-to-find-line-that-characters-are-repeated-on-consecutively#answer_747433

%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[DE]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
matches

This is the entire code other than setting the file name. It does not split the file, so your objection to 20000 strings is avoided. It produces the lines directly without any post-processing cellfun. It is quite efficient. It has been tested.

It was also already posted in your earlier question, with the only difference being DE vs de

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Alternative for regex to find line that characters are repeated on consecutively.

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Akzeptierte Antwort

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Alternative for regex to find line that characters are repeated on consecutively.

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Akzeptierte Antwort

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden