Alternative for regex to find line that characters are repeated on consecutively.
7 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Currently I have very large block of data that looks like this (Very many):
>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1
MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQE
LYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLD
AIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF
>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1
MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPK
FRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPL
GVGAALVTADDFLVFLRRSQQVAEAPGLVDV
I am trying to make a script that finds strings of ten or more consecutive E/D characters, like in the the first block of data in the section above. Basically I am asking for a way that is an alternative for regex, as I have not found any way to make a pattern to do so on regex. I want to know which lines in the large text file the consecutive characters were found on. Really just looking for an alternative to regex, if anyone has any good suggestions. This is part of the code I was using before.
inp = {''};
form = '[de]{10,}';
calc = regexp(inp,form,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>10),calc);
find(idx)
5 Kommentare
Rik
am 19 Mai 2023
I recovered the removed content from the Google cache (something which anyone can do). Editing away your question is very rude. Someone spent time reading your question, understanding your issue, figuring out the solution, and writing an answer.
You chose to publish the contents of your question. You can't retract that now.
Akzeptierte Antwort
Walter Roberson
am 15 Jul. 2021
%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[DE]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
matches
This is the entire code other than setting the file name. It does not split the file, so your objection to 20000 strings is avoided. It produces the lines directly without any post-processing cellfun. It is quite efficient. It has been tested.
It was also already posted in your earlier question, with the only difference being DE vs de
0 Kommentare
Weitere Antworten (0)
Siehe auch
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!