How to capture tokens using regular expressions?

Question

Patrick Mboma am 16 Sep. 2015

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions

Kommentiert: Cedric am 19 Sep. 2015

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form

 expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
 out=regexp(expression,pattern,'name')

The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be

main='abcd' and digits='1'.

What I am missing is the right "pattern". Any suggestions?

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Cedric am 17 Sep. 2015

Bearbeitet: Cedric am 17 Sep. 2015

In MATLAB Online öffnen

Dear Patrick,

In summary, for extracting and validating digits and decimal point, I would would write a pattern like

'(.*?)[\(_]([\d\.]*)'

which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:

'(.*?)[\(_]([^\)]*)'

which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.

I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains

 name1_45 
 name2(45)
 name2b_32
 name2c(84)
 ..

then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:

 % - Dummy file content.
 content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
 % - Flag elements to replace.
 doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
 % - Replace with with space.
 content(doReplace) = ' ' ;
 % - Parse.
 parsed = textscan( content, '%s %f' ) ;

(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get

 >> parsed
 parsed = 
    {4x1 cell}    [4x1 double]
 >> parsed{1}
 ans = 
    'name1'
    'name2'
    'name2b'
    'name2c'
 >> parsed{2}
 ans =
    45
    45
    32
    84

Patrick Mboma am 19 Sep. 2015

Thanks a lot Cedric!!!

Cedric am 19 Sep. 2015

My pleasure!

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Benjamin Kraus am 16 Sep. 2015

3
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192653

In MATLAB Online öffnen

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');

The pattern breaks down like this:

(?<main>[a-zA-Z]+) - A token named "main" with only letters.
(?:[_\(]) - An uncaptured token containing either an underscore or "(".
(?<digits>[0-9]+) - A token named "digits" with only numbers.
)? - An optional ")" character at the end.

The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Patrick Mboma am 17 Sep. 2015

In MATLAB Online öffnen

Dear Benjamin,

Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,

whatever345whatever_100

would also be something I would like to capture. It is the second part that would only include digits.

A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Melden Sie sich an, um zu kommentieren.

Answer 2

Kirby Fears am 16 Sep. 2015

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192648

In MATLAB Online öffnen

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.

ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Patrick Mboma am 17 Sep. 2015

In MATLAB Online öffnen

Dear Kirby,

There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.

In my current solution for instance, I first use regular expressions to transform all the inputs into the same format

whatever_45

then I look for the underscore, etc. But this entails several lines of codes.

Thanks for your input!

Melden Sie sich an, um zu kommentieren.

How to capture tokens using regular expressions?

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Antworten (2)

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

How to capture tokens using regular expressions?

5 Kommentare 3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

Antworten (2)

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

5 Kommentare
3 ältere Kommentare anzeigen3 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden