Extract numbers from a cell array of strings

Question

chlor thanks am 6 Jul. 2016

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/294008-extract-numbers-from-a-cell-array-of-strings

Kommentiert: Guillaume am 7 Jul. 2016

I have the following cell array:

s = 
'HI_B2_ *TTT4009*_D452_07052016.xlsx'
'HI_H2G_ *TTT4002*_D259_070516.xlsx'
'HI_B2C_ *4008*_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ *TTT4004*.xlsx'
'HI__ *TTT4003*_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_ *4006*_148_07052016.xlsx'

I would like to extract all the bold numbers into matrix of two columns like so:

ExM=
'4009', 'HI_B2_ TTT4009_D452_07052016.xlsx'
'4002', 'HI_H2G_ TTT4002_D259_070516.xlsx'
'4008', 'HI_B2C_ 4008_D1482_070516.xlsx'
'4004','HI_A1C_468_070516_ TTT4004.xlsx'
'4003','HI__ TTT4003_862_07052016_G1C.xlsx'
'4006','HI_KA6_ 4006_148_07052016.xlsx'

that have the extracted numbers and corresponding file names. Note that all the number extracted begins with "400" and many of them are also after the letters "TTT"...

I tried

regexpi(s, '[\w\s,]*400[\w\s,]*[_;]+','match')

But it did not work correctly and I am also not sure how to make the matrix of two columns without empty strings.

I will appreciate any input or help material to learn from. Thank you very much!!

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Guillaume am 6 Jul. 2016

Note that within a character group (delimited by []) you can't use character classes such as \w and \s. The way to write [\w\s,] would be

(?:\w|\s|,)

or just expand the character classes:

[a-zA-Z0-9_ \f\n\r\t\v,]

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Azzi Abdelmalek am 6 Jul. 2016

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/294008-extract-numbers-from-a-cell-array-of-strings#answer_227917

s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
a=regexp(s,'.*(400\d*).*','tokens','once')
idx=~cellfun(@isempty,a)
out=[[a{:}]' s(idx)]

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

chlor thanks am 7 Jul. 2016

Bearbeitet: chlor thanks am 7 Jul. 2016

The star was actually there because I want to make the numbers look bold but somehow it changed to stars... So I tried without the stars, but it still gives me this.

s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
>> hi = regexp(s, '(?:TTT)?)400\d', 'match', 'once')
hi = 
    ''
    ''
    ''
    ''
    ''
    ''
    ''
    ''

Still, thank you for the awesome explanation! I have been trying to learn on my own for so many days but it is very hard to find examples which are best to learn from, a lot of times I get stuck over and over again. Your notes help me a lot, thank you so so so much Guillaume!

Guillaume am 7 Jul. 2016

Matlab regexp engine does not throw error when the regular expression is not valid (unfortunately). Your expression has unbalanced parentheses and so is not valid.

|'(?:TTT)?400\d' would have worked. This would return the TTT portion in the match if it is present. Probably not what you want.

'(?<=(?:TTT)?)400\d' would also work. The TTT would not be returned in the match. It is just a requirement that the match be preceded by TTT. However, since that requirement is optional (sic!) because of the last ?, it actually serves no purpose and may just as well be omitted. (In my regex, the star was part of the requirement and was not optional).

So, '400\d' is probably what you need then.

If you want to know whether or not the 400\d is actually preceded by TTT:

regexp(s, '(TTT)?(400\d)', 'tokens', 'once')

and if there's a match, see if the first token is empty (no TTT) or not.

Melden Sie sich an, um zu kommentieren.

Extract numbers from a cell array of strings

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Extract numbers from a cell array of strings

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden