Extract numbers from a cell array of strings

1 Ansicht (letzte 30 Tage)
chlor thanks
chlor thanks am 6 Jul. 2016
Kommentiert: Guillaume am 7 Jul. 2016
I have the following cell array:
s =
'HI_B2_ *TTT4009*_D452_07052016.xlsx'
'HI_H2G_ *TTT4002*_D259_070516.xlsx'
'HI_B2C_ *4008*_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ *TTT4004*.xlsx'
'HI__ *TTT4003*_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_ *4006*_148_07052016.xlsx'
I would like to extract all the bold numbers into matrix of two columns like so:
ExM=
'4009', 'HI_B2_ TTT4009_D452_07052016.xlsx'
'4002', 'HI_H2G_ TTT4002_D259_070516.xlsx'
'4008', 'HI_B2C_ 4008_D1482_070516.xlsx'
'4004','HI_A1C_468_070516_ TTT4004.xlsx'
'4003','HI__ TTT4003_862_07052016_G1C.xlsx'
'4006','HI_KA6_ 4006_148_07052016.xlsx'
that have the extracted numbers and corresponding file names. Note that all the number extracted begins with "400" and many of them are also after the letters "TTT"...
I tried
regexpi(s, '[\w\s,]*400[\w\s,]*[_;]+','match')
But it did not work correctly and I am also not sure how to make the matrix of two columns without empty strings.
I will appreciate any input or help material to learn from. Thank you very much!!
  1 Kommentar
Guillaume
Guillaume am 6 Jul. 2016
Note that within a character group (delimited by []) you can't use character classes such as \w and \s. The way to write [\w\s,] would be
(?:\w|\s|,)
or just expand the character classes:
[a-zA-Z0-9_ \f\n\r\t\v,]

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Azzi Abdelmalek
Azzi Abdelmalek am 6 Jul. 2016
s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
a=regexp(s,'.*(400\d*).*','tokens','once')
idx=~cellfun(@isempty,a)
out=[[a{:}]' s(idx)]
  3 Kommentare
chlor thanks
chlor thanks am 7 Jul. 2016
Bearbeitet: chlor thanks am 7 Jul. 2016
The star was actually there because I want to make the numbers look bold but somehow it changed to stars... So I tried without the stars, but it still gives me this.
s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
>> hi = regexp(s, '(?:TTT)?)400\d', 'match', 'once')
hi =
''
''
''
''
''
''
''
''
Still, thank you for the awesome explanation! I have been trying to learn on my own for so many days but it is very hard to find examples which are best to learn from, a lot of times I get stuck over and over again. Your notes help me a lot, thank you so so so much Guillaume!
Guillaume
Guillaume am 7 Jul. 2016
Matlab regexp engine does not throw error when the regular expression is not valid (unfortunately). Your expression has unbalanced parentheses and so is not valid.
|'(?:TTT)?400\d' would have worked. This would return the TTT portion in the match if it is present. Probably not what you want.
'(?<=(?:TTT)?)400\d' would also work. The TTT would not be returned in the match. It is just a requirement that the match be preceded by TTT. However, since that requirement is optional (sic!) because of the last ?, it actually serves no purpose and may just as well be omitted. (In my regex, the star was part of the requirement and was not optional).
So, '400\d' is probably what you need then.
If you want to know whether or not the 400\d is actually preceded by TTT:
regexp(s, '(TTT)?(400\d)', 'tokens', 'once')
and if there's a match, see if the first token is empty (no TTT) or not.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Mehr zu Characters and Strings finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by