Extract numbers from a cell array of strings

6 views (last 30 days)
I have the following cell array:
s =
'HI_B2_ *TTT4009*_D452_07052016.xlsx'
'HI_H2G_ *TTT4002*_D259_070516.xlsx'
'HI_B2C_ *4008*_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ *TTT4004*.xlsx'
'HI__ *TTT4003*_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_ *4006*_148_07052016.xlsx'
I would like to extract all the bold numbers into matrix of two columns like so:
ExM=
'4009', 'HI_B2_ TTT4009_D452_07052016.xlsx'
'4002', 'HI_H2G_ TTT4002_D259_070516.xlsx'
'4008', 'HI_B2C_ 4008_D1482_070516.xlsx'
'4004','HI_A1C_468_070516_ TTT4004.xlsx'
'4003','HI__ TTT4003_862_07052016_G1C.xlsx'
'4006','HI_KA6_ 4006_148_07052016.xlsx'
that have the extracted numbers and corresponding file names. Note that all the number extracted begins with "400" and many of them are also after the letters "TTT"...
I tried
regexpi(s, '[\w\s,]*400[\w\s,]*[_;]+','match')
But it did not work correctly and I am also not sure how to make the matrix of two columns without empty strings.
I will appreciate any input or help material to learn from. Thank you very much!!
  1 Comment
Guillaume
Guillaume on 6 Jul 2016
Note that within a character group (delimited by []) you can't use character classes such as \w and \s. The way to write [\w\s,] would be
(?:\w|\s|,)
or just expand the character classes:
[a-zA-Z0-9_ \f\n\r\t\v,]

Sign in to comment.

Accepted Answer

Azzi Abdelmalek
Azzi Abdelmalek on 6 Jul 2016
s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
a=regexp(s,'.*(400\d*).*','tokens','once')
idx=~cellfun(@isempty,a)
out=[[a{:}]' s(idx)]
  3 Comments
Guillaume
Guillaume on 7 Jul 2016
Matlab regexp engine does not throw error when the regular expression is not valid (unfortunately). Your expression has unbalanced parentheses and so is not valid.
|'(?:TTT)?400\d' would have worked. This would return the TTT portion in the match if it is present. Probably not what you want.
'(?<=(?:TTT)?)400\d' would also work. The TTT would not be returned in the match. It is just a requirement that the match be preceded by TTT. However, since that requirement is optional (sic!) because of the last ?, it actually serves no purpose and may just as well be omitted. (In my regex, the star was part of the requirement and was not optional).
So, '400\d' is probably what you need then.
If you want to know whether or not the 400\d is actually preceded by TTT:
regexp(s, '(TTT)?(400\d)', 'tokens', 'once')
and if there's a match, see if the first token is empty (no TTT) or not.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by