Regex: How can I perform positive lookbehind for a specific sequence of characters?

2 Ansichten (letzte 30 Tage)
EDIT: Changed 'Negative lookbehind' to 'Positive lookbehind'
Hi,
I am attempting to seperate the first name from a list of names, using regex. The format of the names is as follows:
<last name>, <title>. <first name> <middle names> (<other name>)
Where <middle names> and (<other name>) are optional.
I'm new to regex, and currently finding it hard to intuit. It seems to me that I need a positive lookbehind to capture the word preceded by a '.' followed by a 'whitespace' in order to capture the first names, but its not working how I'd like! See code below:
load titanic.mat
% Attempt #1 (Matches words preceded by'.' characters OR whitespace characters -
% I need it to match '.' followed by a whitespace... how???
name_first = regexp(train.Name, '(?<=[\.\s])([A-Z][a-z]+)', 'match')
% Attempt #2 (Captures unwanted '. ' before first names)
name_first2 = regexp(train.Name, '\.\s([A-Z][a-z]+)', 'match')
% Attempt #2 (Attempt to capture 3rd word, doesn't work)
name_first3 = regexp(train.Name, '(\w.*\w){3}', 'match')
Alternative solutions are great, but ideally I'd like to understand WHY my current code doesn't work (specifically attempt #1), and how I might be able to make it work using the negative lookbehind to lookbehind for a specific sequence of characters (i.e. return a word preceded by 'abc').
Thanks in advance for your help.
  4 Kommentare
Walter Roberson
Walter Roberson am 14 Sep. 2021
Bearbeitet: Walter Roberson am 14 Sep. 2021
% I need it to match '.' followed by a whitespace... how???
Using
name_first = regexp(train.Name, '(?<=\.\s)([A-Z][a-z]+)', 'match')
But consider making it \s+ instead of \s .
Also, are you sure you do not need to handle names with apostrophe like O'Rorke ? Are you sure you do not need to handle names with dashes, like Fitz-Williams ? Are you sure you do not need to handle surnames with spaces, such as van Horton ? Which, incidentally, is also an example of a name that starts with lower-case.
Adam Brann
Adam Brann am 14 Sep. 2021
Thanks for your answer, exactly what I needed. I mistakenly thought the characters to be 'looked behind for' needed to be inside square brackets.
Excellent points regarding the 'unusual' names, I'll go away and have a think about how I might write a regexp to capture those cases. Many thanks for your help.

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Tags

Produkte


Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by