Web scraping with regular expression, getting rid of html tags.
1 Ansicht (letzte 30 Tage)
Ältere Kommentare anzeigen
Hi all,
I am doing some webscraping code and consequently, I am using regular expressions. I need to isolate the words from a string, of course html tags should not be included. Html tags are words included in < > (e.g. br). Unfortunately, my code does not work out and I am wondering why. Here an example:
regexp('qu <qa>','(?!<)\w*(?!>)','match')
My expected results is 'qu' but instead I get 'qu' and 'q'. The code works with this string 'qu q'. What may I do to solve this issue?
thanks
Regards,
Pietro
0 Kommentare
Akzeptierte Antwort
Guillaume
am 3 Jun. 2017
The first part of your expression is a look-ahead. You want a look behind instead. Add a < before the !:
regexp('qu <qa>', '(?<!<)\w*(?!>)', 'match')
3 Kommentare
Guillaume
am 3 Jun. 2017
It's a lot more difficult to tell a regular expression not to match something than it is to tell it to match something. Therefore, I'd do it in two passes.
1. remove the tags:
notags = regexprep(yourstring, '<[^>]*>', '')
2. match whatever it is you want to match
matches = regexp(notags, '\w+', 'match')
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Call Web Services from MATLAB Using HTTP finden Sie in Help Center und File Exchange
Produkte
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!