How do I read the text between href tags and return the results in a cell array?
6 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
StuartG
am 13 Jun. 2016
Kommentiert: Ana Alonso
am 17 Dez. 2019
Currently, I have an html webpage saved in a text format. Below is an example of the portion of the text I am interested in:
<a href='/some/1056-text-stuff'>
I want to search the text document for every case the "<a href='\some\ " pattern appears and extract the text between the tokens, i.e.
/some/1056-text-stuff
Matlab has regexp, match and tags but I am struggling to pick out the string cleanly. Ideally, I would like to search the document and return a cell array of strings which lists all of the matches. Here is my current code:
str= fileread('C:\Users\Me\Documents\MATLAB\trial.txt'); %read in text file
urls = regexp(str, 'href=(\S+)(\s*)$', 'tokens', 'lineAnchors'); %find urls
0 Kommentare
Akzeptierte Antwort
Julian
am 17 Jun. 2016
You can try something like
>> RE='<a[\s]+href="(?<target>.*?)"[^>]*>(?<text>.*?)</a>';
>> list=regexp(html, RE, 'names')
I can recommend this tool https://www.regexbuddy.com/
2 Kommentare
Ana Alonso
am 17 Dez. 2019
Hi there,
What do the (?<target>.*?) and (?<text>.*?) expressions correspond to?
I've never worked with html before and I'm just trying to scrape urls from the html code.
Thanks!
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Data Import and Export finden Sie in Help Center und File Exchange
Produkte
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!