How to select specific urls in a webpage with regexp?

1 Ansicht (letzte 30 Tage)
pietro
pietro am 8 Jun. 2017
Kommentiert: pietro am 1 Jul. 2017
Hi all,
I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>
so after the link there is the string '\d* hp'. Here the code I use to detected them:
url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';
html=urlread(url);
hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');
This code works rather fine, but I'm not able to get rid of the first wrong result that is:
<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr>
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>
As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks
  1 Kommentar
Michael Dombrowski
Michael Dombrowski am 29 Jun. 2017
When I run your code I get no results in hyperlinks. But, have you thought of adding "farm-tractors" into your regex? It would resolve your issue, and as long as all the links also go to the farm-tractors directory it would work fine.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Guillaume
Guillaume am 29 Jun. 2017
Bearbeitet: Guillaume am 29 Jun. 2017
Note: avoid greedy .* particularly in complex expressions, it's bound to cause you problems. Negative classes often work better. For example, instead of <td.*>, use <td[^>]*>.
As per Michael comment, your posted regex does not work. But even with the simplified regex:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a.*?/a>', 'match')' %transposed for easy viewing in command window
you can see that there is a problem. Unfortunately for you, the problem is actually the webpage which is actually not valid html. Your whole problem comes from the fact that the spacer.gif <a hyperlink (on line 131 of the source html) is never closed. So of course, your regex captures everything up to the next a> which belongs to the next <tr><td>.
Unfortunately that makes your life rather difficult. Try:
hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a[^>]*>[^<]*</a>(?=</td><td[^>]*>\d+ hp</td>)', 'match')' %transposed for easy viewing in command window
And if you can report to the website owner that their page is missing a closing tag.

Weitere Antworten (0)

Kategorien

Mehr zu Adding custom doc finden Sie in Help Center und File Exchange

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by