Regular expressions help with HTML source code

Question

0 Stimmen

I'm looking to parse through some HTML source code to pull information from the Wall Street Journal. I need to pull the price of the following commodities: the 4 domestic crude oil spot prices, copper, aluminum, cotton, and cocoa

This is the URL: http://online.wsj.com/mdc/public/page/2_3023-cashprices.html

I'm having some trouble with getting regexp to work the way I want it to.

what string expression would you use to pull out the middle (bold) price listed? If the value is n.a., it's okay if it just returns 'n.a.' or its equivalent.

I tried a variety of methods and I couldn't get it to work.

Could someone show an example of the string he or she would use for extracting the price?

Thanks!

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Cedric am 12 Mär. 2013

Bearbeitet: Cedric am 12 Mär. 2013

In MATLAB Online öffnen

0 Stimmen

Did you see my answer to your previous question? Tokens work well in such situations;

 >> buffer = urlread('http://online.wsj.com/mdc/public/page/2_3023-cashprices.html');
 >> item    = 'West Texas Intermediate, Cushing' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: ''
     price: '92.06'
 >> item    = 'London fixing, spot price' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: '&#163;'          % Code, but the forum renders it.
     price: '19.4273'

Cheers,

Cedric

Note that a . is returned for n.a. entries.

EDIT 1: corrected pattern thank to Walter's comment about pound-signs.

EDIT 2: updated with named tokens so we get the prefix (e.g. pound-sign).

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Cedric am 12 Mär. 2013

Ah thank you Walter, I had not realized that there could be these signs!

Cedric am 12 Mär. 2013

Updated so the prefix is extracted (e.g. pound-sign).

Melden Sie sich an, um zu kommentieren.

Answer 2

Walter Roberson am 11 Mär. 2013

In MATLAB Online öffnen

0 Stimmen

'^<b>.*?\d+(\.\d+)?<\\b>$'

This should allow for the currency symbol, and for the possibility that the decimal point and following digits are not there. The only real "trick" here is the use of .*? to indicate the minimum expansion of repeated . (i.e., match any one character) where .* by itself is "greedy" and would match as many characters as possible.

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Joseph Williams am 12 Mär. 2013

That doesn't work. In part because for the source code on the url, the end tags are denoted with a '/' instead of a '\', but after that, it still doesn't returns and empty answer. Do you have any other suggestions?

Best, J. Williams

Melden Sie sich an, um zu kommentieren.

Regular expressions help with HTML source code

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Weitere Antworten (1)

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Kategorien

Tags

Community Treasure Hunt

Regular expressions help with HTML source code

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

3 Kommentare 1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

Weitere Antworten (1)

1 Kommentar -1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Kategorien

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen 1 älteren Kommentar ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden