How to extract data from a table format HTML?

Question

0 Stimmen

Hi,

I want to access a html and extract some information. However, when I use webread and then htmlTree I miss part of html data and don't know why.

Example:

Using this url

url = http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152

I would like to get information about the rows or columns of SMILES and InChL fields. However, when I use the code below I can't observe this information. I have tried different selectors, but I don't know if the data is dynamically generated.

url = http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152

html = webread(url);

tree = htmlTree(html);

selector= "td";

subtrees= findElement(tree,selector);

str = extractHTMLText(subtrees);

table_data = str(1:end);

Thank you,

Alan

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Jonas am 2 Feb. 2023

In MATLAB Online öffnen

1 Stimme

without digging deeper into html, we can use just text seach:

d=webread('http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152',weboptions('Timeout',15));
SMILESfirstTry=extractBetween(d,'<th class="inf">SMILES</th>','</td>','Boundaries','exclusive');
SMILESsecondTry=extractAfter(SMILESfirstTry{1},'<td colspan="4">')
SMILESsecondTry = 'c1c(ccc(c1)/C=C/C(=O)O)O'

similar could be done for the other tags

simlarly a bit more html stuff:

tree = htmlTree(d);
selector= "tr";
subtrees= findElement(tree,selector);
 str = extractHTMLText(subtrees);
 searchTags={'InChIKey' 'InChICode' 'SMILES'};
 location=contains(str,searchTags);
 rawEntries=str(location)
rawEntries = 3×1 string array
    "InChIKey  NGSWKAQJJWESNS-ZZXKWVIFSA-N"
    "InChICode  InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+"
    "SMILES  c1c(ccc(c1)/C=C/C(=O)O)O"
 extractAfter(rawEntries,'  ')
ans = 3×1 string array
    "NGSWKAQJJWESNS-ZZXKWVIFSA-N"
    "InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+"
    "c1c(ccc(c1)/C=C/C(=O)O)O"

2 Kommentare
Keine anzeigen Keine ausblenden

Alan Cesar Pilon Miro am 3 Feb. 2023

Hi Jonas,

Thank you! the first method worked very well.

Just to mentioned. I had some difficults in the second way, I could not find the objetcts.

Jonas am 6 Feb. 2023

thx for your reply. make sure, that your the data returned from webread is not empty, since the website seems to be quite slow, sometimes the returned data is empty. maybe further increasing the timeout limit can help here

Melden Sie sich an, um zu kommentieren.

How to extract data from a table format HTML?

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigen Keine ausblenden

Weitere Antworten (0)

Kategorien

Tags

Community Treasure Hunt

How to extract data from a table format HTML?

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigen Keine ausblenden

Weitere Antworten (0)

Kategorien

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigen Keine ausblenden