How to extract data from a table format HTML?
11 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Alan Cesar Pilon Miro
am 1 Feb. 2023
Kommentiert: Jonas
am 6 Feb. 2023
Hi,
I want to access a html and extract some information. However, when I use webread and then htmlTree I miss part of html data and don't know why.
Example:
Using this url
I would like to get information about the rows or columns of SMILES and InChL fields. However, when I use the code below I can't observe this information. I have tried different selectors, but I don't know if the data is dynamically generated.
html = webread(url);
tree = htmlTree(html);
selector= "td";
subtrees= findElement(tree,selector);
str = extractHTMLText(subtrees);
table_data = str(1:end);
Thank you,
Alan
0 Kommentare
Akzeptierte Antwort
Jonas
am 2 Feb. 2023
without digging deeper into html, we can use just text seach:
d=webread('http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152',weboptions('Timeout',15));
SMILESfirstTry=extractBetween(d,'<th class="inf">SMILES</th>','</td>','Boundaries','exclusive');
SMILESsecondTry=extractAfter(SMILESfirstTry{1},'<td colspan="4">')
similar could be done for the other tags
simlarly a bit more html stuff:
tree = htmlTree(d);
selector= "tr";
subtrees= findElement(tree,selector);
str = extractHTMLText(subtrees);
searchTags={'InChIKey' 'InChICode' 'SMILES'};
location=contains(str,searchTags);
rawEntries=str(location)
extractAfter(rawEntries,' ')
2 Kommentare
Jonas
am 6 Feb. 2023
thx for your reply. make sure, that your the data returned from webread is not empty, since the website seems to be quite slow, sometimes the returned data is empty. maybe further increasing the timeout limit can help here
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Dates and Time finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!