Extract data from "txt" and "htm" files

Question

Florent Rouxelin am 11 Jul. 2016

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/294779-extract-data-from-txt-and-htm-files

Kommentiert: Guillaume am 11 Jul. 2016

Hi,

I am trying to extract the a "NAME" and its corresponding "Dollar range" from "txt" and "htm" files (1000s of them) and record them in an Excel or CSV file.

(1)For the txt files, the title of the column that I am trying to record are: "NAME" "DOLLAR RANGE OF FUND SHARES"

(2) For the htm files, the title of the column that I am trying to record are: "NAME" "Fund" "Dollar Range of Equity Securities in the Fund Beneficially Owned"

Each files have different number of pages, usually over 50 pages, have some recurring structure but not completely standardized.These tables may appear more than one time within on txt or htm file, and I want to record all of them.

I enclose 3 files: 1 txt and the info I would like to record from it. I also include the information to record from the html file (the full file is too large to be included). Please let me know how to proceed. Thanks a lot!

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Walter Roberson am 11 Jul. 2016

Bearbeitet: Walter Roberson am 11 Jul. 2016

In MATLAB Online öffnen

I would probably write a perl script (others might prefer python) to at least get rid of the extraneous data, as it is much easier in perl to write statements of the form

/pattern1/, /pattern2/ {some action}

that would mean that the action should start to be taken when pattern1 is matched and should be repeated for every line until you find pattern2 (which should be included). You can also write things in perl such as

/pattern1/, /pattern2/-1 {some action}

which would not include the line matching pattern2 in having the action taken.

For example,

/DOLLAR RANGE OF FUND SHARES/, /^\s*$/-1 {print $_;}

would output from a line matching "DOLLAR RANGE OF FUND SHARES" up to but excluding the first line that was empty or included only white-space.

Reducing the data down to its essentials can make it much much easier to write MATLAB code to process the data.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Guillaume am 11 Jul. 2016

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/294779-extract-data-from-txt-and-htm-files#answer_228295

In MATLAB Online öffnen

For the text file, the following would work to extract the required section:

wholetext = fileread('champlain-485apos.txt');
wantedtext = regexp('^NAME\s+DOLLAR RANGE OF FUND SHARES\(1\)\s*$.+?^$', 'match', 'once', 'lineanchors')   %'once' option as assuming that the section only appears once in the file.

Which basically asks to match the header (NAME + spaces (the \s+) + DOLLAR RANGE OF FUND SHARES(1) (brackets need to be escaped in regexes) + optional spaces (the \s*)) on a single line (the ^ and $ enclosing the header) followed by any number of any characters (the .+) up to (the ?) an empty line (the ^$).

I don't know what would be needed for the htm file since I'd need the raw file, but it would be something similar.

regular expressions are probably the best way to locate the sections that you're looking for.

Once you've got the portion of text you want, it's easy to split apart with either textscan or another regular expression.

2 Kommentare
Keine anzeigenKeine ausblenden

Florent Rouxelin am 11 Jul. 2016

Bearbeitet: Guillaume am 11 Jul. 2016

How would you modify this piece of code is the table appears more than once? How would you do to copy in the next empty row in Excel?

The HTML file can be found there actually: https://www.sec.gov/Archives/edgar/data/854126/000119312508244442/d485bpos.htm

Thanks for your help!

Guillaume am 11 Jul. 2016

"How would you modify this piece of code is the table appears more than once?" Simply remove the 'once' option. The output will be a cell array where each cell correspond to a table.

"How would you do to copy in the next empty row in Excel?" There's a lot that goes into answering that question. It sounds to me that first you want to parse each table. I would start by removing the rows with "-----" (with a regexp or strrep), then parsing each column (with textscan and regexp). As to how to append to empty rows in excel, it's probably the wrong approach. You probably would be better off just building the whole excel sheet in matlab, then adding it all at once as a new sheet or new file with xlswrite (or writetable). Otherwise, you have to automate excel from matlab, which, if you've never done it before can be a big task. Or maybe there's already something on the FileExchange

"The HTML file can be found there actually" Can you obtain the same information in a more useful format? htm/html is probably the worst. While you can probably use some regular expressions to locate your table, you're still going to end up writing html parser. At the very least, you're going to have to recognise the <TR></TR> and <TD></TD> tags and examine their contents.

Melden Sie sich an, um zu kommentieren.

Extract data from "txt" and "htm" files

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare
Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Extract data from "txt" and "htm" files

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare Keine anzeigenKeine ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden