XML parsing vs regexp

Sebastian Holmqvist
Sebastian Holmqvist am 30 Jul. 2012
I'm trying to extract values from two elements in a pretty large xml. I'm stuck between doing it "the right way" and doing it "the fast way". I.e parsing vs regexp.
elem_num = 1e4;
%%Create sample xml string
xml_str = cell(1, elem_num+2);
xml_str(1) = {''};
for i=1:elem_num
xml_str(i+1) = {'<elem><aa>abc</aa><ab>def</ab></elem>'};
xml_str(elem_num+2) = {''};
xml_str = cell2mat(xml_str);
%%Convert string to stream and parse
stream = java.io.StringBufferInputStream(xml_str);
factory = javaMethod('newInstance', ...
builder = factory.newDocumentBuilder;
document = builder.parse(stream);
%%Parse DOM properly
aa_list = document.getElementsByTagName('aa');
aa_num = aa_list.getLength;
aa = cell(1, aa_num);
for i=1:aa_num
aa(i) = aa_list.item(i-1).getTextContent;
ab_list = document.getElementsByTagName('ab');
ab_num = ab_list.getLength;
ab = cell(1, ab_num);
for i=1:ab_num
ab(i) = ab_list.item(i-1).getTextContent;
%%Use regexp
aa_regexp = regexp(xml_str, '(abc)', 'tokens');
ab_regexp = regexp(xml_str, '(def)', 'tokens');
As you can see in my code, parsing might be the correct way of handling xml, but takes ages to compute compared to regexp.
% XML Parsing: Elapsed time is 3.222058 seconds.
% Regexp: Elapsed time is 0.050301 seconds.
Any tips on how to speed this up? E.g another parser, a better way of doing it etc?

Walter Roberson
Walter Roberson am 30 Jul. 2012
Often, when HTML or XML are analyzed in terms of extended regular expressions, the implementations are vulnerable to alternative representations of the closing quote on strings, failing to detect a close quote that HTML or XML say is there. The earlier problem was with "double byte character sets", so people learned to deal with that. But then people were caught off-guard with Unicode Code Point representations of the double-quote, such as via a \u or \ux escape sequence.
Walter Roberson
Walter Roberson am 30 Jul. 2012
Sorry I have never used the parser.

