Fast, StAX-like XML Parser for BEAM Languages
Instead of as with SAX or DOM parsing of XML, forcing the user to handle everything at once, this parser allows the user to consume events from a stream as it suits them. Simply call next_event
on the stream.
This means that the user can parse multiple streams from the same process at the same time.
It works like an iterator on any set or list-like type, but returns XML events instead.
yaccety_sax
is a Namespace aware, non-validating XML 1.0 parser.
- Accepts only UTF-8 encoding directly.
- It does check DTDs for wellformedness.
- It can read and parse external DTDs (When given a callback function to retrieve the DTD).
- Returns the parsed DTD as an event to the user for custom validation if needed.
- Comment, Processing Instructions, and whitespace Text nodes are all optionally ignorable.
- All parsed events, report UTF-8 binary (No character lists).
- Handling of common security risks known from parsing DTDs
- Content skipping - skipping entire content of an element
- Event writing - Serializing event streams back into XML
- YACC-like syntax for setting up complex, selective parsers.
- Validation from DTD
- XML 1.1 ??
- Others? add an issue to the repo :-)
Chances are when parsing XML from some REST API, you won't need a lot of the features yaccety
has.
This is what yaccety_sax_simple
is for.
It works mostly in the same way as the full version, except for:
- Smaller, tuple events
- Comments are ignored
- Whitespace text nodes are ignored
- Processing-instructions are not allowed
- DTDs are not allowed
- The entire UTF-8 XML is passed to
yaccety_sax_simple:string/1
without a continuation function
kinda_equal(Filename1, Filename2) ->
% UTF-16 file with external DTD and full of whitespace nodes
{Cont, Init} = ys_utils:trancoding_file_continuation(Filename1),
LhState = yaccety_sax:stream(Init, [
{whitespace, false},
{comments, false},
{proc_inst, false},
{continuation, {Cont, <<>>}},
{base, filename:dirname(Filename1)},
{external, fun ys_utils:external_file_reader/2}
]),
% Start Document event
{_, LhState1} = yaccety_sax:next_event(LhState),
% DTD event
{_, LhState2} = yaccety_sax:next_event(LhState1),
% UTF-8 file with no DTD or whitespace nodes
% Could have streamed this file as well...
{ok, Bin2} = file:read_file(Filename2),
RhState = yaccety_sax:stream(Bin2),
% Start Document event
{_, RhState1} = yaccety_sax:next_event(RhState),
% Now both streams are in a comparable state, so diff them
diff_loop(LhState2, RhState1).
diff_loop(LhState, RhState) ->
{LhEvent, LhState1} = yaccety_sax:next_event(LhState),
{RhEvent, RhState1} = yaccety_sax:next_event(RhState),
#{type := EventType} = LhEvent,
% Some function that checks equality, maybe ignoring
% namespaces or prefixes or something.
case equal_enough(LhEvent, RhEvent) of
true when EventType =:= endDocument -> true;
true -> diff_loop(LhState1, RhState1);
false -> false
end.
Just-for-fun parsing a 5.2 GB Wiki abstract dump with a callback that throws away all events:
- There are 113,593,892 elements in the file.
yaccety_sax
takes around 5 minutes on my machine.xmerl_sax_parser
with default settings is still running...xmerl_sax_parser
with a larger buffer in the continuation function takes around 12 minutes.
Another big difference is that the xmerl process held onto about 42 MB by the end of parsing. yaccety never went above 109 KB.
I didn't attempt using the xmerl_scan on the 5.2 GB file. Not sure it's a good idea to try.
I'm sure there are other parsers out there that stream-parse large data. It would be cool to see how all of them react.
Anyone who has seen The Benny Hill Show knows the song that inspired the name for the repo. Yakety Sax