The other day I needed a small xml parser to convert an xml document
into a different format. First I tried xml-light. This is a simple
parser all written in ocaml that stores the parser xml document in an
ocaml data structure. This data structure can be user to access
various fields of the xml document. It does not offer a dom-like
interface, but actually I consider this a feature. Unfortunately
xml-light is terribly slow. To parse 30K-plus lines of xml it takes
far too long to be considered for my application.
The next logic choice was to try Expat, that is a event-based parser
and it is extremely fast. Since using an event based parser can be a
bit cumbersome (and I already had written of bit of code using
xml-light), I decided to write a small wrapper around expat to provide
a xml-light interface to it.
The code is pretty simple and the main idea is taken from the cduce
xml loader.
First we provide a small data structure to hold the xml document as we
examine it. Nothing deep here. Notice that we use Start
and
String
as we descend the tree and Element
we we unwind the stack.
type stack =
| Element of (Xml.xml * stack)
| Start of (string * (string * string) list * stack)
| String of (string * t * stack)
| Empty
and t =
| PCData
| CData
Then we need to provide expat handlers to store xml fragments on the
stack as we go down. Note that we have an handler for cdata, but not
an handler for pcdata as it is the default.
let pcdata buff = (Xml.PCData buff)
let cdata buff = (Xml.CData buff)
let rec create_elt acc = function
| String (s,CData, st) -> create_elt (L.push (cdata s) acc) st
| String (s,PCData, st) -> create_elt (L.push (pcdata s) acc) st
| Element (x,st) -> create_elt (L.push x acc) st
| Start (tag,attrlst,st) -> stack := Element(Xml.Element(tag,attrlst,acc),st)
| Empty -> assert false
let start_cdata_handler () = txt.cdata <- true ;;
let start_element_handler tag attrlst =
if not (only_ws txt.buffer txt.pos) then begin
let str = String.sub txt.buffer 0 txt.pos in
if txt.cdata then
stack := String (str, CData, !stack)
else
stack := String (str, PCData, !stack)
end
;
txt.pos <- 0;
txt.cdata <- false;
stack := Start (tag,attrlst,!stack)
;;
let end_element_handler _ =
let acc =
if only_ws txt.buffer txt.pos then L.empty
else
let str = String.sub txt.buffer 0 txt.pos in
if txt.cdata then L.one (cdata str)
else L.one (pcdata str)
in
txt.pos <- 0;
txt.cdata <- false;
create_elt acc !stack
;;
let character_data_handler = add_string txt ;;
At the end we just register all handlers with the expat parser and we
return the root of the xml document.
let parse_string str =
let p = Expat.parser_create None in
Expat.set_start_element_handler p start_element_handler ;
Expat.set_end_element_handler p end_element_handler ;
Expat.set_start_cdata_handler p start_cdata_handler ;
Expat.set_character_data_handler p character_data_handler ;
ignore (Expat.set_param_entity_parsing p Expat.ALWAYS);
Expat.parse p str;
Expat.final p;
match !stack with
|Element (x,Empty) -> (stack := Empty; x)
| _ -> assert false
I’ve copied the xml-light methods and to access the document in a
different file. I’ve also made everything lazy to save a bit of
computing time if it is only necessary to access a part of a huge xml document.
The complete code can be found here: git clone
https://www.mancoosi.org/~abate/repos/xmlparser.git
UPDATE
The other that I was made aware that this parser has a serious bug
when used on a 32 bit machine. The problem is that the maximal string
size on a 32bit machine is equal to Sys.max_string_length that is
roughly 16Mb . If we read and parse a bit document document at once
with IO.read_all
, we immediately get an exception. The
solution is to parse the document incrementally using the new function
parser_ch
below that get a channel instead of a string and run
the expat parser incrementally :
let parser_aux f =
let p = Expat.parser_create None in
Expat.set_start_element_handler p start_element_handler ;
Expat.set_end_element_handler p end_element_handler ;
Expat.set_start_cdata_handler p start_cdata_handler ;
Expat.set_character_data_handler p character_data_handler ;
ignore (Expat.set_param_entity_parsing p Expat.ALWAYS);
f p;
match !stack with
|Element (x,Empty) -> (stack := Empty; x)
| _ -> assert false
let parse_str str =
let f p =
Expat.parse p str;
Expat.final p;
in
parser_aux f
let parse_ch ch =
let f p =
try while true do
Expat.parse p (IO.nread ch 10240)
done with IO.No_more_input -> Expat.final p ;
in
parser_aux f