Date Tags ocaml

The other day I needed a small xml parser to convert an xml document into a different format. First I tried xml-light. This is a simple parser all written in ocaml that stores the parser xml document in an ocaml data structure. This data structure can be user to access various fields of the xml document. It does not offer a dom-like interface, but actually I consider this a feature. Unfortunately xml-light is terribly slow. To parse 30K-plus lines of xml it takes far too long to be considered for my application.

The next logic choice was to try Expat, that is a event-based parser and it is extremely fast. Since using an event based parser can be a bit cumbersome (and I already had written of bit of code using xml-light), I decided to write a small wrapper around expat to provide a xml-light interface to it.

The code is pretty simple and the main idea is taken from the cduce xml loader.

First we provide a small data structure to hold the xml document as we examine it. Nothing deep here. Notice that we use Start and String as we descend the tree and Element we we unwind the stack.

type stack =
  | Element of (Xml.xml * stack)
  | Start of (string * (string * string) list * stack)
  | String of (string * t * stack)
  | Empty
and t =
  | PCData
  | CData

Then we need to provide expat handlers to store xml fragments on the stack as we go down. Note that we have an handler for cdata, but not an handler for pcdata as it is the default.

let pcdata buff = (Xml.PCData buff)
let cdata buff = (Xml.CData buff)

let rec create_elt acc = function
  | String (s,CData, st) -> create_elt (L.push (cdata s) acc) st
  | String (s,PCData, st) -> create_elt (L.push (pcdata s) acc) st
  | Element (x,st) -> create_elt (L.push x acc) st
  | Start (tag,attrlst,st) -> stack := Element(Xml.Element(tag,attrlst,acc),st)
  | Empty -> assert false

let start_cdata_handler () = txt.cdata <- true ;;

let start_element_handler tag attrlst =
  if not (only_ws txt.buffer txt.pos) then begin
    let str = String.sub txt.buffer 0 txt.pos in
    if txt.cdata then
      stack := String (str, CData, !stack)
    else
      stack := String (str, PCData, !stack)
  end
  ;
  txt.pos <- 0;
  txt.cdata <- false;
  stack := Start (tag,attrlst,!stack)
;;

let end_element_handler _ =
  let acc =
    if only_ws txt.buffer txt.pos then L.empty
    else
      let str = String.sub txt.buffer 0 txt.pos in
      if txt.cdata then L.one (cdata str)
      else L.one (pcdata str)
  in
  txt.pos <- 0;
  txt.cdata <- false;
  create_elt acc !stack
;;

let character_data_handler = add_string txt ;;

At the end we just register all handlers with the expat parser and we return the root of the xml document.

let parse_string str =
  let p = Expat.parser_create None in
  Expat.set_start_element_handler p start_element_handler ;
  Expat.set_end_element_handler p end_element_handler ;
  Expat.set_start_cdata_handler p start_cdata_handler ;
  Expat.set_character_data_handler p character_data_handler ;
  ignore (Expat.set_param_entity_parsing p Expat.ALWAYS);
  Expat.parse p str;
  Expat.final p;
  match !stack with
  |Element (x,Empty) -> (stack := Empty; x)
  | _ -> assert false

I’ve copied the xml-light methods and to access the document in a different file. I’ve also made everything lazy to save a bit of computing time if it is only necessary to access a part of a huge xml document.

The complete code can be found here: git clone https://www.mancoosi.org/~abate/repos/xmlparser.git

UPDATE

The other that I was made aware that this parser has a serious bug when used on a 32 bit machine. The problem is that the maximal string size on a 32bit machine is equal to Sys.max_string_length that is roughly 16Mb . If we read and parse a bit document document at once with IO.read_all , we immediately get an exception. The solution is to parse the document incrementally using the new function parser_ch below that get a channel instead of a string and run the expat parser incrementally :

let parser_aux f =
  let p = Expat.parser_create None in
  Expat.set_start_element_handler p start_element_handler ;
  Expat.set_end_element_handler p end_element_handler ;

  Expat.set_start_cdata_handler p start_cdata_handler ;

  Expat.set_character_data_handler p character_data_handler ;
  ignore (Expat.set_param_entity_parsing p Expat.ALWAYS);
  f p;
  match !stack with
  |Element (x,Empty) -> (stack := Empty; x)
  | _ -> assert false

let parse_str str =
  let f p =
    Expat.parse p str;
    Expat.final p;
  in
  parser_aux f

let parse_ch ch =
  let f p =
    try while true do
      Expat.parse p (IO.nread ch 10240)
    done with IO.No_more_input -> Expat.final p ;
  in
  parser_aux f