how to add a custom lexer to camlp4

Adding a custom parser in the old camlp4 (now camlp5) was relatively easy. The new camlp4 is quite different. The problem was discussed in two recent threads in the ocaml mailing list here and here.

The main point is to provide a new Lexer module with a compatible signature with the Camlp4 lexer.

There are 3 camlp4 modules that should be defined, namely Loc, Token and Error. The signature to redefine a camlp4 lexer is

open Camlp4.Sig

type token =
  | KWD of string
  | CHAR of char
  | EOI

exception Error of int * int * string

module Loc   : Loc with type t = int * int
module Token : Token with module Loc = Loc and type t = token
module Error : Error

val mk : unit -> (Loc.t -> char Stream.t -> (Token.t * Loc.t) Stream.t)

I still don't understand enough about the camlp4 internal to comment about it. I've reused the cduce lexer as starting point and added a small lexer for regular expressions.

The complete code (lexer + parser) is Here

Average: 1.2 (11 votes)

Comments

A few comments on the above method

Thanks for the code. Here are a few remarks I hope will be useful for others replacing Camlp4's lexer by their own.

I find easier to use Camlp4's own Loc module rather than redefining it entirely:

module Loc = Camlp4.Struct.Loc
module Lexer = My_lexer.Make(Loc)

I found the definition of Token to be a tricky one because the "token" type should be visible to Token, Lexer, and the parser. I got rid of the Token functor and included it in the lexer:

My lexer is as follows:
module Make (Loc : Camlp4.Sig.Loc) = struct
module Loc = Loc
type token = KEYWORD | INT | ...

module Token = struct
module Filter = struct
...
let keyword_conversion tok is_kwd =
match tok with
SYMBOL s | IDENT s when is_kwd s -> KEYWORD s
| _ -> tok
...
end
end

Note the declaration of keyword_conversion. This function is to be called by "filter":
let filter x =
let f tok loc =
let tok' = keyword_conversion tok x.is_kwd in
(tok', loc)
in
...

This allows camlp4 to translate symbols or identifiers to keywords, so you can write:
[ "if"; expr; "then"; expr; "else"; expr ]
instead of:
[ `KWD "if"; expr; `KWD "then"; expr; `KWD "else"; expr ]