Some optional links:
The theory of finite automata and regular expression is the essential ingredient of standard lexical analysis tools, typically called Lexers.
Briefly a program when processed by a compiler goes through the following phases:
For the time being, we will only concern ourselves with the lexical analysis portion. We will take a glimpse at the syntactic analysis/parsing portion when we cover context-free grammars.
A lexer expects as input a file that is in a very specific form, which differs from language to language, but in general shares some features. Here is how a sample file might look like for ocamllex (this file resides in ocaml/parsing/lexer.mll
):
open Lexing
exception Eof
type tokens = INT of int
| FLOAT of float
| LPAREN
| RPAREN
| PLUS
| MINUS
| TIMES
| DIVIDE
let print_token tk =
match tk with
INT i -> "INT " ^ string_of_int i
| FLOAT f -> "FLOAT " ^ string_of_float f
| LPAREN -> "LPAREN"
| RPAREN -> "RPAREN"
| PLUS -> "PLUS"
| MINUS -> "MINUS"
| TIMES -> "TIMES"
| DIVIDE -> "DIVIDE"
}
let digit = ['0'-'9']
let int = '-'? digit+
let frac = '.' digit*
let exp = ['e' 'E'] ['-' '+']? digit+
let float = digit* frac? exp?
let white = [' ' '\t']+
let newline = '\r' | '\n' | "\r\n"
rule read = parse
| white { read lexbuf }
| newline { read lexbuf }
| int as i { INT (int_of_string i) }
| float as f { FLOAT (float_of_string f) }
| '(' { LPAREN }
| ')' { RPAREN }
| '+' { PLUS }
| '-' { MINUS }
| '/' { DIVIDE }
| eof { raise Eof }
These files typically look like OCAML files, but with some specific and different syntax rules.
In this case there is a first section of the file enclosed in braces, called the preamble. We can set up some initial instructions here. For us these are opening the Lexing
module so we have easy access to its methods (i.e. we could write foo
instead of Lexing.foo
some times), and specifying the type of the tokens we want to produce (we can call the type anything we want of course).
I have also included here a method that produces a string representation of the token, for printing purposes. Both this method and the type declaration could have gone into a separate ml or mli file instead, and then loaded in.
Following the preamble is a series of let
statements for providing shortcuts of regular expressions. The right-hand-side of these let statements is not normal OCAML code, but instead a regular expression following the description in section 12.2.4 of the manual. You can use these bindings in future let statements.
Following that is a rule
, specifying a function, in this case called read
, that would receive as input the stream of input. That function does a pattern match based the various regular expressions. In braces we describe what should happen in each case (they should all return the same type, in our case tokens
). There are two rules:
We “compile” this set of instructions:
$ ocamllex lexer.mll
16 states, 354 transitions, table size 1512 bytes
And we get a file lexer.ml
that tends to not be very readable, but contains instructions for carrying out the produced DFA. As you can see, we get some information about that DFA.
You can see a short execution of this lexer by looking into the driver1.ml
file, and compiling and running it.