Thursday, November 7, 2013

Antimirov Derivatives for Regular Expressions

Brzozowski derivatives are one of the shibboleths of functional programming: if you ask someone about implementing regular expressions, and you get back an answer involving derivatives of regular expressions, then you have almost surely identified a functional programmer.

The reason that functional programmers like derivatives so much is that they offer an elegantly algebraic and inductive approach to string matching. The derivative function $\delta_c$ takes a character $c$ and a regular expression $r$, and returns a new regular expression $r'$, which accepts a string $s$ if and only if $r$ accepts $c\cdot s$. This function can be defined as follows: $$ \begin{array}{lcl} \delta_c(c) & = & \epsilon \\ \delta_c(c') & = & \bot \\ \delta_c(\epsilon) & = & \bot \\ \delta_c(r_1 \cdot r_2) & = & \left\{ \begin{array}{ll} \delta_c(r_1) \cdot r_2 \vee \delta_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\ \delta_c(r_1) \cdot r_2 & \mbox{otherwise} \end{array} \right. \\ \delta_c(\bot) & = & \bot \\ \delta_c(r_1 \vee r_2) & = & \delta_c(r_1) \vee \delta_c(r_2) \\ \delta_c(r\ast) & = & \delta_c(r)\cdot r\ast \end{array} $$

Now we can turn the 1-character derivative into a word derivative as follows: $$ \begin{array}{lcl} \delta_\epsilon(r) & = & r \\ \delta_{c\cdot w}(r) & = & \delta_w(\delta_c(r)) \end{array} $$

Again, this is a simple inductive definition. So to match a string we just call the 1-character derivative for each character, and then check to see if the final regular expression is nullable. However, there's a catch! As you can see from the definition of $\delta_c$, there is no guarantee that the size of the derivative is bounded. So matching a long string can potentially lead to the construction of very large derivatives. This is especially unfortunate, since it is possible (using DFAs) to match strings using regular expressions in constant space.

In his 1962 paper, Brzozowski showed that the number of derivatives of a regular expression is finite, if you quotient by the associativity, commutativity, and idempotence axioms for for union $r_1 \vee r_2$. That is, suppose we consider regular expressions modulo an equivalence relation $\equiv$ such that: $$ \begin{array}{lcl} r \vee r & \equiv & r \\ r \vee \bot & \equiv & r \\ r \vee r' & \equiv & r' \vee r \\ r_1 \vee (r_2 \vee r_3) & \equiv & (r_1 \vee r_2) \vee r_3 \\ \end{array} $$

Then his theorem guarantees that the set of derivatives is finite. However, computing with derivatives up to equivalence is rather painful. Even computing equality and ordering is tricky, since union can occur anywhere in a regular expression, and without that it's difficult to implement higher-level data structures such as sets of regular expressions.

So for the most part, derivatives have remained a minor piece of functional programming folklore: they are cute, but a good implementation of derivatives is not really simple enough to displace the usual Dragon Book automata constructions.

However, in 1995 Valentin Antimirov introduced the notion of a partial derivative of a regular expression. The idea is that if you have a regular expression $r$ and a character $c$, then a partial derivative is a regular expression $r'$ such that if $r'$ accepts a word $s$, then $r$ accepts $c \cdot s$. Unlike a derivative, this is only a if-then relationship, and not an if-and-only-if relationship.

Then, rather than taking regular expressions modulo the ACUI equations, we can construct sets of partial derivatives, which collectively accept the same strings as the Brzozowski derivative. The idea is that we can use the algebraic properties of sets to model the effect of the ACUI equations. Below, I give the partial derivative function: $$ \begin{array}{lcl} \alpha_c(c) & = & \setof{\epsilon} \\ \alpha_c(c') & = & \emptyset \\ \alpha_c(\epsilon) & = & \emptyset \\ \alpha_c(r_1 \cdot r_2) & = & \left\{ \begin{array}{ll} \comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} \cup \alpha_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\ \comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} & \mbox{otherwise} \end{array} \right. \\ \alpha_c(\bot) & = & \emptyset \\ \alpha_c(r_1 \vee r_2) & = & \alpha_c(r_1) \cup \alpha_c(r_2) \\ \alpha_c(r\ast) & = & \comprehend{ r' \cdot r\ast }{ r' \in \alpha_c(r) } \end{array} $$

Partial derivatives can be lifted to words just as ordinary derivatives can be, and it is relatively easy to prove that the set of partial word derivatives of a regular expression is finite. We can show this even without taking a quotient, and so partial derivatives lend themselves even more neatly to an efficient implementation than Brzozowski derivatives do.

I'll illustrate this point by using Antimirov derivatives to construct a DFA-based regular expression matcher in ~50 lines of code. You can find the Ocaml source code here.

First, let's define a datatype for regular expressions.

type re = C of char | Nil | Seq of re * re 
        | Bot | Alt of re * re | Star of re

So C is the constructor for single-character strings, Nil and Seq correspond to $\epsilon$ and $r\cdot r'$, and Bot and Alt correspond to $\bot$ and $r \vee r'$.

Next, we'll define the nullability function, which returns true if the regular expression accepts the empty string, and false if it doesn't. It's the obvious recursive definition.

let rec null = function
  | C _ | Bot -> false
  | Nil | Star _ -> true
  | Alt(r1, r2) -> null r1 || null r2
  | Seq(r1, r2) -> null r1 && null r2
Now come some scaffolding code. The R module is the type of finite sets of regular expressions, the M module are finite maps keyed by sets of regexps, and I is the type of finite sets of ints. The rmap function maps a function over a set of regexps, building a new regexp. (I don't know why this is not in the standard Set signature.)
module R = Set.Make(struct type t = re let compare = compare end)
let rmap f rs = R.fold (fun r -> R.add (f r)) rs R.empty

module M = Map.Make(R)
module I = Set.Make(struct type t = int let compare = compare end)
The aderiv function implements the Antimirov derivative function $\alpha_c$ described above. It's basically just a direct transcription of the mathematical definition into Ocaml. The deriv function applies the derivative to a whole set of regular expressions and takes the union.
let rec aderiv c = function
  | C c' when c = c' -> R.singleton Nil 
  | C _ | Nil | Bot -> R.empty
  | Alt(r, r') -> R.union (aderiv c r) (aderiv c r')
  | Seq(r1, r2) -> R.union (rmap (fun r1' -> Seq(r1', r2)) (aderiv c r1))
                           (if null r1 then aderiv c r2 else R.empty)
  | Star r -> rmap (fun r' -> Seq(r', Star r)) (aderiv c r)

let deriv c rs = R.fold (fun r acc -> R.union (aderiv c r) acc) rs R.empty

Since the set of partial derivatives is finite, this means that the powerset of this set is also finite, and so by treating sets of partial derivatives as states, we can construct a deterministic finite-state automaton for matching regular expressions. Let's define an Ocaml type for DFAs:

type dfa = {size : int; fail : int; 
            trans : (int * char * int) list; final : int list}

Here, size is the number of states, and we'll use integers in the range [0,size) to label the states. We'll use fail to label the sink state for non-matching strings, and take trans to be the list of transitions. The final field is a list of accepting states for the DFA.

Now, we'll need a little more scaffolding. The enum function is a "functional for-loop" looping from i to max. We use this to write charfold, which lets us fold over all of the ASCII characters.

let rec enum f v i max = if i < max then enum f (f i v) (i+1) max else v
let charfold f init  = enum (fun i -> f (Char.chr i)) init 0 256

We use this to define the dfa function, which constructs a DFA from a regular expression:

let find rs (n,m) = try M.find rs m, (n,m) with _ -> n, (n+1, M.add rs n m)

let dfa r =
  let rec loop s v t f rs =
    let (x, s) = find rs s in
    if I.mem x v then (s, v, t, f)
    else charfold (fun c (s, v, t, f) ->
                     let rs' = deriv c rs in
                     let (y, s) = find rs' s in
                     loop s v ((x,c,y) :: t) f rs')
           (s, I.add x v, t, if R.exists null rs then x :: f else f) in
  let (s, v, t, f) = loop (0, M.empty) I.empty [] [] (R.singleton r) in
  let (fail, (n, m)) = find R.empty s in 
  { size = n; fail = fail; trans = t; final = f }

The find function takes a set of regular expression and returns a numeric index for it. To do this, it uses a state (n, m), where n is a counter for a gensym, and m is the map we use to map sets of regular expressions to their indices.

The main work happens in the loop function. The s parameter is the state parameter for find, and v is the visited set storing the set of states we have previously visited. The t parameter is the list of transitions built to date, and f are the final states generated so far.

The rs parameter is the current set of regular expressions. We first look up its index x, and if we have visited it, we return. Otherwise, we add the current state to the visited set (and to the final set if any of its elements are nullable), and iterate over the ASCII characters. For each character c, we can take the derivative of rs, and find its index y. Then, we can add the transition (x, c, y) to t, and loop on the derivative. Essentially, this does a depth-first search of the spanning tree.

We then kick things off with empty states and the singleton set of the argument regexp r, and build a DFA from the return value. Note that the failure state is simply the index corresponding to the empty set of partial derivatives. The initial state will always be 0, since the first state we find will be the singleton set r.

Since we have labelled states by integers from 0 to size, we can easily build a table-based matcher from our dfa type.

type table = { m : int array array; accept : bool array; error : int }

let table d = 
  { error = d.fail;
    accept = Array.init d.size (fun i -> List.mem i d.final);
    m = (let a = Array.init d.size (fun _ -> Array.make 256 0) in
         List.iter (fun (x, c, y) -> a.(x).(Char.code c) <- y) d.trans; a) }

Here, the table type has a field m for the transition table, an array of booleans indicating whether the state accepts or not, and the error state. We build the table in the obvious way in the table function, by building an array of array of integers and initializing it using the list of transitions.

Then, we can give the matching function. The matches' function takes a table t, a string s, an index i, and a current state x. It will just trundle along updating its state, as long as we have not yet reached the end of the string (or hit the error state), and then it will report whether it ends in an accepting state or not. The re_match function just calls matches' at index 0, in state 0.

let rec matches' t s i x =
  if i < String.length s && x != t.error 
  then matches' t s (i+1) t.m.(x).(Char.code s.[i])
  else t.accept.(x)

let re_match t s = matches' t s 0 0

And that's it! Obviously more work is needed to make this a real regexp implementation, but the heart of the story is here. It's also possible to improve this implementation a bit further by adding a DFA minimization step, but that's a tale for another day. (DFA minimization is a very pretty little coinductive algorithm, and so makes a nice companion to the inductive story here.)

6 comments:

  1. Thanks for the article, I have read about and implemented Brzozowski derivatives but have not seen Antimirov's work.

    Do you have any comments on "Regular expression derivatives re-examined"? They implement the reduction relation for Brzozowski as "smart constructors" which does not seem to take that much more code than your presentation. It just seems that their approach requires some care in defining the comparison function. Are there other drawbacks, besides proving that the set is finite being harder?

    Also - do you have any pointers on how to implement a recognizer rather than just a matcher? With parser combinator-like interface. I tried an approach based on threading "evidence transforms" through Brzozowski normalizazer and derivative functions, which seems quite cumbersome but can work - was wondering you would have any literature pointers for this. Thanks!

    ReplyDelete
  2. My experience has been that programming the smart constructors ends up taking more code than it really seems like it should. It's possible I've done it in an inelegant way, though.

    By recognizer, do you mean things like group captures?

    ReplyDelete
  3. Yes, ideally to fit this interface - https://github.com/t0yv0/ocaml-regex-algebraic/blob/master/regex.mli

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Rather than working up to equivalence wrt the ACUI equations, I think it is quite simple to compute a canonical regular expression for each equivalence class. Instead of having a binary ∨, you use an n-ary ∨ and keep the invariant that ∨-nodes do not have ∨ children. Then to canonicalize a regular expression, you first canonicalize its subexpressions, then if it is a ∨-node, you sort it by some arbitrary metric and remove duplicates and ⊥'s.

    ReplyDelete