Monday, August 6, 2012

One of the most practically useful features of functional programming languages is pattern matching, and one of the most important quality-of-implementation features for an implementation is the exhaustiveness checking for pattern matches. In this post, I will give the simplest algorithm for coverage checking that I know.

I previously gave one in my paper Focusing on Pattern Matching. I was never completely satisfied with that algorithm, since it was harder to understand than I liked. It worked by giving a judgment to prove that no match would fail, rather than directly proving that all matches covered. The negation made thinking about extensions (such as pattern matching for dependent types) harder than it should be, and it also made it harder to get a match compilation algorithm.

So in this blog post, I'll give a better algorithm. It's super-simple, and is about as naive as it should be -- there are only six rules in total. It's not totally industrial strength, but (a) it's certainly good enough to use in a prototype compiler, and (b) it's easy enough to beef up.


First, I'll give the syntax. $$\begin{array}{llcl} \mbox{Types} & A & ::= & 1 \bnfalt A \times B \bnfalt A + B \bnfalt A \to B \bnfalt o \\ \mbox{Patterns} & p & ::= & () \bnfalt (p, p) \bnfalt \inl(p) \bnfalt \inr(p) \bnfalt \_ \bnfalt x \\ \mbox{Pattern Lists} & ps & ::= & \cdot \bnfalt p, ps \\ \mbox{Branch Lists} & \Pi & ::= & \cdot \bnfalt ps \To e; \Pi \\ \mbox{Contexts} & \Gamma & ::= & \cdot \bnfalt u:A, \Gamma \end{array}$$

The types are sums and products, plus a type $o$ to indicate functions and other non-matchable types. The patterns are just what you'd expect -- $()$ for the unit pattenrn, $(p, p')$ for pairs, and $\inl(p)$ and $\inr(p)$ for the left and right branches of sums. (Of course, there is no pattern for the empty type!) We also have wildcard $\_$ and variable $x$ patterns.

Since our algorithm works inductively, we will destructure patterns into their components, and we will need to keep track of lists of patterns. A pattern list (with metavariable $ps$) is just the obvious thing --- a comma-separated list of patterns.

We also need branch lists -- a pattern match clause consists of a list of of patterns and expressions. So we use branch lists to say what body is associated with each pattern list (a pattern list, since we are going to destructure the pattern given for each arm).

We will also write $\overrightarrow{ps \To e}$ to indicate branch lists. If I write a list like $\overrightarrow{(p_1, p_2), ps \To e}$, then I mean that every element of the branch list starts with a pair pattern, and similarly for the other constructors.

This algorithm keeps track of the type of the value being destructured, so we introduce a typed context $\Gamma$ to keep track of the types of the expressions being destructured.

Without further delay, here's the coverage judgment $\cover{\Pi}{\Gamma}$. You can read this as algorithm bottom-up, in the usual logic-programming style. $$\array{ \rule{ \cover{\overrightarrow{ps' \To e'}}{\Gamma} } { \cover{\overrightarrow{\_, ps' \To e'}}{u:A, \Gamma} } \\[1em] \rule{ \cover{ \overrightarrow{ps' \To e'},\;\; (\_, ps \To e), \overrightarrow{ps'' \To e''} }{u:A, \Gamma} } { \cover{ \overrightarrow{ps' \To e'},\;\; (x, ps \To e), \overrightarrow{ps'' \To e''} }{u:A, \Gamma} } \\[1em] \rule{ \cover{ \overrightarrow{ps \To e},\;\; \overrightarrow{ps' \To e'} }{\Gamma}} { \cover{ \overrightarrow{(), ps \To e},\;\; \overrightarrow{\_, ps' \To e'} }{u:1, \Gamma}} \\[1em] \rule{ \cover{ \overrightarrow{p_1, p_2, ps \To e},\;\; \overrightarrow{\_, \_, ps' \To e'} }{a:A, b:B, \Gamma}} { \cover{ \overrightarrow{(p_1,p_2), ps \To e},\;\; \overrightarrow{\_, ps' \To e'} }{u:A \times B, \Gamma}} \\[1em] \rule{ } { \cover{ \overrightarrow{\_, ps'' \To e''} }{u:0, \Gamma}} \\[1em] \rule{ \cover{ \overrightarrow{p, ps \To e},\;\; \overrightarrow{\_, ps'' \To e''}}{a : A, \Gamma} \\ \cover{ \overrightarrow{p', ps' \To e'},\;\; \overrightarrow{\_, ps'' \To e''}}{b : B, \Gamma}} { \cover{ \overrightarrow{\inl(p), ps \To e},\;\; \overrightarrow{\inr(p'), ps' \To e'},\;\; \overrightarrow{\_, ps'' \To e''} } {u:A + B, \Gamma}} }$$

The first two rules handle wildcards and variables. The first rule says that if the lead pattern of every pattern list in the branch set is a wildcard, we can just drop it and keep going. The second rule says that you can turn any lead variable pattern into a wildcard pattern.

The third rule handles the unit pattern. It is nice and easy --- it says that if every lead pattern is either a unit or wildcard pattern, we can drop it and keep going.

The fourth rule handles tuples, and is almost as easy. It says that if we have a lead pattern $(p_1, p_2), ps \To e$, we can rewrite it to $p_1, p_2, ps \To e$. We also have to remember to double any lead wildcard pattern $\_, ps' \To e'$ into $\_, \_, ps' \To e'$, since we are destructuring a product into two subpatterns.

The fifth rule handles the void type. It just succeeds, since there are no patterns of empty type!

The sixth rule handles sums. It says that if we are scrutinizing a value of type $A + B$, then we can separate the branch list into those branches handling the left case $\inl(p), ps \To e$ and those branches handling the right case $\inr(p'), ps' \To e'$, and then check then at $A$ and $B$ respectively. Here, we have to remember to send wildcards $\_, ps'' \To e''$ to both sides, since a wildcard can match both the left- and right-cases.

And that's it!

The open problem in proof theory is a bit of slight of hand in my notation. You understand exactly what I mean when I take a meta-pattern like $\overrightarrow{(p_1, p_2), ps \To e}$, and turn it into $\overrightarrow{p_1, p_2, ps \To e}$. However, formalizing this is trickier than it looks, and it's not known how to give a proof-theoretic characterization of this kind of pattern (according to Rob Simmons, who I asked early this year about it).

As a bonus, I'll also show how coverage checking can also give you a compilation algorithm for pattern matching. The judgment $\elaborate{\Pi}{\Gamma}{t}$ tells you how to turn a list of branches $\Pi$ into a nested sequence of tuple destructurings and case-statements. $$\array{ \rule{ \elaborate{\overrightarrow{ps' \To e'}}{\Gamma}{t} } { \elaborate{\overrightarrow{\_, ps' \To e'}}{u:A, \Gamma}{t} } \\[1em] \rule{ \elaborate{\overrightarrow{ps' \To e'}}{\Gamma}{t} } { \elaborate{\overrightarrow{\_, ps' \To e'}}{u:A, \Gamma}{t} } \\[1em] \rule{ \elaborate{ \overrightarrow{ps' \To e'},\;\; (\_, ps \To \letv{x}{u}{e}), \overrightarrow{ps'' \To e''} }{\Gamma}{t} } { \elaborate{ \overrightarrow{ps' \To e'},\;\; (x, ps \To e), \overrightarrow{ps'' \To e''} }{\Gamma}{t} } \\[1em] \rule{ \elaborate{ \overrightarrow{ps \To e},\;\; \overrightarrow{ps' \To e'} }{\Gamma}{t} } { \elaborate{ \overrightarrow{(), ps \To e},\;\; \overrightarrow{\_, ps' \To e'} }{u:1, \Gamma}{\letv{()}{u}{t}} } \\[1em] \rule{ \elaborate{ \overrightarrow{p_1, p_2, ps \To e},\;\; \overrightarrow{\_, \_, ps' \To e'} }{a:A, b:B, \Gamma}{t} } { \elaborate{ \overrightarrow{(p_1,p_2), ps \To e},\;\; \overrightarrow{\_, ps' \To e'} }{u:A \times B, \Gamma}{\letv{(a,b)}{u}{t}} } \\[1em] \rule{ } { \elaborate{ \overrightarrow{\_, ps'' \To e''} }{u:0, \Gamma}{\abort(u)} } \\[1em] \rule{ \elaborate{ \overrightarrow{p, ps \To e},\;\; \overrightarrow{\_, ps'' \To e''}}{a : A, \Gamma}{t_1} \\ \elaborate{ \overrightarrow{p', ps' \To e'},\;\; \overrightarrow{\_, ps'' \To e''}}{b : B, \Gamma}{t_2} } { \elaborate{ \overrightarrow{\inl(p), ps \To e},\;\; \overrightarrow{\inr(p'), ps' \To e'},\;\; \overrightarrow{\_, ps'' \To e''} } {u:A + B, \Gamma} {\case{u}{a}{t_1}{b}{t_2}} } }$$

Note that it is exactly the same judgment as before --- coverage checking computes exactly the information we need, and so code generation comes along for the ride!