Monday, May 11, 2009

Implementing regular expression matching using Partial derivative (Part 4: The partial derivative approach)

In the previous post, we have presented a regex pattern matching algorithm that uses derivative. The idea is to extend the derivative operation to handle regex patterns. With this extension, we can perform regex pattern matching by "pushing" all the labels in the input word into the pattern via derivative, and collect the results by examining the resulted pattern (derivative).

This approach is neat but hard to be put to practical use. Here are the reasons.

Issue #1) The derivative operation for regex patterns always builds new patterns. The "structural size" of the derivative pattern is always larger than the original one. This prevents us from getting a compilation scheme.

Issue #2) The derivative based algorithm requires backtracking.
To illustrate, let us consider the following example,


Let p be the pattern (x :: (A|(AB)), y :: (A|C) ),
> p = ( PPair ( PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B'))) ) (PVar 2 S.empty (Choice (L 'A') (L 'C')))

Let w be the input AA
> w = S.pack "AA"

Matching w against p
> firstmatch p w -- (1)


To compute (1), we need to compute the derivatives (dPat (dPat p 'A') 'A')


(dPat (dPat p 'A') 'A')
--> (dPat (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
--> (PChoice
(PPair ( PVar 1 "AA") (Choice Phi Phi)) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
(PPair (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "A" (Choice Empty Phi))) -- (2)
)


Since we are searching for the firstmatch, we search (2) from left to right for the first successful match. It turns out that the first alternative of (2) leads to a matching failure. Therefore, we have to backtrack, and look at the second alternative, in which we find the match, [(1,"A"), (2,"A")].

We all know that backtracking is costly. The situation could be worse in the context of regex pattern matching, where we not only need to test for successful match but also keep track of bindings.

To address the issue #1, we need to take a different approach which is based on partial derivatives.

So what is the partial derivative? How does it differ from derivative?

Given a regex r, the partial derivatives of r with respect to some letter l are a set of regex which are the possible results of removing l from r.

Both derivative and partial derivative both describe the "states" after removing some leading label l from a regex. The derivate operation yields a single regex. On the other hand, the partial derivative operation yields a set of regexs. We can think of derivatives as states in a DFA because they deterministic, i.e. from one input regex and a letter, we can only get one derivative.
On the contrary, we can think of partial derivatives as states in an NFA because they are non-deterministic, i.e. from one input regex and a letter, we get a set of partial derivatives.

Note that as we pointed out earlier in this section, the set of all possible derivatives of a regex is infinite.
On the other hand, the set of all possible partial derivatives of a given regex is finite. This is one of the important results in Antimirov's work.

We recast Antimirov's definition of partial derivative operations as follows,

> partDeriv :: RE -> Char -> [RE]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l
> | l == l' = [Empty]
> | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l
> | isEmpty r1 =
> let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> s2 = partDeriv r2 l
> in nub (s1 ++ s2)
> | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]


It is not hard to realize that by making use of the partial derivative operation we can define a derivative operation which yields a "minimal" regex.

Exercise: Implement the derivative operation by making use of the partial derivative operation.

Furthermore, we are able to use the partial derivative operation to solve the word problem.

Exercise: Implement an algorithm that solves the word problem using the partial derivative operations.


Like what we did to the derivative operation, we can extend the partial derivative operation to operate on regex patterns.


> pdPat :: Pat -> Letter -> [Pat]
> pdPat (PVar x w r) l =
> let pd = partDeriv r l
> in if null pd then []
> else [PVar x (w `S.append` (S.pack [l])) (resToRE pd)]
> pdPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then ([ PPair p1' p2 | p1' <- pdPat p1 l] ++
> [ PPair (mkEmpPat p1) p2' | p2' <- pdPat p2 l])
> else [ PPair p1' p2 | p1' <- pdPat p1 l ]
> pdPat (PChoice p1 p2) l =
> ((pdPat p1 l) ++ (pdPat p2 l))

Summig up a list of regular expressions with choice operation.

> resToRE :: [RE] -> RE
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi



And the partial derivative pattern matching algorithm follows naturally,


> allmatch :: Pat -> Word -> [Env]
> allmatch p w = concat (map collect (allmatch' [p] w))
> where
> allmatch' :: [Pat] -> Word -> [Pat]
> allmatch' ps w =
> case S.uncons w of
> Nothing -> ps
> Just (l,w') -> let ps' = (concat [ pdPat p l | p <- ps ])
> in allmatch' ps' w'

> firstmatch :: Pat -> Word -> Maybe Env
> firstmatch p w =
> case allmatch p w of
> [] -> Nothing
> (env:_) -> Just env



Recall the previous example


Kenny's example

> p4 = PPair (PPair p_x p_y) p_z
> where p_x = PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B')))
> p_y = PVar 2 S.empty (Choice (Seq (L 'B') (Seq (L 'A') (L 'A'))) (L 'A'))
> p_z = PVar 3 S.empty (Choice (Seq (L 'A') (L 'C')) (L 'C'))

> input = S.pack "ABAAC"




*Main> firstmatch p4 input
Just [(1,"AB"),(2,"A"),(3,"AC")]


Note that there is a slight difference between the result obtained from the derivative based and the one above. The key observation here is that the two algorithms differ when the regex pattern has a left associated nested pairs. In case that the pair patterns are nested to the right. The two algorithms behaves the same, which can be proven easily. In realworld regex pattern matching like Perl or grep, we do not have nesting in pattern sequences.

One key observation is that the set of all partial derivative patterns are finite if we drop the "cumulative" bindings. This gives us an opportunity to turn the above algorithm into a compilation scheme.

And we yet need to address the issue of backtracking (#2).
(To be continued)

Thursday, May 7, 2009

Implementing regular expression matching using Partial derivative (Part 3: The derivative approach)

The lesson we learned from the word problem is that we solve the word matching problem by reducing

match w r

to

match w' (deriv r l)
where
(l:w') = w



Suppose regex pattern matching can be defined in terms of a function

> allmatch :: Pat -> Word -> [Env]

where the type Pat represents all possible regex patterns.

> data Pat = PVar Int Word RE
> | PPair Pat Pat
> | PChoice Pat Pat
> | PEmpty Pat -- used internally to indicate that mkEmpty function has been applied.
> deriving Show

In the above, we define four kinds of regex patterns, the variable pattern, the pair parttern, the choice pattern,
and the empty pattern. Note that in the variable pattern we use integer to represent pattern variable.
Within the variable pattern, we use a placeholder of type Word to record
the word being matched by this variable pattern. The empty pattern is used internally, we will provide the explanation shortly.


Env denotes the matching results which are mappings between pattern variables and the words.

> type Env = [(Int,Word)]


Applying the same idea, we should be able to solve the regex pattern matching problem by reducing


allmatch p w

to

allmatch derivative_of_p_with_respect_to_l w'
where
(l:w') = w


To implement this, we first need to define the derivatives operation for regex patterns.


> dPat :: Pat -> Letter -> Pat


The derivative operation for variable pattern is pretty straigh-forward.


> dPat (PVar x w r) l =
> let d = deriv r l
> in PVar x (w `S.append` (S.pack [l])) d


In the above, we "reduce" the pattern annotation regex by computing its derivative. In addition, in the resulting derivative we record the information that we have "consumed" the letter l.

The derivative operation for choice pattern is simple, we omit the explanation.

> dPat (PChoice p1 p2) l = PChoice (dPat p1 l) (dPat p2 l)


The derivative operation for pair patterns is more interesting.
Let strip be a function that strips away variable bindings from a pattern.

> strip :: Pat -> RE
> strip (PVar _ w r) = r
> strip (PPair p1 p2) = Seq (strip p1) (strip p2)
> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)
> strip (PEmpty p) = strip p

The "partial" solution we can immediately think of is as follows,

> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair _ (dPat p2 l))
> else PPair (dPat p1 l) p2

When the sub pattern p1 accepts empty word, the resulting derivative must be a choice pattern. In the first alternative, we use p1 to consume l and keep p2 unchanged; while in the second alternative, we skip p1 and use p2 to consume l. How can we skip p1? In other words, what should _ be replaced with?
Because p1 accepts empty words, we could further match incoming letters just with p2. But we cannot discard p1, simply because we might have recorded some part of the input in p1. On the other hand, we cannot keep p1 as it is. We somehow need to indicate that the matching using p1 is finished.

The idea is to replace _ with a variation of p1 where we make all regular expressions empty whichever is possible.


> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair (mkEmpPat p1) (dPat p2 l))
> else PPair (dPat p1 l) p2

> mkEmpPat :: Pat -> Pat
> mkEmpPat (PVar x w r)
> | isEmpty r = PVar x w Empty
> | otherwise = PVar x w Phi
> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)


There is a space for optimization here. Since we have made p1 empty, we should give it a mark, so that in the subsequent
derivative operation, we could immediately skip it. Putting all pieces together, the derivative operation for regex pattern is as follows,


> dPat :: Pat -> Letter -> Pat
> dPat (PVar x w r) l =
> let d = deriv r l
> in PVar x (w `S.append` (S.pack [l])) d
> dPat (PChoice p1 p2) l = PChoice (dPat p1 l) (dPat p2 l)
> dPat (PPair (PEmpty p1) p2) l =
> (PPair (PEmpty p1) (dPat p2 l))
> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair (PEmpty (mkEmpPat p1)) (dPat p2 l))
> else PPair (dPat p1 l) p2

> mkEmpPat :: Pat -> Pat
> mkEmpPat (PVar x w r)
> | isEmpty r = PVar x w Empty
> | otherwise = PVar x w Phi
> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PEmpty p) = PEmpty p


Recall that the main idea is to perform regex pattern matching by reducing it into derivative forms, how about the base case, i.e. when the input word is empty or fully consumed. We need to collect the bindings from the pattern.


> collect :: Pat -> [Env]
> collect (PVar x w r) | isEmpty r = [[(x,w)]]
> | otherwise = []
> collect (PPair p1 p2) = combine (collect p1) (collect p2)
> collect (PChoice p1 p2) = (collect p1) ++ (collect p2)
> collect (PEmpty p) = collect p

> combine :: [Env] -> [Env] -> [Env]
> combine envs1 envs2 =
> [ env1 ++ env2 | env1 <- envs1, env2 <- envs2 ]


Now we can define our derivative based pattern matching algorithm.


> allmatch :: Pat -> Word -> [Env]
> allmatch p w =
> let p' = S.foldl (\s l -> dPat s l) p w
> in collect p'

> firstmatch :: Pat -> Word -> Maybe Env
> firstmatch p w =
> case allmatch p w of
> [] -> Nothing
> (env:_) -> Just env



Some example,

Kenny's example p4 = ( (x :: A|(AB) , y :: (BAA) | A ), z :: (AC)| C )

> p4 = PPair (PPair p_x p_y) p_z
> where p_x = PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B')))
> p_y = PVar 2 S.empty (Choice (Seq (L 'B') (Seq (L 'A') (L 'A'))) (L 'A'))
> p_z = PVar 3 S.empty (Choice (Seq (L 'A') (L 'C')) (L 'C'))

> input = S.pack "ABAAC"




The sample execution,

*Main> firstmatch p4 input
Just [(1,"A"),(2,"BAA"),(3,"C")]


The correctness of this implementation can be found in my thesis here.
The correctness is verified under the POSIX matching policy, which is mentioned in Vansumeran's paper.


Burak Emir has demonstrated the same algorithm works in Scala, See here and here.

Related Works

S. Vansummeren. Type inference for unique pattern matching. ACM TOPLAS, 28(3):389–428, May 2006.

Tuesday, May 5, 2009

Implementing regular expression matching using Partial derivative (Part 2: The word problem and derivatives)

How do we solve the regex matching problem? To start let us consider a simpler problem - the word problem. The word problem is a classic problem in which we test whether a regex accepts a word.

One well-known approach is to solve the word problem using Brzozowski's derivative operation. The derivative operation is defined as follows

d(l,r) = { w | (l,w) in r ]


We describe the algorithm in the following using Haskell.



> import qualified Data.ByteString.Char8 as S

> data RE = Phi -- empty lang
> | Empty -- empty word
> | L Char -- literal
> | Choice RE RE -- r1 + r2
> | Seq RE RE -- (r1,r2)
> | Star RE -- r*


A word is a byte string.

> type Word = S.ByteString

> type Letter = Char



We follow Brzozowski's approach to define the derivative operations.


> deriv :: RE -> Letter -> RE
> deriv Phi l = Phi
> deriv Empty l = Phi
> deriv (L l') l
> | l == l' = Empty
> | otherwise = Phi
> deriv (Choice r1 r2) l = Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l
> | isEmpty r1 =
> let c1 = (Seq (deriv r1 l) r2)
> c2 = deriv r2 l
> in Choice c1 c2
> | otherwise = Seq (deriv r1 l) r2
> deriv (Star r) l = Seq (deriv r l) (Star r)



To establish the base case of the algorithm,
we need a function that checks whether a regexp accepts the empty word.



-- check whether regular expressions are empty

> class IsEmpty a where
> isEmpty :: a -> Bool

> instance IsEmpty RE where
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False



Then the word matching algorithm can be implemented as follows

> match :: Word -> RE -> Bool
> match w r =
> case S.uncons w of
> Nothing -> isEmpty r -- empty
> Just (l,w') -> w' `match` (deriv r l)


The gist of the word matching algorithm is that a non empty word (l,w') matches a regular expr r if we take away l from the word, the remainder w' matches with the derivative (deriv r l).

A sample execution:


*Main> match (S.pack "AA") (Seq (Star (L 'A')) (Star (L 'A')))
True

Wednesday, April 29, 2009

Implementing regular expression matching using Partial derivative (Part 1: Intro)

Regular expression pattern is a common programming tool which allows us to extract a fragment from a series of items (such as characters, symbols etc) if the fragment matches the pattern that we specify. The process that computes the match result is sometime called regular expression pattern matching. Regular expression pattern matching is commonly found in many main stream programming languages such as, Perl, Python and Java. The "traditional" way of implementing regular expression pattern matching is to construct a DFA (deterministic Finite Automata) that represent the regular expression pattern. Execute the DFA over the input will yield the result.



In this article, I propose an alternative implementation of regular expression pattern matching using Partial Derivatives.
In a similar idea, we can also use partial derivative to perform inequality/equality test among regexs. See Martin's post here.


What is the regular expression pattern matching problem?


Let's consider an example. Given a regular expression pattern, say


(x : A*, y : A*)


which matches a sequence of alphabets A and binding zero or more As to the pattern variable x and sending the rest (of As) to the other pattern variable y.


Suppose we have an input sequence


AA


(or we will call it "a word" subsequently) which matches with the above pattern and producing
one of the following possible bindings.


{ x -> AA, y -> () }
{ x -> A, y -> A }
{ x -> (), y -> AA }


The first solution is called the longest matched, because the first sub pattern consumes the longest possible prefix of the word and the remaining goes to the second pattern. On the other hand, the third solution is referred to the shortest matched, because the first sub pattern consumes the shortest possible prefix. In most of this article we will talk about the longest match.