Saturday, June 2, 2012

Extending Glushkov NFA with sub matching over Strings

We consider implementing Glushkov matching automata using Haskell. We first look at the the word matching problem. Let w be a list of characters (AKA string), r be a regular expression, we say w `matches` r iff w in L(r).

We define regular expression using Haskell data type as follows,
data RE where
  Phi :: RE                      -- empty language
  Empty :: RE                    -- empty word
  L :: Char -> Int -> RE                -- single letter with a position number 
  Choice :: RE  -> RE  -> RE     -- r1 + r2
  Seq :: RE  -> RE  -> RE        -- (r1,r2)
  Star :: RE  -> RE              -- r*
 deriving Eq

Phi denotes the empty language, the empty set. Empty denote the empty sequence, L _ _ construct a letter from a character. Note that we also annotate the letter with its position by an integer which is needed in the Glushkov NFA construction. We will come to this in details pretty soon. Choice _ _ implements the choice operator in regular expression. Seq _ _ implements the concatenation. Star _ implements the Kleene's star.

The input string is a ByteString.
type Word = S.ByteString

Step 1 Annotating the regular expression

According to the Glushkov construction, we need to annotate all the letters in the regular expression with the positional information. E.g. by annotating (a|b)a* we have (a1|b2)a3*, or in Haskell syntax, (Seq (Choice (L 'a' 1) (L 'b' 2)) (Star (L 'a' 3))). Note for convenient, in the following sections, we use this syntax in place where Haskell syntax is expected.

We implement the annotation operation via a State monad.
newtype State s a = State { runState :: (s -> (a,s)) } 
 
instance Monad (State s) where
   -- return :: a -> State s a
   return a        = State (\s -> (a,s))
   -- (>>=) :: State s a -> (a -> State s b) -> State s b
   (State x) >>= f = State (\s -> let (a,s') = x s 
                                      stb = f a
                                  in (runState stb) s')

run :: s -> State s a -> (a,s)
run s sta = (runState sta) s
We use the data type Env to keep tracks of the latest position number.
data Env = Env { cnt :: Int
               } deriving Show

initEnv :: Env 
initEnv = Env 0 
The nextCounter operation increment the counter by 1.
nextCounter :: State Env Int
nextCounter = State (\env -> let c = (cnt env) + 1
                                 env' = c `seq` env{cnt=c}
                             in env' `seq` (c, env'))


With the above operations defined the annotation process can be implemented straight-forward as follows,
rAnnotate :: RE -> RE
rAnnotate r = case run initEnv (rAnn r) of
             { (r', _ ) -> r' }  

rAnn :: RE -> State Env RE
rAnn Phi = return Phi
rAnn Empty = return Empty
rAnn (Choice r1 r2) = 
  do { r1' <- rAnn r1
     ; r2' <- rAnn r2
     ; return (Choice r1' r2') }
rAnn (Seq r1 r2) = 
  do { r1' <- rAnn r1
     ; r2' <- rAnn r2
     ; return (Seq r1' r2') }
rAnn (Star r) = 
  do { r' <- rAnn r                
     ; return (Star r') }
rAnn (L c _) = 
  do { i <- nextCounter       
     ; return (L c i) }



Now we turn to the key operations which are used in the Glushkov NFA construction. We implement the first, last and follow operations as follows,
rFirst :: RE -> [(Char, Int)] 
rFirst Phi = []
rFirst Empty = []
rFirst (L c i) = [(c,i)]
rFirst (Star r) = rFirst r
rFirst (Choice r1 r2) = (rFirst r1) ++ (rFirst r2)
rFirst (Seq r1 r2) = if isEmpty r1 then (rFirst r1) ++ (rFirst r2) else rFirst r1

rLast :: RE -> [(Char, Int)] 
rLast Phi = []
rLast Empty = []
rLast (L c i) = [(c,i)]
rLast (Star r) = rLast r
rLast (Choice r1 r2) = (rLast r1) ++ (rLast r2)
rLast (Seq r1 r2) = if isEmpty r2 then (rLast r1) ++ (rLast r2) else rLast r2

rFollow :: RE -> [(Int, Char, Int)]
rFollow Phi = []
rFollow Empty = []
rFollow (L _ _) = []
rFollow (Choice r1 r2) = (rFollow r1) ++ (rFollow r2)
rFollow (Seq r1 r2) = (rFollow r1) ++ (rFollow r2) ++ [ (l,c,f) | (_,l) <- rLast r1, (c,f) <- rFirst r2 ]
rFollow (Star r) = (rFollow r) ++ [ (l,c,f) | (_,l) <- rLast r, (c,f) <- rFirst r ]


Let r be a regular expression,

  • the rFirst function extracts the positions of all the possible "leading" letters from r.
  • the rLast function extracts the positions of all the possible "trailing" letters from r.
  • the rFollow function builds all the possible transitions from r structurally by leveraging on the rFirst and rLast operations.

For example, rFirst (a1|b2)a3* yields [1,2]. rLast (a1|b2)a3* yields [1,2,3]. rFollow (a1|b2)a3* yields [(1,'a',3), (2,'b',3), (3,'a',3)].

Step 2 Assembling the Lego piece by piece

We are ready to construct the NFA. We define the NFA using the following data type.
data NFA a l = NFA { states :: [a]
                   , initStates   :: [a]
                   , finalStates  :: [a]
                   , delta  :: [(a,l,a)] 
                   } deriving Show
To construct the Glushkov NFA, we introduce the following
rGlushkov :: RE -> NFA Int Char
rGlushkov r = let r' = rAnnotate r
              in NFA{ states = 0:(rPos r')
                    , initStates = [0]
                    , finalStates = if isEmpty r then 0:(map snd (rLast r')) else (map snd (rLast r'))
                    , delta = [ (0,c,f) | (c,f) <- rFirst r' ] ++ (rFollow r') }

rPos :: RE -> [Int]
rPos Phi = []
rPos Empty = []
rPos (L _ i) = [i]
rPos (Choice r1 r2) = (rPos r1) ++ (rPos r2) 
rPos (Seq r1 r2) = (rPos r1) ++ (rPos r2)
rPos (Star r) = rPos r
where rPos returns all the positions in r which constitute all the states in the Glushkov NFA besides the initial state 0. The final states are all the positions in rLast r. If r possess the empty string, state 0 is in the set of final states too. The delta are constructed from the rFollow operation.

Step 3 Running the NFA

Let's have a test drive of our newly constructed NFA. To do that, we need the following,
table :: Eq a => [(a,b)] -> [(a,[b])]
table ps = intern [] ps
  where intern t [] = t
        intern t ((k,v):ps) = case lookup k t of 
                              { Nothing -> intern ((k,[v]):t) ps
                              ; Just vs -> intern (update k (vs++[v]) t) ps }
        update k v t = let t' = filter (\(k',_) -> not (k == k')) t
                       in (k,v):t'


runNFA :: (Eq a, Eq l) => NFA a l -> [l] -> Bool
runNFA nfa w = 
   let xs = runIntern (initStates nfa) (table (map (\(f,s,t) -> ((f,s),t)) (delta nfa))) w
   in any (\x -> x `elem` finalStates nfa) xs
   where runIntern :: (Eq a, Eq l) => [a] -> [((a,l),[a])] -> [l] -> [a]
         runIntern currs _ [] = currs
         runIntern currs dels (l:ls) = 
           let nexts = concatMap (\a -> case lookup (a,l) dels of
                                     { Nothing -> []                        
                                     ; Just b  -> b }) currs
           in nexts `seq` runIntern nexts dels ls
where table function group all the transitions by starting position and the letter. The runNFA function takes an NFA and a list of letters and returns a boolean value to indicate whether the match is successful.
runGlushkov :: RE -> String -> Bool
runGlushkov r w = 
   let r' = rAnnotate r   
       nfa = rGlushkov r'          
   in runNFA nfa w 
runGlushkov allows us to pass in the regular expression directly into the matching procedure. e.g. runGlushkov (a1|b2)a3* "baa" yields True.

Step 4 Towards Submatching Automata

To support sub matching, we extend our language from regular expression to regular expression patterns where v :: r denotes a regular expression is marked by a sub matching group label v.

We consider the pattern data type,
data Pat where
 PVar :: Int -> RE -> Pat 
 PPair :: Pat -> Pat -> Pat  
 PChoice :: Pat -> Pat -> Pat 
  deriving Show
Note that we use integer to represent variables. For instance, the pattern ((1 :: a1b2|a3)(2 :: b4a5a6|a7)(3 :: a8c9|c10)) is represented as
p4 = PPair (PPair p_x p_y) p_z
   where p_x = PVar 1 (Choice (L 'a' 1) (Seq (L 'a' 2) (L 'a' 3)))
         p_y = PVar 2 (Choice (Seq (L 'b' 4) (Seq (L 'a' 5) (L 'a' 6))) (L 'a' 7))
         p_z = PVar 3 (Choice (Seq (L 'a' 8) (L 'c' 9)) (L 'c' 10))
And the annotation operation can be extended to regular expression pattern.
pAnnotate :: Pat -> Pat
pAnnotate p = case run initEnv (pAnn p) of { (p', _ ) -> p' }

pAnn :: Pat -> State Env Pat
pAnn (PVar v r) = do { r' <- rAnn r
                       ; return (PVar v r') }
pAnn (PPair p1 p2) = do { p1' <- pAnn p1
                        ; p2' <- pAnn p2 
                        ; return (PPair p1' p2') }
pAnn (PChoice p1 p2) = do { p1' <- pAnn p1
                          ; p2' <- pAnn p2
                          ; return (PChoice p1' p2') }
Extending the first, last and follow operations requires more attention.
pFirst :: Pat -> [(Char, Int, Int)]
pFirst (PVar v r) = [ (c,i,v) | (c,i) <- rFirst r ]
pFirst (PPair p1 p2) | isEmpty (strip p1) = pFirst p1 ++ pFirst p2
                     | otherwise          = pFirst p1
pFirst (PChoice p1 p2) = pFirst p1 ++ pFirst p2

pLast :: Pat -> [(Char, Int, Int)]
pLast (PVar v r) = [ (c,i,v) | (c,i) <- rLast r ]
pLast (PPair p1 p2) | isEmpty (strip p2) = pLast p1 ++ pLast p2
                    | otherwise          = pLast p2
pLast (PChoice p1 p2) = pLast p1 ++ pLast p2

pFollow :: Pat -> [(Int, Char, Int, Int)]
pFollow (PVar v r) = [ (p, c, q, v) | (p, c, q) <- rFollow r ]
pFollow (PPair p1 p2) = (pFollow p1) ++ (pFollow p2) 
                        ++ [ (l,c,f,v) |  (_,l,_) <- pLast p1, (c,f,v) <- pFirst p2 ]
pFollow (PChoice p1 p2) = (pFollow p1) ++ (pFollow p2)
Note that besides the letter and the position, the pFirst, pLast and pFollow functions return the pattern variable whose sub match result is going to be updated. We also need to extend the NFA to capture the updated pattern variables in the transition.
data NFA2 a l = NFA2 { states2 :: [a]
                   , initStates2   :: [a]
                   , finalStates2  :: [a]
                   , delta2  :: [(a,l,a,Int)] 
                   } deriving Show
The construction of the Glushkov NFA from the regular expression pattern can be defined as follows
pGlushkov :: Pat -> NFA2 Int Char
pGlushkov p = let p' = pAnnotate p
              in NFA2{ states2 = 0:(pPos p')
                    , initStates2 = [0]
                    , finalStates2 = if isEmpty (strip p) then 0:(map snd2 (pLast p')) else (map snd2 (pLast p'))
                    , delta2 = [ (0, c, f, v) | (c,f,v) <- pFirst p']  ++ (pFollow p') 
                    }

pPos :: Pat -> [Int]
pPos (PVar v r) = rPos r
pPos (PPair p1 p2) = pPos p1 ++ pPos p2
pPos (PChoice p1 p2) = pPos p1 ++ pPos p2

snd2 :: (a,b,c) -> b
snd2 (_,x,_) = x
Next let's consider how to we make use of the Glushkov NFA to perform regular expression sub matching.
runNFA2 :: (Eq a, Eq l) => NFA2 a l -> [l] -> [[Int]]
runNFA2 nfa w = 
   let xvs = runIntern (zip (initStates2 nfa) (repeat [])) 
              (table (map (\(f,s,t,v) -> ((f,s),(t,v))) (delta2 nfa))) w
   in map snd (filter (\(x,v) -> x `elem` finalStates2 nfa) xvs)
   where runIntern :: (Eq a, Eq l) => [(a,[Int])] -> [((a,l),[(a,Int)])] -> [l] -> [(a,[Int])]
         runIntern currs _ [] = currs
         runIntern currs dels (l:ls) = 
           let nexts = concatMap (\(a,vs) -> case lookup (a,l) dels of
                                     { Nothing -> []                        
                                     ; Just bvs -> map (\(b,v) -> (b, (vs++[v]))) bvs }) currs
           in nexts `seq` runIntern nexts dels ls
The runNFA2 function takes an NFA, a list of letters and return a list of all possible match results. Each match result is denoted by a sequence of variable names. The sequence has the same length as the input list of letters. It keeps track of the correspondent bound variables for every letter in the input list.
patMatchGlushkov :: Pat -> String -> [(Int,String)]
patMatchGlushkov p w = 
   let p' = pAnnotate p
       nfa = pGlushkov p'         
       matches = runNFA2 nfa w
   in case matches of 
      { [] -> [] -- no match
      ; (m:_) -> IM.toList (collect m w IM.empty) }
   where collect :: [Int] -> String -> IM.IntMap String -> IM.IntMap String
         collect [] _ m = m
         collect (i:is) (c:cs) m = 
                 case IM.lookup i m of
                     { Just r ->  collect is cs (IM.update (\_ -> Just (r++[c])) i m )
                     ; Nothing -> collect is cs (IM.insert i [c] m) }


For instance, patMatchGlushkov ((1 :: a1b2|a3)(2 :: b4a5a6|a7)(3 :: a8c9|c10)) "abaac" yields  [(1,"ab"), (2, "a"), (3,"ac")].

The Full Source Code

> {-# LANGUAGE GADTs, BangPatterns #-} 

--------------------------------------------------------------------------------
-- Regular Expression Pattern Matching via Glushkov automata (Position based)
--------------------------------------------------------------------------------

> module Main where

> import Monad
> import List 
> import Data.Bits
> import Char (ord)
> import GHC.IO
> import Data.Int
> import qualified Data.IntMap as IM
> import qualified Data.ByteString.Char8 as S

> import System.IO.Unsafe

------------------------------
-- regular expressions

> data RE where
>   Phi :: RE                      -- empty language
>   Empty :: RE                    -- empty word
>   L :: Char -> Int -> RE                -- single letter with a position number 
>   Choice :: RE  -> RE  -> RE     -- r1 + r2
>   Seq :: RE  -> RE  -> RE        -- (r1,r2)
>   Star :: RE  -> RE              -- r*
>  deriving Eq

A word is a byte string.

> type Word = S.ByteString

Pretty printing of regular expressions

> instance Show RE where
>     show Phi = "{}"
>     show Empty = "<>"
>     show (L c n) = show c ++ show n
>     show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
>     show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
>     show (Star r) = (show r ++ "*")


> class IsEmpty a where
>     isEmpty :: a -> Bool

> instance IsEmpty RE where
>   isEmpty Phi = False
>   isEmpty Empty = True
>   isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
>   isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
>   isEmpty (Star r) = True
>   isEmpty (L _ _) = False


annotate add position info to the regex

> newtype State s a = State { runState :: (s -> (a,s)) } 
 
> instance Monad (State s) where
>    -- return :: a -> State s a
>    return a        = State (\s -> (a,s))
>    -- (>>=) :: State s a -> (a -> State s b) -> State s b
>    (State x) >>= f = State (\s -> let (a,s') = x s 
>                                       stb = f a
>                                   in (runState stb) s')

> run :: s -> State s a -> (a,s)
> run s sta = (runState sta) s


> data Env = Env { cnt :: Int
>                } deriving Show

> initEnv :: Env 
> initEnv = Env 0 

> nextCounter :: State Env Int
> nextCounter = State (\env -> let c = (cnt env) + 1
>                                  env' = c `seq` env{cnt=c}
>                              in env' `seq` (c, env'))


annotate a regex with position index

> rAnnotate :: RE -> RE
> rAnnotate r = case run initEnv (rAnn r) of
>              { (r', _ ) -> r' }  

> rAnn :: RE -> State Env RE
> rAnn Phi = return Phi
> rAnn Empty = return Empty
> rAnn (Choice r1 r2) = 
>   do { r1' <- rAnn r1
>      ; r2' <- rAnn r2
>      ; return (Choice r1' r2') }
> rAnn (Seq r1 r2) = 
>   do { r1' <- rAnn r1
>      ; r2' <- rAnn r2
>      ; return (Seq r1' r2') }
> rAnn (Star r) = 
>   do { r' <- rAnn r                
>      ; return (Star r') }
> rAnn (L c _) = 
>   do { i <- nextCounter       
>      ; return (L c i) }


> rFirst :: RE -> [(Char, Int)] 
> rFirst Phi = []
> rFirst Empty = []
> rFirst (L c i) = [(c,i)]
> rFirst (Star r) = rFirst r
> rFirst (Choice r1 r2) = (rFirst r1) ++ (rFirst r2)
> rFirst (Seq r1 r2) = if isEmpty r1 then (rFirst r1) ++ (rFirst r2) else rFirst r1

> rLast :: RE -> [(Char, Int)] 
> rLast Phi = []
> rLast Empty = []
> rLast (L c i) = [(c,i)]
> rLast (Star r) = rLast r
> rLast (Choice r1 r2) = (rLast r1) ++ (rLast r2)
> rLast (Seq r1 r2) = if isEmpty r2 then (rLast r1) ++ (rLast r2) else rLast r2

> rFollow :: RE -> [(Int, Char, Int)]
> rFollow Phi = []
> rFollow Empty = []
> rFollow (L _ _) = []
> rFollow (Choice r1 r2) = (rFollow r1) ++ (rFollow r2)
> rFollow (Seq r1 r2) = (rFollow r1) ++ (rFollow r2) ++ [ (l,c,f) | (_,l) <- rLast r1, (c,f) <- rFirst r2 ]
> rFollow (Star r) = (rFollow r) ++ [ (l,c,f) | (_,l) <- rLast r, (c,f) <- rFirst r ]

> rPos :: RE -> [Int]
> rPos Phi = []
> rPos Empty = []
> rPos (L _ i) = [i]
> rPos (Choice r1 r2) = (rPos r1) ++ (rPos r2) 
> rPos (Seq r1 r2) = (rPos r1) ++ (rPos r2)
> rPos (Star r) = rPos r

> data NFA a l = NFA { states :: [a]
>                    , initStates   :: [a]
>                    , finalStates  :: [a]
>                    , delta  :: [(a,l,a)] 
>                    } deriving Show


> table :: Eq a => [(a,b)] -> [(a,[b])]
> table ps = intern [] ps
>   where intern t [] = t
>         intern t ((k,v):ps) = case lookup k t of 
>                               { Nothing -> intern ((k,[v]):t) ps
>                               ; Just vs -> intern (update k (vs++[v]) t) ps }
>         update k v t = let t' = filter (\(k',_) -> not (k == k')) t
>                        in (k,v):t'


> runNFA :: (Eq a, Eq l) => NFA a l -> [l] -> Bool
> runNFA nfa w = 
>    let xs = runIntern (initStates nfa) (table (map (\(f,s,t) -> ((f,s),t)) (delta nfa))) w
>    in any (\x -> x `elem` finalStates nfa) xs
>    where runIntern :: (Eq a, Eq l) => [a] -> [((a,l),[a])] -> [l] -> [a]
>          runIntern currs _ [] = currs
>          runIntern currs dels (l:ls) = 
>            let nexts = concatMap (\a -> case lookup (a,l) dels of
>                                      { Nothing -> []                        
>                                      ; Just b  -> b }) currs
>            in nexts `seq` runIntern nexts dels ls


> rGlushkov :: RE -> NFA Int Char
> rGlushkov r = let r' = rAnnotate r
>               in NFA{ states = 0:(rPos r')
>                     , initStates = [0]
>                     , finalStates = if isEmpty r then 0:(map snd (rLast r')) else (map snd (rLast r'))
>                     , delta = [ (0,c,f) | (c,f) <- rFirst r' ] ++ (rFollow r') }



> runGlushkov :: RE -> String -> Bool
> runGlushkov r w = 
>    let r' = rAnnotate r   
>        nfa = rGlushkov r'          
>    in runNFA nfa w 


label with default position 0

> l0 c = L c 0

r1 = ((a|b)*,c)

> r1 = Seq (Star (Choice (l0 'a') (l0 'b'))) (l0 'c')



> data Pat where
>  PVar :: Int -> RE -> Pat 
>  PPair :: Pat -> Pat -> Pat  
>  PChoice :: Pat -> Pat -> Pat 
>   deriving Show

> strip :: Pat -> RE 
> strip (PVar _ r) = r
> strip (PPair p1 p2) = Seq (strip p1) (strip p2)
> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)


extending annotate operation for patterns

> pAnnotate :: Pat -> Pat
> pAnnotate p = case run initEnv (pAnn p) of { (p', _ ) -> p' }

> pAnn :: Pat -> State Env Pat
> pAnn (PVar v r) = do { r' <- rAnn r
>                        ; return (PVar v r') }
> pAnn (PPair p1 p2) = do { p1' <- pAnn p1
>                         ; p2' <- pAnn p2 
>                         ; return (PPair p1' p2') }
> pAnn (PChoice p1 p2) = do { p1' <- pAnn p1
>                           ; p2' <- pAnn p2
>                           ; return (PChoice p1' p2') }

extending first, last and follow operations for patterns

the result of pFirst and pLast are  list of tripple, (the letter, the position, and the pattern variable updated)

> pFirst :: Pat -> [(Char, Int, Int)]
> pFirst (PVar v r) = [ (c,i,v) | (c,i) <- rFirst r ]
> pFirst (PPair p1 p2) | isEmpty (strip p1) = pFirst p1 ++ pFirst p2
>                      | otherwise          = pFirst p1
> pFirst (PChoice p1 p2) = pFirst p1 ++ pFirst p2

> pLast :: Pat -> [(Char, Int, Int)]
> pLast (PVar v r) = [ (c,i,v) | (c,i) <- rLast r ]
> pLast (PPair p1 p2) | isEmpty (strip p2) = pLast p1 ++ pLast p2
>                     | otherwise          = pLast p2
> pLast (PChoice p1 p2) = pLast p1 ++ pLast p2

we also introduce the pattern variable updated into the result of the follow operation

> pFollow :: Pat -> [(Int, Char, Int, Int)]
> pFollow (PVar v r) = [ (p, c, q, v) | (p, c, q) <- rFollow r ]
> pFollow (PPair p1 p2) = (pFollow p1) ++ (pFollow p2) 
>                         ++ [ (l,c,f,v) |  (_,l,_) <- pLast p1, (c,f,v) <- pFirst p2 ]
> pFollow (PChoice p1 p2) = (pFollow p1) ++ (pFollow p2)


> pPos :: Pat -> [Int]
> pPos (PVar v r) = rPos r
> pPos (PPair p1 p2) = pPos p1 ++ pPos p2
> pPos (PChoice p1 p2) = pPos p1 ++ pPos p2


> snd2 :: (a,b,c) -> b
> snd2 (_,x,_) = x

we need a different nfa because now the delta should keep track of which pattern variable is updated

> data NFA2 a l = NFA2 { states2 :: [a]
>                    , initStates2   :: [a]
>                    , finalStates2  :: [a]
>                    , delta2  :: [(a,l,a,Int)] 
>                    } deriving Show

> -- return a list of variable bindings
> runNFA2 :: (Eq a, Eq l) => NFA2 a l -> [l] -> [[Int]]
> runNFA2 nfa w = 
>    let xvs = runIntern (zip (initStates2 nfa) (repeat [])) 
>               (table (map (\(f,s,t,v) -> ((f,s),(t,v))) (delta2 nfa))) w
>    in map snd (filter (\(x,v) -> x `elem` finalStates2 nfa) xvs)
>    where runIntern :: (Eq a, Eq l) => [(a,[Int])] -> [((a,l),[(a,Int)])] -> [l] -> [(a,[Int])]
>          runIntern currs _ [] = currs
>          runIntern currs dels (l:ls) = 
>            let nexts = concatMap (\(a,vs) -> case lookup (a,l) dels of
>                                      { Nothing -> []                        
>                                      ; Just bvs -> map (\(b,v) -> (b, (vs++[v]))) bvs }) currs
>            in nexts `seq` runIntern nexts dels ls
> 

> pGlushkov :: Pat -> NFA2 Int Char
> pGlushkov p = let p' = pAnnotate p
>               in NFA2{ states2 = 0:(pPos p')
>                     , initStates2 = [0]
>                     , finalStates2 = if isEmpty (strip p) then 0:(map snd2 (pLast p')) else (map snd2 (pLast p'))
>                     , delta2 = [ (0, c, f, v) | (c,f,v) <- pFirst p']  ++ (pFollow p') 
>                     }



> patMatchGlushkov :: Pat -> String -> [(Int,String)]
> patMatchGlushkov p w = 
>    let p' = pAnnotate p
>        nfa = pGlushkov p'         
>        matches = runNFA2 nfa w
>    in case matches of 
>       { [] -> [] -- no match
>       ; (m:_) -> IM.toList (collect m w IM.empty) }
>    where collect :: [Int] -> String -> IM.IntMap String -> IM.IntMap String
>          collect [] _ m = m
>          collect (i:is) (c:cs) m = 
>                  case IM.lookup i m of
>                      { Just r ->  collect is cs (IM.update (\_ -> Just (r++[c])) i m )
>                      ; Nothing -> collect is cs (IM.insert i [c] m) }



> p4 = PPair (PPair p_x p_y) p_z
>    where p_x = PVar 1 (Choice (l0 'A') (Seq (l0 'A') (l0 'B')))
>          p_y = PVar 2 (Choice (Seq (l0 'B') (Seq (l0 'A') (l0 'A'))) (l0 'A'))
>          p_z = PVar 3 (Choice (Seq (l0 'A') (l0 'C')) (l0 'C'))

> s4 = "ABAAC"

Ungreedy match can be easily adopted in Glushkov 

e.g. consider p = ( x :: a1 * ?, y :: a2 * ) where 1 and 2 are position tags.

first p = [ ( 'a', 1, x) ,  ('a', 2, y) ]

last p  = [ ('a', 1, x), ('a', 2, y) ]

follow p = (follow (x :: a1*?)) ++ (follow  (y :: a2*)) ++ 
        [ (p1,c2,p2,v2) | (c1,p1,v1) <- last (x :: a1*?), (c2,p2,v2) <- first (y :: a2*) ]
         = [ (p, c, p',x) |  (p, c, p') <- follow (a1*?) ] ++ 
           [ (p, c, p',y) |  (p, c, p') <- follow (a2*) ] ++ 
        [ (p1,c2,p2,v2) | (c1,p1,v1) <- last (x :: a1*?), (c2,p2,v2) <- first (y :: a2*) ]
         = [ (1,'a',1,x) ] ++ [ (2,'a',2,y) ] ++ [ (p1,c2,p2,v2) | (c1,p1,v1) <- [('a',1,x)], (c2,p2,v2) <- [('a',2,y)] ]   -- (1)
         = [ (1,'a',1,x) ] ++ [ (2,'a',2,y) ] ++ [ (1,'a',2',y) ] 
Note that for (1) we have all the transitions. Assume during the matching, the transitions are 'tried' in the order
of left to right. Hence (1,'a',1,x) is always tried before (1,'a',2',y), which leads to a greedy matching.

On the other hand, if we swap [ (1,'a',1,x) ]  with  [ (1,'a',2',y) ]  we will have non-greedy matching

hence for a non-greedy match, if a1 is non-greedy
follow (q1,q2) =  [ (p1,c2,p2,v2) | (c1,p1,v1) <- last q1, (c2,p2,v2) <- first q2 ] 
               ++ follow q1 ++ follow q2
       


The end



No comments: