CMSC 451 Lecture 15, CFG Simplification Algorithm

    <- previous    index    next ->

Lecture 15 CFG simplification algorithm

  The goal here is to take an arbitrary Context Free Grammar
  G = (V, T, P, S) and perform transformations on the grammar that
  preserve the language generated by the grammar but reach a
  specific format for the productions.

  Overview: Step 1a) Eliminate useless variables that can not become terminals
            Step 1b) Eliminate useless variables that can not be reached
            Step 2)  Eliminate epsilon productions
            Step 3)  Eliminate unit productions
            Step 4)  Make productions Chomsky Normal Form
            Step 5)  Make productions Greibach Normal Form

            The CYK parsing uses Chomsky Normal Form as input
            The CFG to NPDA uses Greibach Normal Form as input

  Details:  one step at a time

  1a) Eliminate useless variables that can not become terminals
      See 1st Ed. book p88, Lemma 4.1, figure 4.7
          2nd Ed. section 7.1
      Basically: Build the set NEWV from productions of the form
      V -> w  where V is a variable and w is one or more terminals.
      Insert V into the set NEWV.
      Then iterate over the productions, now accepting any variable
      in w as a terminal if it is in NEWV. Thus NEWV is all the
      variables that can be reduced to all terminals.

      Now, all productions containing a variable not in NEWV
      can be thrown away. Thus T is unchanged, S is unchanged,
      V=NEWV and P may become the same or smaller.
      The new grammar G=(V,T,P,S) represents the same language.

  1b) Eliminate useless variables that can not be reached from S
      See 1st Ed. book p89, Lemma 4.2, 2nd Ed. book 7.1.
      Set V'=S, T'=phi, mark all production as unused.
      Iterate repeatedly through all productions until no change
      in V' or T'. For any production A -> w, with A in V' 
      insert the terminals from w into the set T' and insert
      the variables form w into the set V' and mark the
      production as used.

      Now, delete all productions from P that are marked unused.
      V=V', T=T', S is unchanged. 
      The new grammar G=(V,T,P,S) represents the same language.


  2)  Eliminate epsilon productions.
      See 1st Ed. book p90, Theorem 4.3, 2nd Ed. book 7.1
      This is complex. If the language of the grammar contains
      the null string, epsilon, then in principle remove epsilon
      from the grammar, eliminate epsilon productions.

      The new grammar G=(V,T,P,S) represents the same language except
      the new language does not contain epsilon.


  3)  Eliminate unit productions.
      See 1st Ed. book p91, Theorem 4.4, 2nd Ed. 7.1
      Iterate through productions finding A -> B type "unit productions".
      Delete this production from P.
      Make a copy of all productions  B -> gamma, replacing B with A.
      Be careful of  A -> B,  B -> C, C -> D type cases,
      there needs to be copies of B -> gamma, C -> gamma, D -> gamma for A.

      Delete duplicate productions. (sort and remove adjacent duplicate)
      The new grammar G=(V,T,P,S) represents the same language.


  Briefly, some pseudo code for the above steps.

  Step 1a) The set V' = phi
           loop through the productions, P, to find:
             A -> w  where w is all terminals
                     union V' with A
           n := 0
           while n /= |V'|
             n := |V'|
             loop through productions to find:
               A -> alpha where alpha is only terminals and variables in V'
                    union V' with A
           end while
           Eliminate := V - V'
           loop through productions
             delete any production containing a variable in Eliminate,
           V := V'
           
  Step 1b) The set V' = {S}
           The set T' = phi
           n := 0
           while n /= |V'| + |T'|
             n := |V'| + |T'|
             loop through productions to find:
               A -> alpha  where A in V'
                           union V' with variables in alpha
                           union T' with terminals in alpha
           end while
           loop through productions
             delete any production containing anything outside V' T' and epsilon
           V := V'
           T := T'
           
  Step 2)  The set N = phi
           n := -1
           while n /= |N|
             n = |N|
             loop through productions to find:
               A -> epsilon
                            union N with A
                            delete production
                            
               A -> alpha   where no terminals in alpha and
                            all variables in alpha are in N
                            union N with A
                            delete production
           end while
           if S in N set null string accepted
           loop through productions
             A -> alpha   where at least one variable in alpha in N
                          generate rules A -> alpha'  where alpha'
                          is all combinations of eliminating the
                          variables in N
                          
  Step 3) P' := all non unit productions ( not A -> B )
          U  := all unit productions
          loop through productions in U, |U| times, to find:
            A -> A   
                      ignore this
                      
            A -> B
                      loop through productions in P'
                      copy/substitute  B -> gamma to A -> gamma in P'
          P := P'
          eliminate duplicate productions (e.g. sort and check i+i against i)

See link to "Turing machines and parsers."
The CYKP, CYK parser, has the above steps coded in C++ and with
"verbose 3" in the grammar file, most of the simplification is printed.

Of possible interest is a test case g_elim.g
input data to   cykp   and output g_elim.out

    <- previous    index    next ->

Lecture 15 CFG simplification algorithm

Other links

Go to top