<- previous index next ->
More theory, Context Free Grammar, see below what was a state in a DFA is a variable, in a grammer, no states. Given a grammar with the usual representation G = (V, T, P, S) with variables V, terminal symbols T, set of productions P and the start symbol from V called S. Productions are variable, one or more variable, often | used for more productions, rather than on seperate lines. S -> aab T -> aac | aad | epsilon A derivation tree is constructed with 1) each tree vertex is a variable or terminal or epsilon 2) the root vertex is S 3) interior vertices are from V, leaf vertices are from T or epsilon 4) an interior vertex A has children, in order, left to right, X1, X2, ... , Xk when there is a production in P of the form A -> X1 X2 ... Xk 5) a leaf can be epsilon only when there is a production A -> epsilon and the leafs parent can have only this child. Watch out! A grammar may have an unbounded number of derivation trees. It just depends on which production is expanded at each vertex. For any valid derivation tree, reading the leafs from left to right gives one string in the language defined by the grammar. There may be many derivation trees for a single string in the language. If the grammar is a CFG then a leftmost derivation tree exists for every string in the corresponding CFL. There may be more than one leftmost derivation trees for some string. See example below and ((()())()) example in previous lecture. If the grammar is a CFG then a rightmost derivation tree exists for every string in the corresponding CFL. There may be more than one rightmost derivation tree for some string. The grammar is called "ambiguous" if the leftmost (rightmost) derivation tree is not unique for every string in the language defined by the grammar. The leftmost and rightmost derivations are usually distinct but might be the same. Given a grammar and a string in the language represented by the grammar, a leftmost derivation tree is constructed bottom up by finding a production in the grammar that has the leftmost character of the string (possibly more than one may have to be tried) and building the tree towards the root. Then work on the second character of the string. After much trial and error, you should get a derivation tree with a root S. We will get to the CYK algorithm that does the parsing in a few lectures. Examples: Construct a grammar for L = { x 0^n y 1^n z n>0 } Recognize that 0^n y 1^n is a base language, say B B -> y | 0B1 (The base y, the recursion 0B1 ) Then, the language is completed S -> xBz using the prefix, base language and suffix. (Note that x, y and z could be any strings not involving n) G = ( V, T, P, S ) where V = { B, S } T = { x, y, z, 0, 1 } S = S P = S -> xBz B -> y | 0B1 * Now construct an arbitrary derivation for S => x00y11z G A derivation always starts with the start variable, S. The "=>", "*" and "G" stand for "derivation", "any number of steps", and "over the grammar G" respectively. The intermediate terms, called sentential form, may contain variable and terminal symbols. Any variable, say B, can be replaced by the right side of any production of the form B -> <right side> A leftmost derivation always replaces the leftmost variable in the sentential form. (In general there are many possible replacements, the process is nondeterministic.) One possible derivation using the grammar above is S => xBz => x0B1z => x00B11z => x00y11z The derivation must obviously stop when the sentential form has only terminal symbols. (No more substitutions possible.) The final string is in the language of the grammar. But, this is a very poor way to generate all strings in the grammar! A "derivation tree" sometimes called a "parse tree" uses the rules above: start with the starting symbol, expand the tree by creating branches using any right side of a starting symbol rule, etc. S / | \ / | \ / | \ / | \ / | \ x B z / | \ / | \ / | \ / | \ 0 B 1 / | \ / | \ 0 B 1 | y Derivation ends x 0 0 y 1 1 z with all leaves terminal symbols, a string in the language generated by the grammar. More examples of grammars are: G(L) for L = { x a^n y b^k z k > n > 0 } note that there must be more b's than a's thus B -> aybb | aBb | Bb G = ( V, T, P, S ) where V = { B, S } T = { a, b, x, y, z } S = S P = S -> xBz B -> aybb | aBb | Bb Incremental changes for "n > k > 0" B -> aayb | aBb | aB Incremental changes for "n >= k >= 0" B -> y | aBb | aB Independent exponents do not cause a problem when nested equivalent to nesting parenthesis. G(L) for L = { a^i b^j c^j d^i e^k f^k i>=0, j>=0, k>=0 } | | | | | | | +---+ | +---+ +-----------+ G = ( V, T , P, S ) V = { I, J, K, S } T = { a, b, c, d, e, f } S = S P = S -> IK I -> J | aId J -> epsilon | bJc K -> epsilon | eKf G(L) for L = { a^i b^j c^k | any unbounded relation such as i=j=k>0, 0<i<k<j } the G(L) can not be a context free grammar. Try it. This will be intuitively seen in the push down automata and provable with the pumping lemma for context free languages. What is a leftmost derivation trees for some string? It is a process that looks at the string left to right and runs the productions backwards. Here is an example, time starts at top and moves down. Given G = (V, T, P, S) V={S, E, I} T={a, b, c} S=S P= I -> a | b | c E -> I | E+E | E*E S -> E (a subset of grammar from book) Given a string a + b * c I E S derived but not used I E [E + E] E S derived but not used I E [E * E] E S done! Have S and no more input. Left derivation tree, just turn upside down, delet unused. S | E / | \ / | \ / | \ E * E / | \ | E + E I | | | I I c | | a b Check: Read leafs left to right, must be initial string, all in T. Interior nodes must be variables, all in V. Every vertical connection must be tracable to a production.
<- previous index next ->