[LAC] Context Free Grammars
We will now introduce context-free grammars (CFGs) as a formalism to define languages. CFGs are more general than regular expressions, i.e. there are more languages definable by CFGs.
Key idea: Rules, called productions, that describe how symbols called non-terminals(or variables, or syntactic categories) can be replaced by non-terminals and terminals until only terminals left.
so what is CFG?
A context-free grammar G = (N, T, P, S) is given by
- A finite set N of nonterminal symbols or nonterminals
- A finite set T of terminal symbols or terminals
- N∩T = ∅, i.e., the sets N and T are disjoint
- S ∈ N: the distinguished start symbol
- A finite set P ⊆ N ×(N ∪T )∗ of productions. A production (A, α), where A ∈ N and α ∈ (N ∪ T )∗ is a sequence of nonterminal and terminal symbols, is written as A → α in the following.(P is what we called production before)
Non-terminal are also referred to as variables, and consequently the set of non-terminals is sometimes is sometimes denoted by V.
The terminals are the alphabet of the language defined by a context-free grammar , and for that reason the set of terminals is sometimes denoted by Σ.
The directly derives relation
if we have a “*” (star) on the transition arrow, this means there a several potential transition:
language generated by a grammar
Derivation tree
A tree is a derivation tree for a CFG G = (N,T,P,S) iff:
1. Every node has a label from N ∪T ∪{ε}.
2. The label of the root node is S.
3. Labels of interior nodes belong to N
4. If a node n has label A and nodes n1,n2, …, nk are children of n, from left to right, with labels X1, X2, …, Xk, respectively, then A->X1X2…Xk is a production in P.
5. If a node n has label ε, then n is a leaf and the only child of its parent.
The string of leaf labels read from left to right, eliding any ε, constitute the yield of the tree.
The yield of a parse tree
If we look at the leaves of any parse tree and concatenate them from the left, we get a string, called the yield of the tree, which is always a string that is derived from the root variable.
Ambiguity
We say that a grammar G is ambiguous if there is a word w ∈ L(G) for which there is more than one parse tree, or, equivalently, more than one leftmost derivation or more than one rightmost derivation. This is usually a bad thing because it entails that there is more than one way to interpret a word (i.e. it leads to semantical ambiguity).
Exercises
Answer: S -> 0S1 | 01
The set of all strings over alphabet a,b not of the form ww for any w.
No strings of odd length can be of the form ww . We use the terminal
symbols A and B to generate all odd-length strings where the center
characters are a and b, respectively.
S -> AB|BA|A|B
A -> aAa|aAb|bAa|bAb|a
B -> aBa|aBb|bBa|bBb|b
To understand why AB and BA are guaranteed to generate strings not of the form ww,
understand that A builds an odd-length string, say of length 2k + 1, where k characters
precede an a which preceds k more characters.
Similarly, B generates a string, say of length 2m + 1,
where m a’s and b’s precede one b preceds m more a’s and b’s. If we
concatenate one to another, we get a string of length 2k + 2m + 2:an even
length string-where the middle a of the portion generated by A is
k + m + 1 characters away from the central b generated by B.
That means that the first k + m + 1 characters cannot
be fully replicated in the final k + m + 1 characters.
[This answer taken from Jerry Cain at Stanford]
评论
发表评论