[LAC] Regular Expressions

Regular Expressions


Overview

Automata(DFA and NFA) describe languages in a somewhat indirect way: not always obvious what the defined language is. But Regular Expression offer a different, more direct way to describe language.
Regular expressions offer something that automata do not: a declarative way to express the strings we want to accept. Thus, regular expressions serve as the input language for many systems that process strings:

  • Search commands such as the UNIX grep
  • Lexical-analyzer generators(Compilers)

The operator of regular expressions

Regular expressions denote languages. For a simple example, the regular expression 01*+10* denotes the language consisting of all strings that are either a single 0 followed by any number of 1’s or a single 1 followed by any number of 0’s.
Before describing the regular expression notation, we need to learn the three operations on languages that the operators of regular expressions represent. These operations are:

  • The union of two languages
  • The concatenation of languages L and M is that taking any string in L and concatenating it with any string in M. For example, if L ={001, 10, 111} and M ={ε, 001}. Then LM is {001, 10, 111, 001001, 10001, 111001}
  • The closure(or star, or Kleene closure) of a language L is denoted L* and represents the set of those strings that can be formed by taking any number of strings from L, possibly with repetitions and concatenating all of them. For instance, if L = {0,1}, then L* is all strings of 0’s and 1’s
  • For any language L, L^0 = {ε}
  • For a special example φ* = {ε}. Note that φ^0 = {ε}, because we can’t select any strings from the empty set. In fact, φ is one of only two languages whose closure is not infinite.

What are regular expressions?

Given an alphabet Σ (e.g. Σ = {a, b, c, … , z}), the syntax (i.e., form) of regular expressions over Σ is defined inductively as follows:
1. ∅ is a regular expression.
2. ε is a regular expression.
3. For each x ∈ Σ, x is a regular expression.
4. If E and F are regular expressions then E + F is a regular expression denoting the union of L(E) and L(F). That is, L(E+F) = L(E) U L(F).
5. If E and F are regular expressions then EF (i.e. just one after the other) is a regular expression denoting the concatenation of L(E) and L(F). That is, L(EF) = L(E)L(F).
6. If E is a regular expression then E* is a regular expression denoting the closure of L(E). That is, L(E*) = (L(E))*.
7. IF E is a regular expression then (E) is a regular expression denoting the same language as E. Formally; L((E)) = L(E).
enter image description here

Example:

Let us write a regular expression for the set of strings that consist of alternating 0’s and 1’s.
First, it is easy to think about the closure of 01 which is (01)*

Note that star takes precedence over dot, so we need a parentheses.

But it can just represent 01010101…01, what if that is 10101010 or start with 0 and end with 0?
There are four possibilities, so we can figure that

(01)* + (10)* + 0(10)* + 1(01)*

Notice that we use the + operator to take the union of the four languages that together gives us all the strings with alternating 0’s and 1’s.

However, we have a much more succinct representation:

(ε+1)(01)*(ε+0)

Precedence of regular expression operators

Like other algebras, the regular expression operators have an assumed order of “precedence”.

  • The star operator is of the highest precedence. That is, it applies only to the smallest sequence of symbols to its left that is a well-formed regular expression.
  • Next in precedence comes the concatenation or “dot” operator.
  • Finally, all unions (+ operator) are grouped with their operands.

Exercise

Write regular expressions for the following languages:
1. The set of strings over alphabet {a, b, c} containing at least one a and at least one b.
(A + B + C)* (A (A + B + C)*B + B (A + B + C)*A) (A + B + C)*
2. The set of strings of 0’s and 1’s whose tenth symbol from the right end is 1.
(0 + 1)* 1 (0+1)^9
3. The set of strings of 0’s and 1’s with at most one pair of consecutive 1’s.
(0 + 10)* (11+ε) (0+10)*
4. The set of all strings of 0’s and 1’s such that every pair of adjacent 0’s appears before any pair of adjacent 1’s.

The trick is to start by writing an expression for the set of strings that have no two adjacent 1’s. Here is one such expression: (10+0)*(ε+1)
To see why this expression works, the first part consists of all strings in which every 1 is followed by a 0. To that, we have only to add the possibility that there is a 1 at the end, which will not be followed by a 0. That is the job of (ε+1).
Now, we can rethink the question as asking for strings that have a prefix with no adjacent 1’s followed by a suffix with no adjacent 0’s. The former is the expression we developed, and the latter is the same expression, with 0 and 1 interchanged. Thus, a solution to this problem is (10+0)(ε+1)(01+1)(ε+0). Note that the ε+1 term in the middle is actually unnecessary, as a 1 matching that factor can be obtained from the (01+1)* factor instead.

评论

此博客中的热门博文

[MLE] Linear Classification

[AIM] MetaHeuristics

[CS231] Neural Networks