Section 1.3
Practice problems (page 85): 1.18, 1.19, 1.28
Optional: 1.21
Regular expressions are an expressive way to describe languages, especially languages that contain repetitions of patterns. Regular expressions are built out of primitive components.
A regular expression \(R\) has one of the following forms:
- The symbol \(a\) where \(a\) is a letter from the alphabet \(\Sigma\), representing a language that matches only the string on that one symbol.
- The symbol \(\epsilon\), representing a language that matches just the empty string.
- \(\emptyset\), representing a language that matches nothing.
- \(R_1 \cup R_2\), or also \(R_1 | R_2\), where \(R_1\), \(R_2\) are already-formed regular expressions. Represents a language containing the strings of the language of \(R_1\) as well as those of the language of \(R_2\).
- \(R_1 \circ R_2\), or also \(R_1R_2\), where \(R_1\), \(R_2\) are already-formed regular expressions. Represents a language containing the concatenations of strings of the languages represented by \(R_1\) and \(R_2\).
- \(R^*\), where \(R\) is a regular expression. Represents the Kleene star of the language represented by \(R\).
This is perhaps the first example of a recursive or inductive definition, where a new item of a certain type can be built from previously-constructed items. We of course have already seen this idea in OCAML’s type system.
For example, we can denote the language that contains any string on the letters \(a\), \(b\), \(c\), but that starts with either \(a\) or \(b\), by:
\[(a|b)(a|b|c)^*\]
Regular expressions are available on almost any programming language, and they are used when scanning through text. The rules for what a token/valid-language-element constitutes on most programming languages can be expressed via regular expressions. Regular expressions used in practice usually involve a number of conveniences, including:
The remarkable fact that we will build towards is that regular expressions have the same expressive power as DFAs/NFAs:
A language is regular, i.e. recognized by a DFA/NFA, if and only if it can be expressed via a regular expression.
There are two directions to this, and one is easy:
Given a regular expression describing a language, we can construct a corresponding NFA recognizing that same language. So any language represented by a regular expression is a regular language.
This is easy to see:
Exercise: Use this idea to build the NFA for the regular expression (a|b)^*aba(a|b)^*
. Make sure to follow the general instructions for concatenation, without taking shortcuts.
The converse is also true but harder to show:
For every regular language there is a regular expression describing it.
We will skip the proof of this, but those interested can find the details on pages 69 through 76. The key step in the proof is the introduction of the concept of generalized nondeterministic finite automata, where transitions are “tagged” by a regular expression, and you follow that transition by consuming a part of the input taht matches that expression. An initial such automaton is created trivially from an NFA that recognizes that language, and that NFA is then slowly contracted one state at a time, at the same time making the transition regular expressions more complex. The final result consists effectively of a start state and an accept state, connected via a regular expression that then describes the language.