斯坦福cs231(编译原理)の 2 Lexical Analysis
Introduction
1 |
|
Token Class
In English: Noun, Verb, Adjective
In Programming language: Identifier, Keywords, '(', ')'
Token Class correspond to set of strings
[token class]: [set of strings]
Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: 'else' or 'if'
whitespace: a non-empty sequence of blanks, newlines, and tabs
Lexical analyzer的主要功能:
Classify program substrings according to role(Token Class)
Communicate tokens to the parser
1 |
|
An implementation must do two things:
- Recognize substrings corresponding to tokens (切割成一个个子串(专业名词:词素))
- Identify the token class of each lexeme (为这些子串分配Token class)
最终形成的就是以下输出:
1 |
|
Regular Language
正则语言用于规定编程语言的词法结构
词法结构本质上就是一组token classes
所以我们必须要定义这样一些规则:什么样的字符串属于某一类token class(这样的规则就是正则表达式)
比如:
\(single \quad character: \{c\}\)
\(Epsilon: \{''\}\)
\(Union: A+B = \{a|a \in A\} \cup \{b|b \in B\}\)

基于字符集sigma的正则表达式集合:
这个描述正则表达式的带竖线的语法叫做文法(grammar)
同一个集合可以由不同的方法表示出来。
如:1* = 1* + 1,因为1*= {'', 1, 11, 111}, 而1* + 1(注意这里的+不是连接的意思,而是并集),{'', 1, 11, 111} U {1} = {'', 1, 11, 111},不变

Conclusion:
正则表达式指定了正则语言
正则表达式5个组成部分:
Two base cases: empty and 1-character strings
Three compound expressions: union concatenatation and iteration
形式语言:
形式语言与自动机——形式语言 - 知乎 (zhihu.com)
(163条消息) 自然语言和形式语言 (包含各种术语的区别)_Hencoff的博客-CSDN博客
定义:let \(\Sigma\) be a set of characters (an alphabet)
A language over \(\Sigma\) is a set of strings of characters drawn from \(\Sigma\)
简而言之,所以形式语言其实就是字符集上的任何一组字符串,正则表达式是形式语言的一种
如:alphabet:英文字符
lang: 英语句子
上面的例子不是,因为英文单词的定义是没有规则的
如:alphabet:ascii
language: C program
映射函数:L maps syntax to semantics
\(L(e) = M\)
\(e => re(regular \quad expression)\)
\(M => set \quad of \quad strings\)
实际上之前的
\(A+B = A \cup B 就是这样一套map函数,其本质上就是:L(A+B) = L(A) \cup \ L(B)\)
Why use a meaning function ?
- 划分了syntax和semantics
- 可以把notation作为一个独立的问题
- 因为表达式(文法)和meanings并不是一对一的关系(可能是多对一,但不会是一对多,这样会出现二义性)
Lexical specification
Keyword: 'if' or 'else' or 'then'
Integar: digit = '0' + '1' + ... + '9', digit digit* = digit+
identifier: letter = [a-zA-Z], letter(letter+digit)*
Whitespace: (' ' + '' + ')+

正则表达式描述了许多常规语言,如电话号码,邮件等等
regular languages are a language specification: 正则语言描述了一些符合特定规则的语言
Given a string s and a rexp R, is \(s \in L(R)?\)
为了回答上面这个问题,我们来整理下:
一些基本的正则表达式
At least one: $ A+ AA*$
Union: $ A|B A + B$
Option: $ A? A + $
Range: $ 'a' + 'b' + ... + 'z' $
Exclude range: $ complement of [a-z] $
具体步骤:
- Write a rexp for lexemes of each token class:
Number = digit+
Keyword = 'if' + 'else' + ...
...
- Contruct R, matching all lexemes for all tokens
R = keyword + idenfier + Number ... = R1 + R2 + ... Rn (所有正则取并集)
Let input be x1...xn
For $1 <= i <= n check x1...xi L(R) $
if success, then we know that
\(x1...xi \in L(Rj) \quad for \quad some \quad j\)
- Remove x1...xi from input and go to (3)
How much input is need ?
就像人类读取一样,贪婪的,以==和=为例,双等号出现的时候使用两个等号作为一个匹配整体
\(x_1...xi \in L(R)\)
\(x_1...xj \in L(R)\)
\(i != j\)
select which t = max(i, j)
Which token is used ?
\(x_1...xi \in L(R_i)\)
\(x_1...xi \in L(R_j)\)
可以按照优先级,哪个最先出现使用哪个
what if no rule matches ?
Error = [all strings not in the lexical specification] as last in priority
如果所有的token class都不满足,最后由Error兜底
Conclusion
- Regular expressions are a concise notation for strings patterns
- Use in lexical analysis requires small extensions
- To resolve ambiguities [matches as long as possible]
- To handle errors [highest priority match]
- Good algorithm known
- Require only single pass over the input (仅仅一次遍历就可以确定每个词素的所属token class)
- Few operations per character (table lookup)
Finite automata
Rexp = specifications (正则表达式作为词法分析的规范语言)
A finite automation consists of of
An input alphabet \(\Sigma\)
A set of states S
A start state n
A set of accepting states F 包含于S
A set of transitions state -> (some input) state
Transition (状态转移)
Is read(读作): In State s1 on input a goto state s2 (在状态s1输入字符a将会进入s2状态)
if end of input and in accepting state => accept
Otherwise => reject
在非接受态输入终止了
在某一个状态无法进行状态转移



For some reason we're about to read x2 when we make an epsilon move the machine change state, but the input pointer stays in exactly the same place. So the new configuration of the machine would be that we're in state b. But our input pointer is still waiting to read x2. So you can think of epsilon move is a kind of free move for the machine. It can move to a different state without consuming any input.
Just to be clear here, the machne does not have to make the epsilon move. It's a choice. So we can decide whether to make the epsilon move or not. Now, epsilon move was the first time we're mentioned the possibility that a finite automaton might have a choice in what moves it makes
上面的叫做有选择的自动机
有选择的自动机(NFA)和没有选择的自动机(DFA)
确定性自动机:
不存在选择,如epsilon move(因为epsilon move实际上本身就是一种选择)
one transition per input per state
A DFA takes only one path through the state graph:
NFA:
- Can have multiple transitions for one input in a given state
- Can have \(\epsilon-moves\)
- An NFA can choose:
NFA只要有一条path可以到达accept state,那么就说这个input属于这个nfa所代表的语言
Conclusion
- NFAs and DFAs recognize the same setg of language: regular languages
- DFAs are faster to execute: there are no choices to consider
- BFAs are , in general, smaller
epsilon move
Rexp 2 NFA
1 |
|
For each kind of rexp, define an equalvalent NFA: NFA for rexp M
For \(\epsilon\)
For \(a\)
如何联合?
we add an epsilon transition to the start state of B. What that says is that first few recognize some portion of the input that belongs to the language of a (我们已经识别了前面一个正则表达式,希望不消耗掉任何一个字符跳转到另一个正则表达式). And when we get to what we've been the final state of a, we can jump to the start state of b without consuming any input and try to read the rest of the string as part of the string in the language of B.
For \(AB\)
For \(A+B\)
For \(A^*\)

NFA 2 DFA
\(\epsilon-closure\)

An NFA may be in many states at any time (NFA 在同一刻可能有多个状态,比如下面的起始点有ABCDHI)
How many different states ?
NFA有N个不同的状态,子集种类最大有\(2^N\)(包含空集) \[ N \quad states \\ |S| \le N \\ 2 ^{N} - 1 \quad non-empty \quad subsets \]
$$ NFA: \ states: S \ start: s S \ final: F S \ a(X) = {y | x X_n x y}
\ \ \ \
DFA: \ states: subsets of S \ start: -closure(s) \ final: {X | X F } \ X Y if Y = -closure(a(X)) $$
1 |
|


DFA can be impl by a 2D table T
- One dimension is states
- Other dimension is input symbol
For every transition $ s_i s_k $ define \(T[i, a] = k\)
1 |
|
0 | 1 | |
---|---|---|
S | T | U |
T | T | U |
U | T | U |
如果觉得states太多,表太大,可以考虑共享内存和链表的方式存储表结构,对于N个states的NFA来说最多需要\(2^N - 1\) 对应的DFA states

这个时候states(行数)为NFA的states数量
Conclusion
- NFA -> DFA conversion is key
- Tools trade between speed and space
- DFAs: faster, less compact
- NFAs: slower, concise
Conclusion
怎么实现
自动机的应用(本质上是保存状态,进行状态转移)