斯坦福cs231(编译原理)の 2 Lexical Analysis

Introduction

1
2
3
4
5
6
7
8
编写代码时:
if (i == j)
z = 0;
else
z = 1;

实际喂给leximal analyzer的:
\tif (i == i)\n\t\tz=0;\n\telse\n\t\tz=1;

Token Class

  • In English: Noun, Verb, Adjective

  • In Programming language: Identifier, Keywords, '(', ')'

Token Class correspond to set of strings

  • [token class]: [set of strings]

  • Identifier: strings of letters or digits, starting with a letter

  • Integer: a non-empty string of digits

  • Keyword: 'else' or 'if'

  • whitespace: a non-empty sequence of blanks, newlines, and tabs

Lexical analyzer的主要功能:

  • Classify program substrings according to role(Token Class)

  • Communicate tokens to the parser

1
2
3
4
string => LA => Parser
token: <class, string>

foo = 42 => <ID, 'foo'>, <OP, '='>, <INT, "42">

An implementation must do two things:

  1. Recognize substrings corresponding to tokens (切割成一个个子串(专业名词:词素))
  2. Identify the token class of each lexeme (为这些子串分配Token class)

最终形成的就是以下输出:

1
2
tokens: [token1, token2]
token: <token class, lexeme>

Regular Language

正则语言用于规定编程语言的词法结构

词法结构本质上就是一组token classes

所以我们必须要定义这样一些规则:什么样的字符串属于某一类token class(这样的规则就是正则表达式)

比如:

\(single \quad character: \{c\}\)

\(Epsilon: \{''\}\)

\(Union: A+B = \{a|a \in A\} \cup \{b|b \in B\}\)

image-20230628232115530

基于字符集sigma的正则表达式集合:

image-20230628232258693这个描述正则表达式的带竖线的语法叫做文法(grammar)

同一个集合可以由不同的方法表示出来。

如:1* = 1* + 1,因为1*= {'', 1, 11, 111}, 而1* + 1(注意这里的+不是连接的意思,而是并集),{'', 1, 11, 111} U {1} = {'', 1, 11, 111},不变

image-20230628233030413

Conclusion:

正则表达式指定了正则语言

正则表达式5个组成部分:

  • Two base cases: empty and 1-character strings

  • Three compound expressions: union concatenatation and iteration

形式语言:

形式语言与自动机——形式语言 - 知乎 (zhihu.com)

(163条消息) 自然语言和形式语言 (包含各种术语的区别)_Hencoff的博客-CSDN博客

定义:let \(\Sigma\) be a set of characters (an alphabet)

A language over \(\Sigma\) is a set of strings of characters drawn from \(\Sigma\)

简而言之,所以形式语言其实就是字符集上的任何一组字符串,正则表达式是形式语言的一种

如:alphabet:英文字符

lang: 英语句子

上面的例子不是,因为英文单词的定义是没有规则的

如:alphabet:ascii

language: C program

映射函数:L maps syntax to semantics

\(L(e) = M\)

\(e => re(regular \quad expression)\)

\(M => set \quad of \quad strings\)

实际上之前的

\(A+B = A \cup B 就是这样一套map函数,其本质上就是:L(A+B) = L(A) \cup \ L(B)\)

Why use a meaning function ?

  • 划分了syntax和semantics
  • 可以把notation作为一个独立的问题
  • 因为表达式(文法)和meanings并不是一对一的关系(可能是多对一,但不会是一对多,这样会出现二义性)

Lexical specification

Keyword: 'if' or 'else' or 'then'

Integar: digit = '0' + '1' + ... + '9', digit digit* = digit+

identifier: letter = [a-zA-Z], letter(letter+digit)*

Whitespace: (' ' + '' + ')+

image-20230629233242362

正则表达式描述了许多常规语言,如电话号码,邮件等等

regular languages are a language specification: 正则语言描述了一些符合特定规则的语言

Given a string s and a rexp R, is \(s \in L(R)?\)

为了回答上面这个问题,我们来整理下:

一些基本的正则表达式

At least one: $ A+ AA*$

Union: $ A|B A + B$

Option: $ A? A + $

Range: $ 'a' + 'b' + ... + 'z' $

Exclude range: $ complement of [a-z] $

具体步骤:

  1. Write a rexp for lexemes of each token class:

​ Number = digit+

​ Keyword = 'if' + 'else' + ...

​ ...

  1. Contruct R, matching all lexemes for all tokens

​ R = keyword + idenfier + Number ... = R1 + R2 + ... Rn (所有正则取并集)

  1. Let input be x1...xn

    For $1 <= i <= n check x1...xi L(R) $

  2. if success, then we know that

\(x1...xi \in L(Rj) \quad for \quad some \quad j\)

  1. Remove x1...xi from input and go to (3)

How much input is need ?

就像人类读取一样,贪婪的,以==和=为例,双等号出现的时候使用两个等号作为一个匹配整体

\(x_1...xi \in L(R)\)

\(x_1...xj \in L(R)\)

\(i != j\)

select which t = max(i, j)

Which token is used ?

\(x_1...xi \in L(R_i)\)

\(x_1...xi \in L(R_j)\)

可以按照优先级,哪个最先出现使用哪个

what if no rule matches ?

Error = [all strings not in the lexical specification] as last in priority

如果所有的token class都不满足,最后由Error兜底

Conclusion

  • Regular expressions are a concise notation for strings patterns
  • Use in lexical analysis requires small extensions
    • To resolve ambiguities [matches as long as possible]
    • To handle errors [highest priority match]
  • Good algorithm known
    • Require only single pass over the input (仅仅一次遍历就可以确定每个词素的所属token class)
    • Few operations per character (table lookup)

Finite automata

Rexp = specifications (正则表达式作为词法分析的规范语言)

A finite automation consists of of

  • An input alphabet \(\Sigma\)

  • A set of states S

  • A start state n

  • A set of accepting states F 包含于S

  • A set of transitions state -> (some input) state

  • Transition (状态转移)

image-20230702223309277

  • Is read(读作): In State s1 on input a goto state s2 (在状态s1输入字符a将会进入s2状态)

  • if end of input and in accepting state => accept

  • Otherwise => reject

    • 在非接受态输入终止了

    • 在某一个状态无法进行状态转移

image-20230630233026614
image-20230630233102794
image-20230630233131470

For some reason we're about to read x2 when we make an epsilon move the machine change state, but the input pointer stays in exactly the same place. So the new configuration of the machine would be that we're in state b. But our input pointer is still waiting to read x2. So you can think of epsilon move is a kind of free move for the machine. It can move to a different state without consuming any input.

Just to be clear here, the machne does not have to make the epsilon move. It's a choice. So we can decide whether to make the epsilon move or not. Now, epsilon move was the first time we're mentioned the possibility that a finite automaton might have a choice in what moves it makes

上面的叫做有选择的自动机

有选择的自动机(NFA)和没有选择的自动机(DFA)

确定性自动机:

  • 不存在选择,如epsilon move(因为epsilon move实际上本身就是一种选择)

  • one transition per input per state

  • A DFA takes only one path through the state graph:

image-20230702223139657

NFA:

  • Can have multiple transitions for one input in a given state
  • Can have \(\epsilon-moves\)
  • An NFA can choose:

image-20230702223124224

NFA只要有一条path可以到达accept state,那么就说这个input属于这个nfa所代表的语言

Conclusion

  • NFAs and DFAs recognize the same setg of language: regular languages
  • DFAs are faster to execute: there are no choices to consider
  • BFAs are , in general, smaller

epsilon move

Rexp 2 NFA

1
Lexical specification => Regular expression => NFA => DFA => Table-driven implementation of DFA

For each kind of rexp, define an equalvalent NFA: NFA for rexp M

image-20230702221029097

For \(\epsilon\)

image-20230702221052123

For \(a\)

image-20230702221104220

如何联合?

we add an epsilon transition to the start state of B. What that says is that first few recognize some portion of the input that belongs to the language of a (我们已经识别了前面一个正则表达式,希望不消耗掉任何一个字符跳转到另一个正则表达式). And when we get to what we've been the final state of a, we can jump to the start state of b without consuming any input and try to read the rest of the string as part of the string in the language of B.

For \(AB\)

image-20230702221203335

For \(A+B\)

image-20230702221219541

For \(A^*\)

image-20230702221643107

image-20230702161855178

NFA 2 DFA

\(\epsilon-closure\)

image-20230702231414583

An NFA may be in many states at any time (NFA 在同一刻可能有多个状态,比如下面的起始点有ABCDHI)

How many different states ?

NFA有N个不同的状态,子集种类最大有\(2^N\)(包含空集) \[ N \quad states \\ |S| \le N \\ 2 ^{N} - 1 \quad non-empty \quad subsets \]

$$ NFA: \ states: S \ start: s S \ final: F S \ a(X) = {y | x X_n x y}

\ \ \ \

DFA: \ states: subsets of S \ start: -closure(s) \ final: {X | X F } \ X Y if Y = -closure(a(X)) $$

1
2
3
4
5
6
7
NFA:
states: S
start: s in S
final: F 包含于S
a(X) = {y | x in X x ->(a) y}


image-20230702225328912
image-20230702225314437

DFA can be impl by a 2D table T

  • One dimension is states
  • Other dimension is input symbol

For every transition $ s_i s_k $ define \(T[i, a] = k\)

1
2
3
4
5
i = 0 控制字符的指针
state = 0 控制状态
while (input[i]) {
state = A[state, input[i++]];
}

image-20230702232859459

0 1
S T U
T T U
U T U

如果觉得states太多,表太大,可以考虑共享内存和链表的方式存储表结构,对于N个states的NFA来说最多需要\(2^N - 1\) 对应的DFA states

image-20230702232401012

这个时候states(行数)为NFA的states数量

Conclusion

  • NFA -> DFA conversion is key
  • Tools trade between speed and space
    • DFAs: faster, less compact
    • NFAs: slower, concise

Conclusion

怎么实现

自动机的应用(本质上是保存状态,进行状态转移)


斯坦福cs231(编译原理)の 2 Lexical Analysis
https://mingmingjiang1.github.io/emocoder/2023/06/10/编译原理/编译原理二/
作者
迷途知返
发布于
2023年6月10日
许可协议