斯坦福cs231(编译原理)の 3 Parsing Analysis 1 (Introduction & LL Analysis)

Introduction

Regular languages

  • The weakest formal languages widely used
  • Many applications

正则表达式的缺陷:

Parser:

  • Input: sequence of tokens from lexer
  • Output: parse tree pf the program
1
2
3
4
5
6
7
8
Cool:
if x = y then 1 else 2 fi
Parser input:
IF ID = ID THEN INT ELSE INT FI
Parser output:
IF-THEN-ELSE
= INT INT
ID ID
Phase Intput Output
Lexer String of characters String of tokens
Parser String of tokens Parse tree

上面两步有的编译器是分开做的,有的编译器是放在一起做的

上下无关文法

由词法分析器得到的tokens并不全是有用的,比如标点符号,所以编译器必须识别哪些是有效token,哪些是无效的,我们需要一种描述规则来描述何为有效token,以及一种识别有效token的方法

程序语言都是有着nested(递归)的结构,如:

1
2
EXPR = if EXPR then EXPR else EXPR fi
while EXPR loop EXPR pool

上下无关文法据说一种描述这种递归结构的natural notation

CFG 由下面几部分构成:

  • a set of terminals: \(T\)
  • a set of nn-terminals: \(N\)
  • a start symbol: \(S (S \in N)\)
  • a set of productions:$ X -> Y_1, ... Y_N (X N, Y_i N T {})$
  1. Begin with a string with only the start symbol S
  2. Replace any non-terminal X in the string by the right-hand side of some production
  3. Repeat (2) until there are non-terminals

let G be a contexto-free grammar with start symbol S. Then the language L(G) of is: \[ \{ a_1, ... a_n \quad | \quad \forall_i \quad a_i \in T \and S \mathop{\rightarrow}^* a_1, ..., a_n \} \]

  • Terminals are so-called because there are no-reulses for replacing them 终结符是不变的
  • Once generated, terminals are permanent,终结符是不变的
  • Terminals ought to be tokens of the language (终结符一般是语言的token,比如关键字,标识符)

🌰:

1
2
3
4
5
6
7
8
9
10
E -> E + E
| E * E
| (E)
| id

对应的语言:
id
id + id
id + id * id
(id + id) * id

整个CFG是一个很大的步骤,需要:

  • Membership in a alnguage is yes or no, also parse tree of the input
  • Must handle errors gracefully
  • Need an implementation of CFG's (eg: bison)

Derivations(推导)

A derivations s a sequence f production: \[ S -> .. -> ,,, -> ... -> .. \] 推导过程可以以树的形式画出来:

  • start symbol is the tree's root
  • For a production \(X->Y_1Y_2..Y_n\) add children \(Y_1Y_2...Y_n\) o node X

考虑下面语法: \[ E → E + E | E * E | (E) | id \] 以及字符串:id * id + id

image-20230617134603437

A parse tree has

  • Terminals at the leaves
  • Non-terminals at the interior nodes

对叶子节点的in-order遍历就是原始输出,parse tree表示了token之间的各种关系

如下:

叶子节点从左到右:就是原始输入:id * id + id

中间节点和左右兄弟节点的关系也很明确,*的左右就是id

上述推导式的产生是left-most,即在每一步优先推导最左边的符号(即推导方向是自左向右),但是这样可能会出现没有结束条件而一直无限推导下去。还有一种与之类似的right-most 推导

1
2
3
4
5
6
E
-> E + E
-> E + id
-> E * E + id
-> E * id + id
-> id * id + id

Note that r-most and l-most derivations have the same parse tree

A derivations defines a parse tree, but one parse tree may have many derivations

一个Parse tree 可能有很多种推导方式可以得到,但是最左推导和最右推导是最重要的两种方式

Conclusion:

  • 不仅对某个字符串是否属于L(G)(语法G所产生的语言)感兴趣,也需要对应的解析树
  • 一个推导式定义了一个解析树(或者解析树的一部分),同一个Parse tree 可能有很多种推导方式可以得到

Ambiguity(语法的二义性)

解析优先级

Grammar: E -> E + E | E * E | (E) | id

String stream: id * id + id

不同的推导得到不同的parse tree:

1
2
3
4
5
6
7
8
9
		E
E + E
E * E id
id id

E
E * E
id E + E
id id

A grammar is ambiguous if it has more than one parse tree for some string, 换句话说,对于某些字符串,可以由超过一种的做作或最右推导方式

How to handle ambiguous

Method 1:

重写语法:其实本质上是强制了解析时候的优先级

1
2
E -> E' + E/E'
E' -> id * E' | id | (E) * E'| (E)

Method 2:

不重写,但是在解析的时候,使用优先级或者判断哪个是最优的

结合性

Parsing Error Handle(parser过程的错误处理)

Error kind Example Detected by
Lexical 使用了未知符号,比如else写成了eles Lexer
Syntax 编写的程序存在结构错误,比如while后面应该有一对{}, 但是只写了一个{ parser
Semantic int x; x(3);声明x为普通变量,但是却把x当成函数使用 Type Checker
Correctness 自己的代码虽然通过了编译器,但是运行结果不符合预期 User/Tester,一般需要自己debug

Error handler should:

  • Report errors accurately and clearly
  • Recover from an error quickly
  • Not slow down compilation of valid code

常见的三种不同错误处理模式:

  • Panic Mode (使用特殊的终止符(error)吃掉错误,继续执行)
    • When an error is detected
      • Discard tokens until one with clear role is found
      • Continue from here

​ 🌰:

1
2
(1 + + 2) + 2
skip ahead to next interger and then continue

Bison: use the special terminal error to descide how mucg input to skip

1
E -> int | E + E | (E) | error int | (error)
  • Error productions (添加可能的错误的推导式): specify known common mistakes in the grammar

    Eg: write 5 x instead of 5 * x

    Add production E -> .... | E E

    Disadvantage: complicates the grammar

  • Automatic local or global correction

    • Find a correct "nearby" program
      • try token insertions and deletions (编辑距离)
      • exhaustive search
    • Disadvantages:
      • hard to impl
      • slows down parsing of correct programs
      • "nearby" is not necessarily "the intended" program

Past:

  • Slow recompilation cycle (even once a day)
  • rind as many errors in one cycle as possible

Present:

  • Quick recompilation cycle
  • users tend to correct one error/cycle
  • Complex error recovery is less compelling

Another Clean Parse Tree Form —— AST

Parser跟踪(trace)一个token序列的推导过程,并由此产生Parse tree

🌰:

Grammar: E -> int | (E) | E + E

After lexical analysis: [Int(2), '+', '(', Int(1), '+', 'Int(5), ')']

Parse Tree:

1
2
3
4
5
	E
E + E
int(2) (E)
E + E
int(1) int(5)

AST:

1
2
3
[Plus, leftO, rightO]
Int(2) [Plus, leftO, rightO]
Int(1) Int(5)

被称之为AST,因为它从具体语法中抽象出来,煎炒了具体语法的细节;而Parse Tree展示了具体的推导规则和相关结构,对于编译器来水说,有很多不必要的内容

递归下降解析

解析树一般按照如下方式构建:

  • 自上至下
  • 从左到右
1
2
3
4
		1
t2. 3. t9
4. 7
t5. t6. t8

Tokens: [t2, t5, t6, t8, t9]

看一个详细的递归下降解析的🌰:

考虑语法Grammar:

E -> T | T | T + E

T -> int | int * T | (E)

输入:\((int_5)\)

步骤:

从起始菲终结符开始,依次尝试关于E的推导式

1
2
3
4
E -> T -> int (mismatch: int does not match (, backtrack) =>
T -> [int * T] (mismatch: int does not match (, backtrack) =>
T -> ( E ) (match: 继续下一个输入字符int =>
T -> ( E -> int ) (match: 继续下一个输入字符) =>

递归下降解析算法的一般定义如下:

首先定义几个关于是否符合某个匹配的函数:

  • 是否为终结符:

    1
    bool term(TOKEN tok) {return *next++ == tok}
  • 是否为lfs为S第n个推导式:

    1
    bool Sn() {}
  • 是否为lfs为S的推导式

    1
    bool S() {}

🌰:

  • E -> T: bool E1() { return T() }

  • E -> T + E: bool E2() { return T() && term(PLUS) && E() }

  • 对于整个E来说:

    1
    2
    3
    4
    bool E() {
    TOKEN *save = next;
    return (next = save, E1()) || (next = save, E2());
    }

同样的对于上面的非终结符T,有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bool T1() {
return term(INT);
}

bool T2() {
return term(INT) && term(TIMES) && T();
}

bool T3() {
return term(OPEN) && E() && term(CLOSE);
}

bool T() {
TOKEN *save = next;
return (next = save, T1()) || (next = save, T2() || (next = save, T3()));
}

整个步骤:

  • 初始化next为第一个token
  • 调用E()

Conslusion:

递归下降过程中,不断地递归合回溯所产生的函数调用结构其实就是解析树的体现,所以在递归下降的同时构建AST。

上面的递归下降算法的限制:

对于E -> T | T | T + E,一旦递归到了T,并且从继续向下递归,一旦不符合了,只会在当前层级尝试其他的推导式,但是不会基于T的同层级去尝试T和T+E

所以上面的普通的递归下降只适用于一部分语法,对于有些不符合的语法,可以采用左因子(left-factor)进行改写:

左递归

考虑产生式:S -> S a

1
2
bool S1() { return S() && term()}
bool S() { return S1()}

如果使用递归下降,会产生如下的调用链条:S => S1 => S => S1 => ...

这就是左递归语法

以下是左递归语法更一般的定义: \[ S \mathop{\rightarrow}^+ S\alpha (+表示至少有一次推导) \] 再看这个🌰: \[ S \rightarrow S\alpha | \beta \] 将会产生以\(\beta\) 开头的且\(\beta\) 后面紧跟任意数量(>=0)\(\alpha\)的字符串。

既然已经知道了会产生什么样的字符串,可以通过把文法改写成右递归避免递归下降产生的左递归问题: \[ S \rightarrow \beta S' \\ S' \rightarrow \alpha S' | \epsilon \] Conclusion:

一般的递归下降是从左到右解析,由于该算法的性质,遇到做递归文法会导致无穷递归;可以通过把左递归文法改成特殊(这里的右递归文法也仅仅适用于部分语法)的右递归文法避免这个问题,因为这个时候非终结符在最右边,不会有右边字符存在一直饥饿(一直访问不到)的现象。

更一般的,可以改写成如下特殊的右递归文法: \[ S' \rightarrow S\alpha_1 | ... | S\alpha_n | \beta_1|...|\beta_m \]

\[ S \rightarrow \beta_1 S'| ... | \beta_m S' \\ S' \rightarrow \alpha_1 S' | ... | \alpha_nS' | \epsilon \]

但是有些语法,如: \[ S \rightarrow A\alpha | \beta \\ A \rightarrow S \beta \] 这个语法也是左递归语法,写成如下形式更好理解: \[ S \mathop{\rightarrow}^+ S \beta \alpha \] 可以通过其他方式消除左递归。

继续top-down

预测

Predicive Parsers like recursive-descent but parser can "predict" which production to use:

  • By looking at the next few tokens
  • No backtracking

Predicive Parsers accept LL(k) grammars (1st L: left 2 right, 2nd L: left-most derivation, k: k tokens look ahead)

在递归下降算法里:

  • 每一步,有很多产生式可以使用,如:E -> T | T | T + E,对于E,有3种选择
  • 需要通过回溯撤销bad choices

LL(1):

  • 通过改写成合适的文法(一般是左因子分解,左因子分解可以理解为公共左因子提取,这里的因子值终结符)每一步,仅有一步选择:

Hint: 其实就是根据当前预测字符决定选择使用哪个推导式,而这个预测字符和推导式的关系需要用一张表来记录

继续考虑语法Grammar:

E -> T | T | T + E

T -> int | int * T | (E)

难以产生预测字符:

  • 对于T,有两个int开头的推导式,所以即使当前有预测字符int,也无法选择哪一个是最优的推导
  • 对于E,更不容易看出预测字符是什么

需要左因子语法改写:

E -> TX

X -> +E |

T -> intY | (E)

Y -> *T |

根据新语法可以计算得到LL(1) parse table: 其中表头为下一个输入的token,每一行是非终结符,单元格里的内容就是当前使用哪个推导式的rfs,这里暂时没有讲如何构造这张表的

int * + $
E TX TX
X +E \(\epsilon\) \(\epsilon\)
T int Y (E)
Y *T \(\epsilon\) \(\epsilon\) \(\epsilon\)

比如:[Y, +] entry,表示当前非终结符Y,遇到了当前输入token +

那么Y就可以按照\(Y \rightarrow \epsilon\) 推导,

[E, *] entry,表示当前非终结符X,遇到了当前输入token *,没有合适的推导式可以使用

这里我们再提一个额外的知识点:

我们不想采用递归的方式去做解析,而是利用栈解析:

  • 非终结符仍然是扩展替换的
  • 终结符也仍然会输入进行比较匹配
  • 栈顶=最左边的待处理的非终结符或非终结符
  • Reject on reaching error state
  • Accepy on end of input & empty stack

形式化定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
initialize stack = <S $> and next
repeat
case stack of
<X, rest>: if T[X, *next] = Y1...Yn
then stack <- <Y1...Yn rest>;
else error(); // 不存在推导动作则直接报错
<t, rest>: if t == *next++
then stack <- <rest>;
else error();
until stack == <>

初始化的时候栈顶是其实非终结符S,
后续,如果当前栈顶是非终结符X,且根据预测表执行的推导动作是X -> Y1...Yn,则pop X from stack and push Y1...Yn into stack
否则:如果栈顶是终结符且和当前输入相同(说明栈顶的这个元素直接可以使用其本身进行推导),则pop t from stack

看一个🌰:

image-20230619230522565

first

follow

ll1-parse-table

构建预测表的步骤:

For each production \(A \rightarrow \alpha\) in G do:

  • For each terminal \(t \in First(\alpha)\) do
    • \(T[A, t] = \alpha\)
  • If \(\epsilon \in First(\alpha)\) for each t $t Follow(A) $ do
    • \(T[A, t] = \alpha\)
  • If \(\epsilon \in First(\alpha)\) and t $$ Follow(A) $ do
    • \(T[A, t] = \alpha\)

First() = { $, ) }

Follow(X) = { +, $, ) }

所以上面的语法形成的预测表如下

( ) + * int $
E TX TX
T (E) intY
X +E \(\epsilon\)
Y \(\epsilon\) \(\epsilon\) *T \(\epsilon\)

上述表格每个表格内的条目最多只有一条,当唯恐的时候,表示解析遇到错误,下面看一个表格内不止一个条目的例子:

语法:\(S -> Sa | b\)

First(S) = { b }

Follow(S) = { $, a }

a b $
$ b / Sa

正如上面的例子,如果一个表的内容是多个,则该语法不是LL(1),当燃判断是否为LL(1)还有其他方法:

  • 非左因子语法不是LL(1)
  • 左递归不是LL(1)
  • 二义性也不是
  • 其他的,如LL(k), k > 1也不是

网页知识点:

LL和LR:

概念梳理:

首先说明,Context-free grammar与无二义性文法不是一个层级的概念。CFG的意思是:我们用产生式设计的一组文法,对于每一个推导,其中的NT可以任意地被产生式右部替换而合法(这并不限制对于一个文本,只能推理出一棵树)。也就是每个NT之下的产生式是等价的,比如对于Verb->吃/睡/飞,在具体解析时,不论前面的主语/后面的宾语是什么,都合法。二义性是在CFG之下的概念。

LL and LR Parsing Demystified (reverberate.org)

LL表示从左到右扫描输入并执行最左推导,LL和LR相比哪个适用范围更广呢?

为简单讨论,我们只讨论LL(1)和LR(1),1代表向前查看的字符数量(预测1个字符),1表示任何时刻我们提前查看当前输入字符的下一个字符,再根据这个提前查看的字符决定使用哪一个规约行为。

  • In LL(1) we see the first symbol of the input and see the production to apply. So, if there is two productions with the same ‘first’ symbol as in the input parser gets a conflict and fails.
  • 在LL(1)里,我们自左到右进行规约,查看当前字符和采用对应的推导式进行规约。所以,如果如果有两个推导式有着相同的fiest symbol,就会产生冲突
  • 在 LR(1) 中,我们看到从左边开始的输入,直到我们得到一个handle。在此之后,我们再看到一个前瞻符号并确定解析器操作.即,解析器比 LL(1) 中有更多的信息来决定其操作,这使得它比 LL(1) 更强大。更强大的手段是,任何可以被 LL(1) 解析的语法也可以被 LR(1) 解析。
  • 这种情况在更一般的场景下也成立,对于任何k,LL(k)语法集是LR(k)语法集的适当子集

这里我们所说的解析器的能力指的是其可以解析的语法的范围,并不是说它能够生成的语言。所以什么是LL(1)语言呢,它们是不是都可以被LL(1)生成呢?

实际上LL(k)的能力和k成正比,k越大,LL(k)的能力越强,LL(k)是LL(k+1)的子集

那LR(k)?

\(First(\alpha)\) 是从\(\alpha\)推导的handle的起始终结符的集合

\(Follow(A)\)(这里大写表示A是一个非终结符),是紧跟在A后面的终结符的集合

当且仅当一个语法G满足如下条件,G才被称之为LL(1):

  1. G not unambiluous
  2. G not left-recursive
  3. If there is a production \(A→\alpha | \beta\), then
    1. \(First(\alpha) \cup First(\beta) = empty\) 否则解析器不知道使用哪个推导式
    2. if \(First(\alpha) \cup First(\beta) = empty\) contains \(\epsilon\), \(First(\beta)\)不应该包含\(\epsilon\), 否则解析器不知道使用哪个推导式
    3. if \(First(\alpha) \cup First(\beta) = empty\), \(First(\beta) \cup First(\alpha) = empty\), 否则解析器不知道使用哪个推导式

下面是LR部分:

item:文法的一个产生式G加上其右部某一位置的一个点,这个点表示了分析过程中的状态。

产生式A->XYZ 产生的四个项:

A->·XYZ A->X·YZ A->XY·Z A->XYZ· 以第二个项为例,其表示已经接收了一个可以由X推导的串,如果希望能归约,那么接下来要识别一个能够由YZ推导的串。

Dcfl is a superset of regular. But dcfl with prefix property is not.

Viable Prefixes and Handle in LR Parsing

Bottom-up Parsing

Consider the grammar

  • \(S -> XX\)
  • $X -> aX | b $

Now, consider a string in \(L(S)\) say aabb. We can parse it as follows by left most derivation – replacing the left most non-terminal in each step, or right most derivation – replacing the rightmost non-terminal in each step.

考虑一个符合该文法的字符串aabb,我们可以按照最左推导解析它(每一步替换最左边的非终结符)或者也可以按照最右推导解析它(每一步替换最右边的非终结符)

image-20230619000054295

上面的最右推导被用在bottom-up parsing

  • Any string derivable from the start symbol is a sentential form — it becomes a sentence if it contains only terminals
  • A sentential form that occurs in the leftmost derivation of some sentence is called left-sentential form
  • A sentential form that occurs in the rightmost derivation of some sentence is called right-sentential form

再次考虑字符串aabb,我们可以按照如下方法解析:

从左至右扫描输入,如果存在子串匹配任何推导式的右侧(RHS, right hand of string), 用该推导式的左侧替换该字符串

具体步骤如下:

  1. ‘a’ “abb” – 不存在RHS匹配'a'
  2. ‘aa’ “bb” – 不存在RHS匹配'a' 或者 'aa'
  3. ‘aab’ “b” – RHS of \(X→b\) matches ‘b’ (first b from left) and so we write as

Consider the string aabb again. We will see a method to parse this:

  1. Scan the input from left to right and see if any substring matches the RHS of any production. If so, replace that substring by the LHS of the production.

So, for “aabb” we do as follows

  1. ‘a’ “abb” – No RHS matches ‘a’
  2. ‘aa’ “bb” – No RHS matches ‘a’ or ‘aa’
  3. ‘aab’ “b” – RHS of \(X→b\) matches ‘b’ (first b from left) and so we write as aaXb
  4. ‘aaX’ “b” – RHS of \(X→aX\) matches “aX” and so we write as
  5. aXb – Again RHS of \(X→aX\) matches “aX” and we get
  6. Xb – RHS of \(X→b\) matches “b” and we get
  7. XX – RHS of \(S→XX\)matches XX and we get
  8. S – the start symbol.

Now what we did here is nothing but a bottom-up parsing. Bottom-up because we started from the string and not from the grammar. Here, we applied a sequence of reductions, which are as follows: \[ aabb → aaXb → aXb → Xb → XX → S \] If we go up and see the Rightmost derivation of the string “aabb”, what we got is the same but in REVERSE order. i.e., our bottom up parsing is doing reverse of the RIGHTMOST derivation(仔细观察上述的替换过程和最开始的最右推导的顺序是相反的). So, we can call it an \(LR\) parser –$ L$ for scanning the input from Left side and \(R\) for doing a Rightmost derivation.

In our parsing we substituted the RHS of a production at each step. This substituted “substring” is called a HANDLE and are shown in BOLD below(在推导过程中,我们在每一步替换了子串,这些子串称为Handle,在下面被加粗了). \[ aa\bold bb → a\bold a \bold Xb → \bold a \bold Xb → X\bold b → \bold X \bold X → S \] Formally a handle is defined as (Greek letters used to denote a string of terminals and non-terminals)

"A handle of a right sentential form ‘γ (γ=αδβ) is a production E→δ and a position in γ where δ can be found and substituted by E to get the previous step in the right most derivation of γ — previous and not “next” because we are doing rightmost derivation in REVERSE. Handle can be given as a production or just the RHS of a production.

The handle is not necessarily starting from the left most position as clear from the above example (从上面的例子可以看到handle不一定开始于RHS的最左边). There is importance to the input string which occurs to the left of the handle (所谓的可行前缀就是当前handle的所有前缀). For example for the handles of “aabb”, we can have the following set of prefixes

aabb {a, aa, aab}
aaXb {a, aa, aaX}
aXb {a, aX}
Xb {X, Xb}
XX {X, XX}

These set of prefixes are called Viable Prefixes (这些前缀集合被称为可行前缀). Formally

" Viable prefixes are the prefixes of right sentential forms that do not extend beyond the end of its handle.

i.e., a viable prefix either has no handle or just one possible handle on the extreme RIGHT which can be reduced.

We will see later that viable prefixes can also be defined as the set of prefixes of the right-sentential form that can appear on the stack of a shift-reduce parser. Also, the set of all viable prefixes of the right sentential forms of a grammar is a REGULAR LANGUAGE. i.e., viable prefixes can be recognized by using a FINITE AUTOMATA. Using this FINITE AUTOMATA and a stack we get the power of a Push Down Automata and that is how we can parse context-free languages.

在后面,可行前缀也可以被定义为出现在shift-reduce解析器的最右推导式的前缀集合。当然,所有的可行前缀是正则语言,所以,可行前缀可以被有限自动机识别。使用有限自动机和stack就可以Push Down Automata,这本质上就是在解析上下无关文法。

参考文献:https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/handouts/100%20Bottom-Up%20Parsing.pdf


斯坦福cs231(编译原理)の 3 Parsing Analysis 1 (Introduction & LL Analysis)
https://mingmingjiang1.github.io/emocoder/2023/06/10/编译原理/编译原理三/
作者
迷途知返
发布于
2023年6月10日
许可协议