斯坦福cs231(编译原理)の 5 Semantic Analysis

Type Checking Rules and How to Implement ?

Cool type can be implemented in a single traversal over the AST

  • Type environment is passed down the tree (From parent to child),环境自上而下是在不断扩增的
  • Types are passed up the tree from (From child to parent),类型是自底向上推导和检查的

🌰: \[ \frac{O,M,C\vdash e_1:Int \quad O,M,C\vdash e_2:Int }{O,M,C\vdash e_1 + e_2:Int } \]

Environment(Object, Method, Class), 环境包括当前对象,当前方法及当前类

1
2
3
4
5
6
TypeCheck(Envirment, e1 + e2) = {
T1 = TypeCheck(Envirment, e1)
T2 = TypeCheck(Envirment, e2)
check T1 == T2 == Int
return Int
}

🌰: \[ \begin{equation} \begin{aligned} \frac{ O,M,C\vdash e_0:T_0 \quad O[T/x],M,C\vdash e_1:T_1 \quad T_0 <= T}{O,M,C\vdash e_1 + e_2:Int } \end{aligned} \end{equation} \]

1
2
3
4
5
6
7
8
TypeCheck(Envirment, let x: T <- e0 in e1) = {
T0 = TypeCheck(Envirment, e0)
T1 = TypeCheck(Envirment.add({x: T}), e1)
check SubType(T0, T)
return T1
}

自上而下,environment不断扩增,自底向上检查类型

Introduction to Semantic Analysis

  • Lexical analysis

    • Detects inputs with illegal tokens
  • Parsing

    • Detects inputs with ill-formed parse trees
  • Semantic analysis

    • Last. Front end phase
    • Catched all remaining errors

Why do semantic analysis ?

  • Parsing can't catch some errors
  • Some langaguage constructs not context-free

(2 封私信 / 25 条消息) 应该如何理解「上下文无关文法」? - 知乎 (zhihu.com)

Semantic analysis will do ?

  • All identifiers are declared
  • Type checking
  • Inheritance relationships checking
  • Classes defined ony once
  • Methods in a class defined only once
  • Reserved identifiers are not misused
  • ...

静态作用域

Matching static declarations with uses

  • Important static analyisis step in most language
  • including COOL

🌰:

1
2
3
let y: String <- "abc" in y + 3 // 在后面的语句里y为string类型

let y: Int in x + 3 // 没有看到x的定义,将会报错

The scope of an identifier is the portion of a program in which that identifier is accessible

计算机领域里变量作用域指的是程序中变量的起作用(可以get)的一段范围

不同作用域里可存在同名变量,同一个作用域不可出现不同名变量

一个变量的作用域范围是有限的

许多语言都是静态作用域:

作用域仅仅依赖程序文本,根据变量在程序文本的位置决定变量的作用域,没有任何运行时决定的行为,eg: Cool

动态作用域:

Lisp: Lisp has changed to mostly static scoping

Scope depends on execution of the program

🌰:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
let x: Int <- 0 in
{
x; =>>>>>> 1
let x: Int <- 1 in
x; =>>>> 2
x; =>>>>>> 1
}

上面1x2x是不同的作用域里x

动态作用域里的变量总是指向离它最近的变量

g(y) = let a <- 4 in f(3);
f(x) = a =>>>>> a = 4

Cool identifier bindings are introduced by:

  • Class declarations (introduce class names)
  • Method declarations (introduce method names)
  • Let declarations (introduce object id's)
  • Formal declarations (introduce object id's)
  • Attribute declarations (introduce object id's)
  • Case declarations (introduce object id's)

Cool 的标识符并不都是允许嵌套的

比如类定义不允许嵌套,类名使用前必须定义

静态作用域实现(符号表)

Much of semantic analysis can be expressed as a recursive descent of on an AST

  • Before: Process an AST node n
  • Recurse: Process the children of n
  • After: Finish processing the AST node n

When performing semantic analysis on a portion of the AST, we need to know which identifiers are defined

当在AST的某部分上执行语义分析时,需要知道哪些标识符已经被定义了。

1
2
3
4
5
6
7
let x: Int <- 0 in e

let x(sym)

init = 0.(sym) e(sym + x)

x is defined in subtree e

Recall: let x: Int <- 0 in e

Idea:

  • Before processing e, add definition of x to current definitions, overriding any other definition of x
  • Recurse
  • After processing e, remove definition of x and restore old definition of x

a symbol table is a data structure that tracks the current bindings of identifiers

简单的实现可以使用stack

  • Add_symbol(x)
  • Find_symbol(x) search stack, starting from top fro x, return x or NULL if none found
  • Remove() pop from ths stack
  • exit_scope() exit current scope
  • Check_scope(x) true if x defined in current scope

在COOL里使用stack即可对let声明的变量进行处理

每次用let声明一个变量,即push入stack,而且也允许嵌套,每次退出当前let范围,即pop出变量;

但是对于函数确不好使用,比如函数参数,一次有多个参数,并且有重名的参数,所以得一次push多个变量,但是多个变量是在同一个层级里,而stack的解决方案默认一个变量一个层级

class names can be used before being defined in COOL

solution:two passes to tranversal on AST

  • Pass1: Gather all class names
  • Pass2: Do the checking

类型

什么是类型,不同的语言对于类型定义不同,但通常意义上,一致认为被定义如下:

  • 一些值的集合(数据)
  • 可操作这些值的一些操作(算法/函数/方法)

必须确保在转化为机器语言之前做好类型检查哦,因为机器语言执行指令不会做任何检查。语言类型系统知名里在哪些类型上的哪些操作是合法的。语言的类型检查目的就是确保在正确的类型上执行正确的操作。

类型检查时期:

  • 编译时期执行类型检查:C, COOL, Java
  • 运行时类型检查:程序执行的时候执行类型检查, Python, Lisp, Perl
  • 无类型检查:machine code

静态类型检查 vs 动态类型检查

  • 静态检查可以在编译时期捕获错误
  • 避免运行时错误

动态检查:

  • 静态类型检查是有局限性的,有些语句在静态类型不好检查出来,比如继承的时候,子类调用父类的方法

现代编程语言很多结合了两者,如:

C,java可以使用unsafe逃脱静态检查

People retrofit static typing to dynamically typed languages for debugging or optimization

类型检查:验证类型的合法性

类型推断:自动推断和填充缺失的类型信息

之前已经见过一些formal notions,比如:

  • Regular expressions
  • Context-free grammars

类型检查也有一套标准的逻辑推导规则:

推导规则具有如下形式:

If Hypothesis is true, then Conclusion is true

类型检查就是通过这样的推理:

If E1 and E2 have certain types, then E3 has a certain type

1
2
3
^ => and
=> if-then
x: T => x has type 'T'

if e1 has type int and e2 type int, then e1 + e2 has type int =======>

e1 has type ^ e2 has type int => e1 + e2 has type int =======>

(E1: int ^ e2 : int) => e1 + e2: int

the statement (E1: int ^ e2 : int) => e1 + e2: int is a special case of hypothesis1 ^ ... ^ hypothesisn => Conclusion

一般情况下,上面的推导规则写出如下符号形式: \[ \frac{\vdash hypothesis1 \vdash hypothesis2 ... \vdash hypothesisn}{\vdash Conslusion} \] \(\vdash\) means ''it is provable that ..."

看几个简单的推导规则: \[ \frac{i \quad is \quad an \quad interger \quad literal}{\vdash i: int} \]

\[ \frac{e1: int \quad e2: int}{\vdash e1 + e2: int} \]

类型检查可以很好的基于AST来做:

如果已经证明了子表达式的类型,那么父节点的类型也就很好地得到,所以类型检查和推断是自底向上做的

类型环境

在类型推导/检查中,有的时候一个变量的类型信息需要更多的信息才能获得该变量的类型,所以在我们的推导规则里需要加入更多的类型信息——类型环境,类型环境给予了自由变量更多的相关类型信息

类型环境:a fucntion from object identifiers to Types (即标识符到类型的映射)

自由变量:如果一个表达式里存在一个变量没有被定义,那么该变量就是自由变量

1
2
let y: int <- 0 in x + y
这里let限定下,x是未定义的,而y是int类型且初始化为0,在进行类型检查的时候,我们只有拥有了前面关于x的信息,才能知道x的类型

let O be function from Object idenfiers to types, notion as follow: \[ O\vdash e: T \] 读作:在环境O下,可以证明e是T类型

单点修改: \[ \begin{equation} \begin{aligned} \frac{ O[T/x]\vdash e_1:int \quad }{O\vdash e_1 + e_2:Int } \end{aligned} \end{equation} \]

1
2
3
表示
O[T/x](x) = int
O[T/x](y) = y (x != y)

实际上类型环境O可以由之前的Symbol table实现,因为

  • 类型环境给予当前作用域内部标识符类型(和作用域类型,只不过作用域是记录标识符)
  • 类型环境也是自上向下不断扩增和单点修改
  • 但是类型检查和推断是自底向上的

Subtypings

subtypings 主要用于继承类型1判断 \[ \begin{equation} \begin{aligned} \frac{ O\vdash e_0:T_0 \quad O[T_0/x]\vdash e_1:T_1 \quad }{O\vdash let \quad x: T_0 = e_0 \quad in \quad e_1: T_1 } \end{aligned} \end{equation} \] 在类上定义关系:<=

  • X <= Y
  • X <= Y if X inherits from Y
  • X <= Z if X <= Y and Y <= Z

\[ \begin{equation} \begin{aligned} \frac{ O\vdash e_0:T_0 \quad O[T_0/x]\vdash e_1:T_1 \quad T_0 <= T }{O\vdash let \quad x: T_0 = e_0 \quad in \quad e_1: T_1 } \end{aligned} \end{equation} \]

属性初始化: \[ \begin{equation} \begin{aligned} \frac{ O_c(x) = T_0 \quad O_C\vdash e_1:T_1 \quad T_1 <= T0 }{O\vdash x: T_0 = e_1 } \end{aligned} \end{equation} \]

考虑 if e0 then e1 else e2 fi

表达式的结果要么e1要么e2,对应的类型要么是e1的类型要么是e2的类型

那么整个if-else表达式返回什么类型?

最好的方法是返回e1和e2的最小上界类型(least upper type)

least upper bound: lub(X, Y),表示X和Y的最小上界类型,

if X <= Z ^ Y <= Z: Z is an upper bound

If X <= Z' ^ Y <= Z', Z <= Z': Z is an upper bound

COOL里,两个类型的最小上界是它们继承树里最近公共祖先

\[ \begin{equation} \begin{aligned} \frac{ O\vdash e_0:Bool \quad O\vdash e_1:T_1 \quad O\vdash e_1:T_2 \quad }{O\vdash if \quad e_0 \quad then \quad e_1 \quad else \quad e_2 \quad fi \quad fi: lub(T_1, T_2) } \end{aligned} \end{equation} \]

Typing Methods

\[ dispatch: \begin{equation} \begin{aligned} \frac{ O\vdash e_0:T_0 \quad O\vdash e_1:T_1 \quad ... \quad O\vdash e_n:T_n \quad }{O\vdash e_0.f(e_1, ... e_n): ? } \end{aligned} \end{equation} \]

不知道任何关于f的信息,如何知道f的返回值类型?

In cool, method and object identifiers live in different name spaces

  • A method foo and an object foo can coexisted in the same scope

In the type rules, this is reflected by a seperate mapping M for method signatures \[ M(C, f) = (T_1, ... T_n, T_{n+1}) means \quad in \quad Class \quad C \quad there \quad is \quad \\ a \quad method \\ f: \quad f(x_1: T_1, ... x_n: T_n): T_{n+1} \]

dispatch \[ dispatch: \begin{equation} \begin{aligned} \frac{ O\vdash e_0:T_0 \quad O\vdash e_1:T_1 \quad ... \quad O\vdash e_n:T_n \quad T_i <= T_{i'} 1 <= i <= n \quad M(T_0, f) = (T_{1'}, ... T_{n'}, T_{n+1}) }{O\vdash e_0.f(e_1, ... e_n): T_{n+1} } \end{aligned} \end{equation} \]

静态dispatch \[ static dispatch: \begin{equation} \begin{aligned} \frac{ O\vdash e_0:T_0 \quad O\vdash e_1:T_1 \quad ... \quad O\vdash e_n:T_n \quad T_0 <= T \quad T_i <= T_{i'} 1 <= i <= n \quad M(T_0, f) = (T_{1'}, ... T_{n'}, T_{n+1}) }{O\vdash e_0@T.f(e_1, ... e_n): T_{n+1} } \end{aligned} \end{equation} \] 最终结合方法环境,变量环境和类, \[ \begin{equation} \begin{aligned} \frac{ O, M, C\vdash e_1:int \quad O, M, C\vdash e_2:int \quad }{O, M, C\vdash e1 + e2: int } \end{aligned} \end{equation} \] 上面的规则仅适用于COOL

General themes:

  • Type rules are defined on the structure of expressions
  • Types of varaibles are modeled by an environment

Static Type vs Dynamic Type

静态类型检查:

可以在编译时期发现一些常见的错误,但是也有一些不是错误的代码被识别出来,报告错误,这个时候需要动态类型检查,但是实际上为了让有些不是错误的代码能够通过静态类型检查,需要一些特殊的方法:

1
2
3
4
5
6
7
8
class A {}
class B interits A {}
class Main {
x: A <- new A; // 静态检查可以通过,左右类型一致
...
x <- new V; // 静态检查通不过,左右类型不一致
...
}

理想情况下,对于所有的表达式:dynamic_type(E) <= static_type(E)

对于任意表达式E,dynamic_type(E) <= static_type(E),表示表达式E的静态类型检查得到的类型是对应表达式动态检查得到得类型的父类

All operations that can be used on an object if type C can also be used on an object of type C' <= C

  • Such as fetching the value of attribute

  • Or invoking a method on the object

  • Subclass only add attributes or methods

  • ```

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19





    ## Self Type

    考虑如下代码:

    ```java
    class Count {
    i : int <- 0;
    inc () : Count {
    {
    i <- i + 1;
    self;
    }
    };
    };
  • 类Count包含一个计数器。

  • inc方法适用于任何子类。

  • 考虑Count的子类Stock

1
2
3
4
5
6
7
8
class Stock inherits Count { 
name : String; -- name of item
};

class Main {
Stock a <- (new Stock).inc();
... a.name ...
};

上述用法在已有的类型系统会报错,因为inc返回的类型是Count,但但是赋值的对象是Stock对象

  • (new Stock).inc()具有动态类型Stock;

  • 因此如下语句是合法的;

    Stock a <- (new Stock).inc();

  • 但这不是很好的类型系统,因为(newStock).inc()的静态类型为Count;

  • 类型检查器“丢失”类型信息

    • 这使得继承inc毫无用处 ;
    • 因此,我们必须为每个子类重新定义inc,并使用专门的返回类型 。

解决上述问题需要引入SELF_TYPE

我们将使用self type扩展类型系统:

  • inc返回self;
  • 因此,返回值与“self”具有相同的类型;
  • 可以是Count或Count的任何子类型!

引入关键字SELF_TYPE以用于此类函数的返回值,SELF_TYPE允许在继承inc时更改inc的返回类型

修改inc的声明:

1
inc(): SELF_TYPE {...}

类型检查系统现在可以证明:

1
2
O, M, C |- (new Count).inc(): Count
O, M, C |- (new Stock).inc(): Stock

注意SELF_TYPE不是动态类型

  • 它也不是类名;
  • 它是静态类型;
  • 帮助类型检查器更好地跟踪类型;
  • 使类型检查器接受更正确的程序

简而言之,拥有SELF_TYPE可以提高类型系统的表达能力。

What can be the dynamic type of the object returned by inc ?(inc将返回什么样的动态类型)

  • Answer: whatever could be the Count or any subclass of Count of 'self'
1
2
3
4
5
class A inherits Count {};

class B inherits Count {};

class C inherits Count {};

In genreral, if SELF_TYPE appears textually in the class C as the declared type of E then

1
dynamic_type(E) <= C

Note: The meaning of SELF_TYPE depends on where it appears , \(SELF\_TYPE_c\) refers to an occurence of SELF_TYPE in the body of C

\[ SELF\_TYPE_c <= C \]

In type checking it is always safe to replace \(SELF\_TYPE_c\) by C \[ SELF\_TYPE_c <= SELF\_TYPE_c \] in cool, we never compare SELF_TYPEs coming from different classes \[ SELF\_TYPE_c <= T \quad if \quad C <= T \]

  • \(SELF\_TYPE_c\) can be any subtype of c
  • This includes C itself
  • Thus this is the moe flexible rule we can allow

let T and T' be any types but SELF_TYPE

  1. \(lub(SELF\_TYPE_c, SELF\_TYPE_c) = SELF\_TYPE_c\)
  2. \(lub(SELF\_TYPE_c, T) = lub(C, T)\)
  3. \(lub(T, SELF\_TYPE_c) = lub(C, T)\)
  4. \(lub(T, T') defined as before\)

哪里使用SELF_TYPE

self type checking

错误恢复

与解析一样,从类型错误中恢复也很重要

目标:

  • What type is assigned to an expresson with no legitimate type ?
  • This type will influence the typing of the enclosing expression

方法1:

assign type Object to ill-typed expressions

1
2
3
4
let y: Int <- x + 2 in y + 2
error: x is undefind: => x is Object
error: + applied to Object => x + 2 is Object
error: bad assimend => ... (Int and Object is incompatible)

方法2:

introduce No_type for use with ill-typed expressions

  • DefineNo_type <= C for all types C
  • Every operation os defined for No_type
    • With a. No_type result
1
2
let y: Int <- x + 2 in y + 2
x is No_type => x + 2 is No_type => no error becuse No_type is subtype of any type

总结:

实际编译器也会使用类似No_type的东西

然而,引入No_type会带来跟多复杂性:类的继承结构将不再是树状,因为每个类都有一个子类

cool里使用了第一种方案


斯坦福cs231(编译原理)の 5 Semantic Analysis
https://mingmingjiang1.github.io/emocoder/2023/06/04/编译原理/编译原理五/
作者
迷途知返
发布于
2023年6月4日
许可协议