Compiler for a simple straight-line language.
- gcc used to compile the compiler modules.
- criterionlib used to run tests.
Once requirements are met, runningmake test
should run all tests.
- id
- num
,
;
+
-
*
/
(
)
=
- program
- Stm
- Exp
- ExpList
- BinOp
program -> Stm $
Stm -> Stm ; Stm
Stm -> id = Exp
Stm -> print ( ExpList )
Exp -> id
Exp -> num
Exp -> Exp BinOp Exp
Exp -> ( Stm, Exp )
ExpList -> Exp, ExpList
ExpList -> Exp
BinOp -> +
BinOp -> -
BinOp -> *
BinOp -> /
All modules relevant to lexical analysis are found in the scanner directory.
Lexical analysis is performed with a scanner produced by lex.
Tokens of the language are defined in header file tokens.h.
mylang.lex is the configuration file of lex.
Running make lexer
produces the scanner and running src/scanner/lexer test.mylang
prints the sequence of scanned tokens from file test.mylang.
All modules relevant to syntax analysis are found in the parser directory.
The syntax analyser is an SLR parser, which is a slightly tweaked LR(0) parser.
A production rule combined with a dot denoting the position of the parser in the right hand side (rhs) of the production rule, is called an LR(0) item. e.g. Stm -> Stm • ; Stm
is an item meaning that the parser has parsed the first Stm
symbol and is now right after it.
Briefly, the parser is a Deterministic Finite Automaton (DFA) where each state is a set of items and edges are grammar symbols (both terminals and non-terminals).
LR(0) parsing is performed by consulting what is called a parse table and by keeping one stack for the states and one for the grammar symbols. The parse table is a table where each column corresponds to a grammar symbol, each row corresponds to a state and each cell contains an action. Available actions are: SHIFT, GOTO, REDUCE, ACCEPT and REJECT. SHIFT and GOTO actions transfer the automaton to another state (the difference between them being that SHIFT does so by popping a terminal symbol from the stack whereas GOTO pops a non-terminal symbol from the stack), REDUCE applies a production rule, i.e. it pops the rhs of a production rule from the stack and pushes the lhs, ACCEPT is the action denoting successfull parsing and REJECT is the action related to a syntax error.
In order to calculate the set of states and edges of the DFA the following algorithm (states.c) is used (tiger book, p.61):
Let T be the set of states, E the set of edges
Initialize T to {closure({first_item})} where first_item = program -> • Stm $
Initialize E to the empty set {}
for each state SN in T:
for each item i in SN:
if dot is in the end of i's rhs: continue
let X be the symbol after the dot in i's rhs
J = goto(SN, X)
if J is empty: continue
add {J} to T
add {SN->J, X} to E
closure
adds more items to a set of items when there is a dot to the left of a non-terminal symbol. goto
moves the dot past the symbol X in all items.
Here are the algorithms used for closure
and goto
(operations.c):
closure(I):
for each item i in I:
if dot is in the end of i's rhs: continue
let X be the symbol after the dot in i's rhs
add X -> •γ to I for each item X -> •γ
add Y -> •γ to I for each item Y -> •γ where X -> Y
return I
goto(I, X):
Initialize J to the empty set {}
for each item i in I:
if dot is in the end of i's rhs or dot is not before X: continue
add A -> α X • β to J where i = A -> α • X β
return closure(J)
Edges between states represent SHIFT and GOTO actions; REDUCE actions are calculated as below:
Let R be the set of REDUCE actions, T the set of states
Initialize R to the empty set {}
for each state SN in T:
for each item i in SN:
if the dot is not last in i's rhs: continue
let p be the production rule of item i
add (REDUCE p, SN) in R
where (REDUCE p, SN)
means reduce by production rule p
when in SN
and rhs of p is the next symbol.
Once states, edges and actions are all calculated the parse table is ready to be populated. However we might get what is called a conflict in the parse table, meaning that we might have a cell containing more than one action. Depending on the actions of the cell the conflict is called shift/shift, shift/reduce or reduce/reduce.
To eliminate most conflicts of the LR(0) parsing table, SLR parsing is used (SLR standing for Simple LR). The only difference of SLR parsing from LR(0) is the algorithm by which REDUCE actions are calculated (reduces
in automaton.c):
Let R be the set of REDUCE actions, T the set of states
Initialize R to the empty set {}
for each state SN in T:
for each item i in SN:
if the dot is not last in i's rhs: continue
let p be the production rule of item i
S = follow(A) where A is the lhs of p
for each token X in S:
add (REDUCE p, SN, X) to R
where (REDUCE p, SN, X)
means reduce by production rule p
when in SN
and X
is the next symbol.
folow(A)
, for non-terminal A
, is the set of terminals that can come after A
in some production rule (MIT OCW notes, p.39).
Here is the algorithm used for follow
(follow
in automaton.c):
follow(X)
Initialize S to the empty set {}
for each production p:
for each symbol s in p's rhs:
if s != X: continue
if s is last rhs symbol:
let A be p's lhs
if A == X: continue
S2 = follow(A)
else S2 = first(a) where a the symbol after s
S = S union S2
return S
first(X)
is set of terminals that can begin strings derived from X
(tiger book, p.49).
Here is the algorithm used for first
(first
in automaton.c):
first(X)
Intialize S to the empty set {}
if X is terminal: return {X}
for each production p:
if p's lhs == X:
let s be the first symbol of p's rhs
if s is terminal:
add s to S
else if p's rhs has only one symbol:
S2 = first(s)
S = S union S2
return S