The parser has to check the query string (which arrives as
plain ASCII text) for valid syntax. If the syntax is correct a
parse tree is built up and handed back otherwise an error is
returned. For the implementation the well known Unix
tools lex and yacc
are used.
The lexer is defined in the file
scan.l and is responsible
for recognizing identifiers,
the SQL keywords etc. For
every keyword or identifier that is found, a token
is generated and handed to the parser.
The parser is defined in the file gram.y and consists of a
set of grammar rules and actions
that are executed
whenever a rule is fired. The code of the actions (which
is actually C-code) is used to build up the parse tree.
The file scan.l is transformed to
the C-source file scan.c
using the program lex
and gram.y is transformed to
gram.c using yacc.
After these transformations have taken
place a normal C-compiler can be used to create the
parser. Never make any changes to the generated C-files as they will
be overwritten the next time lex
or yacc is called.
Note: The mentioned transformations and compilations are normally done
automatically using the makefiles
shipped with the PostgreSQL
source distribution.
A detailed description of yacc or
the grammar rules given in gram.y would be
beyond the scope of this paper. There are many books and
documents dealing with lex and
yacc. You should be familiar with
yacc before you start to study the
grammar given in gram.y otherwise you won't
understand what happens there.
For a better understanding of the data structures used in
PostgreSQL
for the processing of a query we use an example to illustrate the
changes made to these data structures in every stage.
This example contains the following simple query that will be used in
various descriptions and figures throughout the following
sections. The query assumes that the tables given in
The Supplier Database
have already been defined.
Example 2-1. A Simple Select
select s.sname, se.pno
from supplier s, sells se
where s.sno > 2 and s.sno = se.sno;
Figure \ref{parsetree} shows the parse tree built by the
grammar rules and actions given in gram.y for the query
given in Example 2-1
(without the operator tree for
the where clause which is shown in figure \ref{where_clause}
because there was not enough space to show both data structures in one
figure).
The top node of the tree is a SelectStmt node. For every entry
appearing in the from clause of the SQL query a RangeVar
node is created holding the name of the alias and a pointer to a
RelExpr node holding the name of the relation. All
RangeVar nodes are collected in a list which is attached to the field
fromClause of the SelectStmt node.
For every entry appearing in the select list of the SQL query a
ResTarget node is created holding a pointer to an Attr
node. The Attr node holds the relation name of the entry and
a pointer to a Value node holding the name of the
attribute.
All ResTarget nodes are collected to a list which is
connected to the field targetList of the SelectStmt node.
Figure \ref{where_clause} shows the operator tree built for the
where clause of the SQL query given in
Example 2-1
which is attached to the field
qual of the SelectStmt node. The top node of the
operator tree is an A_Expr node representing an AND
operation. This node has two successors called lexpr and
rexpr pointing to two subtrees. The subtree attached to
lexpr represents the qualification s.sno > 2 and the one
attached to rexpr represents s.sno = se.sno. For every
attribute an Attr node is created holding the name of the
relation and a pointer to a Value node holding the name of the
attribute. For the constant term appearing in the query a
Const node is created holding the value.
The transformation process takes the tree handed back by
the parser as input and steps recursively through it. If
a SelectStmt node is found, it is transformed
to a Query
node that will be the top most node of the new data structure. Figure
\ref{transformed} shows the transformed data structure (the part
for the transformed where clause is given in figure
\ref{transformed_where} because there was not enough space to show all
parts in one figure).
Now a check is made, if the relation names in the
FROM clause are known to the system. For every relation name
that is present in the system catalogs a RTE node is
created containing the relation name, the alias name and
the relation id. From now on the relation ids are used to
refer to the relations given in the query. All RTE nodes
are collected in the range table entry list that is connected
to the field rtable of the Query node. If a name of a
relation that is not known to the system is detected in the query an
error will be returned and the query processing will be aborted.
Next it is checked if the attribute names used are
contained in the relations given in the query. For every
attribute} that is found a TLE node is created holding a pointer
to a Resdom node (which holds the name of the column) and a
pointer to a VAR node. There are two important numbers in the
VAR node. The field varno gives the position of the
relation containing the current attribute} in the range
table entry list created above. The field varattno gives the
position of the attribute within the relation. If the name
of an attribute cannot be found an error will be returned and
the query processing will be aborted.