\newcommand{\Env}{\mathsf{Env}} \newcommand{\PP}{\mathsf{PP}} \newcommand{\states}{\mathsf{states}} \newcommand{\state}{\mathsf{state}} \renewcommand{\b}[1]{\mathbf{#1}} \newcommand{\semantics}[1]{\llb #1 \rrb} \newcommand{\Stmt}{\mathsf{Stmt}} \newcommand{\Vars}{\mathsf{Vars}} \chapter{Background} \label{chap:litreview} In this chapter we review the relevant literature pertaining to static analysis. We briefly cover the semantics of programs generally before looking more closely at the framework of abstract interpretation for program analysis. The chapter concludes with an explanation of the particular method of abstract interpretation over intervals and zones which is implemented in this thesis. \section{Program Semantics} In order to perform any static analysis of a program it is necessary to have some notion of the \emph{semantics}, or meaning, of a program. The semantics of a program can be captured in several ways, but for our purposes we consider \emph{denotational}, or \emph{mathematical}, \emph{semantics} \cite{SCOTT70B}. Denotational semantics views a program as a (partial) function built up by the composition of many smaller functions. By expressing our expressions, assignments, conditionals and loops as smaller functions we can compose them to determine the semantics of the whole program. The operator $\semantics{\cdot}$ represents the semantics of its argument as a mathematical function. In imperative languages programs are not a series of functional transformations, however, each statement or expression may have side-effects, so it is necessary to introduce some notion of ``state'' into our semantics. By defining $\semantics{\cdot}: \states \to \states$ for all statements $x$ we provide a mechanism for statements to perform side effects. Elements of $\states$ each represent a possible state for the program to be in. For example: for an integral assignment statement $I := E$, assigning the value of an expression $E$ to a variable $I$, we have the following denotation: \begin{align*} \semantics{I := E} = \lambda \rho . (\rho \oplus \{I \mapsto \semantics{E}(\rho)\}) \end{align*} where $\rho \in \states$ and $\semantics{E}: \states \to \Z$. Our $\states$ in this case are functions $\rho: \Vars \to \D$ from variables to values. Denotational semantics provide a formal abstract framework for us to analyse programs mathematically, but they do not provide a meaningful way to perform automated analysis of programs. As they perfectly capture the semantics of a program they do not avoid the consequences of Rice's Theorem \cite{Rice} or the Halting Problem \cite{Halting} which preclude us from performing precise analysis of programs in general. \section{Abstract Interpretation} \label{sec:abstract-interpretation} In order to work around the limitations on precise static analysis in general, Cousot et al.~ developed a framework for performing sound program analysis called ``abstract interpretation'' \cite{CousotCousot77-1}. In this framework we consider a program's semantics under an abstraction. The original semantics we call ``concrete semantics'' and the abstracted semantics we call ``abstract semantics''. The abstract semantics consist of replacing $\D$ and $\state$ from the denotational semantics with ``abstracted'' structures. To provide us with some intuition of these concepts, consider the following example: \begin{example} Let us assume that our program's state is represented by $\rho: \Vars \to \Z$, with values coming from $\Z$. One possible abstraction would be to instead consider a variable's \emph{sign}. That is, to consider $x$ as satisfying one of the following (the symbols in parentheses are used to refer to these values from now): \begin{align*} x &\in \Z & (\pm) \\ x &\ge 0 & (+) \\ x &\le 0 &(-) \\ x &= 0 &(0) \end{align*} From this abstraction we can then determine something about normal arithmetic operations in our program. We know, for instance, that $(+) \times (-) = (-)$, or that $0 \times (\pm) = 0$. We also, then, know that $x / 0$ is certainly an error. \label{example:abstract-interpretation} \end{example} Abstraction sacrifices precision of analysis for decidability in general. By abstracting our domain in a way which is ``sound'', which safely over-approximates our concrete semantics, we ensure that any analysis we perform on the abstraction is also valid for the concrete semantics. Unfortunately in order to make this abstraction we sacrifice precision, so it is possible that the analysis will identify errors which are not errors in the concrete semantics. \subsection{General Framework} In order to transform between the concrete and abstract domains we define two functions: $\alpha$ and $\gamma$. The function $\alpha: 2^\states \to \states_\alpha$ transforms a concrete state into an abstract state. The function $\gamma: \states_\alpha \to 2^\states$ transforms an abstract state into a set of concrete states. We require these two functions to satisfy the properties of a Galois connection: \begin{align*} a \le \gamma(b) &\iff \alpha(a) \le b & \forall a \subseteq 2^\states, b \in \states_\alpha. \end{align*} \noindent As a consequence of this, the result of abstracting $a \in \states$ must at least contain $a$. In particular we are looking for the \emph{best} abstract transformers, such that $\alpha(a)$ is as tight as possible for a state $a$. Abstract interpretation follows on from the idea of denotational semantics, and it is possible to determine the abstract semantics of an operation based on its concrete semantics. We use $\semantics{\cdot}^\#$ to represent the abstract semantics of a statement, and define the transformation as $\semantics{s}^\# = \alpha \circ \semantics{s} \circ \gamma$. We can now formalise the notion of a Control Flow Graph (henceforth CFG) and define its abstraction. We consider a CFG to be a tuple $G = (N, E, s, st)$. A finite set of \emph{control-points} $N$, a finite set $E \subseteq N \times N$ of \emph{edges} between these control points (ie. if $(u,v) \in E$ then an edge exists in the CFG from $u$ to $v$). A mapping $s: N \times N \to \Stmt$ taking each edge in the CFG to a \emph{statement} $\in \Stmt$. $st \in N$ denotes the special \emph{start control-point}. We denote our abstract semantics of a node in our CFG by $V^\#[v], \forall v \in N$ such that the following hold: \begin{align} V^\# [st] &\ge \alpha(\Z^n) \label{eqn:start-node} \\ V^\# [v] &\ge \llb s_{(u,v)} \rrb ^\# (V^\# [u]) & \forall (u,v) \in E \label{eqn:confluence} \end{align} In the first equation, the possible abstract states of the start node must always be the completely unconstrained abstract domain. Any value is possible for any variable on the first node of the CFG. The second equation requires that the possible abstract states of a node, $v$, contains either the same amount of information or less information than each of the predecessors of $v$ in the CFG. \begin{figure} \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=2cm,main node/.style={circle,fill=blue!20,draw},every loop/.style={min distance=1.5cm}] \node[main node] (V) {$v$}; \node[main node] (U) [above left of=V] {$u$}; \node[main node] (W) [above right of=V] {$w$}; \path[every node/.style={fill=none}] (U) edge node[left]{$s_{(u,v)}$} (V) (W) edge node[right]{$s_{(w,v)}$} (V); \end{tikzpicture} \caption{A simple CFG fragment.} \label{cfg:confluence} \end{figure} So, if we have the CFG fragment shown in Figure \ref{cfg:confluence} then we know that $V^\#[v]$ must `contain', or be larger than, each of the edges coming in to $v$. In this case there are two incoming edges to $v$: $(u,v)$ and $(w,v)$. The edge $(u,v)$ leads to the state $\llb s_{(u,v)} \rrb^\# (V^\#[u])$. It is clear that $V^\#[v]$ must be either the same size or `larger' than this value, as it must either equal this constraint or it must be a looser constraint. This gives us the inequality \eqref{eqn:confluence}: $V^\#[v] \ge \llb s_{(u,v)} \rrb ^\# (V^\#[u])$. Similarly by considering the edge $(w,v)$ and following the same argument we find that $V^\# [v] \ge \llb s_{(w,v)} \rrb ^\# (V^\# [w])$. This analysis thus far has avoided defining any particular abstract domain, but in practice we must choose an abstract domain before performing our analysis. Much has been written on the topic of abstract domains, ranging from the relatively simple, such as signs and intervals \cite{CousotCousot77-1} to the more complex zones \cite{mine:padoII}, octagons \cite{mine:hosc06} and more general convex polyhedra \cite{DBLP:conf/vmcai/SankaranarayananSM05}. For this thesis we are only concerned with \emph{intervals} and \emph{zones}. \subsection{Intervals} The abstract domain of \emph{intervals} was introduced by Cousot et al. \cite{CousotCousot77-1} along with abstract interpretation. In this domain we constrain each variable $x$ to lie within a range $[a, b] \in \CZ^2$. This domain is finer than signs, so it allows more information about the program to be determined. Another equivalent formulation of intervals is that we are finding bounds such that \begin{align*} x &\le b \text{ and,} \\ -x &\le -a. \end{align*} %\noindent We are finding \emph{upper bounds} for $x$ and $-x$ at each %location in the program. \subsection{Zones} \emph{Zones} are an abstract domain introduced by Min\'e \cite{mine:padoII} which generalise intervals to include also bounding the differences between variables. Min\'e's work on zones builds off work done in the model checking of timed automata \cite{yovine-zones}. For each variable $x$ zones constrain the variable and its negation in the same way as intervals, \begin{align*} x &\le b \text{ and,} \\ -x &\le a. \end{align*} Zones additionally provide us with bounds on \emph{differences} between variables. So for each pair of variables $(x,y)$, zones provide us with a bound \begin{align*} x - y & \le c. \end{align*} \section{Min/max problem} In \cite{Gawlitza:2007:PFC:1762174.1762203} a conversion from a program's CFG to a system of equations involving only monotonic, expansive operators is presented. This conversion results in an equation system which can be solved in order to perform abstract interpretation over zones. \subsection{Partial-ordering Formulation} \begin{example} \label{example:partial-ordered-construction} As an example, let us consider the following program: \begin{figure}[here] \begin{minipage}{5cm} \begin{lstlisting} x = 1 while x <= 9: x = x + 1 \end{lstlisting} \end{minipage} \hfill \begin{minipage}{0.45\textwidth} \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=2cm,main node/.style={circle,fill=blue!20,draw},every loop/.style={min distance=1.5cm}] \node[main node] (A) {$A$}; \node[main node] (B) [below of=A] {$B$}; \node[main node] (C) [below of=B] {$C$}; \path[every node/.style={fill=none}] (A) edge node[right]{$x = 1$} (B) (B) edge [loop right] node[right]{$x \le 9: x = x + 1$} (B) (B) edge node[right]{$x > 9$} (C); \end{tikzpicture} \end{minipage} \caption{A simple program with its corresponding CFG} \label{fig:partial-ordered-construction} \end{figure} We now attempt to derive a system of inequalities from this for the \emph{least upper bound} of a variable $x$ at each program point. That is, we want to find a value $ub(x)$ such that, for all possible executions of the program $x \le ub(x)$. We wish to find a value for $ub(x)$ at each program point in our original program. We denote these values by $ub_A(x)$, $ub_B(x)$ and $ub_C(x)$. For $A$ the value of $x$ is entirely unknown, so no useful bound can be given. The possible values of $x$ are bounded only by the top value in $\CZ$, which is $\infty$. This gives us the following inequality for node $A$: \begin{align*} ub(x_A) &\ge \infty \end{align*} When we reach $B$ we have two incoming edges. One comes from $B$ itself, the other from $A$. The edge from $A$ to $B$ is very simple. It has no constraint and sets the value of $x$ to $1$. It corresponds to an equation of the form $ub(x_B) \ge 1$. However, there is also the possibility that $ub(x_A)$ will be unreachable, in which case $ub(x_B)$ will also be unreachable. This makes the inequality \begin{align*} ub(x_B) \ge \left\{\begin{array}{lc} 1 & \text{if } ub(u_A) > -\infty \\ -\infty & \text{otherwise} \end{array}\right. \end{align*} The edge from $B$ to itself is somewhat more complex. To begin with it changes the value of $x$ depending on the previous value of $x$. It also is executed conditionally (when $x \le 99$), which affects the bound on $x$. Because the condition limits the growth of $x$ the corresponding equation for this edge is $ub(x_B) \ge \min(x, 99) + 1$. This leaves us with two simultaneous constraints on the value of $ub(x_B)$. \begin{align*} ub(x_B) &\ge 1 \\ ub(x_B) &\ge \min(ub(x_B), 9) + 1 \end{align*} The upper bound on the value of $x$ for node $C$ is the same as the upper bound on the value of $x$ in $B$, because the edge from $B$ to $C$ does not change the value of $x$, nor does it provide any more details about the upper bound of $x$. (It does change the lower bound of $x$, but we have not considered the lower bound in this analysis.) This gives us a very simple equation system for $C$. \begin{align*} ub(x_C) &\ge ub(x_B) \end{align*} So, the upper bounds for $x$ can be found at every program point in our original program by finding the smallest values of $ub(x_A), ub(x_B), ub(x_C)$ subject to these constraints: \begin{align*} ub(x_A) &\ge \infty \\ ub(x_B) &\ge \left\{\begin{array}{lc} 1 & \text{if } ub(u_A) > -\infty \\ -\infty & \text{otherwise} \end{array}\right. \\ ub(x_B) &\ge \min(ub(x_B), 9) + 1 \\ ub(x_C) &\ge ub(x_B) \end{align*} The smallest values for $ub(x_A), ub(x_B), ub(x_C)$ which satisfy these constraints are: \begin{align*} ub(x_A) &= \infty \\ ub(x_B) &= 10 \\ ub(x_C) &= 10 \end{align*} So, at program point $A$: $x \le \infty$; at $B$: $x \le 10$ and at $C$: $x \le 10$. \end{example} \subsection{Lattice construction} \label{subsec:lattice-construction} The partial-ordering formulation considered above is a helpful presentation of the problem in terms of constraints, but an equivalent equation system can be constructed to aid us in locating the \emph{least} solution to the constraint system. For every partial-ordering on a set there exists an equivalent lattice, defined in terms of a meet operator ($\vee$) and a join operator ($\land$), which are defined by: \begin{align*} x \vee y = z \iff z \ge x, z \ge y, (w \ge x, w \ge y \implies w \ge z, \forall w) \\ x \land y = z \iff z \le x, z \le y, (w \le x, w \le y \implies w \le z, \forall w) %x \le y ~&\iff~ x \vee y = y \\ %x \le y ~&\iff~ x \land y = x \end{align*} In the case of Example \ref{example:partial-ordered-construction}, with $x, y \in \CZ$, the meet and join operators correspond with $\max$ and $\min$, respectively. This allows us to simplify the system of inequalities in Example \ref{example:partial-ordered-construction} to the following: \begin{align*} ub(x_A) &= \infty \\ ub(x_B) &= \max(1, \min(ub(x_B), 99) + 1) \\ ub(x_C) &= ub(x_B) \end{align*} The two constraints on the upper bound of $x_B$ can be combined into one constraint by taking the maximum of them. This is because $x \ge \max(a, b) ~\implies~ x \ge a \text{ and } x \ge b$. The Knaster-Tarski Fixpoint Theorem \cite{Knaster-Tarski} permits us to replace our inequalities with equalities, thereby forming a system of equations which can be solved for the least solution to the earlier constraint system. \begin{example} \label{example:both-bounds} In order to determine a lower bound for $x$ at each program point it is sufficient to determine an \emph{upper} bound for $-x$. We again consider the program fragment in Figure \ref{fig:partial-ordered-construction}. We now try to derive an equation system for the \emph{upper} and \emph{lower} bounds of $x$. In order to capture the state of the variables at each program point we consider the abstract state as the two-dimensional vector $(ub(x), -lb(x))$, that is the upper bound of $x$, $ub(x)$, and the \emph{negated} lower bound of $x$, $-lb(x)$. The least fixpoint of our system of equations over $\CZ^2$ then gives us bounds for $ub(x)$ and $lb(x)$. To begin with, let us once again consider the node $A$. At the point of $A$ there is no information as to the bounds of $x$, so our constraint must contain the entirety of $\CZ$. This corresponds to $ub(x) = \infty, lb(x) = -\infty$, which gives us the following inequality: \begin{align*} \left(\begin{array}{c} ub(x_A) \\ -lb(x_A) \end{array}\right) &\ge \left(\begin{array}{c} \infty \\ \infty \end{array}\right) \end{align*} As before, the node $B$ is the first of particular interest to us. With two incoming edges we must have two constraints for this node. The first constraint comes from the edge $A \to B$. This edge does not do much of interest, it ignores the preceding value of $x$ and sets $x$ to $1$. This is captured in the following inequality: \begin{align*} \left(\begin{array}{c} ub(x_B) \\ -lb(x_B) \end{array}\right) &\ge \left(\begin{array}{c} 1 \\ -1 \end{array}\right) \end{align*} The constraint from the edge $B \to B$ is somewhat more complicated. The transformation $x = x + 1$ is represented by the matrix/vector operations of multiplication and addition. Multiplication by the identity matrix indicates that $x$ itself is present in the right-hand side of the transformation, while the addition of the vector \(\left(\begin{array}{c} 1 \\ -1 \end{array}\right) \) captures the effect of the addition on each of the bounds (increasing each of the bounds by one). The $x \le 9$ conditional, which is present on the edge, is captured by the restriction on the growth of the upper bound. \(\min(ub(x_B), 9)\) ensures the upper-bound's growth is limited by the value $9$. It should be noted that there is no bound placed on the lower-bound as the program only bounds $x$'s growth from above on this edge. \begin{align*} \left(\begin{array}{c} ub(x_B) \\ -lb(x_B) \end{array}\right) &\ge \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}\right] \left(\begin{array}{c} \min(ub(x_B), 99) \\ -lb(x_B) \end{array}\right) + \left(\begin{array}{c} 1 \\ -1 \end{array}\right) \end{align*} The edge $B \to C$ is quite simple as it provides another restriction on the value of $x$ without any further transformation. This restriction is placed on the lower-bound of $x$ only (leaving the upper-bound unchanged) and is captured by the following inequality: \begin{align*} \left(\begin{array}{c} ub(x_C) \\ -lb(x_C) \end{array}\right) &\ge \left(\begin{array}{c} ub(x_B) \\ \min(-lb(x_B), 99) \end{array}\right) \end{align*} This gives us the following inequalities for this program: \begin{align*} \left(\begin{array}{c} ub(x_A) \\ -lb(x_A) \end{array}\right) &\ge \left(\begin{array}{c} \infty \\ \infty \end{array}\right) \\ \left(\begin{array}{c} ub(x_B) \\ -lb(x_B) \end{array}\right) &\ge \left(\begin{array}{c} 1 \\ -1 \end{array}\right) \\ \left(\begin{array}{c} ub(x_B) \\ -lb(x_B) \end{array}\right) &\ge \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}\right] \left(\begin{array}{c} \min(ub(x_B), 99) \\ -lb(x_B) \end{array}\right) + \left(\begin{array}{c} 1 \\ -1 \end{array}\right) \\ \left(\begin{array}{c} ub(x_C) \\ -lb(x_C) \end{array}\right) &\ge \left(\begin{array}{c} ub(x_B) \\ \min(-lb(x_B), 99) \end{array}\right) \end{align*} By combining these expressions as discussed in Section \ref{subsec:lattice-construction} we can obtain the following system of equations: \begin{align*} \left(\begin{array}{c} ub(x_A) \\ -lb(x_A) \end{array}\right) &= \left(\begin{array}{c} \infty \\ \infty \end{array}\right) \\ \left(\begin{array}{c} ub(x_B) \\ -lb(x_B) \end{array}\right) &= \max\left( \left(\begin{array}{c} 1 \\ -1 \end{array}\right), \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}\right] \left(\begin{array}{c} \min(ub(x_B), 99) \\ -lb(x_B) \end{array}\right) + \left(\begin{array}{c} 1 \\ -1 \end{array}\right) \right) \\ \left(\begin{array}{c} ub(x_C) \\ -lb(x_C) \end{array}\right) &= \left(\begin{array}{c} ub(x_B) \\ \min(-lb(x_B), 99) \end{array}\right) \end{align*} \end{example} \section{Min-cost flow} The matrix formulation is sufficient to calculate bounds on individual program variables, as in the interval domain, but in doing so there is a significant loss of precision. In order to reclaim some of this precision we use a relational domain, such as zones. We use a ``template constraint matrix'' to specify interesting values for our analysis. A template constraint matrix is an $m \times n$ matrix, where $n$ is the number of program variables and $m$ is the number of interesting abstract values. \begin{example} If we have two program variables $x_1$ and $x_2$ then we can analyse for values $x_1$, $-x_1$, $x_2$, $-x_2$, $(x_1 - x_2)$ and $(x_2 - x_1)$. If we are only interested in the values of $x_1$, $x_2$ and $(x_2 - x_1)$ then we set our template constraint matrix to be \begin{align*} T &= \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \\ -1 & 1 \end{array}\right] \end{align*} \end{example} This template constraint matrix also specifies the topology of a graph. By considering each row as an edge, and each column as a node, we can produce a directed graph $G$, such that for each row an edge exists between the column containing a $1$ and the column containing a $-1$ in that row. In order to guarantee that all rows exist in $G$ it is also necessary to add an extra ``dummy node'' such that any unspecified edges are adjacent to the dummy node. \begin{example} For the template constraint matrix \begin{align*} T &= \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \\ -1 & 1 \end{array}\right] \end{align*} \noindent we construct a directed graph \begin{center} \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=2.5cm] \node (x1)[draw,circle] {$x1$}; \node (x2)[draw,circle,below of=x1] {$x2$}; \node (dummy)[draw,circle,below right of=x1] {dummy}; \path (x1) edge[->,bend left] node {} (dummy) (x2) edge[->,bend left] node {} (x1) (x2) edge[->,bend left] node {} (dummy); \end{tikzpicture}. \end{center} \end{example} Once we have determined the topology of this graph, we then perform abstract interpretation by assigning the nodes ``supplies'' and the edges ``costs''. The supplies correspond to a statement's semantics, capturing the combination of variables which are the result of a given assignment, while the costs consider the incoming values of variables into the statement. We use the min-cost flow problem on this graph as an operator in our equation systems. We fix the topology and the costs in the graph and then consider it as a function of its edge costs, which are the abstract values entering this node. The validity of this approach is considered by Gawlitza et al.~\cite{EasyChair:117}, where it is shown that the min-cost flow problem, when considered as a function of its edge weights is monotonic and quasi-expansive, thereby making it a valid operator in our equation systems.