
- Published on
Neural networks represented as DAGs of parameterized computational programs
Neural networks can be viewed as algorithms with adjustable parameters, whose optimization is carried out through machine learning. This approach opens the door to a broader analysis, both from a mathematical and computational perspective. In this article, we will explore how these algorithms are represented using directed acyclic graphs (DAGs), following the ideas presented by Blondel et al. (2024). Elements of Differentiable Programming.
The representation of neural networks as DAGs has several advantages:
- Visualization: DAGs allow for a more intuitive visualization of network structures.
- Optimization: They facilitate the use of graph-based optimization techniques.
- Automatic differentiation: They ease the implementation of algorithms such as backpropagation.
- Parallelization: DAGs allow for parallel execution of programs on specialized hardware like GPUs and TPUs.
Additionally, we will delve into how DAGs improve both the design and understanding of complex neural networks, and their relationship with advanced deep learning techniques, such as residual and recurrent networks.
Parameterized computational programs
Computation chains
To begin, let's consider simple computational programs defined by a sequence of functions that are applied sequentially to an input We will call these programs computation chains and represent them as
A practical example of a computation chain is image processing, where an image can undergo a series of transformations such as cropping, rotation, and normalization. In the context of neural networks, these transformations are usually parameterized, and their parameters are adjusted through an optimization process during training.
Figure 1. Representation of a computation chain as a sequence of function compositions. Each intermediate node symbolizes a function. The initial arrow indicates the input, and the final one, the output. The internal arrows illustrate the dependencies between functions, connecting previous outputs or the initial input with the functions.
Formally, a computation chain can be written as:
Where is the input, for are the intermediate states of the program, and is the output. It is crucial that the domain (input spaces) of is compatible with the image (output space) of for the composition to make sense. That is, for Equivalent to equation (1), the complete computation chain can be written in a single expression:
As we will see next, a computation chain can be represented as a directed acyclic graph (DAG), where the nodes of the graph represent individual functions and the edges represent the dependencies and flow of information between the functions.
Directed acyclic graphs
In a generic computational program, intermediate functions may depend on the outputs of other functions. These dependencies can be represented by a Directed Acyclic Graph (DAG). A directed graph consists of a set of nodes and a set of edges An edge is an ordered pair of nodes and can be represented as to indicate that depends on To represent inputs and outputs, it is convenient to use incoming half-edges and outgoing half-edges
In a graph , the parents of a node are those nodes such that denoted as:
Similarly, the children of a node are the nodes such that represented by:
Nodes without parents are called roots and nodes without children are known as leaves.
A path from to in a graph is a sequence of nodes such that A path is simple if it contains no repeated nodes. A graph is acyclic if it contains no cycles, that is, there are no simple paths from a node to itself. A directed acyclic graph (DAG) is a directed graph that lacks cycles.
The edges of a DAG induce a partial order among the nodes, denoted as if there exists a path from to This partial order is a reflexive, transitive, and antisymmetric relation. It is called "partial" because not all nodes are related to each other. However, it is possible to define a total order called a topological order: any order such that if and only if there is no path from to
Computational programs as DAGs
A computational program can be thought of as a function in mathematical terms, which means that the program should return the same values for the same inputs and should not have side effects. Additionally, we assume that the program halts after a finite number of steps. Given this, a program can be decomposed into a finite set of functions and intermediate variables, where the dependencies between these can be represented by a directed acyclic graph (DAG).
Under these conditions, we can make the following assumptions about a computational program:
- There exists an input node and an output node
- Each intermediate function produces a single value
With a finite number of nodes such as where node is the root, corresponding to the program's input, and node is a leaf, corresponding to the program's output. According to these assumptions, except for , each variable is in bijection with a function Therefore, node represents the input , while nodes represent the functions and outputs That is, each node represents both the function and the output
The edges of a DAG represent the dependencies between nodes. The parents of a node denoted as where indicate the variables that the function needs to calculate its output . In other words, the parents indicate functions that are necessary to compute
Find parent nodes
Compute
Program Execution
To execute a program, we need to ensure that the intermediate functions are evaluated in the correct order. Therefore, we assume that the nodes are ordered in a topological order (if this is not the case, then the program cannot be executed). We can execute a program to evaluate the output
Note that we can view as a single-input function of , which is an -tuple of elements, or as a multi-input function of . The two viewpoints are essentially equivalent. The process of executing a program is described in Algorithm 1.
Handling Multiple Program Inputs or Outputs
When a program has multiple inputs, we always group them into a single input as with , since subsequent functions can always filter the elements of they need. Similarly, if an intermediate function has multiple outputs, we always group them into a single output as with , since subsequent functions can always filter the elements of they need.
Figure 2. Two possible representations of a program. Left: Functions are represented as nodes and dependencies as edges. Right: Functions are represented as nodes and dependencies as a set of disjoint nodes. In both cases, the input is represented by a root node and the output by a leaf node.
Alternative Program Representation: Bipartite Graphs
In our formalism, since a function always has a single output , a node can be considered to represent both the function and the output . However, it is also possible to consider two disjoint sets of nodes: one for functions and another for outputs, i.e., a bipartite graph. This formalism is similar to factor graphs (Frey et al., 1997; Loeliger, 2004) used in probabilistic modeling, but with directed edges. An advantage of this formalism is that it allows functions to have multiple outputs. For simplicity, we will focus on the representation of a single set of nodes.
Arithmetic Circuits
Arithmetic circuits are one of the simplest examples of computational graphs, originating from computational complexity theory. Formally, an arithmetic circuit over a field , such as the real numbers or complex numbers , is a directed acyclic graph (DAG) where the root nodes are elements of the field and the intermediate functions are arithmetic operations such as addition or multiplication. The latter are usually called arithmetic gates. Contrary to the general case of computational graphs, each function, whether a sum or a product, can have multiple root nodes. Root nodes can be variables or constants and must belong to the field . Arithmetic circuits can be used to calculate polynomials. There can be multiple arithmetic circuits to represent a given polynomial. To compare two arithmetic circuits that represent the same polynomial, an intuitive notion of complexity is the circuit size, as defined below.
The size of an arithmetic circuit is the number of edges in the directed acyclic graph (DAG) representing the circuit. The polynomial size of a polynomial is the size of the smallest arithmetic circuit that represents .
For more information on arithmetic circuits, refer to the book by Chen et al. 2011.
Feedforward Networks
A feedforward network is a type of computation chain with parameterized functions that are applied sequentially to an input . In this case, the functions are usually parameterized functions that depend on a parameter vector , where is the parameter space of function . Therefore,
for an input and for the trainable parameters . Each function is a layer of the network and each is an intermediate representation of the input . The dimension of is known as the layer dimension (or number of hidden units) of layer . A feedforward network can be defined as a function that maps an input and parameters to an output .
Given a parameterized program of this type, we can learn the parameters by adjusting according to a training dataset. For example, given a dataset of pairs , we can minimize the loss function . The minimization of the loss function can be done using an optimization algorithm such as gradient descent.
Multilayer Perceptrons
Combination of Affine Layers and Activation Functions
In the previous section, we did not specify how to parameterize feedforward networks. A typical parameterization is the multilayer perceptron (MLP), which uses fully connected layers (also known as dense layers) of the form
where are the parameters of layer , is a weight matrix, and is a bias vector. Moreover, we can decompose the layer into two functions. The function is an affine layer and the non-linear function without parameters is called an activation function.
Generally, we can replace the matrix-vector product with any linear function of . For example, convolutional layers use convolution of the input with some filters .
Sometimes it is necessary to deal with networks with multiple inputs. For example, suppose we want to design a function , where and are two inputs. A simple way to do this is to use the concatenation as input to a network . Alternatively, instead of concatenating the inputs, they can be concatenated after having passed through one or more hidden layers.
Relationship with Generalized Linear Models
When the depth of the network is (i.e., a single layer), the output of an MLP is a linear function of the input :
This is called a generalized linear model (GLM). In this case, the activation function is the identity function. Generalized linear models are a special case of MLPs. In particular, when is the function, we have a logistic regression model. Generally, for depth , the output of an MLP is:
This can be seen as a GLM over the intermediate representation of the input . This is the main attraction of MLPs: the ability to learn intermediate representations that are useful for the task of interest. We will see that MLPs can also be used as subcomponents in other architectures.
Activation Functions
As mentioned earlier, feedforward networks use activation functions for each layer. In this section, we will describe some of the most common scalar-to-scalar or vector-to-scalar activation functions. We will also present some probability functions that can be used as activation functions.
Nonlinear Scalar-to-Scalar Activation Functions
The ReLU (Rectified Linear Unit) function is a nonlinear activation function defined as the non-negative part of its argument. That is, the ReLU function is the identity function for non-negative values and zero for negative values:
It is a piecewise linear function and includes an inflection point at . A multilayer perceptron with ReLU activations is called a rectifier neural network. The layers take the form:
where the ReLU is applied element-wise. The ReLU function can be replaced with a smoothed version, known as softplus:
Unlike ReLU, softplus is a smooth, differentiable, and strictly positive function.
Nonlinear Vector-to-Scalar Activation Functions
It is often useful to reduce vectors to scalars. These scalar values can be seen as scores or probabilities, representing the importance of an input vector. A common function for doing this is using the maximum value, also known as max-pooling. Given a vector , the max-pooling function is:
As a smooth approximation of the max-pooling function, the log-sum-exp function can be used, which behaves like a smoothed version of the max-pooling function:
The log-sum-exp function can be seen as a generalization of the softplus function to vectors. In fact, for all :
A numerically stable implementation of the log-sum-exp function is as follows:
where is a vector of ones and .
Scalar-to-Scalar Probability Mappings
Sometimes we want to map a real number to a number between 0 and 1, which can represent the probability of an event. For this purpose, sigmoids are frequently used. A sigmoid is a function whose curve is characterized by an "S" shape. These functions are used to compress real values into an interval. An example is the binary step function, also known as the Heaviside step function:
It is a mapping from to . Unfortunately, the binary step function is discontinuous at . Moreover, because the function is constant at all other points, it has zero derivative at these points, which makes it difficult to use as part of a neural network with backpropagation.
A better sigmoid is the logistic function, which maps from to and is defined as:
It maps to and to , and satisfies . Therefore, it can be seen as a probability function. The logistic function can be viewed as a smoothed version of the step function . Furthermore, the logistic function can be obtained as the derivative of the function, that is, for all :
Two important properties of the logistic function are that for all :
and
Vector-to-Vector Probability Mappings
Sometimes we want to map a vector to a vector of probabilities. This is a mapping from to a probability simplex, defined by:
Two examples of vector-to-vector probability mapping functions are the argmax function and the softargmax function. The function maps a vector to a probability vector that is zero in all entries except for the entry with the maximum value, which it makes one. In effect,
where denotes the one-hot encoding of an integer , that is, if and zero otherwise, i.e.,
This mapping places all the probability mass on the input with the maximum value; in case of a tie, it selects the first input with the maximum value. Unfortunately, the function is not differentiable, which makes it difficult to use in backpropagation.
A smoothed version of the function is the function. The function is defined as:
This operator is commonly known in the literature as the softmax function, but this naming is misleading: this operator actually defines a smoothed version of the function. The output of belongs to the relative interior of the probability simplex , which means it can never reach the edges of the simplex. If we denote , then , i.e., for all . The function is the gradient of the function, that is, for all :
The function can be seen as a generalization of the function to vectors, in effect:
The function satisfies the property that for all and for all ,
This means that the function has degrees of freedom and is not invertible. However, due to the above property, without loss of generality, we can impose that (if this is not the case, we can simply do , where is the mean of ). With this restriction and together with the fact that
we obtain
Therefore, the function is invertible in the sense that we can recover from
Residual Neural Networks
As mentioned earlier, feedforward networks are a type of computation chain. In this case, the intermediate functions are parameterized functions that are applied sequentially to an input. Indeed, consider a feedforward network with layers . Surely, whenever is the identity function, the set of functions that this network can represent is the same as that of an -layer network . Therefore, we can consider an -layer network as an -layer network. In other words, depth, in theory, should not harm the expressive power of feedforward networks. Unfortunately, the assumption that each is an identity function is not realistic. This means that deeper networks can sometimes be more difficult to train than shallower ones, causing accuracy to saturate or degrade as a function of depth.
The key idea of residual neural networks (ResNets) (He et al., 2016) is to introduce residual blocks between layers, such that the output of one layer becomes the input to a later layer. Formally, a residual block is a function of the form
The function is the residual function, as it models the difference between the input and output, i.e., . The function can be seen as a correction that is added to the input to obtain the output . The addition with is known as a skip connection or shortcut. As long as it's possible to adjust so that , then the function becomes the identity function. For example, if we use:
where are the parameters of layer , then it suffices to adjust and so that . Residual blocks are known to remedy the vanishing gradient problem.
In many articles and software packages, an additional activation is included and instead, the residual block is defined as:
where is usually the ReLU function. Whether to include this additional activation or not is essentially a modeling decision. In practice, residual blocks can also include additional operations such as batch normalization and convolutional layers.
Recurrent Neural Networks
Recurrent neural networks (RNNs) (Rumelhart et al., 1986) are a type of neural network that operate on sequences of vectors, either as input, output, or both. Their actual parameterization depends on the configuration, but the central idea is to maintain a state vector that is updated step by step through a recursive function that uses shared parameters across steps. We distinguish between the following configurations:
- Vector to sequence (one to many):
- Sequence to vector (many to one):
- Sequence to sequence (many to many, aligned):
- Sequence to sequence (many to many, unaligned):
where is the length of the input sequence and is the length of the output sequence. Note that the same number of parameters is used for each of the configurations, but this is of course not necessary. Throughout this post, we will use the notation to denote a sequence of vectors.
Vector to Sequence
In this configuration, a decoder function maps a vector and a parameter vector to a sequence of vectors .
Figure 3. One to many (decoding). A vector is mapped to a sequence of vectors .
This is useful for image captioning, where a sentence (a sequence of words) is generated from an image (a vector of pixels). Formally, we can define through the recursion:
where are the network parameters, is the initial state. The goal of is to update the state based on the input and the previous state , while the goal of is to map the state to a vector given the state . It's important to note that the parameters of and are shared across time steps. Typically, and are parameterized using single hidden layer MLPs. Note that has multiple inputs.
Sequence to Vector
In this configuration, an encoder function is defined that maps a sequence of vectors and a parameter vector to a vector .
Figure 4. Many to one (encoding). A sequence of vectors is mapped to a vector .
This is useful for sequence classification, where the entire sequence is classified into a single category, for example in sentiment analysis. Formally, we can define through the recursion:
where are the network parameters, is the initial state, and now updates the encoder states and doesn't take previous predictions as input. The pooling function typically has no parameters and is used to reduce the sequence of states to a single vector. Common examples of pooling functions are the mean, sum, and max-pooling function.
Sequence to Sequence (Aligned)
In this configuration, an alignment function is defined that maps a sequence of vectors and a parameter vector to a sequence of vectors . The length of the input sequence and the output sequence are equal.
Figure 4. Many to many (aligned). A sequence of vectors is mapped to a sequence of vectors .
An example application is POS-tagging, where a tag is assigned to each word in a sentence. Formally, this can be defined as through the recursion:
where are the network parameters, is the initial state, and and are functions similar to those defined in the one-to-many configuration. The function updates the state based on the input and the previous state , while the function maps the state to a vector given the state .
Sequence to Sequence (Unaligned)
In this configuration, an alignment function is defined that maps a sequence of vectors and a parameter vector to a sequence of vectors .
Figure 5. Many to many (unaligned). A sequence of vectors is mapped to a sequence of vectors .
The length of the input sequence and the output sequence are not necessarily equal. An example application is machine translation, where a sentence in one language is translated to a sentence in another language. Formally, this can be defined as through the recursion:
where are the network parameters, is an encoder function that maps the input sequence to a context vector , and is a decoder function that maps the context vector to the output sequence. The encoding function and the decoding function can be implemented as RNNs. The detailed steps would be:
This architecture is aptly called an encoder-decoder architecture. Note that we denote the length of the target sequence as instead of to avoid confusion. However, in practice, the length may depend on the input and is often not known in advance. To address this issue, the vocabulary (of size in our notation) is typically augmented with an "end of sequence" (EOS) token, so that at inference time, we know when to stop generating the output sequence. A disadvantage of this architecture is that all information about the input sequence is contained in the context vector , which can therefore become a bottleneck.
Conclusions
Neural networks, conceived as parameterized programs, provide a powerful and flexible framework for addressing a wide range of problems in machine learning and data processing. Throughout this article, we have explored various architectures and configurations, from feedforward networks to recurrent neural networks and residual networks. The representation of these architectures as directed acyclic graphs (DAGs) not only facilitates their analysis and optimization but also allows for the efficient application of techniques such as automatic differentiation. The different configurations discussed demonstrate the versatility of these models for tasks ranging from image captioning to machine translation.
This approach of viewing neural networks as parameterized programs not only allows for designing more efficient and effective models but also opens pathways for improving the interpretability and explainability of these models, an area of growing importance in artificial intelligence. Although we have covered several types of neural networks, it's important to note that the field is constantly evolving, with new architectures and techniques being developed continuously. In summary, this perspective provides a solid foundation for the future development of more advanced, interpretable, and adaptable machine learning models for a wide range of real-world applications.
Finally, if there are any errors, omissions, or inaccuracies in this article, please don't hesitate to contact us through the following Discord channel: Math & Code.