Neural networks represented as DAGs of parameterized computational programs
Published on

Neural networks represented as DAGs of parameterized computational programs

Neural networks can be viewed as algorithms with adjustable parameters, whose optimization is carried out through machine learning. This approach opens the door to a broader analysis, both from a mathematical and computational perspective. In this article, we will explore how these algorithms are represented using directed acyclic graphs (DAGs), following the ideas presented by Blondel et al. (2024). Elements of Differentiable Programming.

The representation of neural networks as DAGs has several advantages:

  1. Visualization: DAGs allow for a more intuitive visualization of network structures.
  2. Optimization: They facilitate the use of graph-based optimization techniques.
  3. Automatic differentiation: They ease the implementation of algorithms such as backpropagation.
  4. Parallelization: DAGs allow for parallel execution of programs on specialized hardware like GPUs and TPUs.

Additionally, we will delve into how DAGs improve both the design and understanding of complex neural networks, and their relationship with advanced deep learning techniques, such as residual and recurrent networks.

Parameterized computational programs

Computation chains

To begin, let's consider simple computational programs defined by a sequence of functions f1,,fnf_1, \ldots, f_n that are applied sequentially to an input s0S0.s_0 \in S_{0}. We will call these programs computation chains and represent them as f=fnf1.f = f_n \circ \cdots \circ f_1.

A practical example of a computation chain is image processing, where an image can undergo a series of transformations such as cropping, rotation, and normalization. In the context of neural networks, these transformations are usually parameterized, and their parameters are adjusted through an optimization process during training.

Computation Chain

Figure 1. Representation of a computation chain as a sequence of function compositions. Each intermediate node symbolizes a function. The initial arrow indicates the input, and the final one, the output. The internal arrows illustrate the dependencies between functions, connecting previous outputs or the initial input with the functions.

Formally, a computation chain can be written as:

s0S0,s1=f1(s0)S1,s2=f2(s1)S2,sn=fn(sn1)=f(s0)Sn.\begin{equation} \begin{aligned} s_0 &\in S_{0}, \\ s_{1} &= f_{1}(s_{0}) \in S_{1}, \\ s_{2} &= f_{2}(s_{1}) \in S_{2}, \\ &\vdots \\ s_{n} &= f_{n}(s_{n-1}) = f(s_0) \in S_{n}.\\ \end{aligned} \end{equation}

Where s0s_0 is the input, skSks_k\in S_k for k=1,,n1,k = 1, \ldots, n-1, are the intermediate states of the program, and snSns_n\in S_n is the output. It is crucial that the domain (input spaces) of fkf_k is compatible with the image (output space) of fk1f_{k-1} for the composition to make sense. That is, fk:Sk1Skf_k: S_{k-1} \to S_k for k=1,,n.k = 1, \ldots, n. Equivalent to equation (1), the complete computation chain can be written in a single expression:

sn=f(s0)=fnf1(s0).\begin{equation} s_n = f(s_0) = f_n \circ \cdots \circ f_1(s_0). \end{equation}

As we will see next, a computation chain can be represented as a directed acyclic graph (DAG), where the nodes of the graph represent individual functions and the edges represent the dependencies and flow of information between the functions.

Directed acyclic graphs

In a generic computational program, intermediate functions may depend on the outputs of other functions. These dependencies can be represented by a Directed Acyclic Graph (DAG). A directed graph G=(V,E)G = (V, E) consists of a set of nodes VV and a set of edges EV×V.E\subseteq V \times V. An edge (u,v)E(u, v) \in E is an ordered pair of nodes u,vV,u, v\in V, and can be represented as uvu \to v to indicate that vv depends on u.u. To represent inputs and outputs, it is convenient to use incoming half-edges j\to j and outgoing half-edges i.i \to.

In a graph G=(V,E)G = (V, E), the parents of a node vVv\in V are those nodes uVu\in V such that (u,v)E,(u, v)\in E, denoted as:

P(v)={uV:(u,v)E}.\begin{equation} \mathcal{P}(v) = \{u\in V: (u, v)\in E\}. \end{equation}

Similarly, the children of a node vVv\in V are the nodes wVw\in V such that (v,w)E,(v, w)\in E, represented by:

C(v)={wV:(v,w)E}.\begin{equation} \mathcal{C}(v) = \{w\in V: (v, w)\in E\}. \end{equation}

Nodes without parents are called roots and nodes without children are known as leaves.

A path from ii to jj in a graph G=(V,E)G = (V, E) is a sequence of nodes v1,,vkv_1, \ldots, v_k such that iv1vkj.i \to v_1 \to \cdots \to v_k \to j. A path is simple if it contains no repeated nodes. A graph is acyclic if it contains no cycles, that is, there are no simple paths from a node to itself. A directed acyclic graph (DAG) is a directed graph that lacks cycles.

The edges of a DAG induce a partial order among the nodes, denoted as iji \preceq j if there exists a path from ii to j.j. This partial order is a reflexive, transitive, and antisymmetric relation. It is called "partial" because not all nodes are related to each other. However, it is possible to define a total order called a topological order: any order such that iji \preceq j if and only if there is no path from jj to i.i.

Computational programs as DAGs

A computational program can be thought of as a function in mathematical terms, which means that the program should return the same values for the same inputs and should not have side effects. Additionally, we assume that the program halts after a finite number of steps. Given this, a program can be decomposed into a finite set of functions and intermediate variables, where the dependencies between these can be represented by a directed acyclic graph (DAG).

Under these conditions, we can make the following assumptions about a computational program:

  1. There exists an input node s0s_0 and an output node sn.s_n.
  2. Each intermediate function fkf_k produces a single value skSk.s_k \in S_k.

With a finite number of nodes such as V={0,1,,n},V = \{0, 1, \ldots, n\}, where node 00 is the root, corresponding to the program's input, and node nn is a leaf, corresponding to the program's output. According to these assumptions, except for s0s_0, each variable sks_k is in bijection with a function fk.f_k. Therefore, node 00 represents the input s0s_0, while nodes 1,,n1, \ldots, n represent the functions f1,,fnf_1, \ldots, f_n and outputs s1,,sn.s_1, \ldots, s_n. That is, each node kk represents both the function fkf_k and the output sk.s_k.

The edges of a DAG represent the dependencies between nodes. The parents of a node k,k, denoted as {i1,,ipk}:=P(k),\{i_1, \ldots, i_{p_k}\} := \mathcal{P}(k), where pk:=P(k),p_k:=|\mathcal{P}(k)|, indicate the variables sP(k):={si1,,sipk}s_{\mathcal{P}(k)} := \{s_{i_1}, \ldots, s_{i_{p_k}}\} that the function fkf_k needs to calculate its output sks_k. In other words, the parents {i1,,ipk}\{i_1, \ldots, i_{p_k}\} indicate functions fi1,,fipkf_{i_1}, \ldots, f_{i_{p_k}} that are necessary to compute fk.f_k.

Algorithm 1. Program execution

Functions:\text{\textbf{Functions:}} f1,,fn in topological orderf_1, \ldots, f_n \text{ in topological order}

Input:\text{\textbf{Input:}} s0S0s_0\in S_0

1.

For k=1,,n\text{\textbf{For} } k = 1, \ldots, n do:\text{\textbf{do:}}

2.

Find parent nodes {i1,,ipk}:=P(k)\{i_1, \ldots, i_{p_k}\} := \mathcal{P}(k)

3.

Compute sk:=fk(sP(k)):=fk(si1,,sipk)s_k :=f_k\big(s_{\mathcal{P}(k)}\big):= f_k(s_{i_1}, \ldots, s_{i_{p_k}})

End: f(s0)=sn\text{\textbf{End:} } f(s_0) = s_n

Program Execution

To execute a program, we need to ensure that the intermediate functions are evaluated in the correct order. Therefore, we assume that the nodes V={0,1,,n}V=\{0, 1, \ldots, n\} are ordered in a topological order (if this is not the case, then the program cannot be executed). We can execute a program to evaluate the output sk[n]s_k \in [n]

sk:=fk(sP(k))Sk.\begin{equation} s_k := f_k\big(s_{\mathcal{P}(k)}\big)\in S_k. \end{equation}

Note that we can view fkf_k as a single-input function of sP(k)s_{\mathcal{P}(k)}, which is an nn-tuple of elements, or as a multi-input function of si1,,sipks_{i_1}, \ldots, s_{i_{p_k}}. The two viewpoints are essentially equivalent. The process of executing a program is described in Algorithm 1.

Handling Multiple Program Inputs or Outputs

When a program has multiple inputs, we always group them into a single input s0S0s_0 \in S_0 as s0=(s0,1,,s0,m0)s_0 = (s_{0, 1}, \ldots, s_{0, m_0}) with S0=S0,1××S0,m0S_0 = S_{0, 1} \times \cdots \times S_{0, m_0}, since subsequent functions can always filter the elements of s0s_0 they need. Similarly, if an intermediate function fkf_k has multiple outputs, we always group them into a single output skSks_k \in S_k as sk=(sk,1,,sk,mk)s_k = (s_{k, 1}, \ldots, s_{k, m_k}) with Sk=Sk,1××Sk,mkS_k = S_{k, 1} \times \cdots \times S_{k, m_k}, since subsequent functions can always filter the elements of sks_k they need.

Computation Chain

Figure 2. Two possible representations of a program. Left: Functions are represented as nodes and dependencies as edges. Right: Functions are represented as nodes and dependencies as a set of disjoint nodes. In both cases, the input is represented by a root node and the output by a leaf node.

Alternative Program Representation: Bipartite Graphs

In our formalism, since a function fkf_k always has a single output sks_k, a node kk can be considered to represent both the function fkf_k and the output sks_k. However, it is also possible to consider two disjoint sets of nodes: one for functions and another for outputs, i.e., a bipartite graph. This formalism is similar to factor graphs (Frey et al., 1997; Loeliger, 2004) used in probabilistic modeling, but with directed edges. An advantage of this formalism is that it allows functions to have multiple outputs. For simplicity, we will focus on the representation of a single set of nodes.

Arithmetic Circuits

Arithmetic circuits are one of the simplest examples of computational graphs, originating from computational complexity theory. Formally, an arithmetic circuit over a field F\mathbb{F}, such as the real numbers R\mathbb{R} or complex numbers C\mathbb{C}, is a directed acyclic graph (DAG) where the root nodes are elements of the field F\mathbb{F} and the intermediate functions are arithmetic operations such as addition or multiplication. The latter are usually called arithmetic gates. Contrary to the general case of computational graphs, each function, whether a sum or a product, can have multiple root nodes. Root nodes can be variables or constants and must belong to the field F\mathbb{F}. Arithmetic circuits can be used to calculate polynomials. There can be multiple arithmetic circuits to represent a given polynomial. To compare two arithmetic circuits that represent the same polynomial, an intuitive notion of complexity is the circuit size, as defined below.

Definition 1. Circuit and polynomial size

The size S(C)S(C) of an arithmetic circuit CC is the number of edges in the directed acyclic graph (DAG) representing the circuit. The polynomial size S(f)S(f) of a polynomial ff is the size of the smallest arithmetic circuit CC that represents ff.

For more information on arithmetic circuits, refer to the book by Chen et al. 2011.

Feedforward Networks

A feedforward network is a type of computation chain with parameterized functions f1,,fnf_1, \ldots, f_n that are applied sequentially to an input s0S0s_0 \in S_0. In this case, the functions fkf_k are usually parameterized functions that depend on a parameter vector wkWkw_k \in \mathcal{W}_k, where Wk\mathcal{W}_k is the parameter space of function fkf_k. Therefore,

x=s0S0,s1=f1(s0;w1)S1,s2=f2(s1;w2)S2,sn=fn(sn1;wn)=f(s0;w)Sn.\begin{equation} \begin{aligned} x &= s_0 \in S_{0}, \\ s_{1} &= f_{1}(s_{0}; w_1) \in S_{1}, \\ s_{2} &= f_{2}(s_{1}; w_2) \in S_{2}, \\ &\vdots \\ s_{n} &= f_{n}(s_{n-1}; w_n) = f(s_0; w) \in S_{n}.\\ \end{aligned} \end{equation}

for an input s0S0s_0\in S_0 and for the trainable parameters w=(w1,,wn)W=W1××Wnw = (w_1, \ldots, w_n) \in \mathcal{W} = \mathcal{W}_1 \times \cdots \times \mathcal{W}_n. Each function fkf_k is a layer of the network and each skSks_k \in S_k is an intermediate representation of the input s0s_0. The dimension of SkS_k is known as the layer dimension (or number of hidden units) of layer kk. A feedforward network can be defined as a function f:S0×WSnf:S_0 \times \mathcal{W} \to S_n that maps an input s0s_0 and parameters ww to an output sns_n.

Given a parameterized program of this type, we can learn the parameters ww by adjusting ww according to a training dataset. For example, given a dataset of pairs (xi,yi)(x_i, y_i), we can minimize the loss function L(w)=i=1m(f(xi;w),yi)L(w) = \sum_{i=1}^{m} \ell(f(x_i; w), y_i). The minimization of the loss function can be done using an optimization algorithm such as gradient descent.

Multilayer Perceptrons

Combination of Affine Layers and Activation Functions

In the previous section, we did not specify how to parameterize feedforward networks. A typical parameterization is the multilayer perceptron (MLP), which uses fully connected layers (also known as dense layers) of the form

sk=fk(sk1;wk)=ak(Wksk1+bk),\begin{equation} s_k = f_k(s_{k-1}; w_k) = a_k(W_k s_{k-1} + b_k), \end{equation}

where wk=(Wk,bk)w_k = (W_k, b_k) are the parameters of layer kk, WkW_k is a weight matrix, and bkb_k is a bias vector. Moreover, we can decompose the layer into two functions. The function sWks+bks \mapsto W_k s + b_k is an affine layer and the non-linear function sak(s)s \mapsto a_k(s) without parameters is called an activation function.

Generally, we can replace the matrix-vector product Wksk1W_k s_{k-1} with any linear function of sk1s_{k-1}. For example, convolutional layers use convolution of the input sk1s_{k-1} with some filters WkW_k.

Remark 1. Handling Multiple Inputs

Sometimes it is necessary to deal with networks with multiple inputs. For example, suppose we want to design a function g(x1,x2,wg)g(x_1, x_2, w_g), where x1X1x_1 \in \mathcal{X}_1 and x2X2x_2 \in \mathcal{X}_2 are two inputs. A simple way to do this is to use the concatenation x=x1x2X1X2x = x_1 \oplus x_2 \in \mathcal{X}_1 \oplus \mathcal{X}_2 as input to a network f(x,Wf)f(x, W_f). Alternatively, instead of concatenating the inputs, they can be concatenated after having passed through one or more hidden layers.

Relationship with Generalized Linear Models

When the depth of the network is n=1n=1 (i.e., a single layer), the output of an MLP is a linear function of the input xx:

s1=a1(W1x+b1).\begin{equation} s_1 = a_1(W_1 x + b_1). \end{equation}

This is called a generalized linear model (GLM). In this case, the activation function a1a_1 is the identity function. Generalized linear models are a special case of MLPs. In particular, when a1a_1 is the softargmax\operatorname*{softargmax} function, we have a logistic regression model. Generally, for depth n>1n>1, the output of an MLP is:

sn=an(Wnsn1+bn).\begin{equation} s_n = a_n(W_n s_{n-1} + b_n). \end{equation}

This can be seen as a GLM over the intermediate representation sn1s_{n-1} of the input s0s_0. This is the main attraction of MLPs: the ability to learn intermediate representations that are useful for the task of interest. We will see that MLPs can also be used as subcomponents in other architectures.

Activation Functions

As mentioned earlier, feedforward networks use activation functions aka_k for each layer. In this section, we will describe some of the most common scalar-to-scalar or vector-to-scalar activation functions. We will also present some probability functions that can be used as activation functions.

Nonlinear Scalar-to-Scalar Activation Functions

The ReLU (Rectified Linear Unit) function is a nonlinear activation function defined as the non-negative part of its argument. That is, the ReLU function is the identity function for non-negative values and zero for negative values:

ReLU(x)=max(0,x)={xif x0,0if x<0.\begin{equation} \text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x \geq 0, \\ 0 & \text{if } x < 0. \end{cases} \end{equation}

It is a piecewise linear function and includes an inflection point at x=0x=0. A multilayer perceptron with ReLU activations is called a rectifier neural network. The layers take the form:

sk=ReLU(Wksk1+bk).\begin{equation} s_k = \text{ReLU}(W_k s_{k-1} + b_k). \end{equation}

where the ReLU is applied element-wise. The ReLU function can be replaced with a smoothed version, known as softplus:

softplus(x)=log(1+exp(x)).\begin{equation} \text{softplus}(x) = \log(1 + \exp(x)). \end{equation}

Unlike ReLU, softplus is a smooth, differentiable, and strictly positive function.

Nonlinear Vector-to-Scalar Activation Functions

It is often useful to reduce vectors to scalars. These scalar values can be seen as scores or probabilities, representing the importance of an input vector. A common function for doing this is using the maximum value, also known as max-pooling. Given a vector xRdx\in \mathbb{R}^d, the max-pooling function is:

maxpooling(x)=maxj[d]xj.\begin{equation} \text{maxpooling}(x) = \max_{j \in [\,d\,]} x_j. \end{equation}

As a smooth approximation of the max-pooling function, the log-sum-exp function can be used, which behaves like a smoothed version of the max-pooling function:

logsumexp(x):=softmax(x):=log(j=1dexp(xj)).\begin{equation} \text{logsumexp}(x) := \text{softmax}(x) := \log\left(\sum_{j=1}^{d} \exp(x_j)\right). \end{equation}

The log-sum-exp function can be seen as a generalization of the softplus function to vectors. In fact, for all xRx \in \mathbb{R}:

logsumexp((x,0))=softplus(x).\begin{equation} \operatorname*{logsumexp}((x, 0)) = \operatorname*{softplus}(x). \end{equation}

A numerically stable implementation of the log-sum-exp function is as follows:

logsumexp(x)=logsumexp(xc1)+c,\begin{equation} \text{logsumexp}(x) = \text{logsumexp}(x - c\mathbf{1}) + c, \end{equation}

where 1\mathbf{1} is a vector of ones and c=maxj[d]xjc = \max_{j\in [\,d\,]} x_j.

Scalar-to-Scalar Probability Mappings

Sometimes we want to map a real number to a number between 0 and 1, which can represent the probability of an event. For this purpose, sigmoids are frequently used. A sigmoid is a function whose curve is characterized by an "S" shape. These functions are used to compress real values into an interval. An example is the binary step function, also known as the Heaviside step function:

step(x)={1if x0,0if x<0.\begin{equation} \text{step}(x) = \begin{cases} 1 & \text{if } x \geq 0, \\ 0 & \text{if } x < 0. \end{cases} \end{equation}

It is a mapping from R\mathbb{R} to {0,1}\{0, 1\}. Unfortunately, the binary step function is discontinuous at x=0x=0. Moreover, because the function is constant at all other points, it has zero derivative at these points, which makes it difficult to use as part of a neural network with backpropagation.

A better sigmoid is the logistic function, which maps from R\mathbb{R} to (0,1)(0, 1) and is defined as:

logistic(x)=11+exp(x)=exp(x)1+exp(x)=12+12tanh(x2).\begin{equation} \begin{aligned} \text{logistic}(x) &= \frac{1}{1 + \exp(-x)} \\ &= \frac{\exp(x)}{1 + \exp(x)} \\ &= \frac{1}{2} + \frac{1}{2}\tanh\left(\frac{x}{2}\right). \end{aligned} \end{equation}

It maps (,0)(-\infty, 0) to (0,0.5)(0, 0.5) and [0,)\lbrack 0, \infty \rparen to [0.5,1)\lbrack 0.5, 1\rparen, and satisfies logistic(0)=0.5\text{logistic}(0) = 0.5. Therefore, it can be seen as a probability function. The logistic function can be viewed as a smoothed version of the step function step(x)\text{step}(x). Furthermore, the logistic function can be obtained as the derivative of the softplus(x)\text{softplus}(x) function, that is, for all xRx \in \mathbb{R}:

logistic(x)=ddxsoftplus(x).\begin{equation} \text{logistic}(x) = \frac{d}{dx}\text{softplus}(x). \end{equation}

Two important properties of the logistic function are that for all xRx \in \mathbb{R}:

logistic(x)=1logistic(x),\begin{equation} \text{logistic}(-x) = 1 - \text{logistic}(x), \end{equation}

and

ddxlogistic(x)=logistic(x)(1logistic(x))=logistic(x)logistic(x).\begin{equation} \begin{aligned} \frac{d}{dx}\text{logistic}(x) &= \text{logistic}(x)(1 - \text{logistic}(x)) \\ &= \text{logistic}(x)\text{logistic}(-x). \end{aligned} \end{equation}

Vector-to-Vector Probability Mappings

Sometimes we want to map a vector to a vector of probabilities. This is a mapping from Rd\mathbb{R}^d to a probability simplex, defined by:

Δd={pRd:j[d],  pj0   and   j=1dpj=1}.\begin{equation} \Delta^{d} = \left\{p\in \mathbb{R}^d: \forall j \in [\,d\,],\; p_j \geq 0\; \text{ and }\; \sum_{j=1}^{d} p_j = 1\right\}. \end{equation}

Two examples of vector-to-vector probability mapping functions are the argmax function and the softargmax function. The argmax\text{argmax} function maps a vector to a probability vector that is zero in all entries except for the entry with the maximum value, which it makes one. In effect,

argmax(x)=ϕ(argmaxj[d]xj),\begin{equation} \text{argmax}(x) = \phi(\arg\max_{j\in [\,d\,]} x_j), \end{equation}

where ϕ(j)\phi(j) denotes the one-hot encoding of an integer j[d]j\in [\,d\,], that is, ϕ(j)i=1\phi(j)_i = 1 if i=ji = j and zero otherwise, i.e.,

ϕ(j)=(0,,0,1,0,,0)=ej{0,1}d.\begin{equation} \phi(j) = (0, \ldots, 0, 1, 0, \ldots, 0) = e_j \in \{0, 1\}^d. \end{equation}

This mapping places all the probability mass on the input with the maximum value; in case of a tie, it selects the first input with the maximum value. Unfortunately, the argmax\text{argmax} function is not differentiable, which makes it difficult to use in backpropagation.

A smoothed version of the argmax\text{argmax} function is the softargmax\text{softargmax} function. The softargmax\text{softargmax} function is defined as:

softargmax(x)=exp(x)j=1dexp(xj).\begin{equation} \text{softargmax}(x) = \frac{\exp(x)}{\sum_{j=1}^{d} \exp(x_j)}. \end{equation}

This operator is commonly known in the literature as the softmax function, but this naming is misleading: this operator actually defines a smoothed version of the argmax\text{argmax} function. The output of softargmax\text{softargmax} belongs to the relative interior of the probability simplex Δd\Delta^{d}, which means it can never reach the edges of the simplex. If we denote p=softargmax(x)p = \text{softargmax}(x), then pj(0,1)p_j \in (0, 1), i.e., 0<pj<10 < p_j < 1 for all j[d]j \in [\,d\,]. The softargmax\text{softargmax} function is the gradient of the log-sum-exp\text{log-sum-exp} function, that is, for all xRdx \in \mathbb{R}^d:

softargmax(x)=log-sum-exp(x).\begin{equation} \text{softargmax}(x) = \nabla \text{log-sum-exp}(x). \end{equation}

The softargmax\text{softargmax} function can be seen as a generalization of the logistic\text{logistic} function to vectors, in effect:

softargmax(x)=[logistic(x1)logistic(xd)].\begin{equation} \text{softargmax}(x) = \begin{bmatrix} \text{logistic}(x_1) \\ \vdots \\ \text{logistic}(x_d) \end{bmatrix}. \end{equation}
Remark 2. Degrees of Freedom and Inverse of the Softargmax Function

The softargmax\text{softargmax} function satisfies the property that for all xRdx \in \mathbb{R}^d and for all cRc \in \mathbb{R},

p=softargmax(x+c1)=softargmax(x).\begin{equation} p = \text{softargmax}(x + c\mathbf{1}) = \text{softargmax}(x). \end{equation}

This means that the softargmax\text{softargmax} function has d1d-1 degrees of freedom and is not invertible. However, due to the above property, without loss of generality, we can impose that x1=j=1dxj=0\pmb{x}^\top \pmb{1} = \sum_{j=1}^{d} x_j = 0 (if this is not the case, we can simply do xixixˉx_i \leftarrow x_i - \bar{x}, where xˉ=1dj=1dxj\bar{x} = \frac{1}{d}\sum_{j=1}^{d} x_j is the mean of xx). With this restriction and together with the fact that

log(pi)=xilogj=1dexp(xj),\begin{equation} \log(p_i) = x_i - \log\sum_{j=1}^{d} \exp(x_j), \end{equation}

we obtain

j=1dlog(pj)=dlogj=1dexp(xj).\begin{equation} \sum_{j=1}^{d} \log(p_j) = - d \log \sum_{j=1}^{d} \exp(x_j). \end{equation}

Therefore, the softargmax\text{softargmax} function is invertible in the sense that we can recover xx from

xi=[softargmax1(p)]i=log(pi)1dj=1dlog(pj).\begin{equation} x_i = [\text{softargmax}^{-1}(p)]_i = \log(p_i) - \frac{1}{d}\sum_{j=1}^{d} \log(p_j). \end{equation}

Residual Neural Networks

As mentioned earlier, feedforward networks are a type of computation chain. In this case, the intermediate functions are parameterized functions that are applied sequentially to an input. Indeed, consider a feedforward network with n+1n+1 layers f1,,fn+1f_1, \ldots, f_{n+1}. Surely, whenever fn+1f_{n+1} is the identity function, the set of functions that this network can represent is the same as that of an nn-layer network f1,,fnf_1, \ldots, f_n. Therefore, we can consider an n+1n+1-layer network as an nn-layer network. In other words, depth, in theory, should not harm the expressive power of feedforward networks. Unfortunately, the assumption that each fkf_k is an identity function is not realistic. This means that deeper networks can sometimes be more difficult to train than shallower ones, causing accuracy to saturate or degrade as a function of depth.

The key idea of residual neural networks (ResNets) (He et al., 2016) is to introduce residual blocks between layers, such that the output of one layer becomes the input to a later layer. Formally, a residual block is a function of the form

sk=fk(sk1;wk):=sk1+hk(sk1;wk).\begin{equation} s_{k} = f_k(s_{k-1}; w_k):= s_{k-1} + h_k(s_{k-1}; w_k). \end{equation}

The function hkh_k is the residual function, as it models the difference between the input and output, i.e., hk(sk1;wk)=sksk1h_k(s_{k-1}; w_k) = s_{k} - s_{k-1}. The function hkh_k can be seen as a correction that is added to the input sk1s_{k-1} to obtain the output sks_k. The addition with sk1s_{k-1} is known as a skip connection or shortcut. As long as it's possible to adjust wkw_k so that hk(sk1;wk)=0h_k(s_{k-1}; w_k) = 0, then the function fkf_k becomes the identity function. For example, if we use:

hk(sk1;wk)=Ckak(Wksk1+bk)+dk,\begin{equation} h_k(s_{k-1}; w_k) = C_ka_k(W_ks_{k-1} + b_k) + d_k, \end{equation}

where wk=(Wk,bk,Ck,dk)w_k = (W_k, b_k, C_k, d_k) are the parameters of layer kk, then it suffices to adjust Ck=0C_k = 0 and dk=0d_k = 0 so that hk(sk1;wk)=0h_k(s_{k-1}; w_k) = 0. Residual blocks are known to remedy the vanishing gradient problem.

In many articles and software packages, an additional activation is included and instead, the residual block is defined as:

sk=fk(sk1;wk):=ak(sk1+hk(sk1;wk)),\begin{equation} s_{k} = f_k(s_{k-1}; w_k):= a_k(s_{k-1} + h_k(s_{k-1}; w_k)), \end{equation}

where aka_k is usually the ReLU function. Whether to include this additional activation or not is essentially a modeling decision. In practice, residual blocks can also include additional operations such as batch normalization and convolutional layers.

Recurrent Neural Networks

Recurrent neural networks (RNNs) (Rumelhart et al., 1986) are a type of neural network that operate on sequences of vectors, either as input, output, or both. Their actual parameterization depends on the configuration, but the central idea is to maintain a state vector that is updated step by step through a recursive function that uses shared parameters across steps. We distinguish between the following configurations:

  1. Vector to sequence (one to many): fd:Rd×RpRl×m.\begin{equation} f^d:\mathbb{R}^d \times \mathbb{R}^{p} \to \mathbb{R}^{l\times m }. \end{equation}
  2. Sequence to vector (many to one): fe:Rl×d×RpRm.\begin{equation} f^e:\mathbb{R}^{l\times d} \times \mathbb{R}^{p} \to \mathbb{R}^{m}. \end{equation}
  3. Sequence to sequence (many to many, aligned): fa:Rl×d×RpRl×m.\begin{equation} f^a:\mathbb{R}^{l\times d} \times \mathbb{R}^{p} \to \mathbb{R}^{l\times m}. \end{equation}
  4. Sequence to sequence (many to many, unaligned): fu:Rl×d×RpRl×m.\begin{equation} f^u:\mathbb{R}^{l\times d} \times \mathbb{R}^{p} \to \mathbb{R}^{l'\times m}. \end{equation}

where ll is the length of the input sequence and ll' is the length of the output sequence. Note that the same number of parameters pp is used for each of the configurations, but this is of course not necessary. Throughout this post, we will use the notation p1:l:=(p1,,pl)p_{1:l} := (p_1, \ldots, p_l) to denote a sequence of ll vectors.

Vector to Sequence

In this configuration, a decoder function p1:l=fd(x;w)p_{1:l} = f^d(x; w) maps a vector xRdx \in \mathbb{R}^d and a parameter vector wRpw \in \mathbb{R}^p to a sequence of vectors p1:lRl×mp_{1:l} \in \mathbb{R}^{l\times m}.

Computation Chain

Figure 3. One to many (decoding). A vector xRdx \in \mathbb{R}^d is mapped to a sequence of vectors p1:lRl×mp_{1:l} \in \mathbb{R}^{l\times m}.

This is useful for image captioning, where a sentence (a sequence of words) is generated from an image (a vector of pixels). Formally, we can define p1:l=fd(x;w)p_{1:l} = f^d(x; w) through the recursion:

si:=g(x,si1;wg),i[l],pi:=h(si;wh),i[l].\begin{equation} \begin{aligned} s_i &:= g(x, s_{i-1}; w_g), \quad i \in [\,l\,], \\ p_i &:= h(s_i; w_h), \quad i \in [\,l\,]. \end{aligned} \end{equation}

where w:=(wg,wh,s0)w := (w_g, w_h, s_0) are the network parameters, s0s_0 is the initial state. The goal of gg is to update the state sis_i based on the input xx and the previous state si1s_{i-1}, while the goal of hh is to map the state sis_i to a vector pip_i given the state sis_i. It's important to note that the parameters of gg and hh are shared across time steps. Typically, gg and hh are parameterized using single hidden layer MLPs. Note that gg has multiple inputs.

Sequence to Vector

In this configuration, an encoder function p=fe(x1:l;w)p = f^e(x_{1:l}; w) is defined that maps a sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} and a parameter vector wRpw \in \mathbb{R}^p to a vector pRmp \in \mathbb{R}^m.

Computation Chain

Figure 4. Many to one (encoding). A sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} is mapped to a vector pRmp \in \mathbb{R}^m.

This is useful for sequence classification, where the entire sequence is classified into a single category, for example in sentiment analysis. Formally, we can define p=fe(x1:l;w)p = f^e(x_{1:l}; w) through the recursion:

si:=g(xi,si1;wg),i[l],p:=pooling(s1:l),\begin{equation} \begin{aligned} s_i &:= g(x_i, s_{i-1}; w_g), \quad i \in [\,l\,], \\ p &:= \text{pooling}(s_{1:l}), \end{aligned} \end{equation}

where w:=(wg,s0)w := (w_g, s_0) are the network parameters, s0s_0 is the initial state, and gg now updates the encoder states and doesn't take previous predictions as input. The pooling function typically has no parameters and is used to reduce the sequence of states to a single vector. Common examples of pooling functions are the mean, sum, and max-pooling function.

Sequence to Sequence (Aligned)

In this configuration, an alignment function p1:l=fa(x1:l;w)p_{1:l} = f^a(x_{1:l}; w) is defined that maps a sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} and a parameter vector wRpw \in \mathbb{R}^p to a sequence of vectors p1:lRl×mp_{1:l} \in \mathbb{R}^{l\times m}. The length of the input sequence and the output sequence are equal.

Computation Chain

Figure 4. Many to many (aligned). A sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} is mapped to a sequence of vectors p1:lRl×mp_{1:l} \in \mathbb{R}^{l\times m}.

An example application is POS-tagging, where a tag is assigned to each word in a sentence. Formally, this can be defined as p1:l=fa(x1:l;w)p_{1:l} = f^a(x_{1:l}; w) through the recursion:

si:=g(xi,si1;wg),i[l],pi:=h(si;wh),i[l].\begin{equation} \begin{aligned} s_i &:= g(x_i, s_{i-1}; w_g), \quad i \in [\,l\,], \\ p_i &:= h(s_i; w_h), \quad i \in [\,l\,]. \end{aligned} \end{equation}

where w:=(wg,wh,s0)w := (w_g, w_h, s_0) are the network parameters, s0s_0 is the initial state, and gg and hh are functions similar to those defined in the one-to-many configuration. The function gg updates the state sis_i based on the input xix_i and the previous state si1s_{i-1}, while the function hh maps the state sis_i to a vector pip_i given the state sis_i.

Sequence to Sequence (Unaligned)

In this configuration, an alignment function p1:l=fu(x1:l;w)p_{1:l'} = f^u(x_{1:l}; w) is defined that maps a sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} and a parameter vector wRpw \in \mathbb{R}^p to a sequence of vectors p1:lRl×mp_{1:l'} \in \mathbb{R}^{l'\times m}.

Computation Chain

Figure 5. Many to many (unaligned). A sequence of vectors x1:lRl×dx_{1:l} \in \mathbb{R}^{l\times d} is mapped to a sequence of vectors p1:lRl×mp_{1:l'} \in \mathbb{R}^{l'\times m}.

The length of the input sequence and the output sequence are not necessarily equal. An example application is machine translation, where a sentence in one language is translated to a sentence in another language. Formally, this can be defined as p1:l=fu(x1:l;w)p_{1:l'} = f^u(x_{1:l}; w) through the recursion:

c:=fe(x1:l;we),p1:l:=fd(c,wd),\begin{equation} \begin{aligned} c &:= f^e(x_{1:l}; w_e), \\ p_{1:l} &:= f^d(c, w_d), \end{aligned} \end{equation}

where w:=(we,wd)w := (w_e, w_d) are the network parameters, fef^e is an encoder function that maps the input sequence to a context vector cc, and fdf^d is a decoder function that maps the context vector to the output sequence. The encoding function fef^e and the decoding function fdf^d can be implemented as RNNs. The detailed steps would be:

si:=g(xi,si1;wg),i[l],c:=pooling(s1:l),zi:=g(c,pi1,zi1;wg),i[l],pi:=h(zi;wh),i[l].\begin{equation} \begin{aligned} s_i &:= g(x_i, s_{i-1}; w_g), \quad i \in [\,l\,], \\ c &:= \text{pooling}(s_{1:l}), \\ z_i &:= g(c, p_{i-1}, z_{i-1}; w_g), \quad i \in [\,l'\,], \\ p_i &:= h(z_i; w_h), \quad i \in [\,l'\,]. \end{aligned} \end{equation}

This architecture is aptly called an encoder-decoder architecture. Note that we denote the length of the target sequence as ll' instead of ll to avoid confusion. However, in practice, the length may depend on the input and is often not known in advance. To address this issue, the vocabulary (of size dd in our notation) is typically augmented with an "end of sequence" (EOS) token, so that at inference time, we know when to stop generating the output sequence. A disadvantage of this architecture is that all information about the input sequence is contained in the context vector cc, which can therefore become a bottleneck.

Conclusions

Neural networks, conceived as parameterized programs, provide a powerful and flexible framework for addressing a wide range of problems in machine learning and data processing. Throughout this article, we have explored various architectures and configurations, from feedforward networks to recurrent neural networks and residual networks. The representation of these architectures as directed acyclic graphs (DAGs) not only facilitates their analysis and optimization but also allows for the efficient application of techniques such as automatic differentiation. The different configurations discussed demonstrate the versatility of these models for tasks ranging from image captioning to machine translation.

This approach of viewing neural networks as parameterized programs not only allows for designing more efficient and effective models but also opens pathways for improving the interpretability and explainability of these models, an area of growing importance in artificial intelligence. Although we have covered several types of neural networks, it's important to note that the field is constantly evolving, with new architectures and techniques being developed continuously. In summary, this perspective provides a solid foundation for the future development of more advanced, interpretable, and adaptable machine learning models for a wide range of real-world applications.

Finally, if there are any errors, omissions, or inaccuracies in this article, please don't hesitate to contact us through the following Discord channel: Math & Code.

References