Deep learning systems have achieved impressive results from language understanding, to protein sequencing, and autonomous vehicles. While it has long been known that artificial neural networks can approximate any function with sufficiently many hidden layers, we are still struggling to explain what large neural networks converge to, and why they generalize well in practice, as optimization of those networks is non-convex, and at least NP-hard.

However, there is a little bit of a good news. Understanding infinite neural networks (e.g. neural networks whose number of neurons is infinite in the hidden layers) is much easier than finite ones.

Infinite neural networks have a Gaussian distribution that can be described by a kernel (as it is the case in Support Vector Machines or Bayesian inference) determined by the network architecture.

For infinitely wide networks, this kernel is stationary, and makes possible to study the training of the neural network, while for finitely wide networks the kernel does change when the network is trained.

But given that many well performing networks are not wide enough to disregard the changes in their kernels, the mystery about why neural networks work is still kept.

Let’s attempt (I say attempt since this is a complex subject) to deeper dive by reading two key papers that provide some good insights in today’s state of the art of theoretical deep learning. I selected those two papers from a list of around 10 that have been published over the last two years.

*First finding: Gradient descent training for an infinite neural network can be described by kernel methods*

“Full batch gradient descent of infinite networks in parameter space corresponds to kernel gradient descent in function space with respect to a new kernel, the Neural Tangent Kernel (NTK)”

“*At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fθ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial.*

*We then focus on the setting of least-squares regression and show that in the infinite- width limit, the network function fθ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping.*“

A few slides, and two short videos from Arthur Jacot, PhD student at EPFL:

*Second finding: Infinite neural networks evolve as linear models under gradient descent *

*“The neural network can be effectively replaced by its first-order Taylor expansion with respect to its parameters at initialization. The dynamics of gradient descent become analytically tractable through a linear model.”*

*Third finding: The predictions of an infinite neural network throughout gradient descent training can be described by a Gaussian Process*

** “For squared loss, the exact learning dynamics admit a closed-form solution that allows us to characterize the evolution of the predictive distribution in terms of a Gaussian Process.**“

*“In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.*“

A poster from the paper, and another video from Arthur (who did not contribute to that paper) but explain very well the gaussian process:

Google Brain has developed a new library called Neural Tangent to study infinite neural networks, and published those two pictures that show the distribution for a finite network and the gaussian distribution of an infinite network:

Theoretical machine learning is becoming a fast moving research space, and in particular since it can help build better neural networks with finite width so this blog post is probably already outdated, and there are probably some good new good papers to read on that subject today.

Note: The picture above is Composition IX from Wassily Kandinsky at the Pompidou Center in Paris.

*Copyright © 2005-2021 by Serge-Paul Carrasco. All rights reserved.**Contact Us: asvinsider at gmail dot com*

Categories: Algorithms, Deep Learning