Theoretical Deep Learning

(T) Research in deep learning is moving rapidly in every direction: reinforcement and imitation earning for robots, computer vision for self driving cars, graph neural networks for biology, and language models for everything. Until we reach a point where we can predict when and why deep learning models work, we will never be able to design truly artificial intelligent systems. For the last few years, theoretical deep learning has emerged as a new field that aims to explain the behaviors of certain types of deep learning models. It is like theoretical computer science for computer science or theoretical physics for physics, and expanding as more mathematicians are joining the field.

One particular area in theoretical deep learning that got a lot of attention is the behavior of deep learning systems in the infinite width bandwidth, in particular because of the work of Professor Arthur Jacob on Neural Tangent Kernel (NTK):

  • “The training with infinitesimal gradient descent steps of a randomly initialized deep learning model with infinite width is equivalent to a kernel regression which Arthur called a neural tangent kernel (NTK)”

Following is a generalization of that initial work that aims to organize the present research of theoretical deep learning in a few themes.

Neural Networks Regimes:

Let’s define a neural network with the following parameters:

  • n = Sample size of the data = number of neurons
  • N = Number of neurons
  • d = Dimensions (number of inputs to the network)
  • k = SGD steps

As those parameters are changing, the network can enter different regimes e.g. behaviors. There are three types of regimes that have been studied over the last few years. To describe them, I am using the materials from a lecture of Professor Montari from Stanford University:

  • Regime 1 – small networks = few neurons, the number of data points is equal or larger than the dimensions, the number of steps is much bigger than the number of data points
  • Regime 2 – over-parametrized regime = over-parametrized network, very few SGD steps
    • Key concept: neural tangent kernel
  • Regime 3 – mean field regime = a lot of neurons but not a lot of steps, each point during the SGD is visited basically once
    • Key concept: mean field approach

Regime 1:

Computational Gaps in Learning:

Regime 2:

Neural Tangent Kernel (NTK):

Gaussian Processes:

Lazy Training:

Double Descent Curve:

Over-Parametrized Regime:

Reverse Engineering the Neural Tangent Kernel:

Regime 3:

Mean Field Theory Applied to Neural Networks:


Note: The picture above is Amélie-les-Bains in the French Pyrénées-Orientales.

Copyright © 2005-2022 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com