Mean-Field Analysis of Neural Networks

(T) I attended recently another lecture, part of the Stanford statistics seminars, from Professor Tengyu Ma from Stanford University titled: “beyond NTK: A mean-field analysis of neural networks with polynomial width, samples, and time“.

Following are my takeaway notes from the lecture:

Background:
Training neural networks requires optimizing non-convex losses, which is often practically feasible but still not theoretically understood, and a NP-hard problem in the worst case.

The neural tangent kernel (NTK) approach analyzes nonconvex optimization under certain hyperparameter settings, e.g., when the initialization scale is large and the learning rate is small.

The mean-field approach views the collection of weight vectors as a (discrete) distribution over Rd (where d is the input dimension) and approximates its evolution by an infinite-width neural network, where the weight vector distribution is continuous and evolves according to a partial differential equation.

Findings:
The paper provides a mean-field analysis of projected gradient flow on two-layer neural networks with polynomial width and quartic activations. Under a simple data distribution, it demonstrates that the network converges to a non-trivial error in polynomial time with polynomial samples. Notably, the results show a sample complexity that is superior to the NTK approach.

Abstract:
“Despite recent theoretical progress on the non-convex optimization of two-layer neural networks, it is still an open question whether gradient descent on neural networks without unnatural modifications can achieve better sample complexity than kernel methods. This paper provides a clean mean-field analysis of projected gradient flow on polynomial-width two-layer neural networks. Different from prior works, our analysis does not require unnatural modifications of the optimization algorithm. We prove that with sample size n=O(d3.1) where d is the dimension of the inputs, the network trained with projected gradient flow converges in poly(d) time to a non-trivial error that is not achievable by kernel methods using nd4 samples, hence demonstrating a clear separation between unmodified gradient descent and NTK. As a corollary, we show that projected gradient descent with a positive learning rate and a polynomial number of iterations converges to low error with the same sample complexity.”

Reference:
Theoretical Deep Learning

Note: The picture above is un-je-ne-sais-quoi.

Copyright © 2005-2023 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.