The Power and Limits of Deep Learning


(T) I recently listened to a lecture from Yann LeCun on “The Power and Limits of Deep Learning” organized by ACM. Following are a summary in plain English based on the lecture about the state of the art for deep learning, a technical detailed summary of the key points presented in the lecture, and the video recording of the lecture.

Summary of the state of the art for deep learning

The essence of intelligence
A child learns after 9 months that objects are supposed to fall. And an orangutan knows that objects are not supposed to disappear. Both the child and the monkey have learned those things by themselves. Emerging self-supervised machine learning systems should do the same: learn a task with little interactions. The essence of intelligence is to take decisions by predicting what to do and when: if a child sees an object falling, he will move. Self-driving cars that leverage self-supervised machine learning should do the same: they should choose where to go by predicting how other cars and pedestrians move, and as a result, compute automatically the car steering and acceleration.

Taxonomy of machine learning systems
Traditional machine learning system started with supervised learning, where a system learns from training data called label data how to make predictions, and a data scientist tweaked the parameters of the system in order for the system to learn how to make the right predictions. In a few cases, unsupervised learning systems can learn to discover patterns in the data. In that case, the data scientist tweaked the parameters of the system in order for the system to learn how to discover those patterns but without giving to the system labeled data. A lot of new approaches have been proposed besides supervised and unsupervised learning, and in particular: reinforcement learning where the system improves its prediction by learning with a reward if its predictions are correct, weekly supervised learning where there is a lot of labeled data but the quality of that data is poor, and semi-supervised learning where there is not a lot of labeled data.

The recent evolution of machine learning systems
The recent explosion of machine learning systems has been fueled by natural language processing (NLP) leveraging deep learning systems. While traditional machine learning systems leverage ETL (extract transform load) programs that transform the initial raw data into meaningful data (called the features) that the machine learning system (called the model) can use – deep learning systems can find features without ETLs and can learn not only from structured data e.g. data that can be organized in a table, but also from unstructured data e.g. data such as audio, image, and video. The recent evolution of those systems have been in developing systems that are deeper and deeper; meaning that they have more layers of processing units. However, most of the present deep learning systems deployed are good for “perception” but not for “reasoning”.

Self-supervised learning (SSL) systems
Recent leading-edge NLP systems are implementing SSL – given the beginning of a text – they can continue it and as such can generate fake news. SSL for vision systems are more difficult to implement – given multiple images – it is hard for the system to propose either the missing images or part of the image that is missing. New NLP system leverages attention mechanisms – which like human attention helps the system to focus on a specific context – e.g. features that are overweighted. Generative adversarial models that include two systems – one that produces fake images and another one which classifies real images from fake images – and compete in a Nash equilibrium – could potentially be used to model uncertainty.

Technical Detailed Summary

Supervised Learning

  • Works but required training data that needs to be labeled
  • Parameters of the algorithms need to be tweaked in order for the system to provide the right predictions
  • Supervised learning goes back to the Perceptron and Adaline

Convolutional Networks (ConvNets)

  • Two parts: feature extractor and classifier
  • Hierarchical feature extractor from low-level features (shapes) to high-level features (objects)

Reinforcement Learning (RL)

  • Works great for games (Atari, Elf OpenGo, StartCraft)
  • Pure RL requires too many trials for learning – learning is slow: AlphaStar requires 200 years of equivalent real-time play
  • RL works in the virtual world but not in the real world for instance for self-driving cars
  • You cannot take decisions in the real world faster than real-time


  • (Deep) Multi-layer neural nets
  • Each unit computes a weighted sum of its inputs that is processed by a non-linear (activation) function such as ReLu
  • Tweaking the parameters – gradient descent – It’s like walking in the mountains in a fog and following the direction of steepest descent to reach the village in the valley
  • Gradients are computed by backpropagation


  • Inspired by: [Hubel & Wiesel 1962] Visual cortex mode of operations: simple cells detect local features – complex cells “pool’ the outputs of simple cells within a retinotopic neighborhood and [Fukushima 1982] Neocognitron
  • New developments: Face and pedestrian detection and semantic segmentation

Deep learning revolution

  • Starting in 2010, driven by speech recognition community
  • Better and better performance, a dramatic increase in the number of layers
    [Krizhevsy et al 2012] Alexnet 2012: deep convents for object recognition on GPU (8 to 20 layers – 10 million to 1 billion parameters – 1 to 10 billion connections)
  • The error rate on image classification decreasing
  • Depth inflation – ResNet [He & al. 2015], DenseNet [Huang & al. 2017]
  • Each of the few billions of photos uploaded on Facebook every day goes through a handful – 6 – of ConvNets within 2 seconds
  • Mask R-CNN: instance segmentation, two-stage detection system, identifies areas of interest and send them to new networks for identification
  • RetinaNet: feature pyramid network
  • But all of those systems are good for perception not for reasoning

Learning to reason

  • Introducing a differentiable associative memory – “self-attention networks”: maintain a number of facts, “memory network”, a neural net with an attached network for memory, essentially a soft RAM
  • Widely used in NLP: Google’s Transformer Networks, OpenAI’s GPT-2
  • Every unit is itself a neural network works with translation (dynamic convolution)
  • Facebook’s work on visual reasoning: finding an object in a visual representation

How do humans and animals learn (so quickly)?

  • Largely by observation: kids know at age nine that objects will fall – and animals know that objects do not disappear
  • Prediction is the essence of intelligence

Self-supervised networks

  • Training very large networks to understand the world through prediction
    SSL ~ unsupervised learning: no labels given by human but labels are learned by the system itself from the input data
  • RL (Reinforcement Learning): a few bits per sample = to predict a scalar reward
  • SL (Supervised Learning): 10 to 10,000 bits per sample = to predict a category or a few numbers for each input
  • SSL (Self-supervised Learning): millions of bits per sample = to predict future frames in videos
  • SSL = filling the blank – works well for NLP not so well for image recognition
    SSL works with discrete data
  • However, the world is not always predictable/stochastic => multiple futures are possible

Generative adversarial training

  • The challenge: predicting under uncertainty
  • GAN possible solution to predict under uncertainty: Generator predicts future
    – two discriminators learn from the actual and predicted future
  • Self-supervised adversarial learning for video prediction => predicting instance segmentation maps [Luc, Couprie, LeCun, Verbeek ECCV 2018]
  • Energy-based Self Supervised Learning
  • Energy function: takes low values on the data manifold and higher values everywhere else
  • Possible futures: low energy
  • Implausible futures: high energy

Latent Variable Models

  • Regularized auto-encoders
  • Latent variables: when you’re building a generative model, you don’t want to prepare to replicate the same image you put in. You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space
  • Latent representation => decoder => encoder/reconstruction ~ target => residual error

Self-Supervised Forward Models

  • Learning a Control Task with Few Interactions
    Planning requires prediction
  • To plan ahead, simulate the world
  • RL model: agent -> objective
  • SSL model: agent = world simulator -> actor -> critic
  • Use forward models to learn to drive
  • The forward model is trained to predict how every car moves relative to the central car.
    (steering and acceleration are computed)

Recording of the lecture



Note: The picture above is “Paysage de Tahiti” from Paul Gauguin.

Copyright © 2005-2019 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com