Machine Learning Platforms and Data Pipelines at Google, Facebook and Uber


(T) When you are a small start-up of fewer than 20 employees, you obviously have only one machine learning platform for all your analytics and services. That is not obviously the case of many large companies who have multiples product teams with different requirements. However, the trend now for many companies is to have the full machine learning workflow e.g. managing the data, training, evaluating, and deploying the models, making the predictions, and monitoring the predictions into a platform that operates as a centralized service for the whole enterprise. This is MLaaS (Machine Learning as a Service).  This provides a way to build reliable, uniform, and reproducible pipelines across many product organizations.

Following is a description of the key ML platform and data pipelines that are being developed, and implemented at Google, Facebook, and Uber.

Google ML Platform – TensorFlow Extended (TFX)

Last year, Google introduces its TensorFlow-Based Production-Scale Machine Learning Platform, called TFX. TFX has been deployed internally at Google. It intends to provide the full ML pipeline from data ingestion to serving models in production, and everything in between. The full components of the platform will be open sourced (see below the road map). So now, not only you can build your models with TensorFlow like Google does, but also you will be able to run them in a production environment like Google!

TFX was introduced in a paper TFX at the KDD conference with the following abstract:

“Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. 

We present TensorFlow Extended (TFX), a TensorFlow- based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks while providing platform stability that minimizes disruptions.

We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.”

Following is a short video about TFX:


Last week, Google at its TensorFlow Developer summit, announced a new component to TFX, the TensorFlow Model Analysis (TFMA), and its plan to release TFX:

Google will open source the various TFX components in three phases. For Phase 1, the components are already available on the TensorFlow GitHub. Components for Phase 2 and Phase 3 will be available end of this year:



Facebook ML Platform – Datacenter Software and Hardware Infrastructure

Facebook ML platform runs on specific FB hardware (Bruce Canyon, Big Bassin, Tioga Bass, and Twin Lakes) from its Open Compute Project, leverages FBLearner Flow, Facebook workflow data pipeline for generating and predicting models, and uses Caffe2 (for production), PyTorch (for research), and ONNX (to exchange models):



In February, Facebook published a paper Hpca_2018_Facebook “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective” that gives more insights into its ML platform, a paper that was presented at HPCA 2018 with the following abstract:

“Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook’s machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.”

Yangquing Jia, Director of the AI Infrastructure at Facebook, and one of the authors of the Facebook paper gave a talk recently at the Stanford Scaled Machine Learning conference about Facebook data center ML infrastructure:

Uber ML Platform –  Michelangelo

Uber introduced last year its ML platform named Michelangelo in a blog article “Meet Michelangelo: Uber’s Machine Learning Platform“. Michelangelo has two data pipelines: one for batch processing and another one for real-time processing. Features that have been processed by batch models can be made available for online models. Michelangelo is built with the following stack:

  • Prediction models: linear and logistic models, decision trees, unsupervised models, time series models, and deep neural networks
  • Platform stacks: Hadoop, Spark, Samza, and Cassandra
  • And, machine learning stacks such as Spark MLlib, XGBoost, and TensorFlow


Uber’s blog article described how Michelangelo is being used in the case of UberEats. Uber has as well published another blog article in which ML applications are being described for a variety of Uber use cases: “Engineering More Reliable Transportation with Machine Learning and AI at Uber“.

Michelangelo is explained in details in the following video by Uber IEEE and ACM Fellow Li Erran Li:

Note: The picture above is some yummy patisserie.

Copyright © 2005-2018 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.