(B) Someone from LinkedIn emailed me to attend a virtual meet-up “AI/ML: in Product Management: Applications and best practices for leveraging AI/ML in products”, organized by the “LinkedIn Women in Product” Group. Inspired by the discussions from the virtual meet-up, and since for a living I am as well a machine learning product manager, I decided to share my perspective about what I learned over the last few years in the trenches. So here is my laundry list on “how to be a successful machine learning product manager?”
- Being a successful machine learning product manager is hard!
I had many product management roles in my career. I have been a product manager in networking, security, mobile, cloud infrastructure…But out of all my roles, my present and my previous positions as a machine learning product manager have been the most challenging ones. Just read the following to understand why?
- Everything that you learned in other product management roles applies to this role as well
As in any other product management role, you must be a restless student of the markets and technologies. You need to act as a product leader who can start a new product or grow an existing product line. You need to shine both on the “product” side, as the product expert for your company, and on the “management” side as the one who is going to facilitate the growth of a future idea from a while board to a real product in the market place. As in any other product management role, you should be the one who is going to help to maximize the contributions of the various stakeholders through the development, the production, the Q&A, and the launch phases of the product life cycle.
- And, so every machine learning product must start with a new business case
As it is the case for any product. You must first come up with a good use case and develop a reasonable business case. Look for an inflection point in your organization when something is changing. In machine learning, there are always two areas to find a possible inflection point.
You might discover that you have a lot of data about your customers, your products, your infrastructure, your processes. And, if you start to learn from that data, you might be better at what you are doing. For instance, you might discover the exact patterns of your customer behaviors or evaluate how to make your enterprise assets more productive.
Or you might discover that if you had machine learning to an existing experience, interface, or process, you might add significant value to your top or bottom line. You might discover that you can generate new revenue streams by making your customer experiences more personalized or you can save costs by automating some internal processes.
- Learn to develop your product “instincts” and build your product “muscles” for machine learning
We are all even without knowing it, using machine learning apps: search, online shopping, Siri, typing on a mobile keyboard, ordering soon your Tesla car to come and pick you up…
Some of those applications are good, some are not always so good. Learn to recognize which applications are empowered by machine learning, and what seems to be working and not working. Build your intuition and your skills. Put yourself in the shoes of the user.
- The full life cycle of a machine learning product
Any machine learning product has four building blocks: the machine learning application, the features and the models that are used to provide the predictions to the application, the machine learning pipeline to train and deploy in production for inference the models, and the raw data that is used to generate the features.
A successful machine learning product manager needs to master those four building blocks. Let’s dive into each of those building blocks.
- The machine learning application
The definition of the machine learning application is no different than any other Web, mobile, device, infrastructure, or industrial application. It is all about use cases, user experiences, KPIs, and business metrics… The only specific component is that this application, which can be both online or offline, needs to interface with the machine learning pipeline to receive the model predictions.
- The models and the features
The definitions of the models require a clear understanding of what are the requirements of the application: do you want to find similarities between items, do you want to detect inappropriate content on your network, do you want to segment your customers?
The product manager must have a pretty good idea of what are the machine learning tasks such as scoring, ranking, classifying, clustering…and what are the types of learning that must be performed by the models (see below for more on learning).
Ideally, models and features must be re-usable and can be shared across data science teams to improve time to launch new models in production. You want to build a library of features and models that can be modified and updated quickly for new use cases.
- The raw data for your features
You never have enough data. And even worse, you never have the right data. Identify the data with your data science teams that are required for your models. And, work with your data producers to have all that data available in a data lake, a data warehouse, or stream in a data feed to generate your ETLs.
If you have sparse data, work with your data scientists on the algorithms such as for instance for a recommender system on factorization techniques such as matrix factorization or factorization machines; if not try to find new data sources that might be outside your organization.
- Your pipeline
The machine learning pipeline is the infrastructure where models are going to be trained and deployed for inference in a production environment. You generally have three data pipelines: one for training models, one for deploying models in inference, and one for experimenting models.
Training is all about delivering a high-throughput infrastructure as model training required a lot of computing resources that must quickly scale.
Inference is all about to deliver a low-latency infrastructure as model inferences must be fast very fast and scale to provide a hundred thousand or millions of predictions an hour or a day, and possibly hundreds of millions or billions of predictions a month.
There are two ways to build a pipeline. If you are have limited resources and limited engineering talents, you can leverage the infrastructure of a cloud provider such as Google AI cloud or Amazon SageMaker. But be careful if you are using such an environment! If you have data lakes and online applications, it might be difficult to integrate those two to those cloud environments.
If you have more resources and the engineering talents, you are probably better developing and owning your end-to-end machine learning pipeline. But be careful if you are doing so: it is expensive, and technically quite challenging. All the components of a machine learning pipeline are open-source software. Use Jupyter and R Studio for your notebooks. Use Kubeflow for scheduling and monitoring your machine learning jobs, Spark for your model training. Flink for streaming data. Seldon for serving models. And, Kubernetes for running everything.
- Launching your machine learning application
When your models are working and your machine learning application gets its first prediction, you might be thinking that you are ready to launch. Unfortunately, your launch will never end. Your data distribution will change. The data quality of your features might change. Your model predictions might improve or stagnate. Your pipeline will have outages. You might sometime witness the customer and the business value of your model but not all the time. This is why machine learning is hard.
- Machine learning is all about data transformation
You start with the data. You end with data. But between the beginning and the end, your pipeline has been processing a lot of data with only one end goal: performing data transformation. Machine learning is all about data transformation. Raw data is transformed into features. Features feed models. And, models are trained to provide predictions. In all those stages, data is transformed.
As a result, it is very important to have a data lineage infrastructure that consistently tracks the data, their versions, the time context, and the metrics and meta-data associated with any data transformation that occurs in the data pipeline.
To that end, most machine learning pipelines have the concept of a feature store and model store where data lineage about the features and the models can be found.
As a product manager, understanding data lineage is very important to understand the state and the success rates of those transformations.
- Machine learning is all about experimentation
In inference, you might have two models: a champion model and a challenger model. The challenger model might have new features that you want to experiment with.
In training, you might have different algorithms and different hyperparameters that you want to experiment with.
As a product manager, you must always be ready to have your data science teams experimenting. To that end, always have two pipelines: one for staging and another one for production.
- Do not be afraid to launch deep learning models in production
Deep learning models are hard to train, hard to debug, and hard to understand. But the more data, the better they will do compared to linear models.
And if you have a use case for unstructured data such as audio, image, and video, your models will have to be deep learning.
And so far most of the most advanced applications in machine learning are all based on deep learning: recommendations with sparse data, Siri, predictive keyboards, face recognition, fake news, autonomous vehicles…
- The different types of learning
Ten to five years ago, machine learning was pretty simple. We had three classes of systems: supervised learning for regression and classification, unsupervised learning for clustering, and deep learning mostly based on convolutional neural networks and sequence models.
Today, we have at least:
- Weekly supervised learning
- Semi-supervised learning
- Self-supervised learning
- Active learning
- Federated learning
- Reinforcement learning
- Deep reinforcement learning
- Transfer learning
- Multi-task learning
And, today state of the art machine learning applications might combine multiple types of technologies into one master application such as OpenAI’s GPT-2 a language model for reading comprehension, summarization, translation, and question answering, or Covariant’s robots which learn quickly new tasks in a warehouse.
- Build a data warehouse and use it for your machine learning pipeline
If you have a lot of historical data about customers, products, customer interactions with products, Web applications, mobile applications, mobile devices, IoT devices, industrial assets, industrial robots, home robots, autonomous vehicles that is a lot of data and you might consider to leverage a data warehouse.
There are multiple data warehouse options: SnowFlake on AWS, Google BigQuery on Google cloud.
All those data warehouses use SQL queries, you can run a lot of ETLs on those warehouses and extract a lot of precious and valuable data to feed another chain of ETLs on your machine learning pipelines.
- Fairness and bias in your data
This is an important and growing issue in particular for consumer-based machine learning applications. Model predictions need to be fair. And, be careful about human biases in your training data sets.
- Keep your cost in control
Deploying machine applications is expensive. Basically, storage is free but data processing is quite expensive.
If you have 50 ETLs and a few models, your costs for generating those ETLs and training those models on a cloud provider (AWS, Google…) are probably going to cost anywhere between $50,000 and over $100,000 per month depending both of the complexities of your ETLs/models and data pipelines.
So you have hundreds of ETLs and models, you going to reach quickly a few million dollars a year to operate a machine learning pipeline. Keep that in mind when you are working on an ROI.
- Learn how to analyze and visualize data in a notebook
As a product manager, you must at least learn how to analyze and visualize data in a notebook. Download the individual edition of Anaconda and learn how to use Jupyter notebooks. Learn the basics of Python programming. Take the class of Jessica McKellar for an introduction to Python. Get familiar with the basics libraries for data analysis: Pandas, Seaborn, Matplotlib, and Plotly. And, use them.
- Bonus point: learn how to develop simple models
Buy the book “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems” from Aurélien Géron. And, read it.
Learn about how to develop some simple models (regression, classification, clustering) with Scikit-learn. Get familiar with TensorFlow and Kera APIs. Build those models in Jupyter notebooks or in Google Codelabs.
And, if you ever need to develop statistical models, learn R and how to use R Studio.
- Keep learning
Machine learning is changing at the speed of light. That’s one of the reasons why it is exciting but at the same time hard to be in that field. So, keep learning but not in the New York Times. Go to good conferences and meet-ups. Read the Google AI blog and the publications of OpenAI. Follow on Twitter the best researchers from the best universities and research labs.
Note: The picture above is “La boîte de Pandore” a painting from René Magritte.
Copyright © 2005-2020 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com