Statistical Modeling on a Machine Learning Platform

(T) Statistical modeling and machine learning are sometime considered as the “same side of the same coin” since both fields leverages statistical methods to learn about data. But there are sometime considered as the “two sides of the same coin” as their approaches differ.

Simply said:

  • Statistical modeling is a formalization of the relationships between the variables in the data in the form of mathematical equations
  • Machine Learning is an algorithm that can learn from data without relying on rules-based programming

According to Professor Andrew Ng, Computer Science Professor at Stanford University: “machine learning is the science of getting computers to act without being explicitly programmed

But Professor Robert Tibshirani, a Professor of statistics also at Stanford University, calls machine learning “glorified statistics”!

Statistical models have been traditionally developed in the language R, while machine learning models are developed in Python or Scala.

R is written primarily in R but also in C and Fortran (Yes, I wrote Fortran!). By default, R is a single threaded process that holds all its data in RAM. According to an article on R Studio Blog:

  • “R has become the world’s largest repository of statistical knowledge with reference implementations for thousands, if not tens of thousands, of algorithms that have been vetted by experts. The documentation for many R packages includes links to the primary literature on the subject
  • If novel machine learning tools for modeling are first written and supported in Python, many new methods in statistics are first written in R
  • R has a very low barrier to entry for doing exploratory analysis, and converting that work into a great report, dashboard, or API
  • R with RStudio is often considered the best place to do exploratory data analysis”

The list of all R packages can be found at: R CRAN Packages. As already mentioned, CRAN has a large number of statistical methods that cannot always be found in Python and/or Scala libraries such as:

  • Gaussian process
  • Bayesian methods 
  • Survival analysis
  • Feature engineering

Besides their differences in languages, statistical models and machine learning models have also differences in their deployments on a large scale production platform:

  • Inference for statistical models is offline
  • Inference for machine learning model is online

Generally machine learning models, deployed for production, are trained on Apache Spark clusters and deployed on Kubernetes micro-services for inference. Statistical models are generally developed on R Studio notebooks, and deployed for production on R Studio servers.

In many organizations, the expertise, the scale, and the cost required for a machine learning platform does not justify to have two platforms: one for statistical learning and another one for machine learning.

Fortunately, there are two solutions to run R scripts on Apache Spark: Sparklyr and SparkR. And, I really prefer Sparklyr to SparkR.

Sparklyr:

  • Supports all CRAN R packages
  • Use dplyr to filter and aggregate Spark datasets and streams, then bring them in to R
  • sparklyr::spark_apply run any R code accross all the nodes of a Spark cluster
  • Support all Spark MLlib libraries

SparkR:

  • SparkR implements a R distributed data frame similar to R data frames – dplyr – but for large datasets
  • SparkR supports also Spark MLlib

SparkR has not been widely adopted by the R community, but that is not the case for Sparklyr.

As a CRAN package, agnostic to Spark releases, easy to install, and can run across all the distributed nodes of a Spark cluster, SparkR has been a winner for the R community.

So while someone can debate if statistical modeling and machine learning are the same side or the two sides of the same coin. There is no need to have to have R Studio notebooks and R Studio servers to run statistical models while having at the same time, Jupyter notebooks to develop machine learning models in R and Scala, and train them on Apache Spark with large data sets.

You can develop statistical models in R in Jupyter notebooks, and machine learning models in Python and Scala also in Jupyter notebooks.

And you can deploy those statistical models in production for inference on a platform that runs Apache Spark with Sparlyr, and use the same platform to train machine learning models, which can be later deploy to Kubernetes micro-services for online inference.

And, the same platform can be used both for statistical learning and machine learning, so that you can offer one platform for all your data science needs.

Note: The picture above is from Whimsical Wonderland of Lights exhibits at the Golden Gate Park.

Copyright © 2005-2021 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com

Categories: Back-End, Machine Learning