Designing a Spark Data Pipeline

(T) One of my MLOps engineers asked me about my experience with the design of Spark data pipelines for processing large volume of data sets. Following is a summary of my conversation with her:

AirFlow vs Argo on Kubernetes:

Most companies in the San Francisco Bay Area are using AirFlow with Kubernetes (K8s). Few companies are using Argo with K8s. I have used both, and they have pretty much feature parity. Companies that have chosen Argo with K8s are using most of the time Kubeflow, which has more functions than the DAG scheduler of Argo, and widely used at Google.

AirFlow or Argo vs Spark:

It is not one versus the other one as you need both together. Spark requires a job scheduler either AirFlow, Argo, or Kubeflow.

The sweat spot for Spark is really when you have a large number of ETLs which requires a lot of data ingestion either from the data lakes or from the data warehouse such as Snowflake. One of my data pipelines had about 250 to 300 batch ETLs jobs running every day on Spark, and a few streaming ETLs running on Flink.

Designing a Spark data pipeline:

This is a huge effort…

Onboarding ETLs and models to Spark:

It is not that you deploy Kubeflow or Argo running with Spark, and you are done. You have a lot of services to build and ETLs and model configuration set ups to onboard the ETLs and the models to Spark, and monitor them 24×7. This effort could be a full time job for over ten data engineers, depending of the volume of the data.

Plus if you are designing your pipeline with Spark – you will likely need to design a feature store.

Spark over AWS EMR, Spark over AWS K8s (EKS), Spark over Databricks Run Time:

You definitely want to avoid Spark over EMR. This is a nightmare. You have to deal with Spark (driver/executor) – Yarn/Hadoop – EMR – and EC2 instances – good luck in troubleshooting! And AWS S3 writing protocols have a few issues with Spark.

Between Spark over EKS and Spark over Databricks Run Time – if you have the $$$, I would go with the Databricks Run Time for the following three reasons:

  • It works well and is very resilient
  • It offers all the libraries that a data scientist can dream of – do not forget that when you run Spark only, you are stuck with MLibs for the machine learning libraries
  • It enables you to run some small clusters – perfect for scikit-learn models – note that Spark natively does not support scikit-learn

Troubleshooting Spark:

This is a huge effort…

That can easily be also a full time job for a few engineers every day. As a rule of thumb, 30% of my every day jobs failed on Spark. The best that I have been able to reach is 10% failure. So good luck with troubleshooting and monitoring. You need to put a lot of monitoring tools in place such as Prometheus, Grafana, and many others….

Note: The picture above is the triathlon de Saint-Gregoire.

Copyright © 2005-2022 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com

Categories: Back-End, Big Data