Data Engineering @ Facebook

IMG_6376

(T) While in most start-ups, data analysis (analyzing the data), data science (designing the algorithms), and data engineering (developing the data infrastructure) are handled by the same person, at Facebook those tasks are handled by three different groups. I attended this week two presentations, at a meet-up at the Facebook campus in Menlo Park, about data engineering at Facebook. The first presentation was about data stores for data analysis, while the second presentation was about writing data processing pipelines. Following are my notes from the presentations with additional links:

Data stores for data analysis

Facebook has three data stores for data analysis that do not serve Facebook users: ODS, Scuba, and Scrapes.

ODS uses HBase to store 2.5 billion time series of counters, alerts, and dashboards per minute. ODS data store is used for monitoring and trends of system metrics (CPU, memory, IO, network), application metrics (Web, DB, caches), and Facebook application metrics (usage, revenue).

Scuba aggregates data from all Facebook servers, and stores it in memory over thousands of tables in over 100 TB. Scuba data store is used for real-time analysis and troubleshooting of Facebook products and services, such as code regression analysis, bug report, ads revenue monitoring, ads monitoring, performance debugging…

Scrapes, using Hive, is the warehouse for off-line large-scale data analysis across Facebook with 300 PT of data distributed over many data centers over the globe. Scrapes data stores are used from ad hod analysis to deep business intelligence of Facebook applications.

Presto, HiveQL, Hadoop, and Graph are the common query engines over Hive.

For more on those data stores at Facebook, a blog post from Janet Wiener, Facebook engineer: Facebook’s Top Open Data Problems.

Developing data pipelines

Facebook has developed a data pipeline framework in Python, calls Dataswarm.

Dataswarm uses a library of operations such as executing queries, moving data, running scripts, that data analysts can use to define dependency graphs of tasks to be executed. And, it automates the rest such as distributed execution, scheduling, and dependency management.

Dataswarm processes data across a variety of back-end data stores: Hive, Presto, MySQL, Apache Graph, NumPy, Pandas, and Scikit-Learn.

For more on Data pipelines, a talk from Mike Starr, Facebook engineer:

Note that Dataswarm is not open source, by Airbnb has open source a similar tool called Airflow: Airflow: a workflow management platform.

Note: The picture above is the entrance of building 15 at the Facebook campus in Menlo Park.

Copyright © 2005-2016 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.

Categories: Machine Learning