# Reducing the Requirements on Computation for Big Data by Leveraging Statistical Inference (T) One of the San Francisco machine learning meet-ups invited Professor Michael Jordan from UC Berkeley for a talk at Yelp this week. Professor Jordan is an active researcher in machine learning and has recently focused his research on how statistical methods can be used to solve some of the computational challenges posed by Big Data.

Following is an overview of the problems and solutions that he is researching:

#### The Big Data phenomenon

It is not the scale of the data but the granularity of the data that needs to be addressed – example: Facebook’s level of information about each user.

#### The Problem

The design of computing machines is about managing resources: less space (smaller computer), less energy (longer battery), less time (efficient algorithms)
Data should be considered as a resource as space, energy and time
But more data is not necessarily better than less data for two reasons:
– Query complexity grows faster than the number of data points
– More data means less likely to have an algorithm running in an acceptable time frame.

How to address the problem?

Blending the complexity of the computation (more data more computation) with stronger statistical inference (more data better statistics).

#### The goal

Exploiting the increasing inferential strength of data at scale to keep computational complexity at bay.

#### Research themes Distributed platforms and parallel algorithms

“Large computational problems are often usefully addressed via some notion of “divide-and-conquer.” That is, the large problem is divided into subproblems that are hopefully simpler than the original problem, these subproblems are solved (sometimes again with a divide-and-conquer strategy) and the solutions are pieced together to solve the original problem. In the statistical setting, one natural subdivision strategy involves breaking the data into subsets. The estimator of interest is applied to the subsets and the results are combined. The challenge in the statistical setting is that the analysis of subsets of data may present different statistical properties than the overall dataset. For example, confidence intervals based on subsets of data will generally be wider than confidence intervals based on the original data; thus, care must be taken that the overall divide-and-conquer procedure yields a correctly calibrated interval.”

#### Algorithmic weakening for statistical inference

“We do not consider a single algorithm for solving an inference problem, but instead consider a hierarchy of algorithms that are ordered by computational complexity. As data accrue, we want to back off to cheaper algorithms that run more quickly and deliver a result that would be viewed as being of poorer quality from a classical algorithmic point of view. We hope to do this in a way such that the increasing statistical strength of the data compensate for the poor algorithmic quality, so that in fact the overall quality of inference increases as data accrue, even if we impose a computational budget. The challenge is to do this in a theoretically sound way.”