Mining of Massive Datasets


(T) I was delighted to take Mining of Massive Datasets MOOC on Coursera this quarter from Professor Jure Leskovec, Anand Rajaraman and Jeff Ullman from Stanford University. The course goes over any type of large data sets, in any kind of structure, and in any kind of applications that you can imagine. The key topics of the class are:

  • Map-reduce algorithm

  • Finding similar items

  • Search engines

  • Frequent-item set mining

  • Clustering algorithms

  • Web advertising and recommendation systems

  • Mining of social network graphs

  • Extracting the properties of a large dataset

  • Machine-learning algorithms for very large datasets

Out of all those topics, there were two that I enjoyed particularly and I do recommend to study because I have never seen those materials elsewhere

1. Fingerprint recognition (Professor Jeff Ullman)

This topic is part of finding similar items. The first technique to speed up the search for data that is similar to many items is “minhashing” where large sets are compressed while still providing ways to interfere the similarity of the data in the compressed versions. The second technique introduced is “locality-sensitive hashing” that reduces the search to pairs that are most likely to be similar. Fingerprint matching involves a specific type of locality-sensitive function.

2. Mining social network graphs (Professor Jure Leskovec)

The first challenge in mining social graphs is to discover the overlapping communities between members of the graphs. Other challenges of interest include:

  • Similarities among nodes (simrank)

  • Measuring the connectedness of a community

  • Measuring the neighborhood sizes of the nodes

  • Computing the transitive closure (e.g. finding the reachability between two given nodes).


The book based on the class: Mining of Massive Datasets.

Note: The picture above is “Nymphe a la Coquille” from Antoine Coysevox property of the Chateau de Versailles.

Copyright © 2005-2014 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.

Categories: Big Data, Machine Learning