Wednesday, March 2, 2011

Mining of Massive Datasets

Anand Rajaraman and Jeff Ullman wrote a book called "Mining of Massive Datasets" that can be downloaded for free:
It focuses on data mining of very large amounts of data.

You can find materials from past offerings of CS345A at:
http://infolab.stanford.edu/~ullman/mining/mining.html

There, you will find slides, homework assignments, project requirements, and in some cases, exams.


The principal topics covered are:
1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-sensitive hashing.
3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
6. Algorithms for clustering very large, high-dimensional datasets.
7. Two key problems for Web applications: managing advertising and recommendation systems.

No comments:

Post a Comment