FUN in S&T: Data Mining

Wednesday, March 2, 2011

- The proceedings of the challenge have been published in JMLR Workshop and Conference Proceedings.

- The datasets are available under the webscope program. (this one does not require approval and is available to anyone)

Mining of Massive Datasets

Anand Rajaraman and Jeff Ullman wrote a book called "Mining of Massive Datasets" that can be downloaded for free:
It focuses on data mining of very large amounts of data.

You can find materials from past offerings of CS345A at:
http://infolab.stanford.edu/~ullman/mining/mining.html

There, you will find slides, homework assignments, project requirements, and in some cases, exams.

The principal topics covered are:
1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-sensitive hashing.
3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
6. Algorithms for clustering very large, high-dimensional datasets.
7. Two key problems for Web applications: managing advertising and recommendation systems.

Wednesday, March 2, 2011

Learning to rank challenge

Mining of Massive Datasets