Thursday, March 31, 2011

Building Decision Trees in Python - O'Reilly Media

Building Decision Trees in Python - O'Reilly Media

Decision Trees & Data Mining | Tutorial & Resources

Decision Trees & Data Mining | Tutorial & Resources

Machine Learning, etc: Graphical Models class notes

Machine Learning, etc: Graphical Models class notes
  • Jeff Bilmes' Graphical Models course , gives detailed introduction on Lauritzen's results, proof of Hammersley-Clifford results (look in "old scribes" section for notes)
  • Kevin Murphy's Graphical Models course , good description of I-maps
  • Sam Roweis' Graphical Models course , good introduction on exponential families
  • Stephen Wainwright's Graphical Models class, exponential families, variational derivation of inference
  • Michael Jordan's Graphical Models course, his book seems to be the most popular for this type of class
  • Lauritzen's Graphical Models and Inference class , also his Statistical Inference class , useful facts on exponential families, information on log-linear models, decomposability
  • Donna Precup's class , some stuff on structure learning

Machine Learning, etc: Naive Bayes vs. Logistic Regression

Machine Learning, etc: Naive Bayes vs. Logistic Regression

Wednesday, March 30, 2011

MapReduce与自然语言处理

本文链接地址:http://www.52nlp.cn/mapreduce与自然语言处理

我接触MapReduce时间不长,属于初学者的级别,本没有资格在这里谈"MapReduce与自然语言处理"的,不过这两天刚好看了IBM developerWorks上的《用 MapReduce 解决与云计算相关的 Big Data 问题》, 觉得这篇文章有两大好处:第一,它有意或无意的给了读者不仅有价值而且有脉络的关于MapReduce的参考资料;第二,虽然文中没有直接谈"自然语言处 理",但是在最后的"下一步"引申中,它给关注MapReduce在文本处理的读者列出了一份与自然语言处理相关的参考资料,这些资料,相当的有价值。因 此对于"MapReduce或者并行算法与自然语言处理",结合这篇文章以及自己的一点点经验,我尝试在这里"抛砖引玉"一把,当然,仅仅是抛砖引玉。
MapReduce是Google定义的一套并行程序设计模式(parallel programming paradigm),由两名Google的研究员Jeffrey DeanSanjay Ghemawat在2004年时提出,二人目前均为Google Fellow。所以两位Google研究员当年的论文是MapReudce学习者的必读:

'Google 工程师发表的文章 "MapReduce: Simplified Data Processing on Large Clusters" 清楚地解释了 MapReduce 的工作方式。这篇文章导致的结果是,从 2004 年到现在出现了许多开放源码的 MapReduce 实现。'

同时在Google Labs上,有这篇文章的摘要和HTML Slides

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

MapReduce的发明者Jeff DeanJeffrey Dean和Sanjay Ghemawat在2008年时重新整理了这篇论文,并发表在"Communications of the ACM"上,有兴趣的读者可以在Portal ACM上找到:MapReduce: simplified data processing on large clusters
另外,在Google Code University上,有一个比较正式的Mapreduce Tutorial:Introduction to Parallel Programming and MapReduce,写得稍微理论一些,不过也很值得参考。
关于MapReduce的开源实现,最有名的莫过于开源的Hadoop了。Hadoop是仅次于Google MapReduce系统的最早出现的MapReduce实现,它是几个Google工程师离开Google后用Java开发的。 如果读者对学习Hadoop感兴趣,除了参考Hadoop官方文档外,另外就是《Hadoop: The Definitive Guide: MapReduce for the Cloud》这本书了,我所见的一些资料里都推荐过这本书,另外包括IBM的这篇文章的参考资料里也列出了这本书,感觉的确不错!这本书的电子版也可以Google的出来,读者自助吧!
接下来就是"MapReduce与自然语言处理"了,IBM的这篇文章列出了如下的"自然语言处理"方面的参考资料:
· Natural Language Processing with Python以非常通俗的方式介绍自然语言处理。
· 在 NLP 上也有一篇文章。
· PDF 文档 how to do large-scale NLP with NLTK and Dumbo
· 在 "可爱的 Python: 自然语言工具包入门" 中,David Mertz 讲解如何使用 Python 进行计算语言学处理。
我额外再推荐一本书:Data-Intensive Text Processing with MapReduce,作者Jimmy Lin是 马里兰大学的associate professor,目前结合MapReduce利用Hadoop在NLP领域发了一些paper,没有paper的读 者可要抓紧了,"并行算法与自然语言处理"这个切入点不错——开个玩笑!当然这本书的电子版Jimmy Lin已经在这个页面里提供了。
最后就是引用IBM所列的其他学习参考资料了,这些资料,很有价值,也为自己做个备份:
· developerWorks 文章 "在云中使用 MapReduce 和负载平衡" 提供关于在云中使用 MapReduce 的更多信息并讨论了负载平衡。
· how to install a distribution for Hadoop on a single Linux node教程讨论如何在单一 Linux 节点上安装 Hadoop 发行版。
· "A Comparison of Approaches to Large-Scale Data Analysis" 详细对比 MapReduce 和并行 SQL DBMS 的基本控制流。
· developerWorks 上的 "用 Hadoop 进行分布式数据处理" 系列帮助您开始开发应用程序,从支持单一节点到支持多个节点。
· 进一步了解本文中的主题:
Erlang overview
MapReduce and parallel programming
Amazon Elastic MapReduce
Hive and Amazon Elastic MapReduce
Writing parallel applications (for thread monkeys)
10 minutes to parallel MapReduce in Python
Implementing MapReduce in multiprocessing
Disco, an alternative to Hadoop
Hadoop: The Definitive Guide: MapReduce for the Cloud
Business intelligence: Crunch data with Hadoop (MapReduce)
· 在 developerWorks 云开发人员参考资料 中,寻找应用程序和服务开发人员构建云部署项目的知识和经验,并分享自己的经验。

注:原创文章,转载请注明出处"我爱自然语言处理":http://www.52nlp.cn



--

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

Parallel Machine Learning for Hadoop/Mapreduce – A Python Example

Parallel Machine Learning for Hadoop/Mapreduce – A Python Example

Dumbo

Home - GitHub
Dumbo is a project that allows you to easily write and run Hadoop programs in Python (it’s named after Disney’s flying circus elephant, since thelogo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally, Dumbo can be considered to be a convenient Python API for writing MapReduce programs.



Documentation

Writing An Hadoop MapReduce Program In Python @ Michael G. Noll

Writing An Hadoop MapReduce Program In Python @ Michael G. Noll

Sunday, March 27, 2011

clustering - Which machine learning library to use - Stack Overflow

clustering - Which machine learning library to use - Stack Overflow

clustering - Which machine learning library to use - Stack Overflow
http://stackoverflow.com/questions/2915341/which-machine-learning-library-to-use

Some Notes of SVM from S-O-F

This is a very good illustration of how SVM-works:
Links from 
The original is from math.stackexchange.com

The first answer:

As mokus explained, practical support vector machines use a non-linear kernel function to map data into a feature space where they are linearly separable:

SVM mapping one feature space into another

Lots of different functions are used for various kinds of data. Note that an extra dimension is added by the transformation.

(Illustration from Chris Thornton, U. Sussex.)


The 2nd answer: YouTube video that illustrates an example of linearly inseparable points that become separable by a plane when mapped to a higher dimension

 "Non-linear classification", with a link tohttp://en.wikipedia.org/wiki/Kernel_trick which explains the technique more generally.

And for SVM, a very good explanation for the equation- hyperplane is:   the w*x + b = 0
language agnostic - Support vector machines - separating hyperplane question - Stack Overflow

It is the equation of a (hyper)plane using a point and normal vector.
Think of the plane as the set of points P such that the vector passing from P0 to P is perpendicular to the normal
alt text
Check out these pages for explanation:
http://mathworld.wolfram.com/Plane.html
http://en.wikipedia.org/wiki/Plane_%28geometry%29#Definition_with_a_point_and_a_normal_vector


--

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

libsvm compute distance to hyperplane python

Some collections that relate to how to compute the distance to hyperplane in LibSVM in Python


https://github.com/bwallace/curious_snake

From Gábor Melis' () blog

Active Learning for cl-libsvm


Some-other-source:
http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=1022

RE: Distance to hyperplane (/computing w) from Python?
(24-11-2008 05:27 PM)wallace Wrote:  One question though; I've noticed the si->obj value will at times be negative, and so therefore |w|^2 is as well. This seems rather odd to me. Am I perhaps doing something wrong? Thanks again.

si->obj is 
0.5 alpha^T Q alpha - sum alpha_i
= -(primal obj)
= -(w^Tw/2 + C \sum \xu_i)
<= 0

what you should use is that
alpha^TQalpha = w^Tw

in Solve of svm.cpp, we have 

g = (Qalpha - e) 

available. By calculaing alpha^T (g+e) you get
w^Tw

Hope this helps.



--

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

Discovery Systems Laboratory website

Discovery Systems Laboratory website

Saturday, March 26, 2011

Au Naturale - Alex Bowe

Au Naturale - Alex Bowe

What are the best blogs about data?  - Quora

What are the best blogs about data? - Quora

Below is a snapshot:

* OkTrends http://blog.okcupid.com/
* SNA Projects Blog (LinkedIn) http://sna-projects.com/blog/
* Data Evolution (Dataspora) http://dataspora.com
* FiveThirtyEight http://www.fivethirtyeight.com/
* Brendan O'Connor's Blog http://anyall.org
* Alex Smola http://blog.smola.org
* Undirected Grad (Jurgen Van Gael) http://undirectedgrad.blogspot.com/
* Pete Warden http://petewarden.typepad.com
* Jeffry Heer http://hci.stanford.edu/jheer/
* Measuring Measures (Bradford Cross) http://measuringmeasures.com
* Info Vegan (Clay Johnson) http://www.infovegan.com
* High scalability http://highscalability.com
* hilary mason http://www.hilarymason.com
* Sunlight Labs http://sunlightlabs.com
* Datawrangling http://www.datawrangling.com
* Flowing Data http://flowingdata.com
* Geeking With Greg http://glinden.blogspot.com
* Business|bytes|genes|molecules http://mndoci.com
* Data Mining: Text Mining, Visualization, and Social Mediahttp://datamining.typepad.com
* Machine Learning (Theory) http://hunch.net
* Code - Open Blog - NYTimes http://open.blogs.nytimes.com
* LingPipe Blog http://lingpipe-blog.com/
* Peter Norvig http://norvig.com/
* Infochimps http://blog.infochimps.org
* Joseph Turian http://metaoptimize.com/blog/
* Aaron koblin http://www.aaronkoblin.com/work....
* igvita.com http://www.igvita.com
* SimpleGeo blog http://blog.simplegeo.com/
* Chris Diehl http://www.cpdiehl.org/blog.html
* Juice Analytics http://www.juiceanalytics.com
* Dolores Labs Blog http://blog.doloreslabs.com
* UMBC eBiquity Blog http://ebiquity.umbc.edu/blogger/
* Random Etc. (Tom Carden) http://www.tom-carden.co.uk
* Michael G. Noll's Blog http://www.michael-noll.com
* John D. Cook http://www.johndcook.com/blog/
* Daniel Lemire http://www.daniel-lemire.com/blog/
* Cerebral Mastication (J.D. Long) http://www.cerebralmastication.com/
* Eager Eyes http://eagereyes.org/
* Ben Lorica (O'Reilly Radar) http://radar.oreilly.com/ben/
* Marginally Interesting (Mikio L. Braun) http://blog.mikiobraun.de/
* Neoformix http://www.neoformix.com/
* Zero Intelligence Agents (Drew Conway) http://www.drewconway.com/zia/
* Bitquill http://www.bitquill.net
* Michael Nielsen http://michaelnielsen.org/blog/
* Ben Fry http://benfry.com/writing/
* Machine Learning Etc (Yaroslav Bulatov) http://yaroslavvb.blogspot.com
* Chris Harrison http://www.chrisharrison.net
* Guardian Data Blog (Simon Rogers): http://www.guardian.co.uk/news/d...
* Surprise & Coincidence (Ted Dunning) http://tdunning.blogspot.com
* Simple Complexity http://simplecomplexity.net
* Semantic Void (anand kishore) http://www.semanticvoid.com/blog/
* matpalm http://matpalm.com/
* Paco Nathan http://ceteri.blogspot.com/
* Lee Byron http://leebyron.com
* manAmplified (Chris K Wensel) http://www.manamplified.org
* Dumbotics http://dumbotics.com
* Road To Failure (Bradford Stephens) http://www.roadtofailure.com/
* Bio and Geo Informatics http://hackmap.blogspot.com/
* kiwitobes http://blog.kiwitobes.com/
* Yet Another Machine Learning Blog http://yamlb.wordpress.com/
* Cloudera Blog http://www.cloudera.com/blog/
* Inductio Ex Machina (Mark Reid) http://mark.reid.name/iem/
* Statistical Modeling, Causal Inference, and Social Sciencehttp://www.stat.columbia.edu/~ge...
* atbrox http://atbrox.com
* Bitwiese http://www.bitwiese.de/
* SquareCog http://squarecog.wordpress.com/
* Chris Riccomini http://www.riccomini.name/
* Sematext Blog http://blog.sematext.com/
* Data Miner's Blog http://blog.data-miners.com/
* Saaien tist (Jan Aerts) http://saaientist.blogspot.com/
* Byte Mining (@datajunkie) http://www.bytemining.com/
* Ka-Ping Yee http://zesty.ca/
* Jeffrey Veen http://www.veen.com/jeff/index.html