Sunday, January 23, 2011

Data visualization tools for Linux

Data visualization tools for Linux

A quick look at six open source graphics utilities

M. Tim Jones (mtj@mtjones.com), Senior Principal Software Engineer, Emulex Corp.

Summary: Applications for graphical visualization of data on Linux® are varied, from simple 2-D plots to 3-D surfaces, scientific graphics programming, and graphical simulation. Luckily, there are many open source possibilities, including gnuplot, GNU Octave, Scilab, MayaVi, Maxima, OpenDX, and others. Each has its advantages and disadvantages and targets different applications. Explore a variety of open source graphical visualization tools to better decide which is best for your application. [This article has been updated to include coverage of OpenDX - Ed.]

Tags for this article: data, for, linux, tools, visualization

Activity: 38967 views
Comments: 0 (View | Add comment - Sign in)

Average rating 4 stars based  on 159 votes Average rating (159 votes)

A short list of visualization tools

In this article, I provide a survey of a number of popular Linux data visualization tools and include some insight into their other capabilities. For example, does the tool provide a language for numerical computation? Is the tool interactive or does it operate solely in batch mode? Can you use the tool for image or digital signal processing? Does the tool provide language bindings to support integration into user applications (such as Python, Tcl, Java programming languages, and so on)? I also demonstrate the tools' graphical capabilities. Finally, I identify the strengths of each tool to help you decide which is best for your computational task or data visualization.

The open source tools that I explore in this article are (with their associated licenses):

  • Gnuplot (Gnuplot Copyright, non GPL)
  • GNU Octave (GPL)
  • Scilab (Scilab)
  • MayaVi (BSD)
  • Maxima (GPL)
  • OpenDX (IBM Public License)

Gnuplot

Gnuplot is a great visualization tool that has been around since 1986. It's hard to read a thesis or dissertation without running into a gnuplot graph. Although gnuplot is command-line driven, it has grown from its humble beginnings to support a number of non-interactive applications, including its use as a plotting engine for GNU Octave.

Gnuplot is portable, operating on UNIX®, Microsoft® Windows®, Mac OS® X and many other platforms. It supports a range of output formats from postscript to the more recent PNG.

Gnuplot can operate in batch mode, providing a script of commands to generate a plot, and also in interactive mode, which allows you to try out its features to see the effect they have on your plot.

A standard math library is also available with gnuplot that corresponds to the UNIX math library. Arguments for functions support integer, real, and complex. You can configure the math library for radians or degrees (the default is radians).

For plotting, gnuplot can generate 2-D plots with the plot command and 3-D plots (as 2-D projections) with the splot command. With the plot command, gnuplot can operate in rectangular or polar coordinates. The splot command is Cartesian by default but can also support spherical and cylindrical coordinates. You can also apply contours to plots (as shown in Figure 1, below). A new style for plots, pm3d, supports drawing palette-mapped 3-D and 4D data as maps and surfaces.

Here's a short gnuplot example that illustrates 3-D function plotting with contours and hidden line removal. Listing 1 shows the gnuplot commands that are used, and Figure 1 shows the graphical result.


Listing 1. Simple gnuplot function plot
set samples 25
set isosamples 26
set title "Test 3D gnuplot"
set contour base
set hidden3d offset 1
splot [-12:12.01] [-12:12.01] sin(sqrt(x**2+y**2))/sqrt(x**2+y**2)
Listing 1 illustrates the simplicity of gnuplot's command set. The sampling rate and density of the plot are determined by samples and isosamples, and a title is provided for the graph with the title parameter. The base contour is enabled along with hidden line removal, and the sinc plot is created with the splot command using the internally available math library functions. The result is Figure 1.

Figure 1. A simple plot from gnuplot

A simple plot from gnuplot

In addition to creating function plots, gnuplot is also great for plotting data contained in files. Consider the x/y data pairs shown in Listing 2 (an abbreviated version of the file). The data pairs shown in the file represent the x and y coordinates in a two-dimensional space.


Listing 2. Sample data file for gnuplot (data.dat)
56 48
59 29
85 20
93 16
...
56 48

If you want to plot this data in a two-dimensional space, as well as connect each data point with a line, you can use the gnuplot script shown in Listing 3.


Listing 3. Gnuplot script to plot the data from Listing 2
set title "Sample data plot"
plot 'data.dat' using 1:2 t 'data points', \
"data.dat" using 1:2 t "lines" with lines
The result is shown in Figure 2. Note that gnuplot automatically scales the axes, but you're given control over this if you need to position the plot.

Figure 2. A simple plot from gnuplot using a data file

A simple  plot from gnuplot using     a data file

Gnuplot is a great visualization tool that is well known and available as a standard part of many GNU/Linux distributions. However, if you want basic data visualization and numerical computation, then GNU Octave might be what you're looking for.


GNU Octave

GNU Octave is a high-level language, designed primarily for numerical computation, and is a compelling alternative to the commercial Matlab application from The MathWorks. Rather than the simple command set offered by gnuplot, Octave offers a rich language for mathematical programming. You can even write your applications in C or C++ and then interface to Octave.

Octave was originally written around 1992 as companion software for a textbook in chemical reactor design. The authors wanted to help students with reactor design problems, not debugging Fortran programs. The result was a useful language and interactive environment for solving numerical problems.

Octave can operate in a scripted mode, interactively, or through C and C++ language bindings. Octave itself has a rich language that looks similar to C and has a very large math library, including specialized functions for signal and image processing, audio processing, and control theory.

Because Octave uses gnuplot as its backend, anything you can plot with gnuplot you can plot with Octave. Octave does have a richer language for computation, which has its obvious advantages, but you'll still be limited by gnuplot.

In the following example, from the Octave-Forge Web site (SimpleExamples), I plot the Lorentz Strange Attractor. Listing 4 shows the interactive dialog for Octave on the Windows platform with Cygwin. This example demonstrates the use of lsode, an ordinary differential equation solver.


Listing 4. Visualizing the Lorentz Strange Attractor with Octave
GNU Octave, version 2.1.50
Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003 John W. Eaton.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
FITNESS FOR A PARTICULAR PURPOSE. For details, type `warranty'.

Please contribute if you find this software useful.
For more information, visit http://www.octave.org/help-wanted.html

Report bugs to .

>> function y = lorenz( x, t )
y = [10 * (x(2) - x(1));
x(1) * (28 - x(3));
x(1) * x(2) - 8/3 * x(3)];
endfunction
>> x = lsode("lorenz", [3;15;1], (0:0.01:25)');
>> gset parametric
>> gsplot x
>>

The plot shown in Figure 3 is the output from the Octave code shown in Listing 4.


Figure 3. A Lorentz plot with Octave
A Lorentz plot with Octave

GNU Octave (in concert with gnuplot) can emit multiple plots on a single page with the multiplot feature. Using this feature, you define how many plots to create and then define the particular plot using the subwindow command. After the subwindow is defined, you generate your plot normally and then step to the next subwindow (as shown in Listing 5).


Listing 5. Generating multiplots in Octave
>> multiplot(2,2)
>> subwindow(1,1)
>> t=0:0.1:6.0
>> plot(t, cos(t))
>> subwindow(1,2)
>> plot(t, sin(t))
>> subwindow(2,1)
>> plot(t, tan(t))
>> subwindow(2,2)
>> plot(t, tanh(t))

The resulting multiplot page is shown in Figure 4. This is a great feature for collecting related plots together to compare or contrast.


Figure 4. A multiplot with GNU Octave
A multiplot with GNU Octave

You can think about Octave as a high-level language with gnuplot as the backend for visualization. It provides a rich math library and is a great free replacement for Matlab. It's also extensible, with packages developed by users for speech processing, optimization, symbolic computation, and others. Octave is in some GNU/Linux distributions, such as Debian, and can also be used on Windows with Cygwin and Mac OS X. See the Resources section for more information on Octave.


Scilab

Scilab is similar to GNU Octave in that it enables numerical computation and visualization. Scilab is an interpreter and a high-level language for engineering and scientific applications that is in use around the world.

Scilab originated in 1994 and was developed by researchers at Institut national de recherche en informatique et en automatique (INRIA) and École Nationale des Ponts et Chaussées (ENPC) in France. Since 2003, Scilab has been maintained by the Scilab Consortium.

Scilab includes a large library of math functions and is extensible for programs written in high-level languages, such as C and Fortran. It also includes the ability to overload data types and operations. It includes an integrated high-level language, but has some differences from C.

A number of toolboxes are available for Scilab that provide 2-D and 3-D graphics and animation, optimization, statistics, graphs and networks, signal processing, a hybrid dynamic systems modeler and simulator, and many other community contributions.

You can use Scilab on most UNIX systems, as well as the more recent Windows operating systems. Like GNU Octave, Scilab is well documented. Because it is a European project, you can also find documentation and articles in a number of languages other than English.

When Scilab is started, a window displays allowing you to interact with the interpreter (see Figure 5).


Figure 5. Interacting with Scilab
Interacting with Scilab

In this example, I create a vector (t) with values ranging from 0 to 2PI (with a step size of 0.2). I then generate a 3-D plot (using z=f(x,y), or the surface at the point xi,yi). Figure 6 shows the resulting plot.


Figure 6. The resulting Scilab plot from the commands in Figure 5
The resulting     Scilab plot from the commands in Figure 5

Scilab includes a large number of libraries and functions that can generate plots with a minimum of complexity. Take the example of generating a simple three-dimensional histogram plot:

-->hist3d(5*(rand(5,5));

First, the rand(5,5) builds a matrix of size 5,5 containing random values (which I scale to a maximum of 5). This matrix is passed to the function hist3d. The result is the histogram plot shown in Figure 7.


Figure 7. Generating a random three-dimensional histogram plot
Generating a random three-dimensional histogram  plot

Scilab and Octave are similar. Both have a large base of community participation. Scilab is written in Fortran 77, whereas Octave is written in C++. Octave uses gnuplot for its visualization; Scilab provides its own. If you're familiar with Matlab, then Octave is a good choice because it strives for compatibility. Scilab includes many math functions and is very good for signal processing. If you're still not sure which one to use, try them both. They're both great tools, and you may find yourself using each of them for different tasks.


MayaVi

MayaVi, which means magician in Sanskrit, is a data visualization tool that binds Python with the powerful Visualization Toolkit (VTK) for graphical display. MayaVi also provides a graphical user interface (GUI) developed with the Tkinter module. Tkinter is a Tk interface, most commonly coupled with Tcl.

MayaVi was originally developed as a visualization tool for Computational Fluid Dynamics (CFD). After its usefulness in other domains was realized, it was redesigned as a general scientific data visualizer.

The power behind MayaVi is the VTK. The VTK is an open source system for data visualization and image processing that is widely used in the scientific community. VTK packs an amazing set of capabilities with scripting interfaces for Tcl/Tk, Java programming language, and Python in addition to C++ libraries. VTK is portable to a number of operating systems, including UNIX, Windows, and MAC OS X.

The MayaVi shell around VTK can be imported as a Python module from other Python programs and scripted through the Python interpreter. The tkinter GUI provided by MayaVi allows the configuration and application of filters as well as manipulating the lighting effects on the visualization.

Figure 8 is an example visualization using MayaVi on the Windows platform.


Figure 8. 3-D visualization with MayaVi (CT heart scan data)
3-D Visualization     with MayaVi/VTK (CT heart scan data).

MayaVi is an interesting example of extending the VTK in the Python scripting language.


Maxima

Maxima is a full symbolic and numerical computation program in the vein of Octave and Scilab. The initial development of Maxima began in the late 1960s at Massachusetts Institute of Technology (MIT), and it continues to be maintained today. The original version (a computer algebra system) was called DOE Macsyma and led the way for later development of more commonly known applications such as Mathematica.

Maxima provides a nice set of capabilities that you'd expect (such as differential and integral calculus, solving linear systems and nonlinear sets of equations) along with symbolic computing abilities. You can write programs in Maxima using traditional loops and conditionals. You'll also find a hint of Lisp in Maxima (from functions such as quoting, map and apply). Maxima is written in Lisp, and you can execute Lisp code within a Maxima session.

Maxima has a nice online help system that is hypertext based. For example, if you want to know how a particular Maxima function works, you can simply type example( desolve ) and it provides a number of example usages.

Maxima also has some interesting features such as rules and patterns. These rules and patterns are used by the simplifier to simplify expressions. Rules can also be used for commutative and noncommutative algebra.

Maxima is much like Octave and Scilab in that an interpreter is available to interact with the user, and the results are provided directly in the same window or popped up in another. In Figure 9, I request plot of a simple 3-D graph.


Figure 9. Interacting with Maxima
Interacting with Maxima

The resulting plot is shown in Figure 10.


Figure 10. The resulting Maxima plot from the commands in Figure 9
The resulting     Maxima plot from the commands in Figure 9


Open Data Explorer (OpenDX)

An overview of visualization tools wouldn't be complete without a short introduction to Open Data Explorer (OpenDX). OpenDX is an open source version IBM's powerful visualization data explorer. This tool was first released in 1991 as the Visualization Data Explorer, and is now available as open source for data visualization as well as building flexible applications for data visualization.

OpenDX has a number of unique features, but its architecture is worth mentioning. OpenDX uses a client/server model, where the client and server applications can reside on separate hosts. This allows the server to run on a system designed for high-powered number crunching (such as a shared memory multi-processor) with clients running separately on lesser hosts designed more for graphical rendering. OpenDX even allows a problem to be divided amongst a number of servers to be crunched simultaneously (even heterogeneous servers).

OpenDX supports a visual data-flow programming model that allows the visualization program to be defined graphically (see Figure 11). Each of the tabs define a "page" (similar to a function). The data is processed by the transformations shown, for example, the middle "Collect" module collects input objects into a group, and then passes them on (in this case, to the "image" module which displays the image and the "AutoCamera" module which specifies how to view the image).


Figure 11. Visual Programming with OpenDX
Visual Programming with OpenDX

OpenDX even includes a module builder that can help you build custom modules.

Figure 12 shows a sample image that was produced from OpenDX (this taken from the Physical Oceanography tutorial for OpenDX from Dalhousie University). The data represents land topology data and also water depths (bathemetry).


Figure 12. Data Visualization with OpenDX
Data Visualization with OpenDX

OpenDX is by far the most flexible and powerful data visualizer that I've explored here, but it's also the most complicated. Luckily, numerous tutorials (and books) have been written to bring you up to speed, and are provided in the Resources section.


Going further

I've just introduced a few of the open source GNU/Linux visualization tools in this article. Other useful tools include Gri, PGPLOT, SciGraphica, plotutils, NCAR Graphics, and ImLib3D. All are open source, allowing you to see how they work and modify them if you wish. Also, if you're looking for a great graphical simulation environment, check out Open Dynamics Engine (ODE) coupled with OpenGL.

Your needs determine which tool is best for you. If you want a powerful visualization system with a huge variety of visualization algorithms, then MayaVi is the one for you. For numerical computation with visualization, GNU Octave and Scilab fit the bill. If you need symbolic computation capabilities, Maxima is a useful alternative. Last, but not least, if basic plotting is what you need, gnuplot works nicely.


Resources

Learn

Get products and technologies

  • The gnuplot home page is the place for gnuplot software downloads and documentation. You can also find a demo gallery to help you figure out what's possible with gnuplot and how to tailor these recipes for your application.

  • GPlot is a Perl wrapper for Gnuplot. It's written by Terry Gliedt and may help you if you find Gnuplot to be complicated or unfriendly. GPlot loses some of the flexibility of Gnuplot, but extends many of the most common options in a much simpler way.

  • GNU Octave is a high-level language for numerical computation that uses gnuplot as its graphical engine. It's a great alternative to the commercial Matlab software. Its Web site contains downloads and access to a wide range of documentation.

  • You can download the MayaVi Data Visualizer at SourceForge.net. You can also find documentation here, as well as a list of features that MayaVi provides for VTK.

  • The Visualization Toolkit (VTK) is a powerful open source software system for 3-D computer graphics, image processing, and visualization. You'll find software, documentation, and lots of helpful links for using VTK on this site.

  • Scilab is a free scientific software package for numerical computation and graphical visualization. At this site, you'll find the latest version of Scilab, as well as documentation and other user information (such as how to contribute to the project).

  • Maxima is another alternative to Maple and Mathematica, in addition to the open source alternatives, Octave and Scilab. It has a distinguished lineage and supports not only numerical capabilities, but also symbolic computation with inline Lisp programming.

  • The Open Data Explorer is an open source version of IBM's powerful data visualization and application development package that's a must for hardcore scientific visualizations.

  • The NCAR Graphics home page provides a stable UNIX package for drawing contours, maps, surfaces, weather maps, x-y plots, and many others.

  • Gri is a high-level language for scientific graphics programming. You can use it to construct x-y graphs, contour plots, and image graphs with fine control over graphing attributes.

  • SciGraphica is great for data analysis and technical graphics.

  • The ImLib3D library is an open source package for 3-D volumetric image processing that strives for simplicity.

  • ODE is an open physics engine that's perfect for physical systems modeling. Combine this with Open/GL and you have a perfect environment for graphical simulation.

  • The ROOT system is a newer object-oriented data analysis framework. ROOT is a fully featured framework with over 310 classes of architecture and analysis behaviors.

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Check out developerWorks blogs and get involved in the developerWorks community.

Saturday, January 22, 2011

Newsgroups, mailing lists From ACLWiki

Jump to: navigation, search

C++

Make sure you know your basic data structures and algorithms. You're more likely to be asked about that stuff than something higher up the food chain. Those are usually saved for the in-person interview.

Put another way: be solid with the fundamentals and solid with your C++ syntax. Also, knowledge of common libraries like STL and Boost couldn't hurt...but be sure you know what those libraries give you! In the end phone screens are there to cull out people who can't do the basics. Prove you can and you should move on to the next step. Good luck!

Here's some links of interview questions to check out:

Now, for completion's sake, some books:

Git

Google Tech Talk: Linus Torvalds on git
http://www.youtube.com/watch?v=4XpnKHJAok8

The Git Wiki's comparison page
http://git.or.cz/gitwiki/GitSvnComparsion

3 Reasons to Switch to Git from Subversion
http://markmcb.com/2008/10/18/3-reasons-to-switch-to-git-from-subversion/

Hal Daume III --> NIPS 2010 Retrospective

From:http://nlpers.blogspot.com/2011/01/nips-2010-retrospective.html#links

Happy New Year and I know I've been silent but I've been busy.  But no teaching this semester (YAY!) so maybe you'll see more posts.

At any rate, I'm really late to the table, but here are my comments about this past year's NIPS.  Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe.  Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.

But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized.  (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...).  The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:

A: Wow, that map is really taking a long time to load.
B: I know.  It's probably some socialized Canadian WiFi service.
C: No, it's probably just slow because every third bit has to be a Canadian bit?
D: No no, it's because every bit has to be sent in both English and French!
Okay it's not that funny, but it was funny at the time.  (And really "B" is as much a joke about the US as it was about Canada :P.)

But I'm sure you are here to hear about papers, not stupid Canada jokes.  So here's my take.

The tutorial on optimization by Stephen Wright was awesome.  I hope this shows up on videolectures soon. (Update: it has!) I will make it required reading / watching for students.  There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG!  Really?!?!  There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work.  Lots of awesome stuff.  My favorite part of NIPS.

Some papers I saw that I really liked:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting.  The crux is a clever generalization of the "weak learning" notion from binary.  The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50).  Generalize this and it works.

Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this.  This was one of those talks where I understood the first half and then got lost.  But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.

Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all!  It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.

A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification.  Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning.

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?

Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster).  The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta.  Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation.  I think this is amazing.

Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all.  DP-style construction of infinite automata.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph.  It should be regularized in a sensible way.  Very LASSO-esque model, as you might expect given the author list :).

Other papers I saw that I liked but not enough to write mini summaries of:

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin)
Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers.  The first is a list from Avishek Saha:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire)

Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth)

Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire)

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)

Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri)

Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P:

Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).

Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.

Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.

Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).

Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.

Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.

Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).

Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".

A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.

MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.

Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)

Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.

Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.

Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.

More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.

Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.

A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.

Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary

MiloEngineeringProblem

http://milo.com/images/MiloEngineeringProblem.pdf

Friday, January 21, 2011

japerk / nltk-extras / overview – Bitbucket

japerk / nltk-extras / overview – Bitbucket

japerk/nltk-trainer - GitHub

japerk/nltk-trainer - GitHub

Train NLTK objects with zero code

Python NLTK Demos

Natural Language Processing World: Announcing Python NLTK Demos: "Bellow a post from the StreamHacker.com blog presenting a Demo for some features in the NLTK tool. Announcing Python NLTK Demos: ' If you..."

Announcing Python NLTK Demos

Bellow a post from the StreamHacker.com blog presenting a Demo for some features in the NLTK tool.

Announcing Python NLTK Demos:

"
If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.

Other Natural Language Processing Demos

Here's a list of similar resources on the web:

Use Weka in RCAC

When check out the home dirctory, a hiden file:
$ ls -al

find out the .modulesbeginevn

-bash-3.2$ less .modulesbeginenv

MODULE_VERSION_STACK=3.1.6
HOSTNAME=coates-fe01.rcac.purdue.edu
TERM=vt100
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=128.46.225.12 2393 22
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
SSH_TTY=/dev/pts/10
USER=wdi
RCAC_SCRATCH=/scratch/scratch96/w/wdi
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SSH_AUTH_SOCK=/tmp/ssh-cVtAA14437/agent.14437
RCAC_SCRATCH_NEW=/scratch/lustreA/w/wdi
MODULE_VERSION=3.1.6
MAIL=/var/spool/mail/wdi
PATH=/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/pbs/bin
INPUTRC=/etc/inputrc
PWD=/home/ba01/u117/wdi
LANG=C
MODULEPATH=/apps/rhel5/modules/versions:/apps/rhel5/modules$MODULE_VERSION/modulefiles:/opt/modules/localmodules:/opt/modules/modulefiles:
LOADEDMODULES=
SHLVL=1
HOME=/home/ba01/u117/wdi
LOGNAME=wdi
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
SSH_CONNECTION=128.46.225.12 2393 128.211.136.11 22
MODULESHOME=/apps/rhel5/modules
LESSOPEN=|/usr/bin/lesspipe.sh %s
DISPLAY=coates-fe01.rcac.purdue.edu:36.0
G_BROKEN_FILENAMES=1
module=() {  eval `/apps/rhel5/modules$MODULE_VERSION/bin/modulecmd bash $*`}_=/apps/rhel5/modules3.1.6/bin/modulecmd


>>> README file in /apps/rhel5/weka-3-6-0 that may be helpful for you.


Briefly, from your home directory you can use the command:

java -classpath $CLASSPATH:weka.jar weka.gui.Main

to start the Weka GUI.

LCCC - NIPS10 Learning on Cores, Clusters and Clouds

  • Date and Time: 7:30am - 6:30pm Saturday, December 11, 2010

Schedule

Time Event Speaker
7:30 - 8:00 Opening remarks and overview of the field
Video
John Langford
8:00 - 9:00 Keynote: Averaging algorithms and distributed optimization John N. Tsitsiklis
9:00 - 9:20 Coffee Break and Poster Session
9:20 - 9:45 Optimal Distributed Online Prediction Using Mini-Batches Lin Xiao
9:45 - 10:10 MapReduce/Bigtable for Distributed Optimization Slav Petrov
10:10 - 10:30 Mini Talks Part I
10:30 - 15:30 Poster Session and Ski Break
14:00 - 15:30 Unofficial Tutorial on Vowpal Wabbit
Video
Langford et al.
15:30 - 16:30 Keynote: Machine Learning in the Cloud with GraphLab Carlos Guestrin
16:30 - 16:55 Distributed MAP Inference for Undirected Graphical Models Sameer Singh
16:55 - 17:15 Coffee Break and Poster Session
17:15 - 17:40 Gradient Boosted Decision Trees on Hadoop Jerry Ye
17:40 - 18:00 Mini Talks Part II
18:00 - 18:30 Panel discussion and summary
18:30 Last chance to look at posters

Keynote Speakers


Overview

In the current era of web-scale datasets, high throughput biology and astrophysics, and multilanguage machine translation, modern datasets no longer fit on a single computer and traditional machine learning algorithms often have prohibitively long running times. Parallelized and distributed machine learning is no longer a luxury; it has become a necessity. Moreover, industry leaders have already declared that clouds are the future of computing, and new computing platforms such as Microsoft's Azure and Amazon's EC2 are bringing distributed computing to the masses. The machine learning community has been slow to react to these important trends in computing, and it is time for us to step up to the challenge.

While some parallel and distributed machine learning algorithms already exist, many relevant issues are yet to be addressed. Distributed learning algorithms should be robust to node failures and network latencies, and they should be able to exploit the power of asynchronous updates. Some of these issues have been tackled in other fields where distributed computation is more mature, such as convex optimization and numerical linear algebra, and we can learn from their successes and their failures.

The goals of our workshop are:

  • To draw the attention of machine learning researchers to this rich and emerging area of problems and to establish a community of researchers that are interested in distributed learning.
  • To define a number of common problems for distributed learning (online/batch, synchronous/asynchronous, cloud/cluster/multicore) and to encourage future research that is comparable and compatible
  • To expose the learning community to relevant work in fields such as distributed optimization and distributed linear algebra.
  • To identify research problems that are unique to distributed learning.

Organizers


Program Committee


  • Ron Bekkerman - LinkedIn
  • Misha Bilenko - Microsoft
  • Ran Gilad-Bachrach - Microsoft
  • Guy Lebanon - Georgia Tech
  • Ilan Lobel - NYU
  • Gideon Mann - Google
  • Ryan McDonald - Google
  • Ohad Shamir - Microsoft
  • Alex Smola - Yahoo!
  • S V N Vishwanathan - Purdue
  • Martin Wainwright - UC Berkeley
  • Lin Xiao - Microsoft

know what kind of Python is install by ubuntu itself

This how we know what kind of Python is install by ubuntu itself

$ apt-cache policy python
Code:

python:
  Installed: 2.6.5-0ubuntu1
  Candidate: 2.6.5-0ubuntu1
  Version table:
 *** 2.6.5-0ubuntu1 0
        500 http://hk.archive.ubuntu.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status

BOOK: "Machine Learning: An Algorithmic Perspective".

From : http://www-ist.massey.ac.nz/smarsland/MLbook.html

I've written a textbook entitled "Machine Learning: An Algorithmic Perspective". It will be published by CRC Press, part of the Taylor and Francis group, on 2nd April 2009. The book is aimed at computer science and engineering undergraduates studing machine learning and artificial intelligence.
There are lots of Python code examples in the book, and the code is available here. Where special datasets are used they are provided with the code, and there are links to additional datasets at the bottom of the page.

Option 1: Eclipse zip file of all code
Option 2: Choose what you want from here:
Many of the datasets used in the book are available from the UCI Machine Learning Repository.

Choosing a Python Web Framework I – bottling it

Choosing a Python Web Framework I – bottling it

Kyran Dale

Tue, 03 Aug 2010 12:27:47 +0000

A few years ago, when choosing a Python Web-framework with which to build Showmedo, life was a little simpler. The big CMS frameworks like Zope and its relatively user-friendly little bro Plone were reasonably well established and there were a few up and coming lighter frameworks, which promised to take one closer to the Python. Standouts here were Turbogears and Django.

I tried using Plone, even bought the book, but found it very unwieldy. Like so many frameworks it was perfectly happy until one wanted to do something outside its workflow plan. Then things got icky, really icky. There was also far too much 'magic' going on, and far too little connection with the underlying Python. After this experience Turbogears was a breath of fresh air. I could build a web-site in Python and leverage all the efficiency and elegance of the language up close and personal. What could be sweeter? I enjoyed the experience so much I didn't give Django a fair crack of the whip. Although back then it was in beta and had been postponing a 1.0 release for a very long time.

Well, time moves on, and if there has been anything as intense as a framework war, it's fair to conclude that Django has taken the spoils. Which seems fair. It's a fantastically managed project, the documentation seems top notch (very important point), and the community is huge, enthusiastic and growing. For a more flexible experience Pylons is threatening to steal some of Turbogears thunder, offering a more modular, 'best of breed' alternative to Django. One of the big advantages here is the possibility of using the superb and acknowledged king of Python ORM database libraries, SQLAlchemy.

Anyway, today I come to praise something at the other end of the spectrum and newish to the field, namely Bottle.py. An entire web-framework packaged in a single Python module, which seemed crazy when first I heard of it. Bottle bills itself as a micro-framework, fast, simple and lightweight. Having had a chance to play around with it, I can testify to this. It makes a superb development server, among other things. After previous battles with Zope/Plone it's pretty incredible to get a server up and serving pages in a few lines of Python, with one imported module. Here's the 'Hello World' example:

from bottle import route, run

@route('/:name')
def index(name='World'):
return 'Hello %s!' % name

run(host='localhost', port=8080)

So, if you fancy getting away from the complexity of the bigger frameworks and getting back to basics, Bottle.py is a great option and an amazing achievement.

In the meantime, a little selection of Python web-framework videos from the Showmedo vaults:

Eric Florenzano's humungous series, Django from the Ground Up.

Jiang Xin's Pylons series.

Kevin Dangor's Ultimate DVD Turbogears set. Note: Turbogears 2 has seen some impressive changes to the framework, but much of Kevin's presentations is still applicable.

We don't yet have a bottle.py screencast, but there will be one soon. In the meantime, the closest to low-level web-appery we have is John Montgomery's introduction to Python Web Programming CGI
. Note: this is a club series but the linked introductory video is free and gives an overview.

Thursday, January 20, 2011

Django on Dreamhost: Virtual Python Install

Django on Dreamhost: Virtual Python Install

About VMware --- May try sometime

What is VMware Player?
VMware Player is software that enables users to easily create and run
virtual machines on a Windows or Linux PC. VMware Player now creates
virtual machines in addition to running virtual machines created by
VMware Workstation, VMware Fusion, VMware Server, or VMware ESX and
supports Microsoft virtual machines and Symantec LiveState Recovery
disk formats.
What does it cost?

在Ubuntu上安装PPS网络电视

From: http://vv15.com/2010/11/ubuntu-pps/

PPS网络电视出LINUX版本的了,喜欢看网络电视的又用ubuntu系统的朋友有福了,让我们一起来安装他吧!

首先链上官方页面:http://www.pps.tv/about/6/364.html
然后我们根据官方页面的安装方法安装。

软件版本:0.1.1678
* 安装需求:Ubuntu 8.04+, 只能用于x86 Linux个人电脑.
* 软件大小:1.5MB(deb包)

安装说明:

辅助软件:
安装PPS Linux版本前需要先安装以下辅助软件:
* QT库, 4.4.0及以上版本
* libFuse库, 2.7.2及以上版本
* Mplayer, 1.0rc2及以上版本
* MPlayer视频解码器: MPlayer Essential Codec Pack(http://www.mplayerhq.hu/MPlayer/releases/codecs/essential-20071007.tar.bz2)

推荐使用apt-get方式安装:
Step1: 安装辅助软件:
sudo apt-get install libqt4-core libqt4-dbus libqt4-gui libqt4-network libqt4-webkit libqt4-xml libfuse2 mplayer
(这些东西对于我们初学者来说搞不明白,直接在终端里面运行就行了。)
Step2: 安装PPS:
sudo dpkg -i ppstream_1.0.0-1_i386.deb


安装后播放视频画面OK,如果发现没有声音,在PPS的设置里面选择音频设备为alsa,然后重启PPS。 检查Ubuntu的系统设置时,其实可以发现是alsa类型的。所以这一步只是把软件设置和系统设置对应上。

Install Python Package on ECN-computer

Install Python Package on Windows System (e.g.ECN-computer) that do not have the authority.

Since we do not have the authority to write into the path, such as: C:\Python25, we can not install the package in the typical path: C:\Python25\Lib\site-packages


We finally find how to install the Lib by ourselves:

For windows-ECN:

a- run cmd
b- In the promoted shell: N:>   {this is the window's shell starting note}
N:>C:\Python25\python.exe "C:\TEMP\PythonLib\MDP-3.0\setup.py" install --prefix=C:\TEMP\PythonLib

Notes:
C:\Python25\python.exe   ==> System Python path
"C:\TEMP\PythonLib\MDP-3.0\setup.py"   ==> path of which of the setup.py is going to run
--prefix=C:\TEMP\PythonLib   ===> this is where we are going to install the lib
                by this, a file <C:\TEMP\PythonLib\Lib\site-packages> will be installed
                when using it, in the Python Shell:
                >>>sys.path.append('C:\TEMP\PythonLib\Lib\site-packages')

This idea is coming from:
http://wiki.pylonshq.com/display/pylonscookbook/Using+a+Virtualenv+Sandbox

Also, note that we encouter the error messagy saying that it can not find the file: "mdp\__init__.py"
we then check the code, and made the following adjust ment in the setup.py

    # mdp_init = open(os.path.join(os.getcwd(), 'mdp', '__init__.py'))
    mdp_init = open(os.path.join("C:/TEMP/PythonLib/MDP-3.0/", 'mdp', '__init__.py'))

This is simple since we know exactly where our file is .
 

mldata :: Welcome

mldata :: Welcome

From the mldata team:

Dear Machine Learners,

we are proud to announce mldata, the machine learning data set repository at http://mldata.org.

mldata is a community website aimed at exchanging data sets. Compared to existing sites, the emphasis lies on community. That means that anyone can upload data, comment on existing data sets, contribute
solutions to existing data sets, discuss topics in the forum, and in general easily interact with other users.

mldata is organized into four main types of objects:

* Data - just raw data
* Task - learning tasks defined on data sets
* Method - a machine learning method, can be applied to a Task
* Challenge - a set of Tasks defining a challenge

In principle, any kind of data can be uploaded, but mldata can parse some data formats like ARFF, CSV, and that used libsvm and other SVM solvers. For such data sets, more functionality is available like
automatic conversion to other data sets.

Other features include automatic evaluation of solutions for tasks using one of a large number of already available performance measures, but of course we're glad to add any user contributed performance
measure.

So have a look, and let us know what you think on the mldata forum!


Mikio Braun - on behalf of the mldata team.

mldata is sponsored by the Pascal2 Network of Excellence.

Process Over Content: Simple tree using a Python dictionary

Process Over Content: Simple tree using a Python dictionary

Simple tree using a Python dictionary

What I already have: table with rows containing (NodeId, ParentId, Title) values.

What I need: a simple tree structure mapping Nodes to their parents.

Solution:

# simple tree builder.

# (node, parent, title)
els = (
(1, 0, 'a'),
(2, 1, 'b'),
(3, 1, 'c'),
(4, 0, 'd'),
(5, 4, 'e'),
(6, 5, 'f'),
(7, 4, 'g')
)

class Node:
def __init__(self, n, s):
self.id = n
self.title = s
self.children = []

treeMap = {}
Root = Node(0, "Root")
treeMap[Root.id] = Root
for element in els:
nodeId, parentId, title = element
if not nodeId in treeMap:
treeMap[nodeId] = Node(nodeId, title)
else:
treeMap[nodeId].id = nodeId
treeMap[nodeId].title = title

if not parentId in treeMap:
treeMap[parentId] = Node(0, '')
treeMap[parentId].children.append(treeMap[nodeId])

def print_map(node, lvl=0):
for n in node.children:
print ' ' * lvl + n.title
if len(n.children) > 0:
print_map(n, lvl+1)

print_map(Root)

"""
Output:
a
b
c
d
e
f
g
"""
This works well for my purposes because I want to be able to go back and refer to an element by id without having to write the code to traverse the tree and find it. Keeping a copy of a tree structure and keeping a reference to each element in a dictionary, I get the best of both worlds. Deletions are difficult since I'm not keeping track of parent nodes. But, since the use case is a menu on an ASP.NET page that rebuilds on every load, I'm not storing a copy of the tree in memory long-term (i.e., deletions and insertions are irrelevant).

Unfortunately this is just a sketch to wrap my brain around the idea. In real life I had to implement it in VB.NET. (see http://pastebin.com/f8e9c672)