FUN in S&T: January 2011

译者导读：这篇文章主要介绍了从不同类型的HTML文件中抽取出真正有用的正文内容的一种有广泛适应性的方法。其功能类似于CSDN近期推出的“剪影”，能够去除页眉、页脚和侧边栏的无关内容，非常实用。其方法简单有效而又出乎意料，看完后难免大呼原来还可以这样！行文简明易懂，虽然应用了人工神经网络这样的算法，但因为FANN良好的封装性，并不要求读者需要懂得ANN。全文示例以Python代码写成，可读性更佳，具有科普气息，值得一读。

You’ve finally got your hands on the diverse collection of HTML documents you needed. But the content you’re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links. Even worse, there’s visible text in the menus, headers and footers that you want to filter out. If you don’t want to write a complex scraping program for each type of HTML file, there is a solution.

每个人手中都可能有一大堆讨论不同话题的HTML 文档。但你真正感兴趣的内容可能隐藏于广告、布局表格或格式标记以及无数链接当中。甚至更糟的是，你希望那些来自菜单、页眉和页脚的文本能够被过滤掉。如果你不想为每种类型的HTML文件分别编写复杂的抽取程序的话，我这里有一个解决方案。

This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…

本文讲述如何编写与从大量HTML代码中获取正文内容的简单脚本，这一方法无需知道HTML文件的结构和使用的标签。它能够工作于含有文本内容的所有新闻文章和博客页面……

Do you want to find out how statistics and machine learning can save you time and effort mining text?

你想知道统计学和机器学习在挖掘文本方面能够让你省时省力的原因吗？

< type=text/javascript> The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

答案极其简单：使用文本和HTML代码的密度来决定一行文件是否应该输出。（这听起来有点离奇，但它的确有用！）基本的处理工作如下：

Parse the HTML code and keep track of the number of bytes processed.

一、解析HTML代码并记下处理的字节数。

Store the text output on a per-line, or per-paragraph basis.

二、以行或段的形式保存解析输出的文本。

Associate with each text line the number of bytes of HTML required to describe it.

三、统计每一行文本相应的HTML代码的字节数

Compute the text density of each line by calculating the ratio of text to bytes.

四、通过计算文本相对于字节数的比率来获取文本密度

Then decide if the line is part of the content by using a neural network.

五、最后用神经网络来决定这一行是不是正文的一部分。

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it’s easier to implement!

仅仅通过判断行密度是否高于一个固定的阈值（或者就使用平均值）你就可以获得非常好的结果。但你也可以使用机器学习（这易于实现，简直不值一提）来减少这个系统出现的错误。

Let’s take it from the top…

现在让我从头开始……

Converting the HTML to Text

转换HTML为文本

What you need is the core of a text-mode browser, which is already setup to read files with HTML markup and display raw text. By reusing existing code, you won’t have to spend too much time handling invalid XML documents, which are very common — as you’ll realise quickly.

你需要一个文本模式浏览器的核心，它应该已经内建了读取HTML文件和显示原始文本功能。通过重用已有代码，你并不需要把很多时间花在处理无效的XML文件上。

As a quick example, we’ll be using Python along with a few built-in modules: htmllib for the parsing and formatter for outputting formatted text. This is what the top-level function looks like:

我们将使用Python来完成这个例子，它的 htmllib模块可用以解析HTML文件，formatter模块可用以输出格式化的文本。嗯，实现的顶层函数如下：

def extract_text(html):

# Derive from formatter.AbstractWriter to store paragraphs.

writer = LineWriter()

# Default formatter sends commands to our writer.

formatter = AbstractFormatter(writer)

# Derive from htmllib.HTMLParser to track parsed bytes.

parser = TrackingParser(writer, formatter)

# Give the parser the raw HTML data.

parser.feed(html)

parser.close()

# Filter the paragraphs stored and output them.

return writer.output()

The TrackingParser itself overrides the callback functions for parsing start and end tags, as they are given the current parse index in the buffer. You don’t have access to that normally, unless you start diving into frames in the call stack — which isn’t the best approach! Here’s what the class looks like:

TrackingParser覆盖了解析标签开始和结束时调用的回调函数，用以给缓冲对象传递当前解析的索引。通常你不得不这样，除非你使用不被推荐的方法——深入调用堆栈去获取执行帧。这个类看起来是这样的：

class TrackingParser(htmllib.HTMLParser):

"""Try to keep accurate pointer of parsing location."""

def __init__(self, writer, *args):

htmllib.HTMLParser.__init__(self, *args)

self.writer = writer

def parse_starttag(self, i):

index = htmllib.HTMLParser.parse_starttag(self, i)

self.writer.index = index

return index

def parse_endtag(self, i):

self.writer.index = i

return htmllib.HTMLParser.parse_endtag(self, i)

The LineWriter class does the bulk of the work when called by the default formatter. If you have any improvements or changes to make, most likely they’ll go here. This is where we’ll put our machine learning code in later. But you can keep the implementation rather simple and still get good results. Here’s the simplest possible code:

LinWriter的大部分工作都通过调用formatter来完成。如果你要改进或者修改程序，大部分时候其实就是在修改它。我们将在后面讲述怎么为它加上机器学习代码。但你也可以保持它的简单实现，仍然可以得到一个好结果。具体的代码如下：

class Paragraph:

def __init__(self):

self.text = ''

self.bytes = 0

self.density = 0.0

class LineWriter(formatter.AbstractWriter):

def __init__(self, *args):

self.last_index = 0

self.lines = [Paragraph()]

formatter.AbstractWriter.__init__(self)

def send_flowing_data(self, data):

# Work out the length of this text chunk.

t = len(data)

# We've parsed more text, so increment index.

self.index += t

# Calculate the number of bytes since last time.

b = self.index - self.last_index

self.last_index = self.index

# Accumulate this information in current line.

l = self.lines[-1]

l.text += data

l.bytes += b

def send_paragraph(self, blankline):

"""Create a new paragraph if necessary."""

if self.lines[-1].text == '':

return

self.lines[-1].text += 'n' * (blankline+1)

self.lines[-1].bytes += 2 * (blankline+1)

self.lines.append(Writer.Paragraph())

def send_literal_data(self, data):

self.send_flowing_data(data)

def send_line_break(self):

self.send_paragraph(0)

This code doesn’t do any outputting yet, it just gathers the data. We now have a bunch of paragraphs in an array, we know their length, and we know roughly how many bytes of HTML were necessary to create them. Let’s see what emerge from our statistics.

这里代码还没有做输出部分，它只是聚合数据。现在我们有一系列的文字段（用数组保存），以及它们的长度和生成它们所需要的HTML的大概字节数。现在让我们来看看统计学带来了什么。

Examining the Data

数据分析

Luckily, there are some patterns in the data. In the raw output below, you’ll notice there are definite spikes in the number of HTML bytes required to encode lines of text, notably around the title, both sidebars, headers and footers.

幸运的是，数据里总是存在一些模式。从下面的原始输出你可以发现有些文本需要大量的HTML来编码，特别是标题、侧边栏、页眉和页脚。

While the number of HTML bytes spikes in places, it remains below average for quite a few lines. On these lines, the text output is rather high. Calculating the density of text to HTML bytes gives us a better understanding of this relationship.

虽然HTML字节数的峰值多次出现，但大部分仍然低于平均值；我们也可以看到在大部分低HTML字节数的字段中，文本输出却相当高。通过计算文本与HTML字节数的比率（即密度）可以让我们更容易明白它们之间的关系：

The patterns are more obvious in this density value, so it gives us something concrete to work with.

密度值图更加清晰地表达了正文的密度更高，这是我们的工作的事实依据。

Filtering the Lines

过滤文本行

The simplest way we can filter lines now is by comparing the density to a fixed threshold, such as 50% or the average density. Finishing the LineWriter class:

过滤文本行的最简单方法是通过与一个阈值（如 50%或者平均值）比较密度值。下面来完成LineWriter类：

def compute_density(self):

"""Calculate the density for each line, and the average."""

total = 0.0

for l in self.lines:

l.density = len(l.text) / float(l.bytes)

total += l.density

# Store for optional use by the neural network.

self.average = total / float(len(self.lines))

def output(self):

"""Return a string with the useless lines filtered out."""

self.compute_density()

output = StringIO.StringIO()

for l in self.lines:

# Check density against threshold.

# Custom filter extensions go here.

if l.density > 0.5:

output.write(l.text)

return output.getvalue()

This rough filter typically gets most of the lines right. All the headers, footers and sidebars text is usually stripped as long as it’s not too long. However, if there are long copyright notices, comments, or descriptions of other stories, then those are output too. Also, if there are short lines around inline graphics or adverts within the text, these are not output.

这个粗糙的过滤器能够获取大部分正确的文本行。只要页眉、页脚和侧边栏文本并不非常长，那么所有的这些都会被剔除。然而，它仍然会输出比较长的版本声明、注释和对其它故事的概述；在图片和广告周边的比较短小的文本，却被过滤掉了。

To fix this, we need a more complex filtering heuristic. But instead of spending days working out the logic manually, we’ll just grab loads of information about each line and use machine learning to find patterns for us.

要解决这个问题，我们需要更复杂些的启发式过滤器。为了节省手工计算需要花费的无数时间，我们将利用机器学习来处理每一文本行的信息，以找出对我们有用的模式。

Supervised Machine Learning

监督式机器学习

Here’s an example of an interface for tagging lines of text as content or not:

这是一个标识文本行是否为正文的接口界面：

The idea of supervised learning is to provide examples for an algorithm to learn from. In our case, we give it a set documents that were tagged by humans, so we know which line must be output and which line must be filtered out. For this we’ll use a simple neural network known as the perceptron. It takes floating point inputs and filters the information through weighted connections between “neurons” and outputs another floating point number. Roughly speaking, the number of neurons and layers affects the ability to approximate functions precisely; we’ll use both single-layer perceptrons (SLP) and multi-layer perceptrons (MLP) for prototyping.

所谓的监督式学习就是为算法提供学习的例子。在这个案例中，我们给定一系列已经由人标识好的文档——我们知道哪一行必须输出或者过滤掉。我们用使用一个简单的神经网络作为感知器，它接受浮点输入并通过 “神经元”间的加权连接过滤信息，然后输后另一个浮点数。大体来说，神经元数量和层数将影响获取最优解的能力。我们的原型将分别使用单层感知器（SLP）和多层感知器（MLP）模型。

To get the neural network to learn, we need to gather some data. This is where the earlier LineWriter.output() function comes in handy; it gives us a central point to process all the lines at once, and make a global decision which lines to output. Starting with intuition and experimenting a bit, we discover that the following data is useful to decide how to filter a line:

我们需要找些数据来供机器学习。之前的 LineWriter.output()函数正好派上用场，它使我们能够一次处理所有文本行并作出决定哪些文本行应该输出的全局结策。从直觉和经验中我们发现下面的几条原则可用于决定如何过滤文本行：

Density of the current line.
当前行的密度
Number of HTML bytes of the line.
当前行的 HTML字节数
Length of output text for this line.
当前行的输出文本长度
These three values for the previous line,
前一行的这三个值
… and the same for the next line.
后一行的这三个值

For the implementation, we’ll be using Python to interface with FANN, the Fast Artificial Neural Network Library. The essence of the learning code goes like this:

我们可以利用FANN的Python接口来实现，FANN是Fast Artificial Neural NetWork库的简称。基本的学习代码如下：

from pyfann import fann, libfann

# This creates a new single-layer perceptron with 1 output and 3 inputs.

obj = libfann.fann_create_standard_array(2, (3, 1))

ann = fann.fann_class(obj)

# Load the data we described above.

patterns = fann.read_train_from_file('training.txt')

ann.train_on_data(patterns, 1000, 1, 0.0)

# Then test it with different data.

for datin, datout in validation_data:

result = ann.run(datin)

print 'Got:', result, ' Expected:', datout

Trying out different data and different network structures is a rather mechanical process. Don’t have too many neurons or you may train too well for the set of documents you have (overfitting), and conversely try to have enough to solve the problem well. Here are the results, varying the number of lines used (1L-3L) and the number of attributes per line (1A-3A):

尝试不同的数据和不同的网络结构是比较机械的过程。不要使用太多的神经元和使用太好的文本集合来训练（过拟合），相反地应当尝试解决足够多的问题。使用不同的行数（1L-3L）和每一行不同的属性（1A-3A）得到的结果如下：

The interesting thing to note is that 0.5 is already a pretty good guess at a fixed threshold (see first set of columns). The learning algorithm cannot find much better solution for comparing the density alone (1 Attribute in the second column). With 3 Attributes, the next SLP does better overall, though it gets more false negatives. Using multiple lines also increases the performance of the single layer perceptron (fourth set of columns). And finally, using a more complex neural network structure works best overall — making 80% less errors in filtering the lines.

有趣的是作为一个猜测的固定阈值，0.5的表现非常好（看第一列）。学习算法并不能仅仅通过比较密度来找出更佳的方案（第二列）。使用三个属性，下一个SLP比前两都好，但它引入了更多的假阴性。使用多行文本也增进了性能（第四列），最后使用更复杂的神经网络结构比所有的结果都要更好，在文本行过滤中减少了80%错误。

Note that you can tweak how the error is calculated if you want to punish false positives more than false negatives.

注意：你能够调整误差计算，以给假阳性比假阴性更多的惩罚（宁缺勿滥的策略）。

Conclusion

结论

Extracting text from arbitrary HTML files doesn’t necessarily require scraping the file with custom code. You can use statistics to get pretty amazing results, and machine learning to get even better. By tweaking the threshold, you can avoid the worst false positive that pollute your text output. But it’s not so bad in practice; where the neural network makes mistakes, even humans have trouble classifying those lines as “content” or not.

从任意HTML文件中抽取正文无需编写针对文件编写特定的抽取程序，使用统计学就能获得令人惊讶的效果，而机器学习能让它做得更好。通过调整阈值，你能够避免出现鱼目混珠的情况。它的表现相当好，因为在神经网络判断错误的地方，甚至人类也难以判定它是否为正文。

Now all you have to figure out is what to do with that clean text content!

现在需要思考的问题是用这些“干净”的正文内容做什么应用好呢？

Sunday, January 30, 2011

[Linux] 安裝單機版 Hadoop 0.20.1 Single-Node Cluster (Pseudo-Distributed) @ Ubuntu 9.04

From:http://changyy.pixnet.net/blog/post/25245658

沒想到這麼快就又要裝 hadoop 啦！之前的經驗並不算是從頭安裝，整個環境都是別人建的，僅在該環境將 hadoop 從 0.18 版弄到 0.20 版而已。這次由於要測試 HBase REST 部分，因此又來先建一個來測試測試，且僅需安裝 Single Node 模式。若搭上前陣子安裝 Ubuntu 的經驗，或許勉強稱得上從無到有吧？！

網路上有非常不錯的文章，我主要是參考資料如下，但他是 0.20.0 的版本，經我測試的結果，剛好又有 Ubuntu 9.04 的問題，有些地方要變動，慶幸地花了些時間終於搞定了！建議先看看這些參考資料，這篇寫得東西純粹給用來給自己記憶而已

環境資訊

# uname -a
Linux changyy-desktop 2.6.28-15-generic #52-Ubuntu SMP Wed Sep 9 10:49:34 UTC 2009 i686 GNU/Linux

安裝 Openssh-server
- # sudo apt-get install openssh-server
- 因為我是用 Ubuntu 7.04 Desktop 並更新至 9.04 的，主要是 Desktop 版需要安裝一下 openssh-server 才能支援多人登入，若是 Server 版應該預設就有了吧，可用 ssh localhost 來測試
安裝 Java 環境
- # sudo apt-get install sun-java6-jdk
建立與設定 Hadoop 使用的帳號
- # sudo addgroup hadoop
- # sudo adduser --ingroup hadoop hadoop
設定 Hadoop 帳號登入不需帳密
- # su - hadoop
  - 若已是 hadoop 帳號則可以略過此步
- # ssh-keygen -t rsa -P ''
- # cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- 用 hadoop 帳號測試連線
  - # ssh localhost
  - 應該不需要輸入帳密即完成登入，若有問題可參考上述文章來解決。
安裝 Hadoop 0.20.1
- 相關位置
- 依上篇文章安裝在 /usr/loca 位置
  - # cd /usr/local
  - # sudo wget http://apache.stu.edu.tw/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
  - # sudo tar -xvf hadoop-0.20.1.tar.gz
  - # sudo chown -R hadoop:hadoop hadoop-0.20.1
  - # sudo ln -s hadoop-0.20.1/ hadoop
- 主要為了提供未來可以便利的切換版本，因此我先用 symbolic link ，若不懂的話，建議使用 sudo mv hadoop-0.20.1 hadoop 取代最後一步
- 刪掉下載的檔案
  - # sudo rm -rf hadoop-0.20.1.tar.gz
設定 Hadoop 0.21
- 請先切換成 hadoop 身份，若已是則可跳過此步
  - # su - hadoop
- 設定環境變數
  - # vim /usr/local/hadoop/conf/hadoop-env.sh
    - Java 資訊
      - export JAVA_HOME=/usr/lib/jvm/java-6-sun
    - 停用 IPv6
      - HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
- 設定資料位置、port等資訊
  - 預定將資料擺在 /home/hadoop/db 中，可以先去建立一下，記的該目錄使用權歸於 hadoop 帳號
  - # vim /usr/local/hadoop/conf/core-site.xml
    - <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/hadoop/db/hadoop-${user.name}</value>
      <description>A base for other temporary directories.</description>
      </property>
      
      <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The name of the default file system. A URI whose
      scheme and authority determine the FileSystem implementation. The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class. The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
      </property>
      
      <property>
      <name>dfs.replication</name>
      <value>1</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
      </property>
  - # vim /usr/local/hadoop/conf/mapred-site.xml
    - <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
      <description>The host and port that the MapReduce job tracker runs
      at. If "local", then jobs are run in-process as a single map
      and reduce task.
      </description>
      </property>
  - 切記不要偷懶全寫在 core-site.xml ，因為我就是這樣導致預設的服務一直跑不起來，在 logs 中的 jobtracker 和 jobtracker 會一直看到的訊息
    - ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local
    - FATAL org.apache.hadoop.mapred.JobTracker: java.lang.RuntimeException: Not a host:port pair: local
- 格式化儲存空間
  - # /usr/local/hadoop/bin/hadoop namenode -format
- 啟動 Hadoop
  - # /usr/local/hadoop/bin/start-all.sh
- 關閉 Hadoop
  - # /usr/local/hadoop/bin/stop-all.sh
啟動 Hadoop 後，可用以下指令來查看狀態
- # jps
- # netstat -plten | grep java
- # /usr/local/hadoop/bin/hadoop dfsadmin -report

上頭紅色部份就是跟參考資料不同的地方，至於在 Kernel 停用 IPv6 的事，似乎在 Ubuntu 9.04 得要重編核心，我測試的結果最後還是選用 Hadoop 的設定檔，其他資訊可以參考這篇

[#HADOOP-3437] mapred.job.tracker default value/docs appear out of sync with code - ASF JIRA

測試 Hadoop (別忘了要用 hadoop 身份以及啟動 Hadoop Framework 囉)

建立目錄
- # /usr/local/hadoop/bin/hadoop dfs -mkdir input
檢視
- # /usr/local/hadoop/bin/hadoop dfs -ls
丟個測資和檢視一下
- # /usr/local/hadoop/bin/hadoop dfs -put /usr/local/hadoop/conf/core-site.xml input/
- # /usr/local/hadoop/bin/hadoop dfs -lsr
跑範例和看結果
- # /usr/local/hadoop/bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input output
- # /usr/local/hadoop/bin/hadoop dfs -cat output/*

原创理解Python命名机制收藏

本文最初发表于恋花蝶的博客（http://blog.csdn.net/lanphaday），欢迎转载，但必须保留此声明且不得用于商业目的。谢谢。

引子

我热情地邀请大家猜测下面这段程序的输出：

class A(object):

def __init__(self):

self.__private()

self.public()

def __private(self):

print 'A.__private()'

def public(self):

print 'A.public()'

class B(A):

def __private(self):

print 'B.__private()'

def public(self):

print 'B.public()'

b = B()

初探

正确的答案是：

A.__private()

B.public()

如果您已经猜对了，那么可以不看我这篇博文了。如果你没有猜对或者心里有所疑问，那我的这篇博文正是为您所准备的。

一切由为什么会输出“A.__private()”开始。但要讲清楚为什么，我们就有必要了解一下Python的命名机制。

据 Python manual，变量名（标识符）是Python的一种原子元素。当变量名被绑定到一个对象的时候，变量名就指代这个对象，就像人类社会一样，不是吗？当变量名出现在代码块中，那它就是本地变量；当变量名出现在模块中，它就是全局变量。模块相信大家都有很好的理解，但代码块可能让人费解些。在这里解释一下：

代码块就是可作为可执行单元的一段Python程序文本；模块、函数体和类定义都是代码块。不仅如此，每一个交互脚本命令也是一个代码块；一个脚本文件也是一个代码块；一个命令行脚本也是一个代码块。

接下来谈谈变量的可见性，我们引入一个范围的概念。范围就是变量名在代码块的可见性。如果一个代码块里定义本地变量，那范围就包括这个代码块。如果变量定义在一个功能代码块里，那范围就扩展到这个功能块里的任一代码块，除非其中定义了同名的另一变量。但定义在类中的变量的范围被限定在类代码块，而不会扩展到方法代码块中。

迷踪

据上节的理论，我们可以把代码分为三个代码块：类A的定义、类B的定义和变量b的定义。根据类定义，我们知道代码给类A定义了三个成员变量（Python的函数也是对象，所以成员方法称为成员变量也行得通。）；类B定义了两个成员变量。这可以通过以下代码验证：

>>> print '\n'.join(dir(A))

_A__private

__init__

public

>>> print '\n'.join(dir(B))

_A__private

_B__private

__init__

public

咦，为什么类A有个名为_A__private的 Attribute 呢？而且__private消失了！这就要谈谈Python的私有变量轧压了。

探究

懂Python的朋友都知道Python把以两个或以上下划线字符开头且没有以两个或以上下划线结尾的变量当作私有变量。私有变量会在代码生成之前被转换为长格式（变为公有）。转换机制是这样的：在变量前端插入类名，再在前端加入一个下划线字符。这就是所谓的私有变量轧压（Private name mangling）。如类A里的__private标识符将被转换为_A__private，这就是上一节出现 _A__private和__private消失的原因了。

再讲两点题外话：

一是因为轧压会使标识符变长，当超过255的时候，Python会切断，要注意因此引起的命名冲突。

二是当类名全部以下划线命名的时候，Python就不再执行轧压。如：

>>> class ____(object):

def __init__(self):

self.__method()

def __method(self):

print '____.__method()'

>>> print '\n'.join(dir(____))

__class__

__delattr__

__dict__

__doc__

__getattribute__

__hash__

__init__

__method # 没被轧压

__module__

__new__

__reduce__

__reduce_ex__

__repr__

__setattr__

__str__

__weakref__

>>> obj = ____()

____.__method()

>>> obj.__method() # 可以外部调用

____.__method()

现在我们回过头来看看为什么会输出“A.__private()”吧！

真相

相信现在聪明的读者已经猜到答案了吧？如果你还没有想到，我给你个提示：真相跟C语言里的宏预处理差不多。

因为类A定义了一个私有成员函数（变量），所以在代码生成之前先执行私有变量轧压（注意到上一节标红的那行字没有？）。轧压之后，类A的代码就变成这样了：

class A(object):

def __init__(self):

self._A__private() # 这行变了

self.public()

def _A__private(self): # 这行也变了

print 'A.__private()'

def public(self):

print 'A.public()'

是不是有点像C语言里的宏展开啊？

因为在类B定义的时候没有覆盖__init__方法，所以调用的仍然是A.__init__，即执行了self._A__private()，自然输出“A.__private()”了。

下面的两段代码可以增加说服力，增进理解：

>>> class C(A):

def __init__(self): # 重写__init__，不再调用self._A__private

self.__private() # 这里绑定的是_C_private

self.public()

def __private(self):

print 'C.__private()'

def public(self):

print 'C.public()'

>>> c = C()

C.__private()

C.public()

############################

>>> class A(object):

def __init__(self):

self._A__private() # 调用一个没有定义的函数，Python会把它给我的 ^_^～

self.public()

def __private(self):

print 'A.__private()'

def public(self):

print 'A.public()'

>>>a = A()

A.__private()

A.public()

【总结】笔/面试中常考到的一些linux脚本/管理命令

From: http://hi.baidu.com/zengjianyuan/blog/item/426209f41f027ce57709d71e.html

awk #对字段的处理是sed，grep不能实现的。
awk -F , 'NR==1,NR==2 {print $1 $2}' file, 打印第一行到第二行中，以‘，’为分隔符，每行第一，二个字段的值。
print可以改为printf,不过后者不输出换行符。NR：已经读出的记录数，NF：当前记录中的字段个数。
awk '/main/' file or awk '/sun/{print}' filename #显示文件中包含main的行。ps -ef | grep "httpd" | awk '{print $2}'#找出某个进程，然后显示其进程ID。
sed #替换s，删除d，插入i(a)，修改c，截取显示： sed -n '2,6p' file
#echo "a b c a" | sed 's/a/d/g' ,注意有没有g的差别。
#sed -n '2,3p' file,只显示第2,3行，去掉-n试试，sed 默认将来自源文件的每一行显示到屏幕上。-n就是用来覆盖这个操作的。
#sed '/main/ d' file #从文件中删除包含main的行。sed '1,3 d' file,删除前三行。
#sed '1i shit' file, sed '1a shit' file,前者是插入在第i行，后者是在第i行后插入。
#sed -i "s/name/zjy/g" `grep name -rl input` #将input目录下的所有文件中的name都换成zjy。
#http://hi.baidu.com/leowang715/blog/item/3c66f04fabc45dc1d0c86a31.html
tr #删除空行：tr -s ["\n"] < file，小写转大写：tr ["a-z"] ["A-Z"] < file，删除字符‘a’：tr -d ["a"] < file
grep #grep -r "shit" ./      当前目录下递归查找"shit"
sort #sort -t: +1 -2 b (-r倒序输出) (-u输出唯一行)，-t: 表示用:作为分隔符，+1，-2：指定字段作为key，从0开始.
find
＃find path -name filename;
# find path -type x;
-type x 查找类型为 x 的文件，x 为下列字符之一：
b 块设备文件
c 字符设备文件
d 目录文件
p 命名管道(FIFO)
f 普通文件
l 符号链接文件(symbolic links)
s socket文件
-xtype x 与 -type 基本相同，但只查找符号链接文件。
# find ./code -type f -exec ls -il {} \;
#exec选项后面跟随着所要执行的命令或脚本，然后是一对儿{ }，一个空格和一个\，最后是一个分号。
#在使用find命令的-exec选项处理匹配到的文件时， find命令将所有匹配到的文件一起传递给exec执行。但有些系统对能够传递给exec的命令长度有限制，这样在find命令运行几分钟之后，就会出现溢出错误。错误信息通常是“参数列太长”或“参数列溢出”。这就是xargs命令的用处所在，特别是与find命令一起使用。
find命令把匹配到的文件传递给xargs命令，而xargs命令每次只获取一部分文件而不是全部，不像-exec选项那样。这样它可以先处理最先获取的一部分文件，然后是下一批，并如此继续下去。
#find ./code -type f -print | xargs file
wc#wc -lcw filename, l：统计行数,c：统计字节数,w:统计字数
uniq #对相邻行操作;-d 仅显示重复行;-u 仅显示不重复的行;因为它只能处理相邻行，所以一般跟sort配合。
cut #从一个文本文件或者文本流提取数据.
cut -f 1-2 -d: filename #编号从1开始
-d: 以:作为分隔符，默认是tab.
-b ,-c ,-f:字节byte，字符character，字段filed.
1-2,表示范围.N:只有第N项;N-:从第N项一直到行尾;N-M:第N项到第M项（包括M）;-M:从第一项到第M项;-:所有项;
cat/tac #cat与tac显示顺序相反
tee #cmd1 | tee file1...N | cmd2, 命令1的输出送给tee，tee 的输出送给file1...N,并且作为命令2的输入。
tail/head #head -3 file,tail -3 file,显示前三行，后三行
eval#eval cmd[;cmd;cmd],把参数作为命令去执行
expr #expr args,比如,a=`expr $b+1`
let #let express-list, 比如,let "a=b+c"
xargs #将输入输出给xargs后面的命令，作为那个命令的参数。
正则表达式
colrm #命令从文件中除去选定的列。colrm [开始行数编号<结束行数编号>]。colrm 2 5 < filename
rev #把字符串反序。
[sword@localhost ~]$ temp=/home/sword
[sword@localhost ~]$ basename $temp
sword
[sword@localhost ~]$ dirname $temp
/home
[sword@localhost ~]$ basename /home/sword.c .c
sword
ls
join
du#(disk   usage):显示目录或文件的大小。
mail # mail -s "标题" filwsyl@gmail.com < 文件名。
tar/gzip #归档和压缩。我一度认为它俩是一个意思。
#tar cvf ***.tar 等待打包的文件。 tar xvf ***.tar,恢复文件。c(create) 产生归档文件，x恢复归档文件。
tar xvzf ***.tgz -C /tmp;解压缩到/tmp这个文件夹下面。
date
useradd/userdel/usermod
groupadd/groupdel
echo
sleep
crontab/crond
fg %n     #使n号作业成为前台作业。 apue p.223
bg
stty tostop #禁止后台作业输出到控制终端 apue p224
file #判断一个文件是二进制文件，c／c＋＋文件，普通文件等。
lpr,lpq,lprm #印机文件
whereis #查询系统上是否存在特定的一个命令，如果有相应的帮助文档，也会相应地输出。
which #如果系统中一个命令有多个版本，它告诉你当输入某个命令执行时，shell到底调用了哪个版本的命令。
history/fc#显示或操作历史命令列表。
变量声明：
declare/typeset #声明，初始化变量，设置变量属性，查询相关变量。
local -a array_name #local命令只能在函数中声明变量。
readonly -a array_name #带-a选项的readonly命令用来声明只读数组变量。
declare -a array_name
[sword@localhost ~]$ moves=("shit" "fuck" [20]="mother fucking")
[sword@localhost ~]$ echo ${moves[0]}
shit
[sword@localhost ~]$ echo ${moves[1]}
fuck
[sword@localhost ~]$ echo ${moves[2]}
[sword@localhost ~]$ echo ${moves[20]}
mother fucking
ln #ln -s pathOfSourceFile pathOfObjectFile
ps#显示进程的状态，ps -aux | grep "XXX"
kill #kill -l 返回所有信号的号码以及对应的名字。
pstree #用图的形式显示当前系统中执行进程的进程树，勾勒出进程间的父子关系。
ifconfig #修改网卡地址 service network restart
hostname #显示主机名
whoami #我是哪个账户？
uname #显示系统信息，uname -r 显示内核版本
source #使得刚修改过的系统配置文件生效
read    #从标准输入设备读入
shift #shit [N],把命令行参数向左移动N个位置，默认移动一个位置。
set # set $(date),$1~$9被设置为输出。
here文件 #bash的here文件特性可以将脚本中命令的标准输入重定向到脚本中的数据，这个特性主要用来显示菜单。
trap #trap ['命令列表'] [信号列表]，bash 中断处理的命令。
exec #1.执行命令or程序取代当前进程。2.打开or关闭文件描述符，与重定向符号一起使用时，允许读写文件。
bash -xv debug_file #脚本调试
top #实时显示正在运行的进程

>>> import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

[python]用profile协助程序性能优化收藏

本文最初发表于恋花蝶的博客http://blog.csdn.net/lanphaday，欢迎转载，但请务必保留原文完整，并保留本声明。

[python]用profile协助程序性能优化

上帝说：“选择了脚本，就不要考虑性能。”我是很支持这句话的，使用脚本要的就是开发速度、良好的扩展性以及可维护性。可惜到了最后，我们的程序难免会运行得太慢，我们的客户不能忍受，这时候，我们就不得不考虑对代码的性能进行优化了。

程序运行慢的原因有很多，比如存在太多的劣化代码（如在程序中存在大量的“.”操作符），但真正的原因往往是比较是一两段设计并不那么良好的不起眼的程序，比如对一序列元素进行自定义的类型转换等。因为程序性能影响是符合80／20法则的，即20%的代码的运行时间占用了80%的总运行时间（实际上，比例要夸张的多，通常是几十行代码占用了95%以上的运行时间），靠经验就很难找出造成性能瓶颈的代码了。这时候，我们需要一个工具——profile！最近我手上的项目也在一些关键的地方遇到了性能问题，那时已经接近项目完工日期，幸好因为平时的代码模块化程度比较高，所以通过profile分析相关的独立模块，基本上解决了性能问题。通过这件事，让我下决心写一篇关于profile的文章，分享一下 profile的使用心得。

初识profile

profile是python的标准库。可以统计程序里每一个函数的运行时间，并且提供了多样化的报表。使用profile来分析一个程序很简单，举例说如果有一个程序如下：

def foo():

sum = 0

for i in range(100):

sum += i

return sum

if __name__ == "__main__":

foo()

现在要用profile分析这个程序，很简单，把if程序块改为如下：

if __name__ == "__main__":

import profile

profile.run("foo()")

我们仅仅是import了profile这个模块，然后以程序的入口函数名为参数调用了profile.run这个函数，程序运行的输出如下：

5 function calls in 0.143 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.000 0.000 :0(range)

1 0.143 0.143 0.143 0.143 :0(setprofile)

1 0.000 0.000 0.000 0.000 <string>:1(?)

1 0.000 0.000 0.000 0.000 prof1.py:1(foo)

1 0.000 0.000 0.143 0.143 profile:0(foo())

0 0.000 0.000 profile:0(profiler)

上图显示了prof1.py里函数调用的情况，根据图表我们可以清楚地看到foo()函数占用了100%的运行时间，foo()函数是这个程序里名至实归的热点。

除了用这种方式，profile还可以直接用python解释器调用profile模块来剖分py程序，如在命令行界面输入如下命令：

python -m profile prof1.py

产生的输出跟直接修改脚本调用profile.run()函数有一样的功效。

profile的统计结果分为ncalls, tottime, percall, cumtime, percall, filename:lineno(function)等若干列：

ncalls	函数的被调用次数
tottime	函数总计运行时间，除去函数中调用的函数运行时间
percall	函数运行一次的平均时间，等于tottime/ncalls
cumtime	函数总计运行时间，含调用的函数运行时间
percall	函数运行一次的平均时间，等于cumtime/ncalls
filename:lineno(function)	函数所在的文件名，函数的行号，函数名

通常情况下，profile的输出都直接输出到命令行，而且默认是按照文件名排序输出的。这就给我们造成了障碍，我们有时候希望能够把输出保存到文件，并且能够以各种形式来查看结果。profile简单地支持了一些需求，我们可以在 profile.run()函数里再提供一个实参，就是保存输出的文件名；同样的，在命令行参数里，我们也可以加多一个参数，用来保存profile的输出。

用pstats自定义报表

profile解决了我们的一个需求，还有一个需求：以多种形式查看输出，我们可以通过 profile的另一个类Stats来解决。在这里我们需要引入一个模块pstats，它定义了一个类Stats，Stats的构造函数接受一个参数—— 就是profile的输出文件的文件名。Stats提供了对profile输出结果进行排序、输出控制等功能，如我们把前文的程序改为如下：

# …略

if __name__ == "__main__":

import profile

profile.run("foo()", "prof.txt")

import pstats

p = pstats.Stats("prof.txt")

p.sort_stats("time").print_stats()

引入pstats之后，将profile的输出按函数占用的时间排序，输出如下：

Sun Jan 14 00:03:12 2007 prof.txt

5 function calls in 0.002 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.002 0.002 0.002 0.002 :0(setprofile)

1 0.000 0.000 0.002 0.002 profile:0(foo())

1 0.000 0.000 0.000 0.000 G:\prof1.py:1(foo)

1 0.000 0.000 0.000 0.000 <string>:1(?)

1 0.000 0.000 0.000 0.000 :0(range)

0 0.000 0.000 profile:0(profiler)

Stats有若干个函数，这些函数组合能给我们输出不同的profile报表，功能非常强大。下面简单地介绍一下这些函数：

strip_dirs()	用以除去文件名前名的路径信息。
add(filename,[…])	把profile的输出文件加入Stats实例中统计
dump_stats(filename)	把Stats的统计结果保存到文件
sort_stats(key,[…])	最重要的一个函数，用以排序profile的输出
reverse_order()	把Stats实例里的数据反序重排
print_stats([restriction,…])	把Stats报表输出到stdout
print_callers([restriction,…])	输出调用了指定的函数的函数的相关信息
print_callees([restriction,…])	输出指定的函数调用过的函数的相关信息

这里最重要的函数就是sort_stats和print_stats，通过这两个函数我们几乎可以用适当的形式浏览所有的信息了，下面来详细介绍一下。

sort_stats()接受一个或者多个字符串参数，如”time”、”name” 等，表明要根据哪一列来排序，这相当有用，例如我们可以通过用time为key来排序得知最消耗时间的函数，也可以通过cumtime来排序，获知总消耗时间最多的函数，这样我们优化的时候就有了针对性，也就事半功倍了。sort_stats可接受的参数如下：

‘ncalls’	被调用次数
‘cumulative’	函数运行的总时间
‘file’	文件名
‘module’	文件名
‘pcalls’	简单调用统计（兼容旧版，未统计递归调用）
‘line’	行号
‘name’	函数名
‘nfl’	Name/file/line
‘stdname’	标准函数名
‘time’	函数内部运行时间（不计调用子函数的时间）

另一个相当重要的函数就是print_stats——用以根据最后一次调用sort_stats之后得到的报表。print_stats有多个可选参数，用以筛选输出的数据；print_stats的参数可以是数字也可以是perl风格的正则表达式，相关的内容通过其它渠道了解，这里就不详述啦，仅举三个例子：

print_stats(“.1”, “foo:”)

这个语句表示将stats里的内容取前面的10%，然后再将包含”foo:”这个字符串的结果输出。

print_stats(“foo:”,”.1”)

这个语句表示将stats里的包含”foo:”字符串的内容的前10%输出。

print_stats(10)

这个语句表示将stats里前10条数据输出。

实际上，profile输出结果的时候相当于这样调用了Stats的函数：

p.strip_dirs().sort_stats(-1).print_stats()

其中sort_stats函数的参数是-1，这也是为了与旧版本兼容而保留的。sort_stats可以接受-1,0,1,2之一，这四个数分别对应”stdname”, “calls”, “time”和”cumulative”；但如果你使用了数字为参数，那么pstats只按照第一个参数进行排序，其它参数将被忽略。

hotshot——更好的profile

因为profile本身的机制（如使用精确到毫秒的计时器等）导致在相当多情况下profile模块的“测不准”问题相当严重。hotshot大部分都是用C实现的，相对于profile模块它的计时函数对性能剖分的影响就小得多，而且支持以行为单位统计运行时间。美中不足的是hotshot不支持多线程的程序，确切来说，是它的计时核心有个关于临界段的bug；更加不幸的是，hotshot已经不再被维护，而且可能在未来的python版本中会从标准库中移除。不过，对于没有使用多线程的程序而言，hotshot目前仍然是非常好的剖分器。

hotshot有一个Profile类，它的构造函数原型如下：

class Profile( logfile[, lineevents[, linetimings]])

logfile参数是保存剖分统计结果的文件名，lineevents表示是否统计每一行源码的运行时间，默认为0，即以函数执行时间为统计粒度，linetimings为是否记录时间信息，默认为1。下面仍然是示例：

# …略

if __name__ == "__main__":

import hotshot

import hotshot.stats

prof = hotshot.Profile("hs_prof.txt", 1)

prof.runcall(foo)

prof.close()

p = hotshot.stats.load("hs_prof.txt")

p.print_stats()

输出：

1 function calls in 0.003 CPU seconds

Random listing order was used

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.003 0.003 0.003 0.003 i:\prof1.py:1(foo)

0 0.000 0.000 profile:0(profiler)

我们可以看到来自hotshot的干扰信息比profile的少了很多，这也有利于我们分析数据找出热点。不过正如我在前面代码中使用prof = hotshot.Profile("hs_prof.txt", 1)一样，我发现使lineevents=1跟忽略 linveevents参数没有什么不同，还请大家赐教。

使用hotshot能够更加灵活地统计程序的运行情况，因为hotshot.Profile提供了下面一系列的函数：

run(cmd)	执行一段脚本，跟profile模块的run()函数一样功能
runcall(func, args, *keywords)	调用一个函数，并统计相关的运行信息
runctx(cmd, globals, locals)	指定一段脚本的执行环境，执行脚本并统计运行信息

通过这几个函数，我们可以非常方便地建立测试的桩模块，不必再像使用profile那样手工地编写很多驱动模块了。 hotshot.Profile还提供其它有用的函数，具体请参考相关的manual。

Python 2.5注意事项

因为hotshot不能用于多线程，而且它的优势仅在速度，所以python 2.5版本声明不再维护hotshot模块，并且可能在以后的版本中移除它。有去必有来，取而代之的就是cProfile，与cPickle等模块类似，cProfile要比profile模块更快些。cProfile的接口跟profile是一样的，只要在使用到profile的地方用 cProfile替换就可以在以前的项目中使用它。

pstats在python 2.5版本中也产生了一些微妙的变化，pstats.Stats的构造函数增加了一个默认参数，变为：

class Stats( filename[, stream=sys.stdout[, ...]])

对我们而言，这是没有坏处的，stream参数给了我们把profile统计报表保存到文件的机会，这正是我们需要的。

综上所述，如果你使用的python 2.5版，我建议你使用cProfile。

小巧实用的瑞士军刀——timeit

如果我们某天心血来潮，想要向list里append一个元素需要多少时间或者想知道抛出一个异常要多少时间，那使用profile就好像用牛刀杀鸡了。这时候我们更好的选择是timeit模块。

timeit除了有非常友好的编程接口，也同样提供了友好的命令行接口。首先来看看编程接口。 timeit模块包含一个类Timer，它的构造函数是这样的：

class Timer( [stmt='pass' [, setup='pass' [, timer=<timer function>]]])

stmt参数是字符串形式的一个代码段，这个代码段将被评测运行时间；setup参数用以设置stmt的运行环境；timer可以由用户使用自定义精度的计时函数。

timeit.Timer有三个成员函数，下面简单介绍一下：

timeit( [number=1000000])

timeit()执行一次Timer构造函数中的setup语句之后，就重复执行number次stmt语句，然后返回总计运行消耗的时间。

repeat( [repeat=3 [, number=1000000]])

repeat()函数以number为参数调用timeit函数repeat次，并返回总计运行消耗的时间

print_exc( [file=None])

print_exc()函数用以代替标准的tracback，原因在于print_exc()会输出错行的源代码，如：

>>> t = timeit.Timer("t = foo()\nprint t") ß被timeit的代码段

>>> t.timeit()

Traceback (most recent call last):

File "<pyshell#12>", line 1, in -toplevel-

t.timeit()

File "E:\Python23\lib\timeit.py", line 158, in timeit

return self.inner(it, self.timer)

File "<timeit-src>", line 6, in inner

foo() ß标准输出是这样的

NameError: global name 'foo' is not defined

>>> try:

t.timeit()

except:

t.print_exc()

Traceback (most recent call last):

File "<pyshell#17>", line 2, in ?

File "E:\Python23\lib\timeit.py", line 158, in timeit

return self.inner(it, self.timer)

File "<timeit-src>", line 6, in inner

t = foo() ßprint_exc()的输出是这样的，方便定位错误

NameError: global name 'foo' is not defined

除了可以使用timeit的编程接口外，我们也可以在命令行里使用timeit，非常方便：

python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement ...]

其中参数的定义如下：

-n N/--number=N

statement语句执行的次数

-r N/--repeat=N

重复多少次调用timeit()，默认为3

-s S/--setup=S

用以设置statement执行环境的语句，默认为”pass”

-t/--time

计时函数，除了Windows平台外默认使用time.time()函数，

-c/--clock

计时函数，Windows平台默认使用time.clock()函数

-v/--verbose

输出更大精度的计时数值

-h/--help

简单的使用帮助

小巧实用的timeit蕴藏了无限的潜能等待你去发掘，我在这里就不提供实例啦～～

后记

原本我只打算写写profile的使用及自己应用的一些心得的，但仔细阅读了相关的manual之后发现实用的东西很多很多，所以就罗罗嗦嗦地写了这么多，自己的应用经验和心得只好容后再述了。在写这篇介绍的时候，我想起自己以前写过的一个A*算法的 python实现，没有进行过任何优化，所以打算以它为实例，以后抽空写上一篇有比较实用的例子贯穿全文的文档吧。

本文撰写的时候参考了python 2.3/2.4/2.5版本的manual，文中介绍的内容大部分适用于python 2.3或以上的版本，其中cProfile需要2.5版本才能支持。

Monday, January 31, 2011

Contents:

Sunday, January 30, 2011