FUN in S&T: NLTK Default Tagger CoNLL2000 Tag Coverage

Wednesday, January 26, 2011

NLTK Default Tagger CoNLL2000 Tag Coverage

NLTK Default Tagger CoNLL2000 Tag Coverage: "

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag	Found	Actual	Precision	Recall
#	46	47	1	1
$	2122	2134	1	0.6
‘	1811	1809	1	1
(	0	351	None	0
)	0	358	None	0
,	13160	13160	1	1
-LRB-	351	0	0	None
-NONE-	59	0	0	None
-RRB-	358	0	0	None
.	10800	10802	1	1
:	1288	1285	0.7143	1
CC	6589	6586	0.6875	0.7333
CD	10325	10233	0.972	0.9919
DT	22301	22355	0.7826	1
EX	229	254	1	1
FW	1	42	1	0.0455
IN	27798	27835	0.7315	0.7899
JJ	15370	16049	0.7372	0.7303
JJR	1114	1055	0.5412	0.575
JJS	611	451	0.6912	0.7966
LS	13	0	0	None
MD	2616	2637	0.7143	0.75
NN	38023	36789	0.7345	0.8441
NNP	24967	24690	0.8752	0.9421
NNPS	589	550	0.4553	0.3684
NNS	17068	16653	0.8572	0.9527
PDT	24	65	0.6667	1
POS	2224	2203	0.6667	1
PRP	4620	4634	0.8438	0.7941
PRP$	2292	2302	0.6364	1
RB	7681	7961	0.8076	0.8582
RBR	288	392	0.5	0.3684
RBS	90	240	0.5	0.1667
RP	634	95	0.1176	1
SYM	0	6	None	0
TO	6257	6259	1	0.75
UH	2	17	1	0.1111
VB	6681	7286	0.9042	0.8313
VBD	8501	8424	0.7521	0.8605
VBG	3730	4000	0.8493	0.8603
VBN	5763	5867	0.8164	0.8721
VBP	3232	3407	0.6754	0.6638
VBZ	5224	5561	0.7273	0.6906
WDT	1156	1157	0.6	0.5
WP	637	639	1	1
WP$	38	39	1	1
WRB	566	571	0.9	0.75
“	1855	1854	0.6667	1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-”. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.