Wednesday, January 26, 2011

NLTK Default Tagger CoNLL2000 Tag Coverage

NLTK Default Tagger CoNLL2000 Tag Coverage: "

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.


NLTK Default Tagger Performance on CoNLL2000


The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.




























































































































































































































































































































































TagFoundActualPrecisionRecall
#464711
$2122213410.6
1811180911
(0351None0
)0358None0
,131601316011
-LRB-35100None
-NONE-5900None
-RRB-35800None
.108001080211
:128812850.71431
CC658965860.68750.7333
CD10325102330.9720.9919
DT22301223550.78261
EX22925411
FW14210.0455
IN27798278350.73150.7899
JJ15370160490.73720.7303
JJR111410550.54120.575
JJS6114510.69120.7966
LS1300None
MD261626370.71430.75
NN38023367890.73450.8441
NNP24967246900.87520.9421
NNPS5895500.45530.3684
NNS17068166530.85720.9527
PDT24650.66671
POS222422030.66671
PRP462046340.84380.7941
PRP$229223020.63641
RB768179610.80760.8582
RBR2883920.50.3684
RBS902400.50.1667
RP634950.11761
SYM06None0
TO6257625910.75
UH21710.1111
VB668172860.90420.8313
VBD850184240.75210.8605
VBG373040000.84930.8603
VBN576358670.81640.8721
VBP323234070.67540.6638
VBZ522455610.72730.6906
WDT115611570.60.5
WP63763911
WP$383911
WRB5665710.90.75
185518540.66671

Unknown Words in CoNLL2000


The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-”. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.


Missing Symbols and Rare Tags


The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.




"

No comments:

Post a Comment