Monday, January 24, 2011

NLTK corpora

From: http://blog.ynada.com/tag/nltk
 
[*] alpino………….. Alpino Dutch Treebank
[*] nombank.1.0……… NomBank Corpus 1.0
[*] abc…………….. Australian Broadcasting Commission 2006
[*] maxent_ne_chunker… ACE Named Entity Chunker (Maximum entropy)
[*] conll2000……….. CONLL 2000 Chunking Corpus
[*] chat80………….. Chat-80 Data Files
[*] brown…………… Brown Corpus
[*] brown_tei……….. Brown Corpus (TEI XML Version)
[*] cmudict…………. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] biocreative_ppi….. BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] cess_cat………… CESS-CAT Treebank
[*] conll2002……….. CONLL 2002 Named Entity Recognition Corpus
[*] conll2007……….. Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] city_database……. City Database
[*] indian………….. Indian Language POS-Tagged Corpus
[*] shakespeare……… Shakespeare XML Corpus Sample
[*] dependency_treebank. Dependency Parsed Treebank
[*] inaugural……….. C-Span Inaugural Address Corpus
[*] ieer……………. NIST IE-ER DATA SAMPLE
[*] gutenberg……….. Project Gutenberg Selections
[*] gazetteers………. Gazeteer Lists
[*] names…………… Names Corpus, Version 1.3 (1994-03-29)
[*] mac_morpho………. MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[*] movie_reviews……. Sentiment Polarity Dataset Version 2.0
[*] cess_esp………… CESS-ESP Treebank
[*] genesis…………. Genesis Corpus
[*] kimmo…………… PC-KIMMO Data Files
[*] floresta………… Portuguese Treebank
[*] qc……………… Experimental Data for Question Classification
[*] nps_chat………… NPS Chat
[*] paradigms……….. Paradigm Corpus
[*] pil…………….. The Patient Information Leaflet (PIL) Corpus
[*] stopwords……….. Stopwords Corpus
[*] propbank………… Proposition Bank Corpus 1.0
[ ] pe08……………. Cross-Framework and Cross-Domain Parser
Evaluation Shared Task
[*] state_union……… C-Span State of the Union Address Corpus
[*] sinica_treebank….. Sinica Treebank Corpus Sample
[*] ppattach………… Prepositional Phrase Attachment Corpus
[*] senseval………… SENSEVAL 2 Corpus: Sense Tagged Text
[*] problem_reports….. Problem Report Corpus
[*] reuters…………. The Reuters-21578 benchmark corpus, ApteMod
version
[*] swadesh…………. Swadesh Wordlists
[*] rte…………….. PASCAL RTE Challenges 1, 2, and 3
[*] udhr……………. Universal Declaration of Human Rights Corpus
[*] treebank………… Penn Treebank Sample
[*] unicode_samples….. Unicode Samples
[*] verbnet…………. VerbNet Lexicon, Version 2.1
[*] wordnet_ic………. WordNet-InfoContent
[*] book_grammars……. Grammars from NLTK Book
[*] words…………… Word Lists
[*] punkt…………… Punkt Tokenizer Models
[*] wordnet…………. WordNet
[*] large_grammars…… Large context-free grammars for parser
comparison
[*] ycoe……………. York-Toronto-Helsinki Parsed Corpus of Old
English Prose
[*] spanish_grammars…. Grammars for Spanish
[*] rslp……………. RSLP Stemmer (Removedor de Sufixos da Lingua
Portuguesa)
[*] tagsets…………. Help on Tagsets
[*] sample_grammars….. Sample Grammars
[*] timit…………… TIMIT Corpus Sample
[*] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
[*] toolbox…………. Toolbox Sample Files
[*] basque_grammars….. Grammars for Basque
[*] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
[*] webtext…………. Web Text Corpus
[*] switchboard……… Switchboard Corpus Sample

No comments:

Post a Comment