Showing posts with label nltk. Show all posts
Showing posts with label nltk. Show all posts

Monday, January 24, 2011

Regular expressions for tokenizing

Some simple ones:

hyphen  = r'(\w+\-\s?\w+)‘ 

Allows for a space after the hyphen

apostrophe  = r'(\w+\'\w+)‘

numbers = r'((\$|#)?\d+(\.)?\d+%?)‘

Needs to handle large numbers with commas

punct      = r'([^\w\s]+)‘

wordr      = r'(\w+)‘

A nice python trick:

r = “|”.join([url, hyphen, apostro, numbers, wordr, punct])

Makes one string in which a “|” goes in between each substring

Now run it:

sentence = "That art-deco poster costs $23.40.“
nltk.regexp_tokenize(sentence, r)

Google Groups for discussing collecting the Corpus from Twitter

From: http://blog.ynada.com/tag/nltk
 
mzap Show activity Best Reply 2/11/10
I am a linguist at the University of Sydney currently studying the language of microblogging. I would like to build a 100 million word
corpus of tweets. I am trying to determine the best way of collecting such a corpus. Does Twitter make data available directly or is the
only method scraping tweets using the API( I am not a programmer myself although I do have access to a programmer who is able to use the API)?
If I was to use the API would rate limiting mean that it is going to take ages to reach 100 million tweets?


Michael...@ivey) Show activity Best Reply 2/11/10
Take a look at the Streaming API: http://apiwiki.twitter.com/Streaming-API-Documentation
It's very easy to make a simple collection client to pull the statuses/sample stream and gather a decent sample of all the tweets. Tell your programmer to hop on the list and ask any questions that come up...we're (usually) a pretty helpful bunch.

Rolando Espinoza La fuente Show activity Best Reply 2/11/10
With the sample stream I got roughly an average of 10 tweets/sec and roughly 11 words/tweet, but take in count you get the tweets  in multiple languages.
Rolando Espinoza La fuente
www.rolandoespinoza.info


Re: [twitter-dev] Building a 100 million word Twitter corpus
If you're just collecting tweets to build a corpus, it's pretty easy to do with the Streaming API. I've got Perl scripts that can do that, either with Streaming or Search. With Streaming there's no "rate limit" - just connect to the "Sample" stream and collect tweets until you have a big enough corpus.
I don't have a good idea how long it will take you to get 100 million words, but it should be easy to figure out how long it will take to get 100 million tweets - just see how many tweets per hour "sample" is sending.

Yeah, it's pretty easy to collect tweets - I just tested some of my code on a small sample from the Streaming "sample" pipe. It's huge!
Speaking of Twitter "natural language processing", you might be interested in my tweet-text translation efforts. I'm going to be posting some more details in a day or so, but this routine might be of some interest to you:
lexical_regex_utilities.pl at master from znmeb's
Twitter-API-Perl-Utilities - GitHub http://meb.tw/b4AHK9

And a test driver (requires JSON input, which is sort of the "native" language of the Twitter APIs:
test_pg_text.pl at master from znmeb's Twitter-API-Perl-Utilities -
GitHub http://meb.tw/bAmt8q

License is same as Perl - Artistic. I need to put that in the repository. ;-)
M. Edward (Ed) Borasky borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős

NLTK modules and their functions

From: http://blog.ynada.com/tag/nltk
Accessing corpora: nltk.corpus
String processing: nltk.tokenize, nltk.stem
Collocation discovery: nltk.collocations
Part-of-speech tagging: nltk.tag
Classification: nltk.classify, nltk.cluster
Chunking: nltk.chunk
Parsing: nltk.parse
Semantic interpretation: nltk.sem, nltk.inference
Evaluation metrics: nltk.metrics
Probability and estimation; nltk.probability
Applications: nltk.app, nltk.chat

NLTK corpora

From: http://blog.ynada.com/tag/nltk
 
[*] alpino………….. Alpino Dutch Treebank
[*] nombank.1.0……… NomBank Corpus 1.0
[*] abc…………….. Australian Broadcasting Commission 2006
[*] maxent_ne_chunker… ACE Named Entity Chunker (Maximum entropy)
[*] conll2000……….. CONLL 2000 Chunking Corpus
[*] chat80………….. Chat-80 Data Files
[*] brown…………… Brown Corpus
[*] brown_tei……….. Brown Corpus (TEI XML Version)
[*] cmudict…………. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] biocreative_ppi….. BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] cess_cat………… CESS-CAT Treebank
[*] conll2002……….. CONLL 2002 Named Entity Recognition Corpus
[*] conll2007……….. Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] city_database……. City Database
[*] indian………….. Indian Language POS-Tagged Corpus
[*] shakespeare……… Shakespeare XML Corpus Sample
[*] dependency_treebank. Dependency Parsed Treebank
[*] inaugural……….. C-Span Inaugural Address Corpus
[*] ieer……………. NIST IE-ER DATA SAMPLE
[*] gutenberg……….. Project Gutenberg Selections
[*] gazetteers………. Gazeteer Lists
[*] names…………… Names Corpus, Version 1.3 (1994-03-29)
[*] mac_morpho………. MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[*] movie_reviews……. Sentiment Polarity Dataset Version 2.0
[*] cess_esp………… CESS-ESP Treebank
[*] genesis…………. Genesis Corpus
[*] kimmo…………… PC-KIMMO Data Files
[*] floresta………… Portuguese Treebank
[*] qc……………… Experimental Data for Question Classification
[*] nps_chat………… NPS Chat
[*] paradigms……….. Paradigm Corpus
[*] pil…………….. The Patient Information Leaflet (PIL) Corpus
[*] stopwords……….. Stopwords Corpus
[*] propbank………… Proposition Bank Corpus 1.0
[ ] pe08……………. Cross-Framework and Cross-Domain Parser
Evaluation Shared Task
[*] state_union……… C-Span State of the Union Address Corpus
[*] sinica_treebank….. Sinica Treebank Corpus Sample
[*] ppattach………… Prepositional Phrase Attachment Corpus
[*] senseval………… SENSEVAL 2 Corpus: Sense Tagged Text
[*] problem_reports….. Problem Report Corpus
[*] reuters…………. The Reuters-21578 benchmark corpus, ApteMod
version
[*] swadesh…………. Swadesh Wordlists
[*] rte…………….. PASCAL RTE Challenges 1, 2, and 3
[*] udhr……………. Universal Declaration of Human Rights Corpus
[*] treebank………… Penn Treebank Sample
[*] unicode_samples….. Unicode Samples
[*] verbnet…………. VerbNet Lexicon, Version 2.1
[*] wordnet_ic………. WordNet-InfoContent
[*] book_grammars……. Grammars from NLTK Book
[*] words…………… Word Lists
[*] punkt…………… Punkt Tokenizer Models
[*] wordnet…………. WordNet
[*] large_grammars…… Large context-free grammars for parser
comparison
[*] ycoe……………. York-Toronto-Helsinki Parsed Corpus of Old
English Prose
[*] spanish_grammars…. Grammars for Spanish
[*] rslp……………. RSLP Stemmer (Removedor de Sufixos da Lingua
Portuguesa)
[*] tagsets…………. Help on Tagsets
[*] sample_grammars….. Sample Grammars
[*] timit…………… TIMIT Corpus Sample
[*] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
[*] toolbox…………. Toolbox Sample Files
[*] basque_grammars….. Grammars for Basque
[*] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
[*] webtext…………. Web Text Corpus
[*] switchboard……… Switchboard Corpus Sample

NLTK corpus functions

From: http://blog.ynada.com/tag/nltk

fileids() The files of the corpus
fileids([categories]) The files of the corpus corresponding to these categories
categories() The categories of the corpus
categories([fileids]) The categories of the corpus corresponding to these files
raw() The raw content of the corpus
raw(fileids=[f1,f2,f3]) The raw content of the specified files
raw(categories=[c1,c2]) The raw content of the specified categories
words() The words of the whole corpus
words(fileids=[f1,f2,f3]) The words of the specified fileids
words(categories=[c1,c2]) The words of the specified categories
sents() The sentences of the specified categories
sents(fileids=[f1,f2,f3]) The sentences of the specified fileids
sents(categories=[c1,c2]) The sentences of the specified categories
abspath(fileid) The location of the given file on disk
encoding(fileid) The encoding of the file (if known)
open(fileid) Open a stream for reading the given corpus file
root() The path to the root of locally installed corpus
readme() The contents of the README file of the corpus