Monday, January 24, 2011

Google Groups for discussing collecting the Corpus from Twitter

From: http://blog.ynada.com/tag/nltk
 
mzap Show activity Best Reply 2/11/10
I am a linguist at the University of Sydney currently studying the language of microblogging. I would like to build a 100 million word
corpus of tweets. I am trying to determine the best way of collecting such a corpus. Does Twitter make data available directly or is the
only method scraping tweets using the API( I am not a programmer myself although I do have access to a programmer who is able to use the API)?
If I was to use the API would rate limiting mean that it is going to take ages to reach 100 million tweets?


Michael...@ivey) Show activity Best Reply 2/11/10
Take a look at the Streaming API: http://apiwiki.twitter.com/Streaming-API-Documentation
It's very easy to make a simple collection client to pull the statuses/sample stream and gather a decent sample of all the tweets. Tell your programmer to hop on the list and ask any questions that come up...we're (usually) a pretty helpful bunch.

Rolando Espinoza La fuente Show activity Best Reply 2/11/10
With the sample stream I got roughly an average of 10 tweets/sec and roughly 11 words/tweet, but take in count you get the tweets  in multiple languages.
Rolando Espinoza La fuente
www.rolandoespinoza.info


Re: [twitter-dev] Building a 100 million word Twitter corpus
If you're just collecting tweets to build a corpus, it's pretty easy to do with the Streaming API. I've got Perl scripts that can do that, either with Streaming or Search. With Streaming there's no "rate limit" - just connect to the "Sample" stream and collect tweets until you have a big enough corpus.
I don't have a good idea how long it will take you to get 100 million words, but it should be easy to figure out how long it will take to get 100 million tweets - just see how many tweets per hour "sample" is sending.

Yeah, it's pretty easy to collect tweets - I just tested some of my code on a small sample from the Streaming "sample" pipe. It's huge!
Speaking of Twitter "natural language processing", you might be interested in my tweet-text translation efforts. I'm going to be posting some more details in a day or so, but this routine might be of some interest to you:
lexical_regex_utilities.pl at master from znmeb's
Twitter-API-Perl-Utilities - GitHub http://meb.tw/b4AHK9

And a test driver (requires JSON input, which is sort of the "native" language of the Twitter APIs:
test_pg_text.pl at master from znmeb's Twitter-API-Perl-Utilities -
GitHub http://meb.tw/bAmt8q

License is same as Perl - Artistic. I need to put that in the repository. ;-)
M. Edward (Ed) Borasky borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős

No comments:

Post a Comment