Monday, January 24, 2011

Regular expressions for tokenizing

Some simple ones:

hyphen  = r'(\w+\-\s?\w+)‘ 

Allows for a space after the hyphen

apostrophe  = r'(\w+\'\w+)‘

numbers = r'((\$|#)?\d+(\.)?\d+%?)‘

Needs to handle large numbers with commas

punct      = r'([^\w\s]+)‘

wordr      = r'(\w+)‘

A nice python trick:

r = “|”.join([url, hyphen, apostro, numbers, wordr, punct])

Makes one string in which a “|” goes in between each substring

Now run it:

sentence = "That art-deco poster costs $23.40.“
nltk.regexp_tokenize(sentence, r)

No comments:

Post a Comment