FUN in S&T: Regular expressions for tokenizing

Monday, January 24, 2011

Regular expressions for tokenizing

Some simple ones:

hyphen = r'(\w+\-\s?\w+)‘

Allows for a space after the hyphen

apostrophe = r'(\w+\'\w+)‘

numbers = r'((\$|#)?\d+(\.)?\d+%?)‘

Needs to handle large numbers with commas

punct = r'([^\w\s]+)‘

wordr = r'(\w+)‘

A nice python trick:

r = “|”.join([url, hyphen, apostro, numbers, wordr, punct])

Makes one string in which a “|” goes in between each substring

Now run it:

sentence = "That art-deco poster costs $23.40.“
nltk.regexp_tokenize(sentence, r)

FUN in S&T

Monday, January 24, 2011

Regular expressions for tokenizing

No comments:

Post a Comment