Some simple ones:
hyphen = r'(\w+\-\s?\w+)‘
Allows for a space after the hyphen
apostrophe = r'(\w+\'\w+)‘
numbers = r'((\$|#)?\d+(\.)?\d+%?)‘
Needs to handle large numbers with commas
punct = r'([^\w\s]+)‘
wordr = r'(\w+)‘
A nice python trick:
r = “|”.join([url, hyphen, apostro, numbers, wordr, punct])
Makes one string in which a “|” goes in between each substring
Now run it:
sentence = "That art-deco poster costs $23.40.“
nltk.regexp_tokenize(sentence, r)
No comments:
Post a Comment