python – How can I split a text into sentences?

python – How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load(tokenizers/punkt/english.pickle)
fp = open(test.txt)
data = fp.read()
print n-----n.join(tokenizer.tokenize(data))

(I havent tried it!)

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst.

# -*- coding: utf-8 -*-
import re
alphabets= ([A-Za-z])
prefixes = (Mr|St|Mrs|Ms|Dr)[.]
suffixes = (Inc|Ltd|Jr|Sr|Co)
starters = (Mr|Mrs|Ms|Dr|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)
acronyms = ([A-Z][.][A-Z][.](?:[A-Z][.])?)
websites = [.](com|net|org|io|gov)

def split_into_sentences(text):
    text =   + text +   
    text = text.replace(n, )
    text = re.sub(prefixes,\1<prd>,text)
    text = re.sub(websites,<prd>\1,text)
    if Ph.D in text: text = text.replace(Ph.D.,Ph<prd>D<prd>)
    text = re.sub(s + alphabets + [.] , \1<prd> ,text)
    text = re.sub(acronyms+ +starters,\1<stop> \2,text)
    text = re.sub(alphabets + [.] + alphabets + [.] + alphabets + [.],\1<prd>\2<prd>\3<prd>,text)
    text = re.sub(alphabets + [.] + alphabets + [.],\1<prd>\2<prd>,text)
    text = re.sub( +suffixes+[.] +starters, \1<stop> \2,text)
    text = re.sub( +suffixes+[.], \1<prd>,text)
    text = re.sub(  + alphabets + [.], \1<prd>,text)
    if ” in text: text = text.replace(.”,”.)
    if  in text: text = text.replace(.,.)
    if ! in text: text = text.replace(!,!)
    if ? in text: text = text.replace(?,?)
    text = text.replace(.,.<stop>)
    text = text.replace(?,?<stop>)
    text = text.replace(!,!<stop>)
    text = text.replace(<prd>,.)
    sentences = text.split(<stop>)
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

python – How can I split a text into sentences?

Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize
>>> p = Good morning Dr. Adams. The patient is waiting for you in room number 3.

>>> tokenize.sent_tokenize(p)
[Good morning Dr. Adams., The patient is waiting for you in room number 3.]

ref: https://stackoverflow.com/a/9474645/2877052

Leave a Reply

Your email address will not be published. Required fields are marked *