string – n-grams in python, four, five, six grams?
string – n-grams in python, four, five, six grams?
Great native python based answers given by other users. But heres the nltk
approach (just in case, the OP gets penalized for reinventing whats already existing in the nltk
library).
There is an ngram module that people seldom use in nltk
. Its not because its hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.
from nltk import ngrams
sentence = this is a foo bar sentences and i want to ngramize it
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
Im surprised that this hasnt shown up yet:
In [34]: sentence = I really like python, its pretty awesome..split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
[I, really, like, python,]
[really, like, python,, its]
[like, python,, its, pretty]
[python,, its, pretty, awesome.]
string – n-grams in python, four, five, six grams?
Using only nltk tools
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ .join(grams) for grams in n_grams]
Example output
get_ngrams(This is the simplest text i could think of, 3 )
[This is the, is the simplest, the simplest text, simplest text i, text i could, i could think, could think of]
In order to keep the ngrams in array format just remove .join