string – n-grams in python, four, five, six grams?

string – n-grams in python, four, five, six grams?

Great native python based answers given by other users. But heres the nltk approach (just in case, the OP gets penalized for reinventing whats already existing in the nltk library).

There is an ngram module that people seldom use in nltk. Its not because its hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = this is a foo bar sentences and i want to ngramize it

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

Im surprised that this hasnt shown up yet:

In [34]: sentence = I really like python, its pretty awesome..split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
[I, really, like, python,]
[really, like, python,, its]
[like, python,, its, pretty]
[python,, its, pretty, awesome.]

string – n-grams in python, four, five, six grams?

Using only nltk tools

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [  .join(grams) for grams in n_grams]

Example output

get_ngrams(This is the simplest text i could think of, 3 )

[This is the, is the simplest, the simplest text, simplest text i, text i could, i could think, could think of]

In order to keep the ngrams in array format just remove .join

Leave a Reply

Your email address will not be published. Required fields are marked *