How to create a trie in Python

How to create a trie in Python

Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome — or at least space inefficient. But since youre just getting started, I think thats the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

>>> _end = _end_
>>> 
>>> def make_trie(*words):
...     root = dict()
...     for word in words:
...         current_dict = root
...         for letter in word:
...             current_dict = current_dict.setdefault(letter, {})
...         current_dict[_end] = _end
...     return root
... 
>>> make_trie(foo, bar, baz, barz)
{b: {a: {r: {_end_: _end_, z: {_end_: _end_}}, 
             z: {_end_: _end_}}}, 
 f: {o: {o: {_end_: _end_}}}}

If youre not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (Its like a version of get that also updates the dictionary.)

Next, a function to test whether the word is in the trie:

>>> def in_trie(trie, word):
...     current_dict = trie
...     for letter in word:
...         if letter not in current_dict:
...             return False
...         current_dict = current_dict[letter]
...     return _end in current_dict
... 
>>> in_trie(make_trie(foo, bar, baz, barz), baz)
True
>>> in_trie(make_trie(foo, bar, baz, barz), barz)
True
>>> in_trie(make_trie(foo, bar, baz, barz), barzz)
False
>>> in_trie(make_trie(foo, bar, baz, barz), bart)
False
>>> in_trie(make_trie(foo, bar, baz, barz), ba)
False

Ill leave insertion and removal to you as an exercise.

Of course, Unwinds suggestion wouldnt be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters — 27 if we include _end. Also, theres nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

Finally, Ill add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.

Have a look at this:

https://github.com/kmike/marisa-trie

Static memory-efficient Trie structures for Python (2.x and 3.x).

String data in a MARISA-trie may take up to 50x-100x less memory than
in a standard Python dict; the raw lookup speed is comparable; trie
also provides fast advanced methods like prefix search.

Based on marisa-trie C++ library.

Heres a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

There are also a couple of pure-python implementations, though unless youre on a restricted platform youd want to use the C++ backed implementation above for best performance:

How to create a trie in Python

Here is a list of python packages that implement Trie:

  • marisa-trie – a C++ based implementation.
  • python-trie – a simple pure python implementation.
  • PyTrie – a more advanced pure python implementation.
  • pygtrie – a pure python implementation by Google.
  • datrie – a double array trie implementation based on libdatrie.

Leave a Reply

Your email address will not be published. Required fields are marked *