Note: this is Python 2.7 compatible, not Python 3 (for now)
A trie is a search tree that optimizes word dictionary traversal by organizing words in a tree, character by character.
Given a dictionary containing "hell", "hello", and "help", the following tree represents the words in our dictionary:
h
|
e
|
l
/ \
/ \
p l
| / \
_ _ o
You can follow any path down the tree, keeping track of the characters, until you hit an underscore. At this point, you have a valid word. This data structure is particularly useful for efficient autocorrection or finding the shortest/longest word that starts with a given substring.
In Python, a trie can be represented with nested dictionaries like so:
{'h': {'e' : {'l' : {'l': {
'o': {
' ': {}
},
' ': {}
},
'p': {
' ': {}
},
}
}
}where, instead of an underscore, we indicate the end of valid word with a space and empty dictionary.
This library extends common dictionary indexing to work with tries, for example, trie['hel'] returns a subtrie:
trie['hel']
>>> {'p': {' ': {}},
'l': {' ': {},
'o': {' ': {}}}}and
'hello' in trie
>>> Truetests whether a key is contained in the trie. It also adds features like iterating over the words in the trie, etc.
To start using Dictrie, clone this repository:
git clone https://github.com/sufyanAbbasi/dictrieor download the dictrie.py file here: https://github.com/sufyanAbbasi/dictrie/dictrie.py
Move this file to your working directory (sorry, no pip yet!) and run the following:
from dictrie import Dictrie
if __name__ == "__main__":
#initialize a trie with an existing word list
trie = Dictrie(['hell', 'help', 'hello'])
#add some more words to the list
trie.build_list(['hellish', 'hellcat'])
#or
for word in ['hellish', 'hellcat']:
trie[' '] = word
#access a subtrie:
trie['hel']
#test if a key exists in the trie:
'hel' in trie
#test if a word exists in the trie:
trie.is_word('hello')
#iterate over all the words in a trie
for word in trie
print word
#iterate over the words that start with a given string:
for word in trie.get_words('hell'):
print word
Initialize a bare trie by:
trie = Dictrie()Or by supplying any number of iterables (list, set, etc.) of words:
trie = Dictrie(['hell', 'help', 'hello'], {'hellish', 'hellcat'})Or by supplying a trie with the dict keyword:
trie = Dictrie(dict={'h':{e:{y:{' ': {}}}}})The following two methods adds words to the dictionary: As a function:
trie.build_trie(['hell', 'help', 'hello'])Through iteration:
for word in ['hell', 'help', 'hello']:
trie[' '] = wordHere, the key is ignored in this form and each word is automatically placed in the trie.
The Dictrie class extends dictionaries to allow indexing by word substrings. For example:
trie['hel']produces a subtrie of the words that start with the key:
>>> {'p': {' ': {}}, 'l': {' ': {}, 'o': {' ': {}}}}The Dictrie class supports the in keyword, which checks if the sequence of characters exists in the trie:
'hel' in trie
>>> TrueTo test if a valid word exists in the trie, use the is_word(<string>) method:
trie.is_word('hel')
>>> False
trie.is_word('hello')
>>> Truefor word in trie:
print wordprints every word in the trie, from shortest to longest:
>>> hell
help
hello
trie.get_words(<string>) returns a generator that iterates over the words that begin with in alphabetical order.
for word in trie.get_words('hell'):
print wordprints every word that begins with hell from shortest to longest:
>>> hell
hello
github.com/dwy/english-words is a fantastic github repo with over 450,000 english words. Download the text file, words_alpha.txt, and place it in your working directory, and run:
with open('words_alpha.txt') as fp:
for word in fp:
trie[' '] = word.strip()which builds a trie containing all of the available words. Then run:
for word in trie.get_words('trie'):
print wordto list all words in the dictionary which starts with trie in size order:
>>> tried
trier
tries
triene
triens
triers
triedly
...
trieciously
triennially
trierarchal
trierarchic
trienniality
trierarchies
triethylamine
triethanolamine
triethylstibine
- Limit key type to strings
- Figure out how to better deal with iteration on a subtrie
- Iterate on the subtrie itself or the words in the list?
- Test for robustness
Luciano Romalho's, Fluent Python, is an amazing resource for taking your Python skill to the next level. I would highly recommend picking it up!