How to implement autocomplete on a massive dataset

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.

When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again … and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.

And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.

Leave a Comment Cancel reply