Needle in a Haystack: A Nifty Large-Scale Text Search Algorithm Tutorial

View all articles

When coming across the term “text search”, one usually thinks of a large body of text, which is indexed in a way that makes it possible to quickly look up one or more search terms when they are entered by a user. This is a classic problem for computer scientists, to which many solutions exist.

But how about a reverse scenario? What if what’s available for indexing beforehand is a group of search phrases, and only at runtime is a large body of text presented for searching? These questions are what this trie data structure tutorial seeks to address.

text search algorithm tutorial using tries

Applications

A real world application for this scenario is matching a number of medical theses against a list of medical conditions and finding out which theses discuss which conditions. Another example is traversing a large collection of judicial precedents and extracting the laws they reference.

Direct Approach

The most basic approach is to loop through the search phrases, and search through the text each phrase, one by one. This approach does not scale well. Searching for a string inside another has the complexity O(n). Repeating that for m search phrases leads to the awful O(m * n).

The (likely only) upside of a direct approach that it is simple to implement, as apparent in the following C# snippet:

String[] search_phrases = File.ReadAllLines ("terms.txt");
String text_body = File.ReadAllText("body.txt");

int count = 0;
foreach (String phrase in search_phrases)
    if (text_body.IndexOf (phrase) >= 0)
        ++count;

Running this code on my development machine [1] against a test sample [2], I got a runtime of 1 hour 14 minutes – far beyond the time you need to grab a cup of coffee, get up and stretch, or any other excuse developers use to skip work.

A Better Approach - The Trie

The previous scenario can be enhanced in a couple of ways. For example, the search process can be partitioned and parallelized on multiple processors/cores. But the reduction in runtime achieved by this approach (total runtime of 20 minutes assuming perfect division over 4 processors/cores) does not justify the added complexity to coding/debugging.

The best possible solution would be one that traverses the text body only once. This requires search phrases to be indexed in a structure that can be transversed linearly, in parallel with the text body, in one pass, achieving a final complexity of O(n).

A data structure that is especially well-suited for just this scenario is the trie. This versatile data structure is usually overlooked and not as famous as other tree-related structures when it comes to search problems.

Toptal’s previous tutorial on tries provides an excellent introduction to how they are structured and used. In short, a trie is a special tree, capable of storing a sequence of values in such a way that tracing the path from the root to any node yields a valid subset of that sequence.

So, if we can combine all the search phrases into one trie, where each node contains a word, we will have the phrases laid out in a structure where simply tracing from the root downwards, via any path, yields a valid search phrase.

The advantage of a trie is that it significantly cuts search time. To make it easier to grasp for the purposes of this trie tutorial, let’s imagine a binary tree. Traversing a binary tree has the complexity of O(log2n), since each node branches into two, cutting the remaining traversal in half. As such, a ternary tree has the traversal complexity of O(log3n). In a trie, however, the number of child nodes is dictated by the sequence it is representing, and in the case of readable/meaningful text, the number of children is usually high.

Text Search Algorithm

As a simple example, let’s assume the following search phrases:

  • “same family”
  • “different family”
  • “separate existence”
  • “members of the league”

Remember that we know our search phrases beforehand. So, we start by building an index, in the form of a trie:

trie index

Later, the user of our software presents it with a file containing the following text:

The European languages are members of the same family. Their separate existence is a myth.

The rest is quite simple. Our algorithm will have two indicators (pointers, if you like), one starting at the root, or “start” node in our trie structure, and the other at the first word in the text body. The two indicators move along together, word by word. The text indicator simply moves forward, while the trie indicator traverses the trie depth-wise, following a trail of matching words.

The trie indicator returns to start in two cases: When it reaches the end of a branch, which means a search phrase has been found, or when it encounters a non-matching word, in which case no match has been found.

One exception to the movement of the text indicator is when a partial match is found, i.e. after a series of matches a non-match is encountered before the branch ends. In this case the text indicator is not moved forward, since the last word could be the beginning of a new branch.

Let’s apply this algorithm to our trie data structure example and see how it goes:

Step Trie Indicator Text Indicator Match? Trie Action Text Action
0 start The - Move to start Move to next
1 start European - Move to start Move to next
2 start languages - Move to start Move to next
3 start are - Move to start Move to next
4 start members members Move to members Move to next
5 members of of Move to of Move to next
6 of the the Move to the Move to next
7 the same - Move to start -
8 start same same Move to same Move to next
9 same family family Move to start Move to next
10 start their - Move to start Move to next
11 start separate separate Move to separate Move to next
12 separate existence existence Move to start Move to next
13 start is - Move to start Move to next
14 start a - Move to start Move to next
15 start myth - Move to start Move to next


As we can see, the system successfully finds the two matching phrases, “same family” and “separate existence”.

Real World Example

For a recent project, I was presented with the following problem: a client has a large number of articles and PhD theses relating to her field of work, and has generated her own list of phrases representing specific titles and rules relating to the same field of work.

Her dilemma was this: given her list of phrases, how does she link the articles/theses to those phrases? The end goal is to be able to randomly pick a group of phrases and immediately have a list of articles/theses that mention those particular phrases ready for grabbing.

As discussed previously, there are two parts to solving this problem: Indexing the phrases into a trie, and the actual search. The following sections provide a simple implementation in C#. Please note that file handling, encoding issues, text cleanup and similar problems are not handled in these snippets, since they are out of the scope of this article.

Indexing

The indexing operation simply traverses phrases one by one and inserts them into the trie, one word per node/level. Nodes are represented with the following class:

class Node
{
    int PhraseId = -1;
    Dictionary<String, Node> Children = new Dictionary<String, Node>();

    public Node() { }
    public Node(int id)
    {
        PhraseId = id;
    }
}

Each phrase is represented by an ID, which can be as simple as an incremental number, and passed to the following indexing function (variable root is the actual root of the trie):

void addPhrase(ref Node root, String phrase, int phraseId)
{
    // a pointer to traverse the trie without damaging
    // the original reference
    Node node = root;

    // break phrase into words
    String[] words = phrase.Split ();

    // start traversal at root
    for (int i = 0; i < words.Length; ++i)
    {
        // if the current word does not exist as a child
        // to current node, add it
        if (node.Children.ContainsKey(words[i]) == false)
            node.Children.Add(words[i], new Node());

        // move traversal pointer to current word
        node = node.Children[words[i]];

        // if current word is the last one, mark it with
        // phrase Id
        if (i == words.Length - 1)
            node.PhraseId = phraseId;
    }
}

Searching

The search process is a direct implementation of the trie algorithm discussed in the tutorial above:

void findPhrases(ref Node root, String textBody)
{
    // a pointer to traverse the trie without damaging
    // the original reference
    Node node = root;

    // a list of found ids
    List<int> foundPhrases = new List<int>();

    // break text body into words
    String[] words = textBody.Split ();

    // starting traversal at trie root and first
    // word in text body
    for (int i = 0; i < words.Length;)
    {
        // if current node has current word as a child
        // move both node and words pointer forward
        if (node.Children.ContainsKey(words[i]))
        {
            // move trie pointer forward
            node = node.Children[words[i]];

            // move words pointer forward
            ++i;
        }
        else
        {
            // current node does not have current
            // word in its children

            // if there is a phrase Id, then the previous
            // sequence of words matched a phrase, add Id to
            // found list
            if (node.PhraseId != -1)
                foundPhrases.Add(node.PhraseId);


            if (node == root)
            {
                // if trie pointer is already at root, increment
                // words pointer
                ++i;
            }
            else
            {
                // if not, leave words pointer at current word
                // and return trie pointer to root
                node = root;
            }
                
        }
    }

    // one case remains, word pointer as reached the end
    // and the loop is over but the trie pointer is pointing to
    // a phrase Id
    if (node.PhraseId != -1)
        foundPhrases.Add(node.PhraseId);
}

Performance

The code presented here is extracted from the actual project and has been simplified for the purpose of this document. Running this code again on the same machine [1] and against the same test sample [2] resulted in a runtime of 2.5 seconds for building the trie and 0.3 seconds for the search. So much for break time, eh?

Variations

It’s important to acknowledge that the algorithm as described in this trie tutorial can fail in certain edge cases, and therefore is designed with predefined search terms already in mind.

For example, if the beginning of one search term is identical to some part of another search term, as in:

  • to share and enjoy with friends”
  • “I have two tickets to share with someone”

and the text body contains a phrase that causes the trie pointer to start down the wrong path, such as:

I have two tickets to share and enjoy with friends.

then the algorithm will fail to match any term, because the trie indicator will not return to the start node until the text indicator has already passed the beginning of the matching term in the text body.

It is important to consider whether this sort of edge case is a possibility for your application before implementing the algorithm. If so, the algorithm can be modified with additional trie indicators to track all of the matches at any given time, instead of just one match at a time.

Conclusion

Text search is a deep field in computer science; a field rich with problems and solutions alike. The kind of data I had to deal with (23MB of text is a ton of books in real life) might seem like a rare occurrence or a specialized problem, but developers who work with linguistics research, archiving, or any other type of data manipulation, come across much larger amounts of data on a regular basis.

As is evident in the trie data structure tutorial above, it is of great importance to carefully choose the correct algorithm for the problem at hand. In this particular case, the trie approach cut the runtime by a staggering 99.93%, from over an hour to less than 3 seconds.

By no means is this the only effective approach out there, but it is simple enough, and it works. I hope you have found this algorithm interesting, and wish you the best of luck in your coding endeavors.


[1] The machine used for this test has the following specs:

  • Intel i7 4700HQ
  • 16GB RAM

Testing was done on Windows 8.1 using .NET 4.5.1 and also Kubuntu 14.04 using the latest version of mono and the results were very similar.

[2] The test sample consists of 280K search phrases with a total size of 23.5MB, and a text body of 1.5MB.

About the author

Ahmed Al-Amir, Australia
member since March 16, 2014
Ahmed is an entrepreneur with a vivid imagination and 9 years of experience developing high performance applications. He is an expert with data storage/manipulation and high precision industrial applications. He is self-motivated and can work alone or as a part of a team. [click to continue...]
Hiring? Meet the Top 10 Freelance Algorithm Developers for Hire in December 2016

Comments

Pooyan Khosravi
Why not use regular expressions?
Guest
Wouldn't it be better to have a trie for each document? I mean, you would need much more space, but at least you could make it much more faster.
Alaa Murad
Interesting read, but I would go for something that is big data platform, you may want to look for Hadoop & MapReduce also Spark might also better because it's more real-time. That is from Java world, but I'm sure .NET people may have something similar.
Francisco Yllera
if you want to save space, due to the nature of the trie's big space complexity, you may want to use a DAWG http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton or a radix tree http://en.wikipedia.org/wiki/Radix_tree
John Wooten
While these search strategies work and are excellent for some purposes, we have developed a patented technology that searches through volumes of text for "Items of Value" which are specified in a concept network. It is a different approach wherein there is no indexing, and no matching of phrases etc. Instead, we use something akin to the way we think parts of the brain works, like "hey, this word is a word that I might be interested in. Send out a notification that I have been found". Other parts of the notification network, get the message and say "Oh, i already heard from another word that was near this one and with these two, I'll let the network know that we have two interesting words within 5 words of each other". This continues upward until we get one element of the network saying, "I have a 'hit'" on a concept I was interested in. Here are the pieces that are involved and the reason that I was selected. This method is very fast, very scalable, and depends only upon the size of the concept network. It is currently in use by the Department of Energy for doing automatic classification and declassification of documents and emails. Just pointing out that there are ways to solve the problem that don't involve indexing or "searching". Check out http://dmehelp.com for a location where it is used to handle real questions about Durable Medical Equipment.
Ahmad Abd-Elghany
Well done Mr. fancy pants :D, I didn't get the whole thing due to my algorithms ignorance, but the code was clean. well done.
Bpendragon
I actually need this exact functionality for a system I'm working on, and probably would never have thought of this. Thank you for the insight.
Bpendragon
I see a couple reasons not to. The first is space complexity, implemented properly in the form of [Ss]ame|[Dd]ifferent.. it might have similar results, implemented improperly you start running towards the O(m*n) discussed above. The other is extensibility, in this example you can dynamically create your tries. Something a touch more difficult with regex. Though for all we know it might optimize to essentially regex.
Pooyan Khosravi
Regular expression _are_ represented by normal strings. Like building a Trie whenever your match criteria changes, you can compile a new regular expression. Regular expressions usually have O(1) space complexity. Maximum space complexity is longest possible match. For this example, maximum space requirement is longest path in Trie. Keep in mind that Regular expressions solve significantly wider range of problems without increasing space complexity. Since regular expressions are "described", it's hard to "implement" them in a wrong fashion. Tries are just tip of an iceberg compared to optimizations that engines like PCRE are capable of. Consider [PCREJIT](http://man7.org/linux/man-pages/man3/pcrejit.3.html) as an example.
Charles Cook
I can vouch for regular expressions working well for searching large amounts of text for keywords/phrases. In particular I created a CLR SQL UDF to query against a library of books for a paragraph which uses keyword/phrase - works very well.
Safi
Interesting read, thanks! Did you try measuring using regular expressions with alternation (|)? Wouldn't that have similar time/space complexity with less implementaton effort requirements?
Gene Getman
Question: when the trie indicator is at the start node before matching with anything (steps 0-4, 8, 10-11, 13-15 in the example above) .. which terms is it looking for a match to the text indicator? Is it just the first level of the trie ("same", "different", "separate", "members"), or is it checking for matches down every branch? I am asking because it seems the is a major downside in either case: If the algorithm is checking each word to just the fist level of the trie, there is the possibility of not matching to words that may be at deeper levels, even though they are not thematically less important to the search. If the algorithm, is checking each word to each search term down every branch, they is it not computationally different or more efficient than the brute force direct approach.
comments powered by Disqus
Subscribe
The #1 Blog for Engineers
Get the latest content first.
No spam. Just great engineering and design posts.
The #1 Blog for Engineers
Get the latest content first.
Thank you for subscribing!
You can edit your subscription preferences here.
Trending articles
Relevant technologies
About the author
Ahmed Al-Amir
C/C++ Developer
Ahmed is an entrepreneur with a vivid imagination and 9 years of experience developing high performance applications. He is an expert with data storage/manipulation and high precision industrial applications. He is self-motivated and can work alone or as a part of a team.