String Search

R-way Trie:

Deletion:

Walk all the way down the trie, and untag the bottom node that is tagged "isWord".
If this node doesn't have any children, delete and go back up
Delete all the nodes on way back up, until a node has tag "isWord" is encountered.

Challenge: use less memory

Ternary Search Trie

Store characters and values in nodes (not keys).
Each node has 3 children: smaller (left), equal (middle), larger (right).

Follow links corresponding to each character in the key.

If less, take left link; if greater take right link.
If equal, take the middle link and move to the next key character.

More space efficient than R-way trie: more nodes (one more for each character), but many fewer null links.

Even better - TST with R^2 branching at root

Hybrid of R-way trie and TST

Do R^2-way branching at root.
Each of R^2 root nodes points to a TST

TST comparisons bottom line

Faster than hashing (especially for search misses, since hash computation is based on the whole string, plus possible collisions).
More flexible than red-black BSTs.

Note: when radix is small, R-way trie is better.

Application

1) Ordered iteration - to iterate through all keys in sorted order:

Do in order traversal of trie; add keys encountered to a queue.
Maintain sequence of characters on path from root to node.

2) Prefix matches in an R-way trie. Collect all keys in the subtree corresponding to a prefix. (can do ordered iteration)

3) Longest prefix. Find longest key in symbol table that is a prefix of query string. e.g. router chooses IP address in routing table that is longest prefix match.

search for query string.
Keep track of longest key encountered. (e.g. last nodes with "isWord" tag)

Patricia trie

Remove one-way branching.
Each node represents a sequence of characters. (compressed singles children)

Applications: database search, P2P network search, IP routing table (find longest prefix match), storing and querying XML documents.

Suffix tree

Patricia trie of suffixes of a string.
Linear-time construction

Applications:

Linear-time longest repeated substring, longset common substring, longset palindromic substring, substring search, tandem repeats...
Computational biology databases

For now, know how to construct it in O(n^2)

Traverse backwards, similar to patricia trie

TODO: Linear construction using Ukkonen's algorithm: https://youtu.be/aPRqocoBsFQ

based on the O(n^3) algorithm
online algorithm - can be updated with new characters

Exercise Questions

1) Prefix free codes. In data compression, a set of binary strings is if no string is a prefix of another. For example, {01,10,0010,1111} is prefix free, but {01,10,0010,10100} is not because 10 is a prefix of 10100. Design an efficient algorithm to determine if a set of binary strings is prefix-free. The running time of your algorithm should be proportional the number of bits in all of the binary stings.

Insert the binary strings into a 2-way trie.

_Remark: _it's also possible to solve this problem using radix sorting or a ternary search trie.

2) Boggle. Boggle is a word game played on an 4-by-4 grid of tiles, where each tile contains one letter in the alphabet. The goal is to find all words in the dictionary that can be made by following a path of adjacent tiles (with no tile repeated), where two tiles are adjacent if they are horizontal, vertical, or diagonal neighbors

R-way trie.

3) Suffix trees. Learn about and implement , the ultimate string searching data structure.

Search