String Search
R-way Trie:
Deletion:
- Walk all the way down the trie, and untag the bottom node that is tagged "isWord".
- If this node doesn't have any children, delete and go back up
- Delete all the nodes on way back up, until a node has tag "isWord" is encountered.
Challenge: use less memory
Ternary Search Trie
- Store characters and values in nodes (not keys).
- Each node has 3 children: smaller (left), equal (middle), larger (right).
Follow links corresponding to each character in the key.
- If less, take left link; if greater take right link.
- If equal, take the middle link and move to the next key character.
More space efficient than R-way trie: more nodes (one more for each character), but many fewer null links.
Even better - TST with R^2 branching at root
Hybrid of R-way trie and TST
- Do R^2-way branching at root.
- Each of R^2 root nodes points to a TST
TST comparisons bottom line
- Faster than hashing (especially for search misses, since hash computation is based on the whole string, plus possible collisions).
- More flexible than red-black BSTs.
Note: when radix is small, R-way trie is better.
Application
1) Ordered iteration - to iterate through all keys in sorted order:
- Do in order traversal of trie; add keys encountered to a queue.
- Maintain sequence of characters on path from root to node.
2) Prefix matches in an R-way trie. Collect all keys in the subtree corresponding to a prefix. (can do ordered iteration)
3) Longest prefix. Find longest key in symbol table that is a prefix of query string. e.g. router chooses IP address in routing table that is longest prefix match.
- search for query string.
- Keep track of longest key encountered. (e.g. last nodes with "isWord" tag)
Patricia trie
- Remove one-way branching.
- Each node represents a sequence of characters. (compressed singles children)
Applications: database search, P2P network search, IP routing table (find longest prefix match), storing and querying XML documents.
Suffix tree
- Patricia trie of suffixes of a string.
- Linear-time construction
Applications:
- Linear-time longest repeated substring, longset common substring, longset palindromic substring, substring search, tandem repeats...
- Computational biology databases
For now, know how to construct it in O(n^2)
- Traverse backwards, similar to patricia trie
TODO: Linear construction using Ukkonen's algorithm: https://youtu.be/aPRqocoBsFQ
- based on the O(n^3) algorithm
- online algorithm - can be updated with new characters
Exercise Questions
1) Prefix free codes. In data compression, a set of binary strings is if no string is a prefix of another. For example, {01,10,0010,1111} is prefix free, but {01,10,0010,10100} is not because 10 is a prefix of 10100. Design an efficient algorithm to determine if a set of binary strings is prefix-free. The running time of your algorithm should be proportional the number of bits in all of the binary stings.
Insert the binary strings into a 2-way trie.
_Remark: _it's also possible to solve this problem using radix sorting or a ternary search trie.
2) Boggle. Boggle is a word game played on an 4-by-4 grid of tiles, where each tile contains one letter in the alphabet. The goal is to find all words in the dictionary that can be made by following a path of adjacent tiles (with no tile repeated), where two tiles are adjacent if they are horizontal, vertical, or diagonal neighbors
R-way trie.
3) Suffix trees. Learn about and implement , the ultimate string searching data structure.