[ACE] String Matching 2

Useful definitions

  • Prefix: a string w is a prefix of a string x, denoted w ⊏ x
  • e.g., ab ⊏ abcca
  • Suffix: a string w is a suffix of a string x, denoted w ⊐ x
  • e.g., cca ⊐ abcca
  • the empty string ε is both a prefix and a suffix of every string

Automata-based matching

The brute force approach used by NaiveStringMatch is simple,
but not very efficient, when a match fails, it simply increments the shift and tries again, however in some cases, the next shift can’t possibly be valid, given what we know about the pattern and the text character that failed to match.

  • One way of encoding the structure of the pattern is by using an automaton that scans T for an occurrence of P
  • The automaton examines each character in the text exactly once, giving a worst case matching time of O(n)
  • However, automata-based matching algorithms require a pre-processing step to build the automaton from the pattern

Suffix function for a pattern

Before, we get into this automata-based matching method, we should know another important concept:
enter image description here

It is hard to understand at first, so i want to explain it again:
P is our pattern, so when we scan from the first letter to one letter, if there is matching(no matter how long), the matching part must be the prefix of P and suffix of text(which is x in this case).
So this Σ(x) means the largest matching for P on the input x.
For example, if our pattern is ab,
Σ(ccaca) = 1, because only prefix of pattern a can match suffix of ccac a(the bold a)
Σ(ccab) = 2;

how transition function works

enter image description here
The illustration of automata-based matching is as above.
We can see if the next input symbol is not what we want for next, it might not go back to the original(initial) state but go to some state in the middle.
This is how we avoid the “pointless” shift!
But in the NaiveStringMatching algorithm, if the next input is wrong, we just go back to check the initial state which is waste of time.
e.g., in state 5, if the automaton reads a ‘c’ (a successful match) it moves to state 6; however if, as in the example, it reads T[6] = b, it returns to state 4

Pseudo-code

There are two parts of the code, one is AutomatonStringMatching, one is to ComputeTransitionFunction(pre-processing step)
enter image description here
enter image description here

The first one is simple, we just traversal all the text and check the transition function, if the result of transition function is the accepting state(means we find the pattern), it returns.
However, the second part of code, which is to create this transition function is hard to understand.

But if you try it step by step, you will know it.
To briefly explain it:

  • The first(outer) for is to traversal all the symbol in the pattern. Because we want to draw the transition diagram, the states is every symbol in P. We draw from the first symbol and check what will happened when there is a alphabet coming in and record it.
  • The for each loop is to check every alphabet.
  • K means where do you start searching the next state.
  • The while loop must be the difficult one. Pq means the current states, and Pq a means the next states. If Pk ==Pq a means if Pk is your next state, we stop decrement k and record the transition function delta (q,a) = k. if not, it means, we haven’t found the next state, we decrement k.

The code of automata-based matching is here

Complexity

  • running time of ComputeTransitionFunction is O(m^3 |Σ|)
  • by using information about P, we can compute transition function in O(m|Σ|)
  • matching time of AutomatonStringMatch is O(n), giving an overall running time of O(n+ m|Σ|)

if |Σ| is small and patterns and texts are large, e.g., matching DNA
sequences, then an automata approach can be worthwhile
if |Σ| is large generating an automaton may not be worthwhile (unless we are matching the same pattern against many texts)

评论

此博客中的热门博文

[MLE] Linear Classification

[AIM] MetaHeuristics

[CS231] Neural Networks