34

I have been stuck for some time on which is the fastest string search algorithm, heard many opinions, but in the end I'm not sure.

I have heard some people saying that the fastest algorithm is Boyer-Moore and some saying that Knuth-Morris-Pratt is actually faster.

I have looked up for the complexity on both of them but they mostly look the same O(n+m). I have found that in the worst case scenario Boyer-Moore has an O(nm) complexity compared to Knuth-Morris-Pratt which has O(m+2*n). Where n=length of text and m=length of pattern.

As far as I know Boyer-Moore has a linear-worst case-time if I would use the Galil Rule.

My question, Over all which is actually the fastest String search algorithm (This question includes all possible sting algorithms not just Boyer-Moore and Knuth-Morris-Pratt).

Edit: Due to this answer

What I'm exactly looking for is:

Given a text T and a pattern P I have to find all the appearances of P in T.

Also the length of P and T are from [1,2 000 000] and the program has to run under 0.15 sec.

I know that KMP and Rabin-Karp are enough to get a 100% score on the problem but I for one wanted to try and implement Boyer-Moore. Which would be best for this type of pattern search?

5
  • 6
    When you tested these out in your language of choice what did you find?
    – Walter
    Commented Jan 15, 2013 at 20:54
  • 4
    On some tests Boyer-Moore was better on other KMP was better , but I'm not sure i have the "best" implementation of them . As for the language of choice it is in the tags : C++ ( not sure if you saw that since you wrote "language of choice"). P.S. I am also not sure if i tested on the best tests. Commented Jan 15, 2013 at 20:57
  • 1
    stackoverflow.com/q/3183582 Commented Jan 15, 2013 at 23:21
  • Knuth-Morris-Pratt which has O(m+2*n) ... You mean O(m+n).
    – Jules
    Commented Feb 16, 2017 at 10:19
  • 1
    Pick one with a decent algorithmic complexity and then micro-tune the crap out of it with a profiler in hand -- always worked for me. :-D
    – user204677
    Commented Dec 9, 2017 at 23:07

3 Answers 3

43

It depends on the kind of search you want to perform. Each of the algorithms performs particularly well for certain types of a search, but you have not stated the context of your searches.

Here are some typical thoughts on search types:

  • Boyer-Moore: works by pre-analyzing the pattern and comparing from right-to-left. If a mismatch occurs, the initial analysis is used to determine how far the pattern can be shifted w.r.t. the text being searched. This works particularly well for long search patterns. In particular, it can be sub-linear, as you do not need to read every single character of your text.

  • Knuth-Morris-Pratt: also pre-analyzes the pattern, but tries to re-use whatever was already matched in the initial part of the pattern to avoid having to rematch that. This can work quite well, if your alphabet is small (f.ex. DNA bases), as you get a higher chance that your search patterns contain reuseable subpatterns.

  • Aho-Corasick: Needs a lot of preprocessing, but does so for a number of patterns. If you know you will be looking for the same search patterns over and over again, then this is much better than the other, because you need to analyse patterns only once, not once per search.

Hence, as usual in CS, there is no definite answer to the overall best. It is rather a matter of choosing the right tool for the job at hand.

Another note on your worst-case reasoning: Do consider the kinds of searches required to create that worst-case and think thoroughly about whether these are really relevant in your case. For example, the O(mn) worst-case complexity of the Boyer-Moore algorithm stems from a search pattern and a text that each use only one character (like finding aaa in aaaaaaaaaaaaaaaaaaaaa) - do you really need to be fast for searches like that?

3
  • I have the whole english alphabet or so to use and I updated the Question , sorry for not starting with this at the begging. Commented Jan 16, 2013 at 16:28
  • And yes I do need to be fast even for searches like that Commented Jan 16, 2013 at 19:15
  • can you please eloborate on Z's algorithm and manachar also?
    – AnV
    Commented Sep 25, 2020 at 16:51
4

Though I am slightly late to answer this question, but I think Z-Algorithm is much faster than any its counterparts. Its worst-case complexity is O(m+n) and it requires no preprocessing of the pattern/text. It is also very easy to code as compared to the other algorithms.

It works in the following manner.

For example, there is a string S ='abaaba'. We are to find z(i) values for i=0 to len(S)-1. Before going into the explanation, let me put lay down some definitions first.

z(i) = no. of characters of the prefix of S that matches the prefix of s(i).

s(i) = ith suffix of S.

The following are the s(i) values for s = 'abaaba'.

s(0) = 'abaaba' = S
s(1) = 'baaba'
s(2) = 'aaba'
s(3) = 'aba'
s(4) = 'ba'
s(5) = 'a'

The z values are respectively

z(0) = 6 = length(S)
z(1) = 0
z(2) = 1
z(3) = 3
z(4) = 0
z(5) = 1

For detail understanding of the algorithm, refer to the following links.

http://codeforces.com/blog/entry/3107

https://www.youtube.com/watch?v=MFK0WYeVEag

Now it takes O(N) to find all the z values without any pre-processing overhead. One would be wondering now how can you use this logic to match pattern in a given string?

Let's see with an example. Pattern(P) : aba, Text(T) : aacbabcabaad.

Put this in the form P$T. ($ - any character that does not appear in either pattern or text. I'll come to the importance of $ in a little while.)

P$T = aba$aacbabcabaad

We know len(P) = 3.

All z values of P$T are

z(0) = 16 = len(P$T)
z(1) = 0
z(2) = 1
z(3) = 0
z(4) = 1
z(5) = 1
z(6) = 0
z(7) = 0
z(8) = 2
z(9) = 0
z(10) = 0
z(11) = 3
z(12) = 0
z(13) = 1
Z(14) = 1
Z(15) = 0

Now which z(i) = len(P). Ans = 11. So our pattern is present at Ans-len(P)-1 = 7. -1 is for $ character.

Now why $ or any such special character is important. Consider P = 'aaa' and T = 'aaaaaaa'. Without the special character, all z(i) will have incremental values. One can still find the position of pattern in text with the below formulae:

Condition: z(i) >= len(P) and Position: Ans-len(P). But the condition in this case becomes a little tricky and confusing. I personally prefer to use the special character technique.

3
  • 2
    Could you explain it yourself here? Having links to external sites can be used to elaborate, but the core of an answer should be in the answer itself rather than having to follow a link to another site.
    – user40980
    Commented Jun 14, 2014 at 13:40
  • The z-algorithm is basically the same as kmp. I doubt it is much faster. Commented Oct 23, 2015 at 9:32
  • 4
    I agree with @ThomasAhle. Computing z is preprocessing. It's a good explanation, though. I put up an O(n) way to convert from KMP pre-processing to Z pre-processing, due to this answer. Here
    – leewz
    Commented Oct 24, 2015 at 7:05
-1

Use content addressable memory, implemented in software in the form of virtual addressing (pointing letters to letters).

It's kinda superfluous to an average string matching algorithm.

CAM can match a huge number of patterns simultaneously, up to about 128-letter patterns (if they are ASCII; if they are Unicode only 64). And it's one call per length of letter in the string you want to match to and one random read from memory per length of the max pattern length. So if you were analyzing a 100,000 letter string, with up to 90,000,000 patterns simultaneously (which would take about 128 GiB to store a count of patterns that large), it would take 12,800,000 random reads from RAM, so it would happen in 1ms.

Here's how the virtual addressing works.

If I start off with 256 startoff addresses, which represent the first letter, these letters point to 256 of the next letters. If a pattern is nonexistent, you don't store it.

So if I keep linking letters to letters, it's like having 128 slices of virtual addressing pointing to virtual addressing.

That will work — but to get to 900,000,000 patterns simultaneously matching, there's one last trick to add to it — and it's taking advantage of the fact that you start off with a lot of reuse of these letter buffers, but later on it scatters out. If you list the contents, instead of allocating all 256 characters, then it slows down very little, and you'll get a 100 times capacity increase, because you basically eventually only get 1 letter used in every letter pointer buffer (which I dubbed 'escape').

If you want to get a nearest-neighbour string match then you have many of these running in parallel and you collect in a hierarchy, so you spread your error out unbiased. if you try to nearest-neighbour with just one, then you're biased towards the start of the tree.

1
  • 4
    @MagnusRobertCarlWoot given that you have the same gavatar as roucer81, it is either an astronomical coincidence of hash code collision or you have the same email address. If you the same individual behind both accounts, you should use the "contact us" form to merge them so that you get proper credit for the reputation gained through upvotes on this answer.
    – user40980
    Commented Oct 20, 2015 at 16:40

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.