2

I have many substrings(2-5 words each) which I would like to search in some text of about 40-50 words length. What is the most efficient way to flag matching substrings.

Currently I am simply using:

for substring in substrings:
   if substring in fullText:
      return True

substrings - the list of strings to be searched

fullText - text to be searched on.

Worst case for this solution is all substrings are searched on fullText repeatedly.

0

1 Answer 1

5

Create a regular expression from your list (something like "word1|word2|word3") and use the regular expression functions available for your language. It will hopefully create a data structure optimized for matching, maybe a finite state machine or something equivalent.

4
  • 1
    There's no "hopefully". Regular expressions are invariably converted to finite state machines. Commented Apr 15, 2018 at 8:46
  • I've seen incredibly naive and stupid solutions to problems where known good algorithms exist, so I'm extra careful with words, but of course you are correct. Commented Apr 15, 2018 at 9:38
  • For python on the test cases I ran, regular expression match worked slower compared to simple string search. Also wanted to mention, I am trying to search sentences within a sentence.
    – skadoosh
    Commented Apr 17, 2018 at 11:50
  • 1
    Compiling the regular expression consumes some time, so your results will depend on the number of patterns and the number of files. A simple string search might be faster for a single file, especially when you have an early match. To really evaluate the relative performance, measure realistic test cases. If your initial nested loop is fastest, you obviously don't need to optimize. Commented Apr 17, 2018 at 12:29

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.