4
\$\begingroup\$

I am writing an algorithm to count the number of times a substring repeats itself. The string is between 1-200 characters ranging from letters a-z. There should be no left overs at the end of the pattern either and it should be split into the smallest possible combination.

answer("abcabcabcabc"):output = 4
answer("abccbaabccba"): output = 2
answer("abcabcd"): output = 1

My code:

import re
def answer(s):
    length = len(s)
    x=[]
    reg = ""
    for i in range(1, length+1):
        if length % i == 0:
            x.append(i)
    repeat = len(x)*10
    for _ in range(0, repeat):
        a = length/max(x)
        print(length)
        print(max(x))
        print(a)
        for _ in range(0, a):
            reg = reg + "."
        exp = re.findall(reg, s)
        print exp
        if all(check==exp[0] for check in exp):
            print len(exp)
            return len(exp)
        elif all(check!=exp[0] for check in exp):
            x.remove(max(x))

This is Python 2.7 code and I don't have to use regex. It just seemed like the easiest way.

Is there a better/faster/more optimal way to do this?

NOTE: it breaks if the string size is too big.

EDIT: Fixed indentation

\$\endgroup\$
7
  • \$\begingroup\$ You list this as a programming-challenge, could you please state the site of this programming challenge? \$\endgroup\$
    – holroy
    Commented Apr 20, 2017 at 23:29
  • \$\begingroup\$ It's a level 1 foobar question. I passed the test, I'm just curious if there is a better way. \$\endgroup\$ Commented Apr 20, 2017 at 23:46
  • \$\begingroup\$ Indentation seems off. Please double check. \$\endgroup\$
    – vnp
    Commented Apr 21, 2017 at 3:21
  • \$\begingroup\$ @vnp where? It works for me \$\endgroup\$ Commented Apr 21, 2017 at 3:47
  • \$\begingroup\$ @JakeSteele for i in range(1, length+1): is indented just as the next line if length % i == 0:. \$\endgroup\$
    – vnp
    Commented Apr 21, 2017 at 3:50

4 Answers 4

5
\$\begingroup\$

I see that there already is an accepted answer, before I got around to writing this answer. That is a little discouraging for me as a reviewer, but I've written a code alternative and have some thoughts regarding your code.

Code review

  • Document your code – To guess what your code is doing by simply reading it is not very easy. It could very well have a few comments on what it tries to achieve.
  • Better variable names – In general try to use variable names describing the content. Names like x, a, exp (which I would read as expression?), and so on, are not very informative.
  • Regex are generally expensive – In general regex, whilst being a fantastic too, are somewhat overused. Do note that it takes time to build the patterns and match the patterns, especially compared to simple string matches.

Some new(?) constructs

I'm not sure if you know these already, but there are a few new constructs I would like to show you:

  • String repeat – Strings can be multiplied (aka duplicated) using the multiplication operator.

    text = "abc"
    print text * 3     # Will output "abcabcabc"
    
  • divmod with multiple outputs – divmod(a, b) will divide a by b and return the divisor and the rest. This can be stored directly into a tuple like in the following:

    div, rest = divmod(13, 5)   # Giving div == 2, rest = 3
    
  • Slice and dice strings – As of Python 2.3 one can slice strings. The most common variants indicates how many characters you want from start like text[:3] or if you want the rest of the text like text[3:]. This can be very useful...
  • A slightly fancier print varant – Using .format in combination with print can produce nicer output rather easily:

    print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], div, rest)
    

    This would output on the same line, something like:

    i: 4, sub: abcd, div: 2, mod: 3
    
  • else-block after for?! – This will make sense later on, but if a for loop completes normally, it'll not enter the optional else:-block. In other words, if you break out of a for loop Python won't enter the else block. This can be used to verify that the for loop actually found/did something, and provide an alternative if it didn't.

    for i in range(5):
        print i
       if i == 3:
          break
    else:
       print "Nothing larger than 3..."
    

Code alternative

I haven't commented too much on your chosen algorithm, as I find it a little confusing, so I thought what is he trying to achieve and what is an alternative approach. I then came up with these demands for the code:

  • You're looking for the maximum repeated substring completely filling out the original string. This means:
    • The candidate substring length, must then divide the original string length without any leftover (or rest characters)
    • The candidate substring can't be more than half the length of the original string, as it then can't be duplicated
    • The first (and shortest) candidate substring will always give the most repeats, if it matches the other criteria

So one way to write this out is like this:

def repeatedSubstring(source):
  """Look for the shortest substring which when repeated equals
     the source string, without any left over characters.
  """
  length = len(source)
  print '\nOriginal text: {}. Length: {}'.format(source, length)

  # Check candidate strings
  for i in range(1, length/2+1):
    repeat_count, leftovers = divmod(length, i)
    # print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], repeat_count, leftovers)
    # print 'repeated: {}'.format(source[:i]*repeat_count)

    # Check for no leftovers characters, and equality when repeated 
    if (leftovers == 0) and (source == source[:i]*repeat_count):
      print 'Found repeated substring: {}, count: {}'.format(source[:i], repeat_count)
      break
  else:
    print "Couldn't find any repeated substring"

repeatedSubstring("abcdeabcde")
repeatedSubstring("abcdeabcdef")
repeatedSubstring("aaaaa")

I've commented out some debug print statements, and left it a little more verbose than the original code.

If you want, you can easily replace the remaining print statements with return div and return 1. A stripped down version would then look like:

def repeatedSubstringCount(source):
  """Look for the shortest substring which when repeated equals
     the source string, without any left over characters.
     Return the maximum repeat count, 1 if none found.
  """
  length = len(source)

  # Check candidate strings
  for i in range(1, length/2+1):
    repeat_count, leftovers = divmod(length, i)

    # Check for no leftovers characters, and equality when repeated 
    if (leftovers == 0) and (source == source[:i]*repeat_count):
      return repeat_count

  return 1

I still left a few comments in there, so that it possible to have some idea on what is happening. I've also changed the name to "repeatedSubstringCount" to indicate both what the function does, but also what it returns.

\$\endgroup\$
2
\$\begingroup\$

Edit: @holroy's answer is faster than this one

This would be my approached on this task:

from itertools import groupby

def all_equal(iterable):
    "Returns True if all the elements are equal to each other"
    "https://docs.python.org/2.7/library/itertools.html#recipes"
    g = groupby(iterable)
    return next(g, True) and not next(g, False)

def divisors(n): # based on vnp's answer
    x = []
    y = []
    i = 1
    while i*i < n:
        if n % i == 0:
            x.append(i)
            y.append(n/i)
        i += 1
    if i*i == n:
        x.append(i)
    x.extend(list(reversed(y)))
    return x

def answer(s):
    n = len(s)
    for divisor in divisors(n):
        # cut the string in divisor-length substring and check if all substrings are equal
        if all_equal(s[i:i+divisor] for i in range(0, n, divisor)):
            return n / divisor

Time measurements:

In [52]: %timeit answer1("abcdefghijklmnopqrstuvwxyz"*10) # Your solution
1000 loops, best of 3: 437 µs per loop

In [53]: %timeit answer("abcdefghijklmnopqrstuvwxyz"*10) # My solution
10000 loops, best of 3: 36.6 µs per loop

In [19]: %timeit repeatedSubstringCount("abcdefghijklmnopqrstuvwxyz"*10) # @holroy's solution
100000 loops, best of 3: 12.2 µs per loop

In [59]: %timeit answer1("abcabcd")
100000 loops, best of 3: 20.4 µs per loop

In [60]: %timeit answer("abcabcd")
100000 loops, best of 3: 8.96 µs per loop

In [21]: %timeit repeatedSubstringCount("abcabcd")
1000000 loops, best of 3: 2.03 µs per loop
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for commenting on the speed issue... I didn't think about that at all when implementing it, I just went with my basic instinct on how I would code it and make it readable. It's always nice when that is fast as well! \$\endgroup\$
    – holroy
    Commented Apr 21, 2017 at 14:12
1
\$\begingroup\$

The loop

    for i in range(1, length+1):
        if length % i == 0:
            x.append(i)

builds a list of the divisors of length. It has a very well defined purpose, and I recommend to factor it out into a function

    def divisors(n)

It could also be optimized. There is no need to encompass the entire range(1, length+1). Notice that the divisors come in pairs: if i divides n, then n/i also divides n:

        i = 1
        while i*i < n:
            if n % i == 0:
                x.append(i)
                x.append(n/i)
                i += 1
            if i*i == n:
                x.append(i)
        return x

This version has the \$O(\sqrt{n})\$ time complexity vs \$O(n)\$ of the original. The resulting list is not sorted, but it is easily amendable:

    x = []
    y = []
    while i*i < n:
            if n % i == 0:
                x.append(i)
                y.append(n/i)
                i += 1
            if i*i == n:
                x.append(i)
        x.append(list(reversed(y)))
    return x

The line

    repeat = len(x)*10

truly stumbles me. Where does 10 come from?

The

    for _ in range(0, len(x)):

seems to work just fine.


The loop above computes max(x) at each iteration, and therefore exhibits a quadratic complexity over the len(x). Since x is sorted, you should just iterate from the end (or reverse x to begin with).


The elif is unjustified. If the code reaches this clause, it is already known that the condition is true - otherwise the function would already return. I recommend

    if all(check==exp[0] for check in exp):
        print len(exp)
        return len(exp)
    x.remove(max(x))

All that said, I am not sure I understand the core logic (or the problem statement, since you said you passed the test). Can you explain the results:

>>> print answer("aba")
1
>>> print answer("aaba")
1
>>> print answer("aabaa")
None
\$\endgroup\$
2
  • \$\begingroup\$ Aren't the first two cases you asked about just without any repeating substring, so it is one string which is repeated exactly once? The last one should also be 1 in that case, though. \$\endgroup\$
    – Graipher
    Commented Apr 21, 2017 at 10:26
  • \$\begingroup\$ Whenever I ran it for a larger string with close to 200 characters it would break. I have no idea why the 10 makes a difference, but it didn't work without making the range bigger. \$\endgroup\$ Commented Apr 21, 2017 at 13:02
1
\$\begingroup\$

There is a more elegant way to do that.

def answer(s):
    return len(s) // (s+s).find(s, 1)

This works because the first alignment of a string on itself doubled is exactly at the length of the repeated pattern.

abcabcabcabcabcabcabcabc
   abcabcabcabc
   ^ first alignment at index 3 (len(s)/3 = 4)
abccbaabccbaabccbaabccba
      abccbaabccba
      ^ first alignment at index 6 (len(s)/6 = 2)
abcabcdabcabcd
       abcabcd
       ^ first alignment at index 7 (len(s)/7 = 1)
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.