7

Part of our ERP system is a sub-system for running background jobs. We track a variety of meta-data about our jobs in a table including timestamps for submitted, started, and end times.

I'm creating a report showing the performance of our job system, detailed by day. One KPI is maximum # of jobs running at once. The algorithm I'm currently using is:

dim cnt as integer    'number of overlapping jobs
dim max as integer    'maximum number of jobs running at one time

for each Job where 
         Job.SubmitTs = today

  'bJob is a 2nd instance of Job
  for each bJob where
           bJob.SubmitTs =  today and
           bJob.StartTs  <= Job.EndTs and
           bJob.EndTs    >= Job.EndTs
    cnt = cCnt + 1
  next

  if cnt > max then max = cnt
next

The problem with this algorithm is that it is very slow due to all the looping. I was wondering if there is a faster way to implement this?

Edit I cannot use SQL queries to access the data.

8
  • 3
    If you post a schema of that table, we might be able to suggest a query that can get what you need. Commented May 18, 2012 at 15:50
  • 1
    I can't use a SQL query to get the data as the language doesn't support it.
    – briddums
    Commented May 18, 2012 at 16:19
  • This was asked for and solved multiple times in SQL and outside of it. stackoverflow.com/questions/117962/… stackoverflow.com/questions/2037618/… stackoverflow.com/questions/1025688/… stackoverflow.com/questions/325933/…
    – Job
    Commented May 18, 2012 at 16:20
  • @Job 3 of your links point to SQL solutions which I cannot use. The 4th is dealing with comparing 2 date ranges. I am looking for a more efficient solution for comparing 1.2 million ranges
    – briddums
    Commented May 18, 2012 at 16:32
  • 1
    @maple_shaft For codereview.SE there should be real code involved; this is just pseudocode as far as I get it, wrong at that (although it resembles VB). To the OP: if you can provide at least a language name, I can give you a better code snippet...
    – K.Steff
    Commented May 18, 2012 at 16:45

2 Answers 2

14
  1. Include all the start and end points (in time) of the Jobs in an array (this creates 2*N elements (1 for start 1 for end))
  2. sort the array ordering by the timestamp of the event,
  3. then iterate over the (2*N) elements as follows:

    for each element X 
    do 
      if(X.type == start)
        counter++
      else
        counter--
      ans=max(ans, counter);
    end
    

Complexity: O(n.log(n)) for initial build of the sorted structure + O(n) for the iteration through its elements.

Edit: It has been suggested by Giorgio that an array is used, which is a better option. (The suggestion originally was to use a red/black tree, but no task removing capability seems to be needed, so maintaining order is trivial).

4
  • 2
    +1: Cool idea! Wouldn't be possible to just sort all the start and end points in an array (this also takes O(n log(n))) and then scan the array as you have described?
    – Giorgio
    Commented May 18, 2012 at 16:56
  • @Giorgio Actually yes, the OP never says tasks are to be removed, so actually a dynamic array is a better fit. Seems I unconsciously introduced further complexity :)
    – K.Steff
    Commented May 18, 2012 at 16:58
  • Brilliant! Collapsing it into a single array made it super-fast.
    – briddums
    Commented May 18, 2012 at 17:34
  • having the array contents already sorted seems like a prerequisite for the algorithm to work at all
    – matt b
    Commented Jul 9, 2013 at 14:31
1

SQL knows how to sort stuff, and I'd bet your engine uses the most efficient algorithm it can (i.e. O(n log(n))), while still working if the result set doesn't fit in RAM (in contrast with K.Steff's answer which will fail if the array doesn't fit in RAM). In python :

import sqlite3
connection = sqlite3.connect("database.db")
cursor = connection.cursor()
cursor.execute("SELECT isStart from (" +
               "  SELECT startTime AS time, 1 AS isStart FROM myTable " +
               "  UNION ALL " +
               "  SELECT endTime as time, -1 AS isStart FROM myTable " +
               "  ORDER BY time ASC, isStart ASC" +
               ")")
maxOverlap = 0
currentOverlap = 0
for (isStart,) in cursor:
    currentOverlap += isStart
    maxOverlap = max(maxOverlap, currentOverlap)
print maxOverlap

Note the isStart ASC order is necessary so the intervals [10,15] and [15,23] are not considered overlapping.

For further improvement, it is probably the case that the columns startTime and endTime have an index, so the SQL engine should efficiently merge the two already sorted tables, resulting in an O(n) operation (the O(log(n)) factor then happens at insertion time for each row, when it is added to the index, but you already have it anyway). If your SQL engine isn't smart enough to do an efficient merge, do it yourself on the fly (still in python) :

import sqlite3
connection = sqlite3.connect("database.db")
cursorStart = connection.cursor()
cursorStart.execute("SELECT startTime FROM myTable ORDER BY startTime ASC")
cursorEnd = connection.cursor()
cursorEnd.execute("SELECT endTime FROM myTable ORDER BY endTime ASC")
maxOverlap = 0
currentOverlap = 0
currentStart = cursorStart.fetchone()
currentEnd = cursorEnd.fetchone()
while currentStart is not None and currentEnd is not None:
    if currentStart < currentEnd:
        currentOverlap += 1
        currentStart = cursorStart.fetchone()
    else:
        currentOverlap -= 1
        currentEnd = cursorEnd.fetchone()
    maxOverlap = max(maxOverlap, currentOverlap)
print maxOverlap

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.