1

Sorry if the title is confusing, but here's an example to illustrate what I'm trying to accomplish.

I would like to create a 'checksum' or ID grouping of similar number sequences, so that when you compare one sequence to another, it can see if they are similar or not. 'Similar' sequences here are sequences that are shifted to the right or left, but maintain their pattern. This is a similar sequence:

[ 1, 4, 5, 6 ]
[ 4, 5, 6, 1 ]
[ 5, 6, 1, 4 ]
[ 6, 1, 4, 5 ]

As you can see in each set of numbers the sequence is shifted to the left. So in this example:

sumof([ 1, 4, 5, 6 ]) == sumof([ 4, 5, 6, 1 ]);

But

sumof([ 1, 4, 4, 6 ]) != sumof([ 4, 5, 6, 1 ]);

Another way to put it is:

$checksum == sumof([ 1, 4, 5, 6 ]) == sumof([ 4, 5, 6, 1 ]) == sumof([ 5, 6, 1, 4 ]) == sumof([ 6, 1, 4, 5 ])

I would like the function sumof() to create an alphanumeric value of the sum/checksum of the sequence (preferably an integer, but can contain characters too). Like hashes, I want this value to be robust enough to prevent any collisions with non-matching sequences.

I hope I've described this well enough. If you need any clarification please ask in the comments.

5
  • Lookup for calculating hash values. Commented May 17, 2018 at 22:35
  • The hashing method itself isn't important, I can use md5() for that. I'm more interested in creating a value that groups these similar sequences so I can compare them. Commented May 17, 2018 at 22:38
  • You would need to define rules for 'similar', otherwise use AI and big data and teach a system the vague rules.
    – mattnz
    Commented May 18, 2018 at 2:51
  • The only requirements are that #1 the sets have the same number of elements and #2 the sequence pattern is the same. Commented May 18, 2018 at 2:53

1 Answer 1

2

Derive a canonical (virtual) form of the sequence.

/**
  * Do a lexical comparison and yield the start index of the
  * lexically first cyclic sequence.
  */
int minCycleStartIndex(int[] seq) {
   int n = seq.length;
   int minI = 0;
   for (int i = 1; i < n; ++i) {
       for (int j = 0; j < n; ++j) {
           int c = seq[(minI + j) % n] - seq[(i + j) % n];
           if (c != 0) {
               if (c > 0) {
                   minI = i;
               }
               break;
           }
       }
   }
   return minI;
}

It might very well be that an other order gives a better result, for instance in every loop step comparing the sums upto the index, and in decreasing level, so that data mountains with neighbouring high values come first.

Then implement a progressive checksum on the cyclic sequence:

int cyclicChecksum(int[] seq, int startI) {
   int checksum = 0;
   for (int i = 0; i < n; ++i) {
       int x = seq[(startI + i) % n];
       x /= 4; // Support similarity by lowering the resolution.
       //checksum = (checksum << 2) + x;
       checksum = (checksum << 4) | (x & 0xF);
   }
   return checksum;
}

int minI = minCycleStartIndex(seq);
int crc = cyclicChecksum(seq, minI);

The checksum above is the problematic part. By masking bits out, the same CRC may also contain wildly different cycles.

Checksums 0x12345678 and 0x12A45678 could be similar, have just 1 different value. Because of the 4 bit shift, a hex comparison would be easy.

There are numerous variations imaginable, and probably even needed for your scenario. Yet there are some interesting aspects mentioned here.

By the way associative summing would loose the ordering info, but would fit too, for instance the sum of all elements divided by say 4.

1
  • 1
    Finding a canonical sequence is a great solution. And once we have that sequence, we can use any hashing or CRC algorithm, including cryptographic hashes. Note that your checksum will “forget” the first elements on long sequences. But binning 4 values together as a simple similarity metric before hashing is an interesting adaption.
    – amon
    Commented May 18, 2018 at 13:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.