Name-Based Sorting Algorithms For Maximized Common Letters

Question

I am attempting to write a program for a competition (we're allowed to consult Stack Exchange, as long as I'm not given physical code) that takes in a list of 5000 people's names (distributed fairly evenly from A-Z, but are real names - not random character strings, and each name is unique) and sorts this list of people into "rooms" of exactly 4 people each, with each person being placed in exactly one room only (cannot reuse names, cannot omit names).

Each "solution" consists of 1250 of such rooms (I store it in a 2d array and am using java), and the solution's "score" is defined to be the sum of each room's score, where the room score is the number of common letters held in each of the 4 peoples' names. A given letter is only counted more than once if it appears exactly that many times in each person's name.

I began with algorithms that simply placed the names into random rooms just to test the waters and see what the low-end of scores might look like (around 650). I then tried doing a simple alphabetical sort of the list containing the names (using Collections.sort()), which considerably improved the score (around 3800), but still had quite a bit of room for improvement.

My latest algorithm has been a slightly modified version of simulated annealing, where I begin with the alphabetically sorted list and continuously switch two people from two rooms (by picking two random room numbers and two random indices in the rooms). Inside each outer level iteration, I update the annealing temperature and conduct an inner loop search of 1000 iterations (at the temperature of the outer loop) where I get 1000 random swaps, pick the best one (no temperature-probability selection in the inner loop), and use that in the outer loop, accepting it as the new solution if its score is higher and accepting it with some probability if it is lower.

This SA approach has gotten me a score in the low 6000's, and runs quite quickly, but it seems to have asymptoted. Are there any other algorithms or optimizations I should consider? Should I move away from a stochastic approach and look for construction approaches that make use of the alphabetical nature of the words themselves? Thanks!

It feels like a trie might address this problem in some way. — Robert Harvey, Commented Jan 13, 2016 at 18:07
I don't think it's a 'sorting' algorithm, for there's no total or partial order in the 'number of common letters' relation. It looks more like clustering. — 9000, Commented Jan 13, 2016 at 21:46

Corbin March · Accepted Answer · 2016-01-15 06:21:31Z

A heuristic worth considering:

Pick a random name and add it to a group.
Scan the remaining names for the highest point match and add it to the group.
Scan the remaining names for the highest total point match against the two grouped names and add it to the group.
Scan the remaining names for the highest total point match against the three grouped names and add it to the group.
Repeat to complete the remaining groups.

The result would likely miss some excellent matches because its members would be matched with others prematurely. Still, it might be a good starting state for stochastic methods.

A further enrichment might be to pre-calculate "distances" between all names. That's 12,500,000 measurements -- a lot, but do-able. Sort the measurements and start the process above with the best match.

A final thought is k-means clustering. We could think of the names as points in 26 dimension Euclidean space. That's one dimension per character. Each character is one "unit" along that dimension. "Aaron" is two units along the A axis, zero units along B, zero along C, etc.. 1 along R... Therefore, he shares a point with "Roana", is pretty close to "Rona", but is no where near "Ziggy". (This isn't a perfect way of measuring. For instance, "Ziggy" is really no further or closer to "Aaron" than "Tim", but our Euclidean layout suggests it is. I'll have to think a bit more on it...)

True k-means is infeasible with a complexity of O(n^dk+1logn) [d == dimensions (26), k == cluster count (1250)] Ouch. Still, conceptualizing as points in space seems useful. There are all sorts of tricks you can try with nearest neighbor search.

Stack Exchange Network

Name-Based Sorting Algorithms For Maximized Common Letters

1 Answer 1

Hot Network Questions

Name-Based Sorting Algorithms For Maximized Common Letters

1 Answer 1

Related

Hot Network Questions