3

The hash table data structure can be easily spread across multiple machines with a simple algorithm to distribute the keys:

machine_to_query = item_key % machine_count

When you want to read and write key value pairs, you use the key to work out which machine stores the data, then you talk to that machine. If you want a count of the total number of items, you'd need to request the count from each server and add them up.

What algorithms exist for efficiently managing data structures where the data is partitioned across multiple machines? Distributed algorithms, not parallel algorithms.

How might something like a sorted array work in a distributed fashion? Efficiently.

2
  • Academic interest, or real-world project? Either way, some further details/requirments probably needed for any good answers to be written.
    – sdg
    Commented Jan 25, 2013 at 17:30
  • Academic interest
    – sungiant
    Commented Jan 26, 2013 at 20:30

2 Answers 2

1

I don't know about published books that have this kind of thing, but there are some real world examples you could look at. Scala has a http://www.scala-lang.org/api/current/index.html#scala.collection.parallel.immutable.package">Parallel Immutable Collections package. They have a few hash-backed things, but also a vector (implemented as a shallow tree - http://xuwei-k.github.com/scala-library-sxr/scala-library-2.10.0-M1/scala/collection/parallel/immutable/ParVector.scala.html">source code available) and a sequence.

I think collections are being rewritten in Java 8 as part of http://openjdk.java.net/projects/lambda/">Project Lambda so you could look into that too. I expected source code to be available somewhere, but I can't find it after a brief search. I think one key element (which I think you assume in your question) is that having a collection do its own concurrency management is a big win. Instead of iterating over collections externally where every user has to manage the concurrency, the collection does some kind of map() or reduce() operation where it is passed a function that operates on each item or filters items and the collection manages its concurrency internally.

I think most of these use a divide-and-conquer approach, sending the divisions to various processors. You might Google http://en.wikipedia.org/wiki/Amdahl%27s_law">Ahmdal's Law as a starting point because it governs the maximum possible performance gain from running any algorithm over multiple processors. Also map-reduce and big-data.

1
  • Thanks for the reply, however, i'm specifically interested in distributed systems, not parallel systems, perhaps I wasn't clear enough in my question, I've updated the title of the topic now. ;-D
    – sungiant
    Commented Jan 28, 2013 at 12:29
1

Here's a link to a similar question on SO:

https://stackoverflow.com/questions/3927537/how-do-you-implement-sorting-and-paging-on-distributed-data

Also the last article from T-79.4001 Seminar on Theoretical Computer Science looks useful:

http://www.tcs.hut.fi/Studies/T-79.4001/2007SPR/

Some books on the subject:

  • Distributed Algorithms - Kaufmann
  • Design and Analysis of Distributed Algorithms - Wiley

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.