Algorithms for data structures in distributed system

Question

The hash table data structure can be easily spread across multiple machines with a simple algorithm to distribute the keys:

machine_to_query = item_key % machine_count

When you want to read and write key value pairs, you use the key to work out which machine stores the data, then you talk to that machine. If you want a count of the total number of items, you'd need to request the count from each server and add them up.

What algorithms exist for efficiently managing data structures where the data is partitioned across multiple machines? Distributed algorithms, not parallel algorithms.

How might something like a sorted array work in a distributed fashion? Efficiently.

Academic interest, or real-world project? Either way, some further details/requirments probably needed for any good answers to be written. — sdg, Commented Jan 25, 2013 at 17:30

GlenPeterson · Accepted Answer · 2013-01-25 19:10:27Z

I don't know about published books that have this kind of thing, but there are some real world examples you could look at. Scala has a http://www.scala-lang.org/api/current/index.html#scala.collection.parallel.immutable.package">Parallel Immutable Collections package. They have a few hash-backed things, but also a vector (implemented as a shallow tree - http://xuwei-k.github.com/scala-library-sxr/scala-library-2.10.0-M1/scala/collection/parallel/immutable/ParVector.scala.html">source code available) and a sequence.

I think collections are being rewritten in Java 8 as part of http://openjdk.java.net/projects/lambda/">Project Lambda so you could look into that too. I expected source code to be available somewhere, but I can't find it after a brief search. I think one key element (which I think you assume in your question) is that having a collection do its own concurrency management is a big win. Instead of iterating over collections externally where every user has to manage the concurrency, the collection does some kind of map() or reduce() operation where it is passed a function that operates on each item or filters items and the collection manages its concurrency internally.

I think most of these use a divide-and-conquer approach, sending the divisions to various processors. You might Google http://en.wikipedia.org/wiki/Amdahl%27s_law">Ahmdal's Law as a starting point because it governs the maximum possible performance gain from running any algorithm over multiple processors. Also map-reduce and big-data.

Thanks for the reply, however, i'm specifically interested in distributed systems, not parallel systems, perhaps I wasn't clear enough in my question, I've updated the title of the topic now. ;-D — sungiant, Commented Jan 28, 2013 at 12:29

Community · Accepted Answer · 2017-05-23 12:40:15Z

1

Here's a link to a similar question on SO:

https://stackoverflow.com/questions/3927537/how-do-you-implement-sorting-and-paging-on-distributed-data

Also the last article from T-79.4001 Seminar on Theoretical Computer Science looks useful:

http://www.tcs.hut.fi/Studies/T-79.4001/2007SPR/

Some books on the subject:

Distributed Algorithms - Kaufmann
Design and Analysis of Distributed Algorithms - Wiley

edited May 23, 2017 at 12:40

CommunityBot

1

answered Jan 28, 2013 at 12:41

sungiant

1416 bronze badges

Add a comment |

Stack Exchange Network

Algorithms for data structures in distributed system

2 Answers 2

Hot Network Questions

Algorithms for data structures in distributed system

2 Answers 2

Related

Hot Network Questions