Parallel processing a Tree on GPU

Question

I have seen a few papers on parallel/GPU processing of trees, but after briefly looking through them I wasn't able to grasp what they did. The closest to a helpful explanation was found in Parallelization: Binary Tree Traversal in this diagram:

But having a difficult time following the paper.

Wondering if one could outline an algorithm for parallel processing of a tree. Somehow I can imagine this being possible, and seeing papers on it suggests it is, but I can't really think of what you would do to make it happen.

If it's any help, specifically I'm wondering how to traverse a B+tree to find the matches.

Update

Here is another diagram (from here) which seems to shed some light, but having difficulty understanding.

The paper you link seems to be talking about normal, recursively defined trees. Do you have a link to something talking about trees on a GPU? — Caleth, Commented Apr 25, 2018 at 15:01
@Caleth There are a few papers that mention GPU tree traversals, but can't tell if they are good. "Recently, irregular algorithms such as ... kd-tree traversals have been implemented on GPUs" (docs.lib.purdue.edu/cgi/…) "For example, in our transformed Barnes-Hut kernel we load a partial node that only contains the position vector of the current node and its type" in section "Transforming CPU traversals for the GPU". — Lance Pollard, Commented Apr 25, 2018 at 15:39
@Caleth I think this is a presentation for the paper. They have concepts autoropes, lockstepping, and warps, which adds to it. — Lance Pollard, Commented Apr 25, 2018 at 15:43
This one says "We develop data parallel a tree traversal algorithm - Parallel Scanning and Backtracking (PSB) that processes multiple branches of a tree node in parallel ... To the best of our knowledge, this is the first work that parallelizes kNN query processing on the n-ary tree structured index for the GPU." — Lance Pollard, Commented Apr 25, 2018 at 15:47
...But reading section "Parallel Scan and Backtrack for kNN Query" I don't see anything about GPU vectors and whatnot to actually make it possible. Not sure where I'm missing stuff. — Lance Pollard, Commented Apr 25, 2018 at 15:53

Caleth · Accepted Answer · 2018-04-25 13:24:04Z

4

It's very simple.

When you would recurse down the left child then the right child, instead start a task that recurses down the left, and continue down the right. When you are done with the right, you may have to wait on the left, but in a balanced tree that won't be long

When you are at depth N in your tree, you have the potential for 2^N cores working

That specific diagram is only spawning tasks at specific depths, probably because it isn't queueing tasks. That is using knowledge of the structure of the tree to not overload the scheduler

If you have a task queue and a thread pool, the implementation can leverage those to not have to worry about the quantity of tasks spawned.

Say you have

class Node 
{
    int data; // or w/e

    Node * left;
    Node * right;

    template <typname Action>
    void SerialTraverse(Action action)
    {
        action(data); // Pre-order traversal

        if (left) left->SerialTraverse(action);
          // Traverse the left
        if (right) right->SerialTraverse(action);
          // Traverse the right
    }
}

namespace library 
{
    template<typename Task, typename Class, typename Arg>
    std::future<void> enqueue_task(Task task, Class * obj, Arg arg);
    // on some other thread, call "obj->task(arg);"
    // returns a handle upon which we can wait
}

Then you can change SerialTraverse to

    template <typname Action>
    void ParallelTraverse(Action action)
    {
        action(data);
        std::future<void> fut;
        if (left) fut = library::enqueue_task(ParallelTraverse, left, action);
          // start traversing left on another thread

        if (right) right->ParallelTraverse(action);
          // traverse right on this thread, in parallel to traversing left

        if (fut.valid()) fut.wait();
          // wait for the left traversal to end
    }

edited Apr 25, 2018 at 13:24

answered Apr 25, 2018 at 8:56

Caleth

11.9k2 gold badges28 silver badges39 bronze badges

1

To paraphrase Caleth's answer, to force individual threads (spawned tasks) to descend a particular branch (say, one of the colors in the diagram), simply pass in "a list of turns (left, right) to be taken from the starting point" as an input to each task.
– rwong
Commented Apr 25, 2018 at 10:48
2

@rwong or just a reference to whichever node it would start from
– Caleth
Commented Apr 25, 2018 at 10:54
Not really following. So taking WebGL as an example, what is the input array of data, like [nodeid, lnodeid, rnodeid, nodeid, lnodeid, ...] or something, where the block size is 3, and you return something in the array like the turn to go on next? That's what I gather (still not quite following) from @rwong comment. Ah so you do two full-depth branches at the same time, then move onto the next pair of branches? (Thinking of a B+tree instead of a binary tree like the diagram suggests).
– Lance Pollard
Commented Apr 25, 2018 at 13:03
@LancePollard I've added an example of transforming a serial tree algorithm to an equivalent parallel algorithm. The important point is that you start both actions on the left and right children without waiting for the other to be done
– Caleth
Commented Apr 25, 2018 at 13:20
@Caleth thank you, I can kind of see how that would be parallel using threads, wondering though how this would be done on the GPU with limited data. Reflected that in the question. Thank you.
– Lance Pollard
Commented Apr 25, 2018 at 13:42

| Show 1 more comment

Magnus Wootton · Accepted Answer · 2020-02-25 22:48:03Z

You can think its better to have a gpu call per level of the tree, but I think its simpler if you just put it all in 1 call, do the steps in an iteration, and get your cores in operation back via amount of simultaneous whole accesses.

Heres a setup to do a key store in a tree, on the gpu.

use a 32bit uint texture, and use this format-> [NEXTFREEPOSITION][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER]... next free position points here at the end to tack another iop on.

the parallel pointer points to the leg beside the one your up to, and the serial pointer points deeper into the tree. If ever the parallel node is null, it means you are at the end, and you have a no match, and you abort. When you are at the last level, you put the leaf contents in, and complete successfully.

You hurt your gpu quite a bit by hopping around on the texture randomly, but there isnt too much to do, because its all spacially divided parallel wise and you get away with log series levels.

Heres an example of the reading code, done in direct compute.

//its only really short.

[numthreads(32, 32, 1)]  //put your 200k parallel accesses here. :)
void cs_zc_read_read(uint3 DTid : SV_DispatchThreadID )
{
  //get the first link pointer with a kickstart array, cause the start reuses 
  //too much and it makes too many parallel nodes.
 uint link=buffer1[buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+0].f].f; 
 uint i;
 while(i<ZCKEY32;i++)
 {
  uint match32=buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+i].f 
  uint match232;
  uint lnk[2];
  match232=buffer2[link];  
  lnk[0]=buffer2[link+2]; 
  lnk[1]=buffer2[link+1];
  if(match32==match232)
  {
   link=lnk[0];   //this steps in series.
   i++;
   if(i==ZCKEY32)
   {
    //success, its an exact match.
    link=lnk[0]; //get the routing position 
    //get two routings.
    BufferOut[(DTid.x+DTid.y*RES)*2+0].f=buffer2[link+1].f;
    BufferOut[(DTid.x+DTid.y*RES)*2+1].f=buffer2[link+2].f;
   }
  }
  else
  {
   link=lnk[1];   //this steps in parallel
   if(link==0){i=100000;} //abort, its a no match
  }
 }
}

Dont worry about parallelizing the write because its a pain. =p But if you wanted it to kick ass more, you collect multivalues using ternary codes, and you get more flexibility, but you have to multiply the cost by the amount your collecting in parallel, and it doesnt matter if its just more iteration in the single call, because 'once youve saturated your cores more parallel threads dont gain you no more speed.'

Magnus Wootton · Accepted Answer · 2023-06-30 09:01:33Z

0

you can probably put cpu code in the shader for a binary tree access, just convert it across pretty much exactly the same, then u can do a query a core, so if u have 2048 processors, u can do 2048 queries in the same speed as one.

answered Jun 30, 2023 at 9:01

Magnus Wootton

1

Add a comment |

Stack Exchange Network

Parallel processing a Tree on GPU

3 Answers 3

Hot Network Questions

Parallel processing a Tree on GPU

3 Answers 3

Related

Hot Network Questions