You can think its better to have a gpu call per level of the tree, but I think its simpler if you just put it all in 1 call, do the steps in an iteration, and get your cores in operation back via amount of simultaneous whole accesses.
Heres a setup to do a key store in a tree, on the gpu.
use a 32bit uint texture, and use this format->
[NEXTFREEPOSITION][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER]... next free position points here at the end to tack another iop on.
the parallel pointer points to the leg beside the one your up to, and the serial pointer points deeper into the tree. If ever the parallel node is null, it means you are at the end, and you have a no match, and you abort. When you are at the last level, you put the leaf contents in, and complete successfully.
You hurt your gpu quite a bit by hopping around on the texture randomly, but there isnt too much to do, because its all spacially divided parallel wise and you get away with log series levels.
Heres an example of the reading code, done in direct compute.
//its only really short.
[numthreads(32, 32, 1)] //put your 200k parallel accesses here. :)
void cs_zc_read_read(uint3 DTid : SV_DispatchThreadID )
{
//get the first link pointer with a kickstart array, cause the start reuses
//too much and it makes too many parallel nodes.
uint link=buffer1[buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+0].f].f;
uint i;
while(i<ZCKEY32;i++)
{
uint match32=buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+i].f
uint match232;
uint lnk[2];
match232=buffer2[link];
lnk[0]=buffer2[link+2];
lnk[1]=buffer2[link+1];
if(match32==match232)
{
link=lnk[0]; //this steps in series.
i++;
if(i==ZCKEY32)
{
//success, its an exact match.
link=lnk[0]; //get the routing position
//get two routings.
BufferOut[(DTid.x+DTid.y*RES)*2+0].f=buffer2[link+1].f;
BufferOut[(DTid.x+DTid.y*RES)*2+1].f=buffer2[link+2].f;
}
}
else
{
link=lnk[1]; //this steps in parallel
if(link==0){i=100000;} //abort, its a no match
}
}
}
Dont worry about parallelizing the write because its a pain. =p
But if you wanted it to kick ass more, you collect multivalues using ternary codes, and you get more flexibility, but you have to multiply the cost by the amount your collecting in parallel, and it doesnt matter if its just more iteration in the single call, because 'once youve saturated your cores more parallel threads dont gain you no more speed.'