Here is a (quite-naive) solution in Numba using multiple threads:
import numba as nb
@nb.njit(
[
# Eagerly compiled for common types
# Please add your type if it is missing
'(int32[:,:], int32[:])',
'(int64[:,:], int32[:])',
'(float32[:,:], int32[:])',
'(float64[:,:], int32[:])',
],
parallel=True,
cache=True
)
def nb_kernel(res, sizes):
n = np.prod(sizes)
m = sizes.size
chunk_size = 1024
assert n > 0 and m > 0
for i in range(m):
assert sizes[i] > 0
# Compute blocks of 256 rows.
# Multiple threads compute separate blocks.
for block in nb.prange((n + chunk_size - 1) // chunk_size):
start = block * chunk_size
end = min(start + chunk_size, n)
# Compute the first row of the block
jump = 1
for j in range(m-1, -1, -1):
res[start, j] = (start // jump) % sizes[j]
jump *= sizes[j]
# The next rows of the block incrementally
for i in range(start+1, end):
inc = 1
for j in range(m-1, -1, -1):
val = res[i-1, j] + inc
if val >= sizes[j]:
val = 0
inc = 1
else:
inc = 0
res[i, j] = val
def nb_compute(sizes, dtype):
res = np.empty((np.prod(sizes), len(sizes)), dtype=dtype)
nb_kernel(res, np.array(sizes, dtype=np.int32))
return res
Benchmark
On my machine (i5-9600KF CPU), on Windows, here are results with sizes=(101,53,71)
and dtype=np.int32
:
np.vstack(...): 626.1 ms
np.ascontiguousarray(...): 3.5 ms
np.stack(...): 2.6 ms
nb_compute(...): 1.1 ms <----
nb_kernel(...): 0.5 ms <----
With a fixed length of `sizes` known at compile time:
nb_compute(...): 0.8 ms <----
nb_kernel(...): 0.2 ms <----
Analysis and optimisations
We can see that calling nb_kernel
directly on a preallocated array is significantly faster. Indeed, when we fill an array for the first time, it causes a lot of memory page faults which are inherently slow. Doing that in parallel is better (but it does not scale on Windows).
If you already do this in each thread of a parallel code, nb_kernel
will not make this significantly faster. Indeed, most of the speed up of Numba comes from the use of multiple threads. Consequently, in this case, we need to optimise the Numba kernel. One major optimisation is to specialise the fonction for a specific length of sizes
(then known at compile time). Indeed, the code is more than twice if we replace m
by 3 (so we only supports len(sizes)
being 3). I expect most of the case to have a very small len(sizes)
so you can specialise the function for the case 2, 3, 4 and 5 and write a pure-Python function calling the good specialisation. This optimisation also makes the parallel code faster.
For better performance, you should avoid filling big arrays due to DRAM being slow. This is especially true for temporary arrays (arrays that are filled once and then never reused) due to page faults. I think the above code is optimal for output arrays not fitting in the last-level cache (LLC) of your CPU.
For output arrays fitting in the LLC, there are faster implementations than the one above, for example using a native SIMD-friendly language (but it is pretty complex to implement).
np.ones
of the same shape/dtype. For example, on Windows, withsizes=(101,53,71)
anddtype=np.int32
, your first line takes 2.5 ms while np.one takes 0.7 ms. This means you cannot expect a speed-up better than 3.5 times. This is assuming a perfect/great Numpy code writing directly the output array (i.e. no large temporary arrays). If there is just one temporary array, the maximum speed-up falls at 1.7...idx.T
does not actually transpose the array. It just changes the stride. The target array becomes non-contiguous. This is called lazy transposition. Thus,np.ascontiguousarray
is then the expensive part, mainly due to the transposition done during this function call. On my machinenp.ascontiguousarray
+.T
takes 5.7 ms, a.copy
takes 4.2 ms, computingidx
takes 2.1 ms andidx.T
takes 2.1 ms too. In fact,idx.T
takes 0.0002 ms experimentally showing it does not move/copy anything in DRAM (this is the time to copy only 3KiB in DRAM).out2[:] = out[:]
) takes 0.5 ms. This is 4 times bigger that.copy()
which create a new array and so is slow because of page faults. Page faults takes 75% of the time! Thus, if you want a fast Numpy code you should not create a new array (nor create temporary ones), this is the expensive part. But this is difficult to do in practice.