4

When I need to generate an N-dimensional index array in C-order, I’ve tried a few different NumPy approaches.

The fastest for larger arrays but less readable:

np.stack(np.meshgrid(*[np.arange(i, dtype=dtype) for i in sizes], indexing="ij"), axis=-1).reshape(-1, len(sizes))

More readable with good performance:

np.ascontiguousarray(np.indices(sizes, dtype=dtype).reshape(len(sizes), -1).T)

Here I'm not sure if the ascontiguousarray copy is actually necessary, or if there's a better way to make sure the result is in C-order without forcing a copy.

Most readable, but by far the slowest:

np.vstack([*np.ndindex(sizes)], dtype=dtype)

The iterator conversion is quite slow for larger arrays.

Is there a built-in more straightforward and readable way to achieve this that matches the performance of np.meshgrid or np.indices using NumPy? If not, can either the meshgrid or indices approaches be optimized to avoid unnecessary memory copies (like ascontiguousarray) while still making sure the array is C-contiguous?

Example:

sizes = (3, 1, 2)
idx = np.ascontiguousarray(np.indices(sizes).reshape(len(sizes), -1).T)
print(idx)
print(f"C_CONTIGUOUS: {idx.flags['C_CONTIGUOUS']}")
# [[0 0 0]
#  [0 0 1]
#  [1 0 0]
#  [1 0 1]
#  [2 0 0]
#  [2 0 1]]
# C_CONTIGUOUS: True
7
  • I would use Numba for such a thing (or possibly Cython). Numpy internal iterators are rather slow when they iterate on a small number of item along a given axis (e.g. when the last axis is 3 for example). Numpy transpositions is not very efficient either though transposition are generally expensive anyway. Large temporary arrays are also expensive due to page faults (especially on Windows). Commented Apr 6 at 12:17
  • Besides, even writing the output array can be limited by page faults (at least on Windows) and sequential execution. An optimistic lower bound is the time of np.ones of the same shape/dtype. For example, on Windows, with sizes=(101,53,71) and dtype=np.int32, your first line takes 2.5 ms while np.one takes 0.7 ms. This means you cannot expect a speed-up better than 3.5 times. This is assuming a perfect/great Numpy code writing directly the output array (i.e. no large temporary arrays). If there is just one temporary array, the maximum speed-up falls at 1.7... Commented Apr 6 at 12:28
  • Thanks, @JérômeRichard. The performance difference between idx="np.indices(sizes).reshape(len(sizes), -1)" and its transposed version "idx.T" appears to be relatively small. Copying the data using "ascontiguousarray" seems to have a more noticeable impact for larger arrays. Do you see any clever way to return a C-contiguous result in that expression without explicitly copying? I also have a custom Numba implementation that's probably more memory-efficient, but it doesn’t outperform meshgrid or np.indices since numpy 2.x
    – Olibarer
    Commented Apr 6 at 13:44
  • idx.T does not actually transpose the array. It just changes the stride. The target array becomes non-contiguous. This is called lazy transposition. Thus, np.ascontiguousarray is then the expensive part, mainly due to the transposition done during this function call. On my machine np.ascontiguousarray+.T takes 5.7 ms, a .copy takes 4.2 ms, computing idx takes 2.1 ms and idx.T takes 2.1 ms too. In fact, idx.T takes 0.0002 ms experimentally showing it does not move/copy anything in DRAM (this is the time to copy only 3KiB in DRAM). Commented Apr 6 at 14:36
  • A real memory copy (out2[:] = out[:]) takes 0.5 ms. This is 4 times bigger that .copy() which create a new array and so is slow because of page faults. Page faults takes 75% of the time! Thus, if you want a fast Numpy code you should not create a new array (nor create temporary ones), this is the expensive part. But this is difficult to do in practice. Commented Apr 6 at 14:40

1 Answer 1

5

Here is a (quite-naive) solution in Numba using multiple threads:

import numba as nb

@nb.njit(
    [
        # Eagerly compiled for common types
        # Please add your type if it is missing
        '(int32[:,:], int32[:])',
        '(int64[:,:], int32[:])',
        '(float32[:,:], int32[:])',
        '(float64[:,:], int32[:])',
    ],
    parallel=True,
    cache=True
)
def nb_kernel(res, sizes):
    n = np.prod(sizes)
    m = sizes.size
    chunk_size = 1024
    assert n > 0 and m > 0
    for i in range(m):
        assert sizes[i] > 0
    # Compute blocks of 256 rows.
    # Multiple threads compute separate blocks.
    for block in nb.prange((n + chunk_size - 1) // chunk_size):
        start = block * chunk_size
        end = min(start + chunk_size, n)
        # Compute the first row of the block
        jump = 1
        for j in range(m-1, -1, -1):
            res[start, j] = (start // jump) % sizes[j]
            jump *= sizes[j]
        # The next rows of the block incrementally
        for i in range(start+1, end):
            inc = 1
            for j in range(m-1, -1, -1):
                val = res[i-1, j] + inc
                if val >= sizes[j]:
                    val = 0
                    inc = 1
                else:
                    inc = 0
                res[i, j] = val

def nb_compute(sizes, dtype):
    res = np.empty((np.prod(sizes), len(sizes)), dtype=dtype)
    nb_kernel(res, np.array(sizes, dtype=np.int32))
    return res

Benchmark

On my machine (i5-9600KF CPU), on Windows, here are results with sizes=(101,53,71) and dtype=np.int32:

np.vstack(...):              626.1 ms
np.ascontiguousarray(...):     3.5 ms
np.stack(...):                 2.6 ms

nb_compute(...):               1.1 ms   <----
nb_kernel(...):                0.5 ms   <----

With a fixed length of `sizes` known at compile time:
    nb_compute(...):           0.8 ms   <----
    nb_kernel(...):            0.2 ms   <----

Analysis and optimisations

We can see that calling nb_kernel directly on a preallocated array is significantly faster. Indeed, when we fill an array for the first time, it causes a lot of memory page faults which are inherently slow. Doing that in parallel is better (but it does not scale on Windows).

If you already do this in each thread of a parallel code, nb_kernel will not make this significantly faster. Indeed, most of the speed up of Numba comes from the use of multiple threads. Consequently, in this case, we need to optimise the Numba kernel. One major optimisation is to specialise the fonction for a specific length of sizes (then known at compile time). Indeed, the code is more than twice if we replace m by 3 (so we only supports len(sizes) being 3). I expect most of the case to have a very small len(sizes) so you can specialise the function for the case 2, 3, 4 and 5 and write a pure-Python function calling the good specialisation. This optimisation also makes the parallel code faster.

For better performance, you should avoid filling big arrays due to DRAM being slow. This is especially true for temporary arrays (arrays that are filled once and then never reused) due to page faults. I think the above code is optimal for output arrays not fitting in the last-level cache (LLC) of your CPU.

For output arrays fitting in the LLC, there are faster implementations than the one above, for example using a native SIMD-friendly language (but it is pretty complex to implement).

1
  • Thank you for your answer. Your solution was well thought out and instructive. The parallel algorithm you proposed works and taught me a lot. On my older laptop (i7-5500U, Linux), the performance was similar, but I assume users with more cores will benefit even more. I appreciate the work you’ve done.
    – Olibarer
    Commented Apr 6 at 21:30

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.