Filtering (reducing) a NumPy Array

Summary

Using a loop-based approach with single pass and copying, accelerated with Numba, offers the best overall trade-off in terms of speed, memory efficiency and flexibility.
If the execution of the condition function is sufficiently fast, two-passes (filter2_nb()) may be faster, while they are more memory efficient regardless.
Also, for sufficiently large inputs, resizing instead of copying (filter_resize_xnb()) leads to faster execution.

If the data type (and the condition function) is known ahead of time and can be compiled, the Cython acceleration seems to be faster.
It is likely that a similar hard-coding of the condition would lead to comparable speed-up with Numba accerelation as well.

When it comes to NumPy-only based approaches, boolean masking or integer indexing are of comparable speed, and which one comes out faster depends largely on the filtering factor, i.e. the portion of values that passes through the filtering condition.

The np.fromiter() approach is much slower (it would be off-charts in the plot), but does not produce large temporary objects.

Note that the following tests are meant to give some insights into the different approaches and should be taken with a grain of salt.
The most relevant assumptions are that the condition is broadcastable and that it would eventually compute very fast.


Definitions

  1. Using a generator:
def filter_fromiter(arr, cond):
    return np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype)
  1. Using boolean masking:
def filter_mask(arr, cond):
    return arr[cond(arr)]
  1. Using integer indexing:
def filter_idx(arr, cond):
    return arr[np.nonzero(cond(arr))]

4a. Using explicit looping, with single pass and final copying/resizing

  • Cython-accelerated with copying (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2


cdef int cond_cy(long x, long k=K):
    return x < k


cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size):
    cdef size_t j = 0
    for i in range(size):
        if cond_cy(arr[i]):
            result[j] = arr[i]
            j += 1
    return j


def filter_cy(arr):
    result = np.empty_like(arr)
    new_size = _filter_cy(arr, result, arr.size)
    return result[:new_size].copy()
  • Cython-accelerated with resizing (pre-compiled condition)
def filter_resize_cy(arr):
    result = np.empty_like(arr)
    new_size = _filter_cy(arr, result, arr.size)
    result.resize(new_size)
    return result
  • Numba-accelerated with copying
import numba as nb


@nb.njit
def cond_nb(x, k=K):
    return x < k


@nb.njit
def filter_nb(arr, cond_nb):
    result = np.empty_like(arr)
    j = 0
    for i in range(arr.size):
        if cond_nb(arr[i]):
            result[j] = arr[i]
            j += 1
    return result[:j].copy()
  • Numba-accelerated with resizing
@nb.njit
def _filter_out_nb(arr, out, cond_nb):
    j = 0
    for i in range(arr.size):
        if cond_nb(arr[i]):
            out[j] = arr[i]
            j += 1
    return j


def filter_resize_xnb(arr, cond_nb):
    result = np.empty_like(arr)
    j = _filter_out_nb(arr, result, cond_nb)
    result.resize(j, refcheck=False)  # unsupported in NoPython mode
    return result
  • Numba-accelerated with a generator and np.fromiter()
@nb.njit
def filter_gen_nb(arr, cond_nb):
    for i in range(arr.size):
        if cond_nb(arr[i]):
            yield arr[i]


def filter_gen_xnb(arr, cond_nb):
    return np.fromiter(filter_gen_nb(arr, cond_nb), dtype=arr.dtype)

4b. Using explicit looping with two passes: one to determine the size of the result and one to actually perform the computation

  • Cython-accelerated (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


cdef size_t _filtered_size_cy(long[:] arr, size_t size):
    cdef size_t j = 0
    for i in range(size):
        if cond_cy(arr[i]):
            j += 1
    return j


def filter2_cy(arr):
    cdef size_t new_size = _filtered_size_cy(arr, arr.size)
    result = np.empty(new_size, dtype=arr.dtype)
    new_size = _filter_cy(arr, result, arr.size)
    return result
  • Numba-accelerated
@nb.njit
def filter2_nb(arr, cond_nb):
    j = 0
    for i in range(arr.size):
        if cond_nb(arr[i]):
            j += 1
    result = np.empty(j, dtype=arr.dtype)
    j = 0
    for i in range(arr.size):
        if cond_nb(arr[i]):
            result[j] = arr[i]
            j += 1
    return result

Timing Benchmarks

(The generator-based filter_fromiter() method is much slower than the others — by approx. 2 orders of magnitude.
Similar (and perhaps slightly worse) performances can be expected from a list comprehension.
This would be true for any explicit looping with non-accelerated code.)

The timing would depend on both the input array size and the percent of filtered items.

As a function of input size

The first graph addresses the timings as a function of the input size (for ~50% filtering factor — i.e. 50% of the elements appear in the result):

bm_size

In general, explicit looping with one form of acceleration leads to the fastest execution, with slight variations depending on input size.

Within NumPy, the integer indexing approaches are basically on par with boolean masking.

The benefits of using np.fromiter() (no pre-allocation) can be reaped by writing a Numba-accelerated generator, which would come out slower than the other approaches (within an order of magnitude), but much faster than pure-Python looping.

As a function of filling

The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):

bm_filling

The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph).

Again, explicit looping with some mean of acceleration leads to the fastest execution.

Within NumPy, the integer indexing and boolean masking approaches are again basically the same.

(Full code available here)


Memory Considerations

The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input.
Memory-wise this is the most efficient method.
This approach can be effectively speed up with a Numba-accelerated generator.

Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.
The caveat here is that computing the condition has to be fast for these methods to be fast.

On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input.
Hence, these are not very memory-efficient compared to two-passes or the generator-based one.

Yet they are of similar asymptotic temporary memory footprint compared to masking, but the constant term is typically larger than masking.

The boolean masking solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 byte, so this is ~8 times smaller than the default size of a NumPy array on a typical 64-bit system.

The integer indexing solution has the same requirement as the boolean mask slicing in the first step (inside np.nonzero() call), which gets converted to a series of ints (typically int64 on a 64-bit system) in the second step (the output of np.nonzero()).
This second step, therefore, has variable memory requirements, depending on the number of filtered elements.


Remarks

  • both boolean masking and integer indexing require some form of conditioning that is capable of producing a boolean mask (or, alternatively, a list of indices); in the above implementation, the condition is broadcastable
  • the generator and the Numba-accelerated methods are also the most flexible when it comes to specifying a different filtering condition
  • the Numba-accelerated methods require the condition to be Numba-compatible to access the Numba acceleration in NoPython mode
  • the Cython solution requires specifying the data types for it to be fast, or extra effort for multiple types dispatching, and extra effort (not explored here) to get the same level of flexibility as the other methods
  • for both Numba and Cython, the filtering condition can be hard-coded, leading to marginal but appreaciable speed differences
  • the single-pass solutions require additional code to handle the unused (but otherwise initially allotted) memory.
  • the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy:  True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy:  True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy:  False

(EDITED: various improvements based on @ShadowRanger, @PaulPanzer, @max9111 and @DavidW comments.)

Leave a Comment