A Student's Guide to Software Engineering Tools & Techniques »


Author: Jeremy Tan

Before reading this article, you should have basic knowledge of Python and its iterable types. If not, do read the introduction to Python included among these learning resources.

NumPy's Purpose

Data scientists often work with highly structured data, like time series or plots of pixel values. On these data sets, they routinely need to perform operations uniformly and efficiently. Without external packages, Python can express and handle some of these operations reliably. For huge sets, however – like the training and test data provided to machine learning algorithms – Python becomes complicated and sluggish.

NumPy fills in the gaps, simplifying (and speeding up) the code required to process and visualise data sets large and small. So many other science-oriented Python libraries depend on NumPy (pandas, matplotlib, etc.) that NumPy describes itself as "the fundamental package for scientific computing in Python".

NumPy's Benefits and Drawbacks

NumPy's main benefit is its abstraction of arrays. In a traditional imperative style, handling multi-dimensional arrays requires multiple for loops or list comprehensions; both approaches lead to tedious, error-prone code, and do not take advantage of vector instructions in modern processors.

For example, to find the average of each column in a two-dimensional array (list of lists):

def find_averages(arr):
    l = len(arr[0])
    m = len(arr)
    return max(sum(arr[j][i] for j in range(m)) / m for i in range(l))

This is quite a lot of work for a simple task. The equivalent code in NumPy is a single call:

import numpy as np
np.mean(arr, 0)

If you frequently deal with data sets more complicated than one-dimensional lists, NumPy is suited for you.

Transcribing array-based mathematics into NumPy is also made simpler by the many array functions and operators in its library. Not only is the code clearer – a Python virtue – but it can reduce development and debugging time.

# A and B are 2-by-2 matrices, c is a scalar
A = np.array([[1, 5], [-1, 3]])
B = np.array([[0, 2], [2, -3]])
c = 2

c * A # scalar multiplication
A @ B # matrix multiplication
A > B # compare A and B elementwise
B.T # matrix transpose

The downside to these benefits is a large memory footprint: NumPy's performance depends on compiled C and Fortran code, so it may not be suitable for embedded systems. All elements of a NumPy array must also be of the same type, a primitive C type or something similar to one, for the rest of NumPy to work well; in particular Python integers are treated as generic Python objects if they are too large, negating speed improvements, while strings have to be interpreted as character arrays.

>>> np.array([1.5, "a"])
array(['1.5', 'a'], dtype='<U32') # not a numeric type!
>>> np.array([10 ** 20, 1])
array([100000000000000000000, 1], dtype=object)

Another limitation occurs when resizing arrays. In most cases, enlarging an array creates a new copy, like in C, rather than simply making space for the new cells like in Python. Memory has to be managed manually for code that works with data at scale.

>>> A = np.array([[5, 0], [-1, -3]])
>>> B = np.append(A, [[-3], [1]], axis=1) # creates a new object
>>> B[0,0] += 1
>>> A
array([[ 5,  0],
       [-1, -3]]) # not modified

Internals of NumPy

Although NumPy overall is an intricate language, you should be able to easily grasp its two main features: C-like arrays and the functions that work on them.

The Array Type

NumPy is built around its array type (an alias for ndarray, hinting at its multidimensional nature). Unlike Python lists, elements of array instances are stored in memory contiguously, reducing interpreter overhead and enabling parallel operations.

An array object can be initialised from iterables or nested lists through np.array (assuming NumPy has been imported as np, as is usually done in production code, and which will be assumed in the rest of this article):

a = np.array([7, 3, 9, 3, 1, 3, 3, 7]) # 1D array
b = np.array([[8, 1, 6], [3, 5, 7], [4, 9, 2]]) # 2D array

By default, these arrays are indexed in C order (last index changes fastest), as can be illustrated by invoking the reshape array method:

>>> np.array(range(27)).reshape((3,3,3))
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

Otherwise, indexing is mostly as in standard Python, with the convenience that extra square brackets can be omitted for multi-axis (array coordinate) indexing, so a[0][1:][2] becomes a[0,1:,2]. See this page for a more detailed explanation.

Library Functions

In addition to its array type, NumPy provides a large library for manipulating arrays. These include:

# Example: solve a system of linear equations with NumPy
>>> A = np.array([[9, -9, -10], [8, 5, -4], [0, 7, -9]])
>>> b = np.array([-3, 5, -7])
>>> np.linalg.solve(A, b)
array([0.95958854, 0.22997796, 0.95664952])

Many of these functions work equally well on one-dimensional arrays as they do on 11-dimensional ones due to the compiled binaries. Many can take in standard Python nested lists, automatically converting them to NumPy's array type before proceeding. Some functions also accept extra arguments that refine the parts of arrays they work on or the values they return, allowing finer control beyond indexing and slicing. The axis argument in np.mean above is an example.

For even faster programs, it is possible to interact with external C libraries through NumPy's application programming interface, but this is not needed for general use. The primary reason for exposing such a functionality is the continuing widespread use of C libraries in mathematics and physics.

Tutorials and Resources

Where can you get started with learning NumPy, then? The NumPy User Guide contains several tutorials for programmers of all abilities, including the following:

In practice, NumPy is not often used alone, but rather as part of what NumPy's documentation calls a scientific Python distribution, which also includes several NumPy dependents. The most widely used such distribution is Anaconda.

Some of these dependents are listed below, with links to tutorials that themselves use NumPy to drive their high-level routines. Reading them may illuminate your journey in understanding and the underlying array language.

  • matplotlib: produces plots of all kinds. Input to the plotting functions is typically in the form of NumPy arrays.
  • SciPy: essentially a more elaborated NumPy with computationally intensive routines for tasks such as optimisation and image processing.
  • scikit-learn: like SciPy, but for machine learning.
  • pandas: uses NumPy in its DataFrame object.
  • Numba: heavily relies on the contingency offered by NumPy objects to enable massively parallel computation.