Author: Jeremy Tan
Before reading this article, you should have basic knowledge of Python and its iterable types. If not, do read the introduction to Python included among these learning resources.
Data scientists often work with highly structured data, like time series or plots of pixel values. On these data sets, they routinely need to perform operations uniformly and efficiently. Without external packages, Python can express and handle some of these operations reliably. For huge sets, however – like the training and test data provided to machine learning algorithms – Python becomes complicated and sluggish.
NumPy fills in the gaps, simplifying (and speeding up) the code required to process and visualise data sets large and small. So many other science-oriented Python libraries depend on NumPy (pandas, matplotlib, etc.) that NumPy describes itself as "the fundamental package for scientific computing in Python".
NumPy's main benefit is its abstraction of arrays. In a traditional imperative style, handling multi-dimensional arrays requires multiple for loops or list comprehensions; both approaches lead to tedious, error-prone code, and do not take advantage of vector instructions in modern processors.
For example, to find the average of each column in a two-dimensional array (list of lists):
def find_averages(arr):
l = len(arr[0])
m = len(arr)
return max(sum(arr[j][i] for j in range(m)) / m for i in range(l))
This is quite a lot of work for a simple task. The equivalent code in NumPy is a single call:
import numpy as np
np.mean(arr, 0)
If you frequently deal with data sets more complicated than one-dimensional lists, NumPy is suited for you.
Transcribing array-based mathematics into NumPy is also made simpler by the many array functions and operators in its library. Not only is the code clearer – a Python virtue – but it can reduce development and debugging time.
# A and B are 2-by-2 matrices, c is a scalar
A = np.array([[1, 5], [-1, 3]])
B = np.array([[0, 2], [2, -3]])
c = 2
c * A # scalar multiplication
A @ B # matrix multiplication
A > B # compare A and B elementwise
B.T # matrix transpose
The downside to these benefits is a large memory footprint: NumPy's performance depends on compiled C and Fortran code, so it may not be suitable for embedded systems. All elements of a NumPy array must also be of the same type, a primitive C type or something similar to one, for the rest of NumPy to work well; in particular Python integers are treated as generic Python objects if they are too large, negating speed improvements, while strings have to be interpreted as character arrays.
>>> np.array([1.5, "a"])
array(['1.5', 'a'], dtype='<U32') # not a numeric type!
>>> np.array([10 ** 20, 1])
array([100000000000000000000, 1], dtype=object)
Another limitation occurs when resizing arrays. In most cases, enlarging an array creates a new copy, like in C, rather than simply making space for the new cells like in Python. Memory has to be managed manually for code that works with data at scale.
>>> A = np.array([[5, 0], [-1, -3]])
>>> B = np.append(A, [[-3], [1]], axis=1) # creates a new object
>>> B[0,0] += 1
>>> A
array([[ 5, 0],
[-1, -3]]) # not modified
Although NumPy overall is an intricate language, you should be able to easily grasp its two main features: C-like arrays and the functions that work on them.
NumPy is built around its array
type (an alias for ndarray
, hinting at its multidimensional nature). Unlike Python
lists, elements of array
instances are stored in memory contiguously, reducing interpreter overhead
and enabling parallel operations.
An array
object can be initialised from iterables or nested lists through np.array
(assuming NumPy has been
imported as np
, as is usually done in production code, and which will be assumed in the rest of this article):
a = np.array([7, 3, 9, 3, 1, 3, 3, 7]) # 1D array
b = np.array([[8, 1, 6], [3, 5, 7], [4, 9, 2]]) # 2D array
By default, these arrays are indexed in C order (last index changes fastest), as can be illustrated by
invoking the reshape
array method:
>>> np.array(range(27)).reshape((3,3,3))
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
Otherwise, indexing is mostly as in standard Python, with the convenience that extra square brackets can be
omitted for multi-axis (array coordinate) indexing, so a[0][1:][2]
becomes a[0,1:,2]
.
See this page for a more detailed explanation.
In addition to its array type, NumPy provides a large library for manipulating arrays. These include:
# Example: solve a system of linear equations with NumPy
>>> A = np.array([[9, -9, -10], [8, 5, -4], [0, 7, -9]])
>>> b = np.array([-3, 5, -7])
>>> np.linalg.solve(A, b)
array([0.95958854, 0.22997796, 0.95664952])
Many of these functions work equally well on one-dimensional arrays as they do on 11-dimensional ones due to the
compiled binaries. Many can take in standard Python nested lists, automatically converting them to NumPy's
array type before proceeding. Some functions also accept extra arguments that refine the parts of arrays they work on
or the values they return, allowing finer control beyond indexing and slicing. The axis
argument in np.mean
above is an example.
For even faster programs, it is possible to interact with external C libraries through NumPy's application programming interface, but this is not needed for general use. The primary reason for exposing such a functionality is the continuing widespread use of C libraries in mathematics and physics.
Where can you get started with learning NumPy, then? The NumPy User Guide contains several tutorials for programmers of all abilities, including the following:
In practice, NumPy is not often used alone, but rather as part of what NumPy's documentation calls a scientific Python distribution, which also includes several NumPy dependents. The most widely used such distribution is Anaconda.
Some of these dependents are listed below, with links to tutorials that themselves use NumPy to drive their high-level routines. Reading them may illuminate your journey in understanding and the underlying array language.
DataFrame
object.