Author: Jeremy Tan

Before reading this article, you should have basic knowledge of Python and its iterable types. If not, do read the introduction to Python included among these learning resources.

Data scientists often work with highly structured data, like time series or plots of pixel values. On these data sets, they routinely need to perform operations uniformly and efficiently. Without external packages, Python can express and handle some of these operations reliably. For huge sets, however – like the training and test data provided to machine learning algorithms – Python becomes complicated and sluggish.

NumPy fills in the gaps, simplifying (and speeding up) the code required to process and visualise data sets large and small. So many other science-oriented Python libraries depend on NumPy (pandas, matplotlib, etc.) that NumPy describes itself as "the fundamental package for scientific computing in Python".

NumPy's main benefit is its abstraction of arrays. In a traditional imperative style, handling multi-dimensional arrays requires multiple for loops or list comprehensions; both approaches lead to tedious, error-prone code, and do not take advantage of vector instructions in modern processors.

For example, to find the average of each column in a two-dimensional array (list of lists):

```
def find_averages(arr):
l = len(arr[0])
m = len(arr)
return max(sum(arr[j][i] for j in range(m)) / m for i in range(l))
```

This is quite a lot of work for a simple task. The equivalent code in NumPy is a single call:

```
import numpy as np
np.mean(arr, 0)
```

If you frequently deal with data sets more complicated than one-dimensional lists, NumPy is suited for you.

Transcribing array-based mathematics into NumPy is also made simpler by the many array functions and operators in its library. Not only is the code clearer – a Python virtue – but it can reduce development and debugging time.

```
# A and B are 2-by-2 matrices, c is a scalar
A = np.array([[1, 5], [-1, 3]])
B = np.array([[0, 2], [2, -3]])
c = 2
c * A # scalar multiplication
A @ B # matrix multiplication
A > B # compare A and B elementwise
B.T # matrix transpose
```

The downside to these benefits is a large memory footprint: NumPy's performance depends on compiled C and Fortran code,
so it may not be suitable for embedded systems. All elements of a NumPy array must also be of *the same* type,
a primitive C type or something similar to one, for the rest of NumPy to work well; in particular Python integers
are treated as generic Python objects if they are too large, negating speed improvements, while strings have to be
interpreted as character arrays.

```
>>> np.array([1.5, "a"])
array(['1.5', 'a'], dtype='<U32') # not a numeric type!
>>> np.array([10 ** 20, 1])
array([100000000000000000000, 1], dtype=object)
```

Another limitation occurs when resizing arrays. In most cases, enlarging an array creates a new copy, like in C, rather than simply making space for the new cells like in Python. Memory has to be managed manually for code that works with data at scale.

```
>>> A = np.array([[5, 0], [-1, -3]])
>>> B = np.append(A, [[-3], [1]], axis=1) # creates a new object
>>> B[0,0] += 1
>>> A
array([[ 5, 0],
[-1, -3]]) # not modified
```

Although NumPy overall is an intricate language, you should be able to easily grasp its two main features: C-like arrays and the functions that work on them.

NumPy is built around its `array`

type (an alias for `ndarray`

, hinting at its multidimensional nature). Unlike Python
lists, elements of `array`

instances are stored in memory contiguously, reducing interpreter overhead
and enabling parallel operations.

An `array`

object can be initialised from iterables or nested lists through `np.array`

(assuming NumPy has been
imported as `np`

, as is usually done in production code, and which will be assumed in the rest of this article):

```
a = np.array([7, 3, 9, 3, 1, 3, 3, 7]) # 1D array
b = np.array([[8, 1, 6], [3, 5, 7], [4, 9, 2]]) # 2D array
```

By default, these arrays are indexed in C order (last index changes fastest), as can be illustrated by
invoking the `reshape`

array method:

```
>>> np.array(range(27)).reshape((3,3,3))
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
```

Otherwise, indexing is mostly as in standard Python, with the convenience that extra square brackets can be
omitted for multi-*axis* (array coordinate) indexing, so `a[0][1:][2]`

becomes `a[0,1:,2]`

.
See this page for a more detailed explanation.

In addition to its array type, NumPy provides a large library for manipulating arrays. These include:

- creating special types of arrays
- sorting and searching
- random number generation
- linear algebra
- other mathematical functions

```
# Example: solve a system of linear equations with NumPy
>>> A = np.array([[9, -9, -10], [8, 5, -4], [0, 7, -9]])
>>> b = np.array([-3, 5, -7])
>>> np.linalg.solve(A, b)
array([0.95958854, 0.22997796, 0.95664952])
```

Many of these functions work equally well on one-dimensional arrays as they do on 11-dimensional ones due to the
compiled binaries. Many can take in standard Python nested lists, automatically converting them to NumPy's
array type before proceeding. Some functions also accept extra arguments that refine the parts of arrays they work on
or the values they return, allowing finer control beyond indexing and slicing. The `axis`

argument in `np.mean`

above is an example.

For even faster programs, it is possible to interact with external C libraries through NumPy's application programming interface, but this is not needed for general use. The primary reason for exposing such a functionality is the continuing widespread use of C libraries in mathematics and physics.

Where can you get started with learning NumPy, then? The NumPy User Guide contains several tutorials for programmers of all abilities, including the following:

- NumPy: the absolute basics for beginners introduces the array type.
- NumPy basics goes into slightly more advanced topics, such as broadcasting (the mechanism for interpreting operations between differently sized arrays, or between arrays and scalars).
- The quickstart encompasses both the above tutorials, fits on one page and also has handy links to commonly used functions.
- A tutorial on doing some linear algebra with NumPy.
- A cheat sheet for programmers coming from MATLAB, another array-based programming language.

In practice, NumPy is not often used alone, but rather as part of what NumPy's documentation calls a *scientific
Python distribution*, which also includes several NumPy dependents. The most widely used such distribution is
Anaconda.

Some of these dependents are listed below, with links to tutorials that themselves use NumPy to drive their high-level routines. Reading them may illuminate your journey in understanding and the underlying array language.

- matplotlib: produces plots of all kinds. Input to the plotting functions is typically in the form of NumPy arrays.
- SciPy: essentially a more elaborated NumPy with computationally intensive routines for tasks such as optimisation and image processing.
- scikit-learn: like SciPy, but for machine learning.
- pandas: uses NumPy in its
`DataFrame`

object. - Numba: heavily relies on the contingency offered by NumPy objects to enable massively parallel computation.