Introduction to NumPy

Lately, I’ve been studying statistics and data analysis. I have beforehand knowledge of the Python programming language, so when looking at the two most widely used programming tools applied in this domain, Pandas and R, I chose Pandas – a software library for Python. Pandas uses another library in the construction of it’s data structures (Series and DataFrames), called NumPy. This post goes over some of what can be done with NumPy.

The ndarray

The fundamental data structure in NumPy is an N-dimensional array named ndarray. The zeros, empty, and array methods are used to create an ndarray.

# Import the NumPy library.
import numpy as np

arr = np.zeros((2,3), dtype=np.int64)
print(arr)
[[0 0 0]
 [0 0 0]]

It’s conventional to import numpy as np

Notice how the zeros method creates an ndarray with all zeros. The first (required) parameter of the zeros method is shape, meaning the shaped of the ndarray. In this case the shape (2,3) tells NumPy to create an ndarry with 2 rows and 3 columns.

The dtype keyword argument (not required, but it brings up an important concept) tells NumPy what we want the data type of the elements inside of the array to be.

Note: the elements inside of an ndarray are all of the same type, unlike lists which can be of mixed type (e.g. a list containing strings and integers).

We can change the data type of the elements with ndarray’s astype method. I’ll demonstrate…

arr = np.empty((2,3), dtype=np.float_)
print(arr)
[[  2.12202817e-314   2.12199579e-314   1.61897901e-318]
 [  1.56307538e-311   0.00000000e+000   0.00000000e+000]]

The empty method creates an ndarray (of the given shape) with uninitialized values – i.e. garbage values. Notice how infinitesimal the non-zero numbers are. Let’s cast them to integers.

arr2 = arr.astype(np.int64)
print(arr2)
[[0 0 0]
 [0 0 0]]

The astype method created a new ndarray by copying the values of the source ndarray casting its values from floating point (a.k.a double precision) into integers, and with the loss of precision the numbers become zero.

The astype method is useful when converting string values to numerical values to perform calculations.

arr = np.array(['1','2','3'], dtype=np.string_)
arr2 = arr.astype(np.int64)
print(arr2)
[1 2 3]

Here we used the array method to first create an ndarray of the characters 1, 2, and 3. Then we were able to create a new array with those characters translated into integers.

As the first parameter of np.array method we used a list, but it can be any array like object, such as a list or tuple. To create an ndarray of the shape (2,3) like we did for zeros and empty, simply pass in an array like object with nested array like objects.

arr = np.array([(1,2,3), (4,5,6)])
print(arr)
[[1 2 3]
 [4 5 6]]

Array Arithmetic

import random as rand

# Create a list of 10 integers between 1 and 100.
nums = rand.sample(range(1, 101), 10)

# Convert the list to a NumPy array.
nums = np.array(nums)
print(nums)
[42 40 67 65 92 84 22 73 49 98]

We can easily perform arithmetic operations on the elements in a NumPy array. For example, let’s increase each element (integer) in the array by a factor of 10.

nums = nums * 10
print(nums)
[420 400 670 650 920 840 220 730 490 980]

Pretty neat, eh!? We can also perform arithmetic with two ndarrays as operands.

diff = nums - nums
print(diff)
[0 0 0 0 0 0 0 0 0 0]

Selecting ndarray Data

We can select data in an ndarry with indexing and slicing. Using indexing, we can retrieve a specific value; with slicing, we can fetch multiple values from the source ndarray.

Still working with our nums ndarray, let’s grab the first element.

print(nums[0])
420

ndarray indexing, like regular list indexing, is zero based – i.e. the first element starts at 0, the second at 1, etc.. Let’s say we wanted to select the 5th, 6th, and 7th elements. We can use a slice for that.

print(nums[4:8])
[920 840 220]

The slice starts at the index value before the colon and returns it and the values after until, but not including, the value at the index after the colon. Now let’s grab data from an ndarray of a different shape.

nums2 = np.array([nums[:5], nums[5:]])
print(nums2)
print(nums2.shape)
[[420 400 670 650 920]
 [840 220 730 490 980]]
(2, 5)

Notice the shape of the ndarray (2, 5), 2 rows and 5 columns. We can select the first row.

print(nums2[0])
[420 400 670 650 920]

We could get the second row with nums2[1], and if there was a third nums2[2] and so on. Let’s fetch the value in the third column of the second row.

print(nums2[1,2])
730

With the comma syntax [x,y], the first position is the row and the second the column – in a two dimensional ndarray. We can combine this syntax with slicing.

print(nums2[1,2:])
[730 490 980]

To demonstrate how indexing and slicing works on 3-dimensional ndarrays we’ll use slices of the nums array and the np.arange method to create 2 more ndarrays.

nums3 = np.array([
    [nums[:5], nums[5:]],
    [np.arange(0,5), np.arange(5,10)]
])

print(nums3)
print(nums3.shape)
[[[420 400 670 650 920]
  [840 220 730 490 980]]

 [[  0   1   2   3   4]
  [  5   6   7   8   9]]]
(2, 2, 5)

To select the row with numbers 0 – 4, we first select the highest dimension of data and then on to lower dimensions.

print(nums3[1,0])
[0 1 2 3 4]

So if we wanted the last two elements from the second row in the first ndarray of shape 2,5 we’d do the following.

print(nums3[0,1,-2:])
[490 980]

This is equivalent to nums3[0,1,3:], but with the negative index slice (i.e. [-2:]) we can say start at 2 before the end and go to until the end.

Boolean Indexing

Not only can we select data from an ndarray using the index values, we also have the power to select based on comparisons. In other words, if we want data from an ndarray the only matches a certain condition, we can use a conditional statement in the square brackets to retrieve only that data.

First, I’ll create an ndarray using NumPy’s random module, populating the array with random data.

rand_data = np.random.randn(4,4)
print(rand_data)
[[ 1.41605716  1.18998145 -0.79101034  1.67810952]
 [-0.66823165  0.35587036 -0.24547048  0.11480275]
 [-0.03796473 -0.09380624  0.071222   -1.73191764]
 [ 0.15127885 -1.15335985  1.97041638  1.7231374 ]]

Like we did with arithmetic, we can use an ndarray as an operand in a comparison operation.

print(rand_data > 0)
[[ True  True False  True]
 [False  True False  True]
 [False False  True False]
 [ True False  True  True]]

The resulting ndarray, in this case, tells us where the positive values are located in the rand_data ndarray. This can be used to extract just those values.

pos_data = rand_data[rand_data > 0]
print(pos_data)
[1.41605716 1.18998145 1.67810952 0.35587036 0.11480275 0.071222
 0.15127885 1.97041638 1.7231374 ]

There’s a lot more to NumPy and ndarrays, but this is enough of an introduction to get started.