Lately, I’ve been studying statistics and data analysis. I have beforehand knowledge of the Python programming language, so when looking at the two most widely used programming tools applied in this domain, Pandas and R, I chose Pandas – a software library for Python. Pandas uses another library in the construction of it’s data structures (Series and DataFrames), called NumPy. This post goes over some of what can be done with NumPy.
The fundamental data structure in NumPy is an N-dimensional array named
array methods are used to create an ndarray.
# Import the NumPy library. import numpy as np arr = np.zeros((2,3), dtype=np.int64) print(arr)
[[0 0 0] [0 0 0]]
It’s conventional to import numpy as np
Notice how the zeros method creates an ndarray with all zeros. The first (required) parameter of the zeros method is
shape, meaning the shaped of the ndarray. In this case the shape
(2,3) tells NumPy to create an ndarry with 2 rows and 3 columns.
dtype keyword argument (not required, but it brings up an important concept) tells NumPy what we want the data type of the elements inside of the array to be.
Note: the elements inside of an ndarray are all of the same type, unlike lists which can be of mixed type (e.g. a list containing strings and integers).
We can change the data type of the elements with ndarray’s astype method. I’ll demonstrate…
arr = np.empty((2,3), dtype=np.float_) print(arr)
[[ 2.12202817e-314 2.12199579e-314 1.61897901e-318] [ 1.56307538e-311 0.00000000e+000 0.00000000e+000]]
empty method creates an ndarray (of the given shape) with uninitialized values – i.e. garbage values. Notice how infinitesimal the non-zero numbers are. Let’s cast them to integers.
arr2 = arr.astype(np.int64) print(arr2)
[[0 0 0] [0 0 0]]
astype method created a new ndarray by copying the values of the source ndarray casting its values from floating point (a.k.a double precision) into integers, and with the loss of precision the numbers become zero.
astype method is useful when converting string values to numerical values to perform calculations.
arr = np.array(['1','2','3'], dtype=np.string_) arr2 = arr.astype(np.int64) print(arr2)
[1 2 3]
Here we used the
array method to first create an ndarray of the characters 1, 2, and 3. Then we were able to create a new array with those characters translated into integers.
As the first parameter of
np.array method we used a list, but it can be any array like object, such as a list or tuple. To create an ndarray of the shape (2,3) like we did for
empty, simply pass in an array like object with nested array like objects.
arr = np.array([(1,2,3), (4,5,6)]) print(arr)
[[1 2 3] [4 5 6]]
import random as rand # Create a list of 10 integers between 1 and 100. nums = rand.sample(range(1, 101), 10) # Convert the list to a NumPy array. nums = np.array(nums) print(nums)
[42 40 67 65 92 84 22 73 49 98]
We can easily perform arithmetic operations on the elements in a NumPy array. For example, let’s increase each element (integer) in the array by a factor of 10.
nums = nums * 10 print(nums)
[420 400 670 650 920 840 220 730 490 980]
Pretty neat, eh!? We can also perform arithmetic with two ndarrays as operands.
diff = nums - nums print(diff)
[0 0 0 0 0 0 0 0 0 0]
Selecting ndarray Data
We can select data in an ndarry with indexing and slicing. Using indexing, we can retrieve a specific value; with slicing, we can fetch multiple values from the source ndarray.
Still working with our
nums ndarray, let’s grab the first element.
ndarray indexing, like regular list indexing, is zero based – i.e. the first element starts at 0, the second at 1, etc.. Let’s say we wanted to select the 5th, 6th, and 7th elements. We can use a slice for that.
[920 840 220]
The slice starts at the index value before the colon and returns it and the values after until, but not including, the value at the index after the colon. Now let’s grab data from an ndarray of a different shape.
nums2 = np.array([nums[:5], nums[5:]]) print(nums2) print(nums2.shape)
[[420 400 670 650 920] [840 220 730 490 980]] (2, 5)
Notice the shape of the ndarray
(2, 5), 2 rows and 5 columns. We can select the first row.
[420 400 670 650 920]
We could get the second row with
nums2, and if there was a third
nums2 and so on. Let’s fetch the value in the third column of the second row.
With the comma syntax
[x,y], the first position is the row and the second the column – in a two dimensional ndarray. We can combine this syntax with slicing.
[730 490 980]
To demonstrate how indexing and slicing works on 3-dimensional ndarrays we’ll use slices of the nums array and the np.arange method to create 2 more ndarrays.
nums3 = np.array([ [nums[:5], nums[5:]], [np.arange(0,5), np.arange(5,10)] ]) print(nums3) print(nums3.shape)
[[[420 400 670 650 920] [840 220 730 490 980]] [[ 0 1 2 3 4] [ 5 6 7 8 9]]] (2, 2, 5)
To select the row with numbers 0 – 4, we first select the highest dimension of data and then on to lower dimensions.
[0 1 2 3 4]
So if we wanted the last two elements from the second row in the first ndarray of shape 2,5 we’d do the following.
This is equivalent to
nums3[0,1,3:], but with the negative index slice (i.e.
[-2:]) we can say start at 2 before the end and go to until the end.
Not only can we select data from an ndarray using the index values, we also have the power to select based on comparisons. In other words, if we want data from an ndarray the only matches a certain condition, we can use a conditional statement in the square brackets to retrieve only that data.
First, I’ll create an ndarray using NumPy’s random module, populating the array with random data.
rand_data = np.random.randn(4,4) print(rand_data)
[[ 1.41605716 1.18998145 -0.79101034 1.67810952] [-0.66823165 0.35587036 -0.24547048 0.11480275] [-0.03796473 -0.09380624 0.071222 -1.73191764] [ 0.15127885 -1.15335985 1.97041638 1.7231374 ]]
Like we did with arithmetic, we can use an ndarray as an operand in a comparison operation.
print(rand_data > 0)
[[ True True False True] [False True False True] [False False True False] [ True False True True]]
The resulting ndarray, in this case, tells us where the positive values are located in the
rand_data ndarray. This can be used to extract just those values.
pos_data = rand_data[rand_data > 0] print(pos_data)
[1.41605716 1.18998145 1.67810952 0.35587036 0.11480275 0.071222 0.15127885 1.97041638 1.7231374 ]
There’s a lot more to NumPy and ndarrays, but this is enough of an introduction to get started.