# Statistics – Standard Deviation

Most people are familiar with the concepts of the mean, median, and mode. They are measures of the central tendency of a value that has measured in a given population. They tell us, in different ways, about the value of an attribute at the heart of the population, rather than at the positive or negative extremes. Other measures are used to tell us about the variation within a population. One frequently used measure of variation is standard deviation.

Consider the following list of top grossing films as of March 2018 (in nominal USD)

 Film Year Gross Run-time Avatar 2009 2787965087 162 Titanic 1997 2187463944 194 Start Wars: The Force Awakens 2015 2068223624 136 Jurassic World 2015 1671713208 124 The Avengers 2012 1518812988 143 Furioius 7 2015 1516045911 137 Avengers: Age of Ultron 2015 1405403694 141 Harry Potter and the Deathly Hallows – Part 2 2011 1341511219 130 Star Wars: The Last Jedi 2017 1332261619 152 Frozen 2013 1290000000 102

We can calculate the average run-time of a top grossing film to be 142.1 minutes, but this tells us nothing about the variation of run-times within the population. One way we can quantify the variation is by looking at how far each film is from the mean.

 Film Delta from Mean Avatar 19.9 Titanic 51.9 Start Wars: The Force Awakens -6.1 Jurassic World -18.1 The Avengers 0.9 Furioius 7 -5.1 Avengers: Age of Ultron -1.1 Harry Potter and the Deathly Hallows – Part 2 -12.1 Star Wars: The Last Jedi 9.9 Frozen -40.1

Nice! now we can clearly see variation in film run-times. While this does make things clearer, it doesn’t give us a single measure of variation. It makes sense to find the average distance from the mean, no? But look what happens when we try to do that.

 Film Delta from Mean Avatar 19.9 Titanic 51.9 Start Wars: The Force Awakens -6.1 Jurassic World -18.1 The Avengers 0.9 Furioius 7 -5.1 Avengers: Age of Ultron -1.1 Harry Potter and the Deathly Hallows – Part 2 -12.1 Star Wars: The Last Jedi 9.9 Frozen -40.1 Sum 0

We cannot calculate an average because when we add up the deltas, we get 0. This is because the negative values cancel out the positive ones – as one would expect if you think about it. But this average distance from the mean seemed like such a good way to provide a single value that represents the variation in the data. Well, we can get around this issue by squaring the deltas (because squaring any number produces a positive number). So let’s try that.

 Film Delta from Mean Avatar 396.01 Titanic 2693.61 Start Wars: The Force Awakens 37.21 Jurassic World 327.61 The Avengers 0.81 Furioius 7 1.21 Avengers: Age of Ultron 26.01 Harry Potter and the Deathly Hallows – Part 2 146.41 Star Wars: The Last Jedi 98.01 Frozen 1608.01 Sum 5334.90

Here’s a visualization of what we just did.

The red dot in the center of the number line is the mean of the run-time values (i.e. `{ 162, 194, 136, 124, 143, 137, 141, 130, 152, 102 }`). The blue dots are distance of each film’s run-time from the mean (i.e. `{ 19.9, 51.9, -6.1, -18.1, 0.9, -5.1, -1.1, -12.1, 9.9, -40.1 }`, 2 points omitted due to space constraints). The yellow squares are the distances (deltas) squared. Now, since we have a sum which we can divide by the total number of films to calculate an average, we can produce a single number that represents the variance.

```variance = sum of squares/number of squares = 5334.90/10 = 533.49
```

In the the graphic above, I tried to illustrate the variance as the average of these squares – shown in orange, not to scale. Now to relate back to our original values, we simply take the square root of the variance. This square root of the variance is called the standard deviation. It is a single value that tells us how dispersed the values are within a data set. It’s shown below as the green line.

```standard Deviation = √variance = √533.49 = 23.10
```

The standard deviation of running times the top 10 grossing films as of the time of the writing of this blog is 23.1 minutes.