2 Describing data

When working with data, a summary will often help us understand the data better than just looking at all the data points. To this end, we want to be able to describe the central tendency of a variable and its variation (or dispersion). The central tendency will give us a measure of the center of the variable, while the dispersion will provide us with a measure of the spread.

In this discussion, we will introduce some light mathematical notation. We will work with a variable that we will call \(x\). Sometimes we talk about an observation of \(x\), usually the \(i\)th observation, denoted as \(x_i\). \(i\) stands for the indices of the observation that goes from the first observation \(x_1\) to the last observation, \(x_n\), where \(n\) is the size of the sample. The mean of a variable is often described as \(\bar{x}\) with the bar on top of x indicating the average. Alternatively, the mean “function” notation could be used as \(\text{Mean}(x)\) to indicate the average. The same logic goes for the standard deviation that is sometimes written as \(s\), and sometimes as \(\text{SD}(x)\). Later, we will, for example, talk about a sample mean and a population mean. In statistical notation, the distinction between the sample and the population is made by using different symbols. For the mean, the sample mean could be written as \(\bar{x}\) while the population mean could be written with the Greek letter “mu”, \(\mu\). The standard deviation in a sample may be written as \(s\) while the population standard deviation is written with the Greek letter “sigma”, \(\sigma\). Be aware, there are many dialects of statistical notation!

2.1 Central tendency

The most common, and maybe most intuitive, measure of the center of a set of numbers is the arithmetic mean (or the average). This measure is the point on the scale that perfectly balances the weight of the measurements. To get the mean, we need to sum up all observations and divide the sum by the number of observations. Let’s say we have \(n\) observations of the variable \(x\). Summing the values of these observations is often described using the following notation \(\sum_{i=1}^n x_i\). The \(\sum\) symbol tells us that we are doing summation, \(i=1\) indicates the first index in the sequence of values, and \(n\) says that the summation will go on until we reach the last observation. \(x_i\) indicates that we are doing a summation over the variable \(x\). When we have summed all observations, we will divide them by the number of values. Sometimes this whole expression is written as \(\text{Mean}(x) = \bar{x} = \frac{1}{n}\sum_{n=1}^n x_i\) ¹.

¹ This the same as writing

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\] only a little bit more compact.

Another measure of central tendency is the median. Instead of relying on the weight of each observation, the median gives us the value that is placed in the middle of an ordered sequence of numbers. First, let’s order our observations \(x_1 < x_2 < ... < x_n\). Next we pick the middle observation \(x_m\) as the median (\(\text{Median}(x_n)\)). If \(n\) is even, we take the mean of the two center values.

Finally, we can discuss the most common number in a sequence of values. To arrive at this number, we would count the frequency of each distinct value and name the value with the highest frequency the mode.

The first measure of tendency, the mean, accounts for the actual value of each observation. The median, on the other hand, only takes into account the order of the sequence of sorted values. And last, the mode only takes into account the frequency distribution of values.

2.2 Measures of dispersion

The standard deviation gives us the average deviation from the mean. To get this measure, your computer will start by calculating the variance. The variance is the average squared deviation from the mean. We can write this as a formula, and you will recognize most of it from the formula of the mean ². The exception is that we are doing a summation over all squared differences from the mean \((x_i - \bar{x})^2\). The variance is thus not on the same scale as the mean, and therefore, in many instances, is hard to interpret. To get a more interpretable number, we take the square root of the variance to get the standard deviation. This gives us the formula used by your computer to calculate the standard deviation.

² The formula for the variance

\[s^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2\]

However, this is not the whole story, when we only have a small sample from a larger population of possible observations and we wish to estimate the variance or standard deviation of the population, it turns out that the average deviation from the mean is biased. To correct this bias, the variance and standard deviation are usually calculated using \(n-1\) giving us the most common formula (for the standard deviation):

\[s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x}_n)^2}\]

This correction is called Bessel’s correction³. It is the preferred way of calculating the sample variance and standard deviation in software packages like R.

³ Read more about the correction here

Similar to the description of central tendency, we can focus on the sorted sequence of observations to find measures of dispersion. If we order all observations from the smallest to the largest, it would be easy to find the maximum and the minimum. The distance between the minimum and the maximum is called the range. Furthermore, we could use the order to find the value corresponding to the 25th and 75th percentiles and calculate the distance between them. The nth percentile is the value of the maximum number among the n % of all observations. For example, the 25th percentile divides the data into two pieces: 25% of the data is equal to or below the 25th percentile, and 75% of the data is above it. The range between the 25th and the 75th percentile is called the interquartile range.

The median absolute deviation (MAD) combines the idea from the standard deviation with measuring the central tendency in deviations from the center. However, MAD uses the median deviation from the median. In the formula⁴, a correction is used to give appriximately a standard deviation when the variables is normally distributed. The choice of using MAD is usually motivated by data that is not normally distributed.

⁴ See here for details on MAD. See also the JASP guide for MAD and MAD robust here.

2.2.1 Describing categories

The above discussion is valid for numerical data on the interval or ratio scale. When we are trying to summarise categories we often need to resort to freqency summarises. A variable that contains three categories can, for example, be summarized by counting the number of observations in each category. This will allow for the calculation of the relative number of observations, and the cumulative percentage.