Quantitative Methods - Lectures - Estimating a sampling distribution

Population and sample

When we are interested in continuous data, the sample mean is unbiased estimate of the population parameter (the population mean).

The population mean:

\[\mu=\frac{\sum{X_i}}{N}\]

The sample mean:

\[\bar{x}=\frac{\sum{x_i}}{n}\]

Central tendency is captured by the “center of gravity” in the data, however, we might also want to know something about its variation.
The population variance is the average (squared) difference from the mean
As the population parameters are unknown, we estimate them with our sample

\[\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i-\mu)^2}}{N}\]

\[s^2 =\frac{\sum_{i=1}^{n}{(x_i-\bar{x})^2}}{n-1}\]

The sample variance is an unbiased estimate of the population if we use \(n-1\) otherwise we will tend to underestimate the population values.

\[s^2 =\frac{\sum_{i=1}^{n}{(x_i-\bar{x})^2}}{n-1}\]

The variance is the average squared deviation from the mean
The standard deviation (\(s\)) is the square root of the variance, thus on the same scale as the mean

\[s = \sqrt{\frac{\sum{(x_i-\bar{x})^2}}{n-1}}\]

Any statistic can be calculated from a sample and used as an estimation of the population parameter.
As an example, the sample mean (\(\bar{x}\)) is an unbiased estimator of the population mean, we can know this because the average of repeated samples from a population will be close to the population mean (\(\mu\)).

The variation (standard deviation) of a distribution of averages is affected by the sample size.
This variation can be estimated from samples and is known as the standard error.
The sample standard error is an estimate of the standard deviation of the sampling distribution!

\[SE = \frac{s}{\sqrt{n}}\]

Based on the estimate of the sampling distribution we can device a test, to test if a value exists within specified range.
95% of all values lies within \(\pm 1.96\times \sigma\) from the mean in a normal distribution, this leaves us with an uncertainty of 5%.
However, due to problems with proving a theory or hypothesis, we instead test against a null-hypothesis.
The null hypothesis \(H_0\) is constructed to contain scenarios not covered by the alternative hypothesis \(H_A\)

The null hypothesis is that the mean of group 1 is similar to group 2 \(H_0: \mu_1 - \mu_2 = 0\)
To reject this hypothesis, we need to find support for \(\mu_1 - \mu_2 \neq 0\)
We want to do this with some specified error control, usually 5%. We accept that we will wrong in a specified number of cases.
We can calculate a 95% confidence interval of the difference

Upper bound: \[\bar{x} + t_{1-\alpha/2} \times SE\] Lower bound: \[\bar{x} - t_{1-\alpha/2} \times SE\]

\(\bar{x}\) is the difference in means between groups.
The standard error (\(SE\)) estimates the standard deviation of the sampling distribution
\(t_{1-\alpha/2}\) represents the area under probability distribution curve containing 95% of all values.
The \(t\)-distribution is used instead of the normal distribution since it can capture deviations from the Normal distribution due to the sample size.

Two groups are compared, the \(H_0\) is that there is no difference between the groups: \(H_0: \mu_1 = \mu_2\)
The difference between the groups are estimated to \(\mu_2 - \mu_1 =\) 0.97
The 95% confidence interval is \(m_2 - m_1 \pm t_{\alpha/2} \times SE(m_2 - m_1)\) where the \(SE(m_2 - m_1)\) is the standard error of the difference.

We can estimate population parameters using a random sample from the population
The calculated sample standard error is an estimate of the standard deviation of a sampling distribution
Using a probability density function like the \(t\)- or \(z\)-distribution, we can estimate a range a plausible values of a population parameter (e.g. mean).
We can test if a estimated interval contains the null hypothesis, if not we can reject \(H_0\).