The purpose of descriptive statistics is to allow us to more easily grasp the significant features of a set of sample data. However, they tell us little about the population from which the sample was taken. Inferential statistics is the branch of statistics that deals with using sample data to make valid judgments (inferences) about the population from which the sample data came.
The table below illustrates some differences between descriptive statistics and inferential statistics. In each example, descriptive statistics are used to tell us something about a sample. Inferential statistics are used to tell us something about the corresponding population.
|60% of the voters responding to a poll favor proposition A.||60% of the voters in the state favor proposition A, with a margin of plus or minus three percentage points.|
|In a trial study, brand A pain medicine resulted in noticeable relief an average of 20 minutes sooner than brand B medicine.||Brand A pain medicine brings noticeable relief significantly faster than brand B medicine.|
|The sample mean is 100.||The 95% confidence interval for the population mean is 97 to 103.|
|A random sample of high school students was selected to take an SAT preparation course. After completing the course, the mean SAT score for this group of students was 25 points higher.||An SAT preparation course will significantly increase students' SAT scores.|
So far, we have focused on descriptive statistics that describe a particular sample from a much larger population. Recall, however, that the ultimate goal is to be able to describe the population from which the sample came. For example, we might calculate the average reaction time of a sample of teen drivers in order to learn something about the reaction times of the population of all teen drivers. In a different context, we might determine what proportion of a sample of voters approves of the President's performance in order to learn something about the proportion of the entire population who approve.
In either case, we use the sample statistic (sample mean or sample proportion) to generate a range of likely values for the corresponding population parameter (population mean or population proportion). This range of likely values is called a confidence interval. A confidence interval has two parts; an interval of likely values and a measure of our confidence that the population parameter lies within the specified interval. The interval is generally of the form
where the estimate is derived from a simple random sample of the population. The estimate of the population mean is the sample mean and the estimate for a population proportion is the sample proportion. The margin of error (which is also based on our sample) determines the size of the confidence interval. A small margin of error results in a fairly narrow range of values for the population parameter making our confidence interval rather precise. A large margin of error results in a broad range of values for the population.
Our confidence in this statistical method is given by a confidence level which is the probability that this method will result in a confidence interval that contains the population parameter. For example, a confidence level of 95% means that the method used to calculate a confidence interval will yield a result (i.e., an interval) that actually contains the population parameter 95% of the time (i.e., for 95% of all possible samples). Notice that it is always possible that the particular sample we used to calculate the confidence interval is among the 5% for which the calculated interval does not contain the population parameter.
Unfortunately, the confidence level and the margin of error go hand in hand. That is, as the confidence level increases so, too, does the margin of error. This makes sense intuitively. If you want to be more confident that your interval contains the population parameter, just make the interval bigger!
In a statistical study, the sample mean is used to estimate the population mean. However, the number of different samples (of a given size) that could be taken from any given population is extremely large and these different samples would have different means. Some would be lower than the mean of the population and some would be higher.
The central limit theorem states that, for samples of size n from a normal population, the distribution of sample means is normal with a mean equal to the mean of the population and a standard deviation equal to the standard deviation of the population divided by the square root of the sample size. (For suitably large sample sizes, the central limit theorem also applies to populations whose distributions are not normal.)
Central Limit Theorem
For samples of size n, the distribution of sample means
where μ and σ represent the mean and the standard deviation of the population from which the sample came.
The practical significance of the central limit theorem is twofold. First, as the sample size increases, the standard deviation of the distribution of sample means decreases (because the sample size is in the denominator of the fraction). Consequently, we can be assured that larger samples tend to yield more accurate estimates of the population mean than smaller samples.
The sample data can be used to create a range of values (a confidence interval) that is likely to contain the population mean. Notice that the confidence interval is only likely to contain the population mean; it is not guaranteed to contain it.
Often, one of the goals of a statistical study is to learn something about the mean value of a population parameter. The sample mean is an estimate of the corresponding population mean. The central limit theorem confirms that means from larger samples tend to be more accurate than means from smaller samples. Nevertheless, a sample mean alone tells us little about the population mean.
A confidence interval for the population mean is a range of values (based on the sample mean, the sample size, and either the sample or the population standard deviation) that is likely to contain the population mean. The confidence level is the proportion of samples that will yield a confidence interval that actually contains the population mean. For example, if the confidence level is 95% (0.95) then for 95% of all possible samples, the confidence interval generated using the techniques described below will contain the population mean. The remaining 5% of the samples will result in confidence intervals that do not contain the population mean.
While we can set the confidence level at any value we wish, the most common confidence levels are 90%, 95%, and 99%. It stands to reason that larger intervals are more likely to include the population mean than smaller ones. Consequently, higher confidence levels are associated with wider intervals.
If the population standard deviation is known, a confidence interval can be derived from the distribution of sample means using the Central Limit Theorem. The actual derivation of the confidence interval is not shown here. If the population standard deviation is known, the confidence interval is given by
The lower bound of the interval is found by subtracting the margin of error from the sample mean and the upper bound is found by adding:
where x-bar is the sample mean, sigma is the population standard deviation, n is the sample size, and z* is the critical z value. The critical value z* and its negative delimit a central area under the standard normal curve equal to the desired confidence level.
|z* = 1.645
The area between
-1.645 and 1.645
|z* = 1.960
The area between
-1.960 and 1.960
|z* = 2.576
The area between
-2.576 and 2.576
The table below illustrates a 90%, a 95%, and a 99% confidence interval. Notice that the only thing that changes in the calculation is the critical value z*.
|Confidence Level||Confidence Interval|
In general, however, the population standard deviation is not known. In such cases, the sample standard deviation is used as an estimate of the population standard deviation. With less information about the population, it turns out that the resulting confidence intervals are a little wider in order to achieve the same degree of confidence.
If the population standard deviation is unknown, the limits of a confidence interval are given by
As before, the lower bound is found by subtracting the margin of error from the sample mean and the upper bound is found by adding the margin of error:
where x-bar is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the critical t value. The critical value t* is based on the Student (the name of a statistician) t-distribution. Unlike the normal distribution, the shape of the t-distribution depends on the sample size. For small sample sizes, the t-distribution is slightly lower and more spread out than the normal distribution. As the sample size gets larger, the corresponding t-distribution becomes more and more similar to a normal distribution. For sample sizes over about 1,000 there is no practical difference between the two.
The critical t* value and its negative delimit a central area under the t-distribution curve equal to the desired confidence level. Since the shape of the t-distribution depends on the sample size, the critical values also are dependent on sample size. Before computers, statisticians would have to look up the critical values in a table. In the next lab exercise, you will learn how to use Excel to determine critical values.
Suppose a sample of size 75 is taken. The sample mean is 98.2 and the sample standard deviation is 10. Find the 90% confidence interval. The critical t value is 1.666 (from a critical t value table or using Excel).
The population mean is between 96.28 and 100.12.