header

Central Limit Theorem and Inferential Statistics

Central Limit Theorem

The central limit theorem forms the basis of inferential statistics and it would be difficult to overestimate its importance. In a statistical study, the sample mean is used to estimate the population mean. However, the number of different samples (of a given size) that could be taken is extremely large and these different samples would have different means. Some would be lower than the mean of the population and some would be higher.

The central limit theorem states that, for samples of size n from a normal population, the distribution of sample means is normal with a mean equal to the mean of the population and a standard deviation equal to the standard deviation of the population divided by the square root of the sample size. (For suitably large sample sizes, the central limit theorem also applies to populations whose distributions are not normal.)

Central Limit Theorem

For samples of size n, the distribution of sample means

  1. is normal.
  2. has a mean of μ.
  3. has a standard deviation of  sigma/sqrt(n).

where μ and σ represent the mean and the standard deviation of the population from which the sample came. 

The practical significance of the central limit theorem is twofold. First, as the sample size increases the standard deviation of the sample means decreases. Consequently, we can be assured that larger samples tend to yield more accurate estimates of the population mean than smaller samples.

Second, it is very unlikely that a sample mean will exactly equal the population mean. Even if it did, we wouldn't know it. Consequently, the sample mean is used to create a range of values (called a confidence interval) that is likely to contain the population mean. Notice that the confidence interval is only likely to contain the population mean; it is not guaranteed to contain it.

Inferential Statistics

The purpose of descriptive statistics is to allow us to more easily grasp the significant features of a set of sample data. However, They tell us little about the population from which the sample was taken. Inferential statistics is the branch of statistics that deals with using sample data to make valid judgments about the population from which the data came.

The table below illustrates some differences between descriptive statistics and inferential statistics. In each example, descriptive statistics are used to tell us something about a sample. Inferential statistics are used to tell us something about the corresponding population.

Descriptive Statistics

Inferential Statistics

60% of the voters responding to a poll favor proposition A.60% of the voters in the state favor proposition A, with a margin of plus or minus three percentage points.
In a trial study, brand A pain medicine resulted in noticeable relief an average of 20 minutes sooner than brand B medicine.Brand A pain medicine brings noticeable relief significantly faster than brand B medicine.
The sample mean is 100.The 95% confidence interval for the population mean is 97 to 103.
A random sample of high school students was selected to take an SAT preparation course. After completing the course, the mean SAT score for this group of students was 25 points higher.An SAT preparation course will significantly increase students' SAT scores.

We will take a brief look at confidence intervals for the mean and simple hypothesis testing.

Confidence Intervals

Often, one of the goals of a statistical study is to learn something about the mean value of a population parameter. The sample mean is an estimate of the corresponding population mean. The central limit theorem confirms that means from larger samples tend to be more accurate than means from smaller samples. Nevertheless, a sample mean alone tells us little about the population mean.

A confidence interval is a range of values (based on the sample mean, the sample size, and either the sample or the population standard deviation) that is likely to contain the population mean. The confidence level is the proportion of samples that will yield a confidence interval that actually contains the population mean. For example, if the confidence level is 95% (0.95) then for 95% of all possible samples, the confidence interval generated using the techniques described below will contain the population mean. The remaining 5% of the samples will result in confidence intervals that do not contain the population mean.

While we can set the confidence level at any value we wish, the most common confidence levels are 90%, 95%, and 99%. It stands to reason that larger intervals are more likely to include the population mean than smaller ones. Consequently, higher confidence levels are associated with wider intervals.

Population Standard Deviation Known

If the population standard deviation is known, a confidence interval can be derived from the distribution of sample means using the Central Limit Theorem. The actual derivation of the confidence interval is not shown here. If the population standard deviation is known, the confidence interval is given by

Confidence Interval

where x-bar is the sample mean, sigma is the population standard deviation, n is the sample size, and z* is the critical z value. The critical value z* and its negative delimit a central area under the standard normal curve equal to the desired confidence level.

Confidence Level

90%95%99%
z* = 1.645

The area between
-1.645 and 1.645
is
0.90
z* = 1.960

The area between
-1.960 and 1.960
is
0.95
z* = 2.576

The area between
-2.576 and 2.576
is
0.99

The table below illustrates a 90%, a 95%, and a 99% confidence interval. Notice that the only thing that changes in the calculation is the critical value z*.

Confidence LevelConfidence Interval
90%90% Confidence Interval 
95%95% Confidence Interval 
99%99% Confidence Interval 

Population Standard Deviation Unknown

In general, however, the population standard deviation is not known. In such cases, the sample standard deviation is used as an estimate of the population standard deviation. With less information about the population,  it turns out that the resulting confidence intervals are a little wider in order to achieve the same degree of confidence.

If the population standard deviation is unknown, the limits of a confidence interval are given by

Confidence Interval 

where x-bar is the sample mean, s is the sample standard deviation, n is the sample size, and t* is the critical t value. The critical value t* is based on the Student (the name of a statistician) t-distribution. Unlike the normal distribution, the shape of the t-distribution depends on the sample size. For small sample sizes, the t-distribution is slightly lower and more spread out than the normal distribution. As the sample size gets larger, the corresponding t-distribution becomes more and more similar to a normal distribution. For sample sizes over about 1,000 there is no practical difference between the two.

The critical t* value and its negative delimit a central area under the t-distribution curve equal to the desired confidence level. Since the shape of the t-distribution depends on the sample size, the critical values also are dependent on sample size. Before computers, statisticians would have to look up the critical values in a table. In the next lab exercise, you will learn how to use Excel to determine critical values.

Simple Hypothesis Testing

In inferential statistics, a study is often performed to allow the researcher to investigate two possible hypotheses about a population. The null hypothesis states that a population parameter has some specific value that is assumed to be correct. The alternate hypothesis challenges this assumption. The statistical study results in a decision to accept or reject the null hypothesis. (Statisticians do not accept the alternate hypothesis, they reject the null hypothesis. This is primarily because the null hypothesis is specific while the alternate hypothesis is vague.)

One such inference test involves the mean of a population. The null hypothesis asserts that the population mean is some specific value. This value is often based on previous statistical studies. The alternate hypothesis can have one of three forms:

1. The population mean is not equal to this specific value.
2. The population mean is less than this specific value.
3. The population mean is more than this specific value.

The researcher determines which of these alternate hypotheses is being tested before collecting the sample data and performing the statistical analysis.

We will consider only the first alternate hypothesis. Since it asserts that the true population mean is not equal to the specific value given in the null hypothesis, it is referred to as a two-tailed alternate hypothesis. The true population mean may be either smaller or larger than the value given in the null hypothesis. This kind of test can be represented by listing the null and alternate hypotheses as shown below.

 

Even if the null hypothesis were correct, it is unlikely that a given sample would have a mean of exactly . A sample mean close to supports the null hypothesis. A sample mean far from leads us to reject the null hypothesis. After all, if the null hypothesis were true, how could we get a sample with a mean so far from ? The question is, how do we know when a sample mean is so far from that we should reject the null hypothesis?

It turns out that the answer to that question is quite simple. If the confidence interval (based on the sample data) contains then we accept the null hypothesis, otherwise we reject the null hypothesis. That is, if the confidence interval does not include then it is unlikely that the sample came from the population described by the null hypothesis. Therefore, we reject the null hypothesis.

Example

For a given population we establish the following null and alternate hypotheses:

 

Suppose a sample of size 75 is taken. The sample mean is 98.2 and the sample standard deviation is 10. Based on this sample should we accept or reject the null hypothesis? For a sample of size 75 and a 95% confidence interval, the critical t value is 1.666 (t* = 1.666). The 90% confidence interval is given by:

 

 

which yields

96.28 <= mu <= 100.12 

Since this confidence interval includes 100 (the mean under the null hypothesis), we accept the null hypothesis. If the confidence interval does not contain the mean given in the null hypothesis, we would reject the null hypothesis.