In this Blog we are going to learn about the Inferential statistics and why it is must to learn by every data scientist.
Imagine you are working in an agricultural sciences lab, where you have been collaborating with a local farmer to develop new varieties of fruits and vegetables. It is your job to analyze the fruits and vegetables that are harvested each year and track any changes that occur from one plant generation to the next. Today you are measuring the sugar content of the latest crop of tomatoes to test how sweet they are. You have 25 tomatoes and find that the mean sugar content is 32 milligrams (mg) of sugar per gram (g) of tomato with a standard deviation of 4 mg/g.
In fact, it is very unlikely that the mean and standard deviation of your 25-tomato sample is exactly the same as the mean and standard deviation of the entire harvest. Fortunately, you can use techniques from a branch of Statistics known as “Inferential statistics” to use your smaller subset of measurements to learn something about the sugar content of the entire tomato harvest. These and other inferential statistics techniques are an invaluable tool for scientists as they analyze and interpret their data.
What are Inferential statistics?
Many statistical techniques have been developed to help scientists make sense of the data they collect. These techniques are typically categorized as either descriptive or Inferential. While Descriptive statistics allow scientists to quickly summarize the major characteristics of a dataset, inferential statistics go a step further by helping scientists uncover patterns or relationships in a dataset, make judgments about data, or apply information about a small dataset to a larger group. They are part of the process of data analysis used by scientists to interpret and make statements about their results. The inferential statistics toolbox available to scientists is quite large and contains many different methods for analyzing and interpreting data. As an introduction to the topic, we will give a brief overview of some of the more common methods of statistical inference used by scientists. Many of these methods involve using smaller subsets of data to make inferences about larger populations.
Populations versus subsamples
When we use the word “population” in our everyday speech, we are usually talking about the number of people, plants, or animals that live in a particular area. However, to a scientist or statistician this term can mean something very different. In statistics, a population is defined as the complete set of possible observations. Based on this definition of a population, you might be thinking how impractical, or even impossible, it could be for a scientist to collect data about an entire population. so smaller subsets of the population, known as either subsamples or samples, are often studied instead. It is important that such subsample is representative of the population from which it comes.
Inferential statistics can help scientists make generalizations about a population based on subsample data. Through the process of estimation, subsample data is used to identify population parameters like the population mean or variance.
Random sampling helps scientists collect a subsample dataset that is representative of the larger population. This is critical for statistical inference, which often involves using subsample datasets to make inferences about entire populations.
To find the sample subset from a whole Population data we use following methods in Inferential Statistics.
Normal distribution and z — statistic:
The normal distribution is also known as the bell curve having the following properties:
mean = median = mode.
- The curve is symmetric with half of the values on the left and half of the values on the right.
- The area under the curve is 1.
Image credit: University of Virginia
In a normal distribution:
- 68% of the data falls within one standard deviation of the mean
- 95% of the data falls within two standard deviations of the mean
- 99.7 % of the data falls within three standard deviations of the mean.
For calculating the probability of occurrence of an event we need the z — statistic. The formula for calculating the z — statistic is
where x is the value for which we want to calculate the z — value. μ and σ are the population mean and standard deviation respectively.
Central limit theorem:
Normal distribution is used to represent random variables with unknown distributions. Thus, it is widely used in many fields including natural and social sciences. The reason to justify why it can used to represent random variables with unknown distributions is the central limit theorem (CLT).
According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution.
Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country. It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20-year-old people across the country and calculate the average height of the people in samples. According to the CLT, as we take more samples from the population, sampling distribution will get close to a normal distribution.
Lets look with another example. Consider the case where we look at the number of tweets a person makes in a week (randomly generated data between 0 and 200). The frequency distribution of the data looks like this:
This is not similar to any kind of distribution we know.
Now lets take 1000 random samples of size 50 from this data and calculate the mean of each sample. When we plot these means we get a normal distribution curve also known as the sampling curve or the sampling distribution.
Mean = 98.78 (population mean = 98.87)
The central limit theorem has some important properties:
- The mean of the population is approximately equal to the mean of the sampling distribution. We can see this in the example above where population mean( 98.87 ) is approximately equal to the mean of sampling distribution (98.78).
- The standard deviation of the sampling distribution also known as the standard error is equal to the population standard deviation divided by the square root of the sample size. As a result the greater the sample size, the lower the standard deviation and greater accuracy in determining the sample mean from the population mean.
- The distribution of sample means is normal regardless of the shape of the population distribution. It means that even if our original distribution is skewed or bimodal or some other distribution the mean of sample means is always a normal distribution. This is what makes the central limit theorem so powerful.
For the central limit theorem to hold the sample size should be sufficiently large (generally > 30)
Like I have previously said we find the sample mean and would like to estimate the population mean. How well the sample statistics estimates the underlying population is always an issue. A confidence interval addresses this issue as it provides a range of values which may likely contain the population parameter.
There are one sided or two sided confidence intervals. In two sided confidence interval, if we are choosing a 95% confidence interval, we choose 2.5% on each side of the tail and then calculate the range. In a one sided confidence interval we calculate the confidence interval by taking the entire 5% either to the left or to the right of the distribution. The above image displays a two sided confidence interval. The formula we use to calculate the confidence interval is:
where the symbols stand for
Another important term in the confidence interval concept is the margin of error. It is half the size of a confidence interval. It is known as the sampling error and it means that if the sample mean is within the margin of error then its actual value is the population mean and the difference only occurs due to chance. Otherwise the results are considered to be statistically significant. We will revisit this concept in a later tutorial.
When we take a 95% confidence interval it does not mean that the population means is within the range we derive with a 95% chance. The confidence interval represents the frequency (i.e. proportion) of possible confidence intervals that contain the true value of the unknown population parameter. So if we take infinitely many samples and find the confidence interval range for each of these samples then the number of intervals that contain the population parameter is equal to the confidence interval. If we take a 95% confidence interval then it means that the population parameter is present in 95% of all possible confidence interval ranges.