Introduction to the Inferential Statistics

In this Blog we are going to learn about the Inferential statistics and why it is must to learn by every data scientist.

Imagine you are working in an agricultural sciences lab, where you have been collaborating with a local farmer to develop new varieties of fruits and vegetables. It is your job to analyze the fruits and vegetables that are harvested each year and track any changes that occur from one plant generation to the next. Today you are measuring the sugar content of the latest crop of tomatoes to test how sweet they are. You have 25 tomatoes and find that the mean sugar content is 32 milligrams (mg) of sugar per gram (g) of tomato with a standard deviation of 4 mg/g.

By studying the sample of tomatoes can we able to predict entire tomato harvest. In situations like this, scientists can use inferential statistics to help them make sense of their data.
By studying the sample of tomatoes can we able to predict entire tomato harvest. In situations like this, scientists can use inferential statistics to help them make sense of their data.

In fact, it is very unlikely that the mean and standard deviation of your 25-tomato sample is exactly the same as the mean and standard deviation of the entire harvest. Fortunately, you can use techniques from a branch of Statistics known as “Inferential statistics” to use your smaller subset of measurements to learn something about the sugar content of the entire tomato harvest. These and other inferential statistics techniques are an invaluable tool for scientists as they analyze and interpret their data.

What are Inferential statistics?

Populations versus subsamples

Inferential statistics can help scientists make generalizations about a population based on subsample data. Through the process of estimation, subsample data is used to identify population parameters like the population mean or variance.

Random sampling helps scientists collect a subsample dataset that is representative of the larger population. This is critical for statistical inference, which often involves using subsample datasets to make inferences about entire populations.

Sample from a Given Population Statistics

To find the sample subset from a whole Population data we use following methods in Inferential Statistics.

Normal distribution and z — statistic:

mean = median = mode.

  1. The curve is symmetric with half of the values on the left and half of the values on the right.
  2. The area under the curve is 1.

Image credit: University of Virginia

In a normal distribution:

  • 68% of the data falls within one standard deviation of the mean
  • 95% of the data falls within two standard deviations of the mean
  • 99.7 % of the data falls within three standard deviations of the mean.

For calculating the probability of occurrence of an event we need the z — statistic. The formula for calculating the z — statistic is

where x is the value for which we want to calculate the z — value. μ and σ are the population mean and standard deviation respectively.

Central limit theorem:

According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution.

Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country. It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20-year-old people across the country and calculate the average height of the people in samples. According to the CLT, as we take more samples from the population, sampling distribution will get close to a normal distribution.

Lets look with another example. Consider the case where we look at the number of tweets a person makes in a week (randomly generated data between 0 and 200). The frequency distribution of the data looks like this:

This is not similar to any kind of distribution we know.

Now lets take 1000 random samples of size 50 from this data and calculate the mean of each sample. When we plot these means we get a normal distribution curve also known as the sampling curve or the sampling distribution.

Mean = 98.78 (population mean = 98.87)

The central limit theorem has some important properties:

  1. The mean of the population is approximately equal to the mean of the sampling distribution. We can see this in the example above where population mean( 98.87 ) is approximately equal to the mean of sampling distribution (98.78).
  2. The standard deviation of the sampling distribution also known as the standard error is equal to the population standard deviation divided by the square root of the sample size. As a result the greater the sample size, the lower the standard deviation and greater accuracy in determining the sample mean from the population mean.
  1. The distribution of sample means is normal regardless of the shape of the population distribution. It means that even if our original distribution is skewed or bimodal or some other distribution the mean of sample means is always a normal distribution. This is what makes the central limit theorem so powerful.

For the central limit theorem to hold the sample size should be sufficiently large (generally > 30)

Confidence Interval:

There are one sided or two sided confidence intervals. In two sided confidence interval, if we are choosing a 95% confidence interval, we choose 2.5% on each side of the tail and then calculate the range. In a one sided confidence interval we calculate the confidence interval by taking the entire 5% either to the left or to the right of the distribution. The above image displays a two sided confidence interval. The formula we use to calculate the confidence interval is:

where the symbols stand for

Sample Mean
Z value for the desired confidence level
confidence level
the standard deviation of the population

Another important term in the confidence interval concept is the margin of error. It is half the size of a confidence interval. It is known as the sampling error and it means that if the sample mean is within the margin of error then its actual value is the population mean and the difference only occurs due to chance. Otherwise the results are considered to be statistically significant. We will revisit this concept in a later tutorial.

When we take a 95% confidence interval it does not mean that the population means is within the range we derive with a 95% chance. The confidence interval represents the frequency (i.e. proportion) of possible confidence intervals that contain the true value of the unknown population parameter. So if we take infinitely many samples and find the confidence interval range for each of these samples then the number of intervals that contain the population parameter is equal to the confidence interval. If we take a 95% confidence interval then it means that the population parameter is present in 95% of all possible confidence interval ranges.

Hi, this is Venu iam here to share my data Science knowledge effectively and easy way