“Facts are stubborn, but statistics are reliable” –Mark Twain
The study of statistics revolves around the study of data sets. This lesson describes two important types of data sets - populations and samples. Along the way, we'll introduce simple random sampling, the main method used in this tutorial to select samples.
Before discussing the Inferential statistics, let us see the population and sample. Population contains all the data points from a set of data. It is a group from where we collect the data. While a sample consists of some observations selected from the population. The sample from the population should be selected such that it has all the characteristics that a population has. Population’s measurable characteristics such as mean, standard deviation etc. are called as parameters while Sample’s measurable characteristic is known as a statistic.
The role of the population plays a major role in statistics and data science. Moreover, without drawing population and sample, the whole world of building statistics and data science might have gone with no existence.
Population vs Sample
The main difference between a population and sample has to do with how observations are assigned to the data set.
- A population includes all of the elements from a set of data.
- A sample consists one or more observations drawn from the population.
Depending on the sampling method, a sample can have fewer observations than the population, the same number of observations, or more observations. More than one sample can be derived from the same population.
Other differences have to do with nomenclature, notation, and computations. For example,
- A measurable characteristic of a population, such as a mean or standard deviation, is called a parameter; but a measurable characteristic of a sample is called a statistic.
- We will see in future lessons that the mean of a population is denoted by the symbol μ; but the mean of a sample is denoted by the symbol x.
- We will also learn in future lessons that the formula for the standard deviation of a population is different from the formula for the standard deviation of a sample.
It is the collection of a specified group of similar objects, individuals, or entities that have some common observable characteristics in them. Out of which, each object is termed as “Elementary units”.
Example- Let’s consider we have a list consisting of the name of all the employees in a company, It is nothing but a population. Out of which each employee will be considered as an elementary unit.
Types of Population
This is a type of population in which the number of elementary units is exactly quantifiable.
Example- Books in a university library.
In this type of population, The count of elementary units is not quantifiable to at most certainty.
Example- Population of a country. The population of a country is not certainly quantifiable in most of the time while approximation can be done. This is because in each second the number of deaths and births is changing concerning time.
This is such a type of population that is mostly based on real-time data and the information is concrete and reliable. This population does not require approximation or hypothetical data.
Example- Employees working in a company.
This can be a finite or infinite imaginary population designed by a researcher. Here mostly, the researcher will take a real-time scenario and apply his/her common hypothesis or assumptions to draw the structure and information of a population.
Example- Possible outcomes of a die if rolled ’n’ times.
A part of the population drawn according to a rule or plan for concluding characteristics is called a sample.
Example-Imagine an XYZ company that has around 50k employees. To do some analysis based on the information of these employees, It is practically difficult for researchers concerning time and money with all of 50k employees. The best possible way is to select a 5k people (or any random number) from this population and collect the data from these employees to do the analysis. This random count of employees selected from the entire population is called Sample. This data analysis will be done by the researches on a hypothesis that whatever inferences they get from these 5k people will apply to the entire population itself.
The number of items in a sample is called a sample size. In the above example, Out of 50k employees, 5k was selected for analysis and that makes the sample size 5k.
Characteristics of the sample
A sample should follow certain characteristics to make it fit for data analysis. Research done on a wrong sample will result in wrong inferences and these may contradict the behavior of the entire population resulting in dangerous consequences.
A sample should represent the overall behavior of a population. Imagine the situation in the above example in which 5k employees are selected out of 50k employees. If in the original population, there are 30k men and 20k women but in the sample, there were only female employees present (which is the sample size). Any analysis done on this sample will do not represent the overall behavior of the population.
Homogeneity is nothing but the matching of behavior in multiple samples. If we derive multiple samples from a population, It is expected that all samples infer somewhat the same conclusions about the population.
Imagine if we want to calculate the mean salary of the 50 k employees and we have 3 samples each of 5k sample size.
· Sample 1 has a mean salary of $40k
· Sample 2 has a mean salary of 38k
· Sample 3 has a mean salary of $41k
We can say that these samples are homogeneous since all samples are giving approximately equal information regarding the salary of the employees.
What if the result is like this,
· Sample 1 has a mean salary of $40k
· Sample 2 has a mean salary of 15k
· Sample 3 has a mean salary of $100k
Here, the researcher will not able to determine the approximate salary of a person in the company due to data volatility.
The number of sampling units in a sample should be adequate for doing the research.
In the above example, Out of 50k employees, It will be not effective if draw a sample of sample size 5 or 6 for doing research.
Similar regulating conditions
There should be a similar way of selecting samples if there is a need for multiple samples.
In the above example, Out of 50k employees, a sample of 5k employees was chosen at random and if we are selecting another sample it’s should be also chosen randomly. Any kind of pre-conditions for selecting the elementary unit should not be encouraged.
If Sample 1 of sample size 5k is chosen at random but we are creating sample 2 of the same sample size for the same data analysis but we chose only female employees in the sample 2.This will affect the homogeneity of the samples and will end up in incorrect inferences.