# Statistics Research Principles and Terminologies

That’s where inferential statistics comes into the picture.

Inferential statistics refer to the use of sample data to reach some conclusion (i.

e.

, make some inference) about the characteristics of the larger population that the sample is supposed to represent.

To make the leap from sample data to inferences about a population, one must be very clear about whether the sample accurately represents the population.

Thus, an important first step is to clearly define the population that the sample is alleged to represent.

Sampling MethodsSampling is the process of drawing out a sample from its population.

We have a number of ways at our disposal to select samples.

I’ll intuitively explain three of the most popular methods.

Scenario #1: I want to conduct a survey to determine how satisfied students of a particular university are.

I walk into the university and find an overwhelming number of students all around the campus.

They along with the absent students and students enrolled in distance learning represent my population.

There’s absolutely no way I can reach all of them out so I enter the cafeteria wherein 120 students present over there participated in my survey.

This is an example of random sampling.

In this process, every member of a population has an equal chance of being selected into a sample.

A major benefit of random sampling is that any differences between the sample and the population from which the sample is selected will not be due to systematic bias in the selection process but due to chance.

A larger sample tends to represent a population well.

Scenario #2: Suppose 35% of my population data of middle-class families of India earn their yearly income through a business, I would try to match the percentage of families that owns a business in my sample.

Similarly, if 10% of the population have more than 6 members in their family, 10% of my sample should have more than 6 members in their family.

This is known as representative sampling where I am purposely selecting cases so that my sample match the larger population on specific characteristics.

This is a costly and time-consuming but increases my chances of being able to generalize the results from my sample to the population.

Scenario #3: I want to do a statistical study on the level of fitness of 10th-grade students so I select a sample of 200 students from the nearest high school to my residence.

This method of selecting samples is called convenience sampling.

In this method, the researcher generally selects participants on the basis of proximity, ease-of-access, and willingness to participate.

It is not at all a bad method if my sample does not differ from my population of interest in ways that influence the outcome of the study.

It is definitely less time consuming and convenient.

Types of VariablesGiven data of 10th-grade students of a country, a student’s age, height, weight, gender, attitude about school, etc.

are all known as variables.

Anything that could be codified and contains more than a single score (value) is a variable.

A constant, in contrast, is a variable that only has a single score.

For example, “age” will be treated as a constant if it is 15 years for all the students in our sample data.

A variable could be quantitative (continuous) or qualitative (categorical).

For example, “height”, “weight” and “age” are quantitative variables whereas “gender” and “attitude about school” are qualitative variables.

A quantitative variable indicates some sort of amount.

You can perform mathematical operations over them such as average or mean.

A qualitative variable, on the other hand, does not indicate more or less of a certain quality.

For example, “gender” variable contains two scores/values — “male” and “female”.

One score is not more or less than the other score.

You cannot apply mathematical operations over them.

The scores only represent a qualitative difference.

A dichotomous variable is a qualitative variable with only two different scores (e.

g.

, “gender” variable).

Determining the nature of a variable using the modern approach.

Note: A metric variable is synonymous to a continuous variable.

Image by spss-tutorials.

comScales of MeasurementThere are four different scales of measurement for variables in statistics.

A nominally scaled variable is one in which the labels that are used to identify the different levels of the variable have no weight or a numeric value.

From our previous example of sample data, “gender” is a nominal variable.

Even if its scores — “male” and “female” are encoded into 0 and 1 for conducting statistics using computer software, a value of 1 does not indicate a higher score than a value of 0.

They are simply labels assigned to each group.

Suppose I collected fan ratings on Avengers Endgame over a scale of 1 to 10 where 1 represents completely dissatisfied and 10 represents overly satisfied.

A score of 9 tells me that a fan enjoyed the movie far more than someone who rated 3.

The scores do have weight.

They just don’t tell me a measurable difference of satisfaction between a rating of 9 and 10.

Such variables are known as ordinal variables.

This type of variable fail to answer ‘how much more a score is greater or less than the other (in terms of a measurable quantity)?’.

The third and fourth kind of scales of measurement for variables is intervals and ratios.

They contain information about both relative value and distance.

For example, “height”.

If one member of my sample is 170 cm tall, another is 173 cm tall, and a third is 166 cm tall, I know who is tallest and how much taller or shorter each member of my sample is in relation to the others.

Whenever a variable is measured using a scale of equal intervals, they fall among these two groups.

The difference between intervals and ratios comes into picture when we talk about they treat a zero value.

In the case of intervals, zero does not mean “nothing”.

For example, the “year” variable may contain a year zero which has a meaning with regard to time.

The same goes for temperature in degrees Celsius: zero degrees is not “nothing” with regard to temperature.

Ratio scales also include a zero value which means “nothing”.

For example, the “weight” variable in kilos.

Zero kilos corresponds to “nothing” with regard to weight.

Ratio variables may even hold negative values.

For example, “bank account balance” variable.

Characteristics and available computations on different scales of measurement of a variable.

Research DesignsNow that we’ve covered the important terminologies and concepts of statistics, we’ll dive into research designs and methodologies employed by statisticians.

This section will provide a sneak peek of how statistics is actually leveraged in the real world.

You believe your audience is not opening your mail lest reading it may be due to the subject of the mail.

Thus, you’re looking for a new strategy to reduce your audience’s churn rate.

What you could use to understand your user behavior and response to reduce churn is experimental design.

You can prepare two different subjects for your promotional mail and check which one works better by dividing your mailing list into two samples using random sampling.

You should send a mail with subject A to group A and subject B to group B.

Any differences between the sample and population caused due to random assignment is pure chance.

You finally find out after experimentation that subject B garnered customer responses.

With the experimental design, researchers can isolate specific independent variables that may cause variation in dependent variables.

Since customer response depended upon the mail subject, it is a dependent variable whereas mail subject is an independent variable.

Depiction of experimental design.

This method is also known as A/B testing, considered a great tool to boost engagement and improve conversion rate.

There’s yet another type of research methodology where participants are not divided into groups.

Researchers do not manipulate the data.

They collect the data on several variables and then determine how strongly different variables are related to each other using statistical analyses.

This is known as a correlational research design.

For example, you may be interested in determining whether watching violence on television causes violent behavior in adolescence.

Using correlational research design, you established that indeed there’s a positive relationship between those two variables.

But correlation is not and cannot be taken to imply causation.

It simply shows the relationship.

It could be possible that watching violence on television causes violent behavior but the opposite could be concluded as well.

It could also be that the cause of both these is a third variable — say, for example, growing up in a violent neighborhood or home — and that both the watching violence on television and the violent behavior are the outcome of this.

Correlational research designs are easier to conduct and allow researchers to examine many variables simultaneously.

The primary drawback is that such research does not allow for the careful controls necessary for drawing conclusions about causal associations between variables.

Visualization of the correlation between two variables.

Image by machinelearningmastery.

comSummaryWe covered a lot of statistical terminologies and research principles in this blog.

We saw the differences between a population and a sample, parameters and statistics, descriptive and inferential statistics.

We understood the process of sampling and how a sample can be drawn from a population using three popular methods — random sampling, representative sample, and convenience sampling.

Variables are of different types and could be measured using varying scales.

They are broadly classified as quantitative or qualitative based on the scores that they contain.

Finally, we saw through examples of how statisticians and researchers use different statistical research design to conduct real-world analyses.

In my next blog post, I’ll dive into our first set of statistics — measures of central tendency.

As we all know, they’re the most popular and widely used statistics and does a fairly good job summarizing data.