The answers to these questions are what shaped the results of my study.
Before I conduct my own research, I searched if there was any previous study done.
Unfortunately, there don’t seem to be any.
After investigating the possible sources of data and assuming they have articles on all the accidents that ever occured,I realised that scrapping different local news sites is the most feasible option.
Checking whether each independent accident is a head-on collision constitutes a bernoulli random variable — it’s a Yes or No question.
The result of summing all the values of each observation is a binomial distribution.
Generally,there are two options of estimating a population parameter.
The parameter can be estimated as a single value(point estimate) and/or an interval(confidence interval estimate).
The point estimate is just an average of the sample.
The confidence interval, however, provides a range of values within which the true population is likely to be.
It takes into account the uncertainty of the parameter.
Additionally, it serves as a guideline on how you expect your values to vary.
(Udacity,A/B Testing Course:Overview of A/B Testing,videos 14&15).
For example, assuming you have taken a truly representative sample which you use to calculate the confidence interval, if you were to replicate the study with 200 different samples, you expect the means(the proportion in this case) of 190 samples to fall in the interval and the remaining 10 to fall outside.
If less than confidence level % samples’ means fall in the interval, then something must have gone wrong in your study and you might want to investigate.
Similarly, you should expect the interval constructed around those means to overlap with the true population parameter(proportion) and around the other 10 samples’ means not to overlap.
To estimate the interval of the parameter, researchers use parametric or non parametric statistical procedures depending on whether certain conditions are satisfied.
The parametric procedure is the normal density function.
We move from binomial distribution(or any other distribution)to Normal distribution thanks to the Central Limit Theorem(CLT).
As you may know already, it is not almost always practical to analyse the entire population,if it’s really large.
Hence, a random sample of the target population is used to make inferences about the population.
The CLT states that if take you many,let’s say 1000 samples of large enough(more than 30) size and compute their average, the distribution of those means(sampling distribution of sample means) approach a normal distribution with mean= true population mean and standard deviation( SD)=sqrt(P*(1-P)/N) for sample proportions.
The distribution of the samples proportion is called a sampling distribution of sample proportions(SDSP).
The SD formula is just for one sample proportion.
Sampling distribution of sample means has a different SD formula.
In order to determine the confidence interval of the proportion using the normal distribution, two conditions must be true in order to have sound conclusions:z > 5 and (N – z )> 5Where:z = the number of successes (head-on collisions) and N – z =failures(non head-on collisions)If those two conditions are not met, it implies that the SDSP is skewed towards the bigger value(successes or failures)and not normally distributed.
Meeting those conditions signals that the sample likely to be random and good representative of the target population.
Sample SizeThe data dates range from 2000, 5 years after X became an independent country,to 2019.
I could use all that data for more for more information but to save time, I just decided to use data for the last five years(2014–2018).
Having the above information in mind, I need to calculate the minimum sample size that saves me cost and yet achieve the error equal to or less the maximum allowed most(usually 80% or more) of the time.
Generally, the probability of a statistical study to achieve its goal is what’s called statistical power of a study.
Specifically, this study aims to estimate the proportion of head-on collisions in X with the maximum acceptable error of 0.
05 points below or above the true proportion.
I chose 0.
05 points error arbitrarily.
There are various ways to calculate a sample size one of which is online calculator.
In this article I will use AUSVET calculator.
Generally, sample size calculation requires the following inputs:Estimated true proportion — this is usually from previous studies,experts, or pilot study.
As mentioned in the introduction, there is no previous study on the matter and to start somewhere, I scrapped for 140 accidents reports(10 from each state) for the proportion estimate.
Let z = 50,N = 140.
Then p-hat= 0.
Where z and p-hat are the number of head-on collisions and the proportion of head-on collisions in the sample.
Allowable error: This is the maximum allowed error that needs to be achieved.
It determines the width of the confidence interval.
Alpha: Since I will use 95% confidence level, the corresponding alpha level=5%.
Estimated Target Population Size: One of the newspapers reports ~12500 accidents take place annually across the country.
Thus, target population for this study is 12500 * 5( the last 5 years).
Submitting all the required inputs, the calculated minimum sample is 353.
Now that I have identified where to get the data from; which data makes up my population and the sampling frame; and the appropriate sample size that will ensure 5% margin of error 80% of the times with 95% confidence level, the next step is to collect the data.
Data Collection and ScreeningI collected data through web scraping.
Before you start calculating anything, you make sure the data collected is the one planned for.
Issues found and had to be addressed:Presence of accidents articles belonging to some states had dated periods earlier than the cut-off year(2014).
Those articles had to be removed to ensure all states’ articles are from the same year.
Articles about cars or accidents in general and not about a specific accident that happened were removed.
To be able to estimate the proportion on the state level as well as nationally, I ensured that each state had at least 5 head-on collisions and non head-on collisions.
Calculating Confidence IntervalOnce the right data is cleaned, I calculated the proportion.
ConclusionWell, that’s how I conducted the research.
Most experienced data professionals eI exposed myself to tell one thing: data collection and clean take up the most time and my study was no exception.
Of all the three steps,I noticed that the computation of the desired estimate takes up the least portion of the study’s time.
Initially, I started collecting data without defining some of the important points i stated above.
Consequently, I hit a lot of walls.
After reading and watching contents,I learned the importance of taking the time to decide clearly on what and how you plan to measure it.
So please take the time to plan for your study.
It paves you a smoother path to having a strong credible study in the most efficient way.
I hope you this article was helpful.
References:Vieira , E.
Introduction to real world Statistics: With step-by-step SPSS instructions.
Doing Bayesian data analysis : a tutorial with R, JAGS, and StanUdacity,A/B Testing Course,class 1:Overview of A/B Testing.