# Data science concepts you need to know! Part 1

There are a number of approaches here, I will present two of the more common methods.Let’s take the following data sets:##lets make some dataset.seed(122)dat <- data.frame("val"=rnorm(100, mean=0.6,sd=0.15))#this lets us plot graphs next to each otherpar(mfrow = c(1, 2))#this is a simple histogram# i call the column named "val" from dataframe named "dat" with the \$val additionhist(dat\$val, main="Data A", xlab="values")hist(log(dat\$val), main="Data B",xlab="values")Hopefully it is clear that data A might be normal, but data B certainly looks non-normal (note the clear asymmetry)!To test the normality more thoroughly,we can plot an ideal normal distribution and then our data on top (this is known as a QQ-plot), and then calculate the R-squared between the relationships:##plot data against ideal normal#note that the subset command means we will only take the val column of data with Group=="Control"# subset(dat, Group=="Treated") would return the whole dataframe where the column "Group" is equal to "Treated", adding the \$val allows us to access the val column directlypar(mfrow = c(1, 2))qqnorm(dat\$val, main="QQ plot A")qqline(dat\$val)qqnorm(log(dat\$val), main="QQ plot B")qqline(log(dat\$val))#get Rsquaredqn=qqnorm(log(dat\$val), plot.it=FALSE)rsq_B <- cor(qn\$x,qn\$y)#get Rsquaredqn=qqnorm(dat\$val, plot.it=FALSE)rsq_A <- cor(qn\$x,qn\$y)The QQplot plots a perfect normal distribution as a solid diagonal line and then our actual data as markers (circles) on top of that.We can see from the plots that our data A follows a normal (the solid diagonal line) distribution well..However examining the data in plot B we can see substantial deviations from the normal line, these deviations are clearly not random but follow a trend (random deviations would occur equally as many times on either side of the line).To quantify this we can extract the R-squareds (we’ll learn more on R-squared in later lessons, but it essentially measures how close our points are to that diagonal line on a scale from 0 to 1)..The R-squared values are 0.997 and 0.974 for A and B respectively..It is enough to look at the data to realize plot B is non-normal, but we can also pick a cutoff, such as an R-squared of 0.975 and call anything below this non-normal and anything above this normal.Another approach is to use the Shapiro test, which will conduct a hypothesis test on whether our data is normal or not:shapiro.test(dat\$val)The key part here is the P-value, which is 0.84 for data A..The higher the p-value the more normal the distribution..Often we will use a P-value cutoff of 0.1 to ensure normality..So anything with a p-value of <0.1 we would class as non-normal..This approach is complementary to the above, and I would recommend both plotting a QQ-plot (as above) and running a Shapiro test (and reporting both in any report).Let’s look at the QQ-plot and Shapiro test for some non-normal (specifically log-normal) data:non_norm <- log(rnorm(75, mean=5,sd=3))par(mfrow = c(1, 2))hist(non_norm, main="Histogram of log-normal data", xlab="Values")qqnorm(non_norm, main="QQ plot for log-normal data")qqline(non_norm)The above distribution has a Shapiro test P-value of 2.4e-09 and the QQ-plot shows systematic deviations from the normal line..This strongly suggests our data B is non- normal.Hypothesis testing normal distributionsOnce we are convinced that our data is normally distributed (we’ll tackle non-normal data later) we can run some hypothesis tests..The T-test is commonly used as a “go-to” hypothesis test but here I want to convince you to use the Welch test instead (remember, to access the Welch test in R we call the t.test() function but add the argument var.equal=FALSE)..The Welch test is better suited to unequal sample sizes and unequal variances, which we will often come across in the real world.Below, I have run a simulation to demonstrate the danger of using T-tests over Welch tests with unequal variance data:#set up varsnSims <- 20000 #number of simulationsp1 <-c()p2 <- c()#create variables for dataframecatx<-rep("x",38)caty<-rep("y",22)condition<- c(catx,caty)#run simulationsfor(i in 1:nSims){ #for each simulated experiment sim_a<-rnorm(n = 38, mean = 0, sd = 1.11) #simulate participants condition a sim_b<-rnorm(n = 22, mean = 0, sd = 1.84) #simulate participants condition b p1[i]<-t.test(sim_a,sim_b, alternative = "two.sided", var.equal = TRUE)\$p.value #perform the t-test and store p-value p2[i]<-t.test(sim_a,sim_b, alternative = "two.sided", var.equal = FALSE)\$p.value #perform the Welch test and store p-value}par(mfrow = c(1, 2))hist(p1, main="Histogram of t-test p-values ", xlab=("Observed p-value"))hist(p2, main="Histogram of Welch's p-values", xlab=("Observed p-value"))We are here plotting a histogram of P-values derived from the two tests after running a simulation many many times..We will remember that the p-value is the probability that the difference between groups is observed purely by chance (i.e. there is no real difference between groups) .. More details