Business Analysis in R Part 1: Advanced Filtering and Conditional CalculationsHamza RafiqBlockedUnblockFollowFollowingJun 23You will find a lot of great articles on people doing analyses of different datasets but it is difficult to find material related to business analysis in R.
This was one of the main reasons why I have started this Business Analysis in R series so that I can share some of the useful ways you can analyze your data at work more efficiently.
Before we move on, I would recommend that you have a basic understanding of the Tidyverse set of packages in R.
In this dataset, I will try to replicate some of the tasks that I do at work and you will find these really useful.
you will get to see how powerful R is compared to Excel.
Loading LibrariesLoading the following libraries that we will be using for our data munging and analysislibrary(tidyverse) ## A set of tools for Data manipulation and visualizationlibrary(lubridate) ## for date time manipulationlibrary(scales) ## Formatting numbers and valueslibrary(hrbrthemes)# For changing ggplot themelibrary(extrafont) # More font optionsReading DataI use the read_csv function to read in the data.
Just copy the path of the file wherever it lies on your machine and replace the ‘’ with ‘’.
I always set the trim_ws argument to TRUE in case there is any whitespace in the data.
If it is saved as an excel file, you can just the read_excel function from the readxl package and basically follow the same principlessupermarket_sales <- read_csv("C:UsersHamzaDownloadssupermarket_sales – Sheet1.
csv",trim_ws = TRUE)Viewing the dataView() is a very handy function which allows you to look at your data in an excel spreadsheet like format as shown below.
I am also going to use the glimpse function to check the datatypessupermarket_sales %>% View()glimpse(supermarket_sales)Output from View functionOutput of glimpse functionLooks like I need to fix my Date and Time columns and convert them to the right format.
I am going to round the Time column to the nearest hour and convert the Date column to a date format.
After that, we will create a new column which shows the day of the week.
supermarket_sales <- supermarket_sales %>% mutate(Time=as.
POSIXct(Time),Time=hour(Time), Date=mdy(Date),weekday = wday(Date,label = TRUE))I first changed the Time column to a date-time type and extracted just the hour from it.
The mdy function in the Lubridate package allowed me to format the Date column as a date type from which I extracted the weekdayAnalysisLet's get started with some exploratory data analysis.
While we will be uncovering different insights in the data, the main focus here is to give you an understanding of how to get desired results in R.
Once you have an understanding on how the code works, you can analyze and explore it any way you want.
Bar Charts & Faceted Bar chartsWe will be plotting some bar charts to get an insight into total sales relative to the day of the week and time## Creating a summarysales_by_day <- supermarket_sales %>% group_by(weekday) %>% summarise(Total_Sales=sum(Total)) %>% ungroup##Visualizing summary datasales_by_day %>% ggplot(aes(reorder(weekday,Total_Sales),Total_Sales,fill=weekday))+ geom_col(show.
legend = FALSE,color="black")+geom_text(aes(label=comma(Total_Sales)),size=3,hjust=1,color="black")+ scale_y_comma()+ scale_fill_brewer(palette = "Paired")+ coord_flip()+ theme_classic()+ labs(title = "Total Sales breakdown by Weekday and Time",x="Hour of the day",y= "Total sales")Sales by day## Summarizing by day and hoursales_by_day_hour <- supermarket_sales %>% group_by(weekday,Time) %>% summarise(Total_Sales=sum(Total)) %>% ungroup()## Visualizingsales_by_day_hour %>% mutate(Time=fct_reorder(Time,Total_Sales)) %>% ggplot(aes(Time,Total_Sales,fill=weekday))+ geom_col(show.
legend = FALSE,color="black")+ geom_text(aes(label=comma(Total_Sales)),size=3,hjust=1,color="black")+ scale_y_comma()+ scale_fill_brewer(palette = "Paired")+ facet_wrap(~weekday,scales="free_y")+ coord_flip()+ theme_classic()+ labs(title = "Total Sales breakdown by Weekday and Time",x="Hour of the day",y= "Total sales")Line GraphsWe will now explore how the sales have changed over time for the different genders.
supermarket_sales %>% group_by(Monthly=floor_date(Date,unit = "1 weeks"),Gender) %>% summarise(Total_Sales=sum(Total)) %>% ggplot(aes(Monthly,Total_Sales))+ geom_line(aes(color=Gender),size=1)+ theme_light()+ scale_y_comma()+ labs(title = "Total Sales over time by Gender",subtitle = "Sales for men increased over time",y="Total Sales",x="")Advanced FilteringR has some really powerful data filtering options that other tools like Excel do not have.
For example, I want to filter for sales that are greater than the average sale of women.
supermarket_sales %>% filter(Total > mean(Total[Gender=="Female"]))All sales greater than the average sale of femalesYou can even take this further.
Lets filter for sales that are greater than the average sale of health and beauty products sold to females.
supermarket_sales %>% filter(Total > mean(Total[Gender=="Female"&`Product line`=="Health and beauty"]))It is pretty simple.
All I had to was add an additional filter context for Product line when calculating the mean.
Sometimes you do not know the exact strings in your character based columns and it can be a real pain to type each one specifically.
This is where you can use regular expressions to make things easier.
supermarket_sales %>% filter(str_detect(`Product line`,regex("sport|beauty",ignore_case = TRUE)))Basically, I am filtering for data which contains the following strings in the Product line column.
There is also a negate argument which you can set to TRUE and it will only show results that do not contain these strings.
supermarket_sales %>% filter(str_detect(`Product line`,regex("sport|beauty",ignore_case = TRUE),negate = TRUE))Group by and SummarizingThis is probably the functions that you will be using the most.
You might have seen a lot of tutorials in R where people use these functions but they only show the basics like sum, mean, or count.
But in reality, you have to calculate a lot of conditional calculation that I have rarely seen any tutorial cover.
I will start with a basic calculation then move to conditional calculations.
Let's take a look at our sales by city.
supermarket_sales %>% group_by(City) %>% summarise(Total_Sales= sum(Total)) %>% arrange(desc(Total_Sales))This is a simple calculation.
I just grouped the data by city and calculated the sum of the Total sales and arranged by the largest sale.
Now if we wanted to calculate sales in Branch A for each city, this is how we will do it.
supermarket_sales %>% group_by(City) %>% summarise(Total_Sales= sum(Total[Branch=="A"])) %>% arrange(desc(Total_Sales))We use the  to add a filter context, the same way we filtered our data previously.
This is how I calculate all my conditional calculations.
Let's do another one for different products and payment typessupermarket_sales %>% group_by(City) %>% summarise(Total_Sales= sum(Total[Payment %in%c("Cash","Credit card")& str_detect(`Product line`,regex("sport|beauty",ignore_case = TRUE))])) %>% arrange(desc(Total_Sales))This might seem a little complicated but when we break it down, it is pretty simple.
I did a conditional sum of where Payments were equal to Cash or Credit card and products that contained sport or beauty in their name.
I used str_detect, combined with regex as before to get the products I wanted because I couldn't be bothered with typing their long names.
The possibilities with these conditional calculations are endless and I use them at work all the time.
Sometimes you also have to make time-based calculations at work.
For example, Sales this YTD and Sales same period last year.
We can do that by following the same principles as above combined with the help of the Lubridate package which allows us to manipulate date values.
supermarket_sales %>% group_by(City) %>% summarise(Total_Sales= sum(Total[Date %within% interval("2019-01-01","2019-03-31")]), Same_period_LY=sum(Total[Date %within% interval("2018-01-01","2018-03-31")]))Here, I used the %within% and Interval functions to specify which dates I want to aggregate by.
Don't worry if there is no data for 2018 as there is none but I just wanted you to understand the concept.
I really like these “time intelligence” calculations because these functions allow me to be as specific as I want.
Time intelligence calculationConclusionThese were some of the functions I use at work for analyzing data.
I will keep continuing my Business Analysis in R series to share some really cool and useful ways on how you can analyze your business data more efficiently in R.