Top R libraries for Data ScienceKunal DhariwalBlockedUnblockFollowFollowingNov 22, 2018courtresy: https://pixabay.
com/en/statistic-survey-website-template-1606951/Here, let me tell you something about some awesome libraries that R has.
I consider these libraries to be the top libraries for Data Science.
These libraries have wide range of functions and is quite useful for Data Science operations.
I’ve used them and still use them for most of my day to day Data Science operations.
Without wasting any further time, let me get you started with awesome R stuff.
These libraries mentioned here are in random order and wouldn’t want to give them the ranking because they’re all useful in their own way and it’s not justified to rate them.
DplyrDplyr is mainly used for data manipulation in R.
Dplyr is actually built around these 5 functions.
These functions make up the majority of the data manipulation you tend to do.
You can work with local data frames as well as with remote database tables.
You might need to:Select certain columns of data.
Filter your data to select specific rows.
Arrange the rows of your data into an order.
Mutate your data frame to contain new columns.
Summarize chunks of you data in some way.
It also has functions like sample, group by and pipe.
Ggplot2Ggplot2 is the one of the best library for data visualization in R.
The ggplot2 library implements a “grammar of graphics” (Wilkinson, 2005).
This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.
Ggplot2 has wide range of functions.
Read this R documentation to know about ggplot2 functions, click here: https://bit.
Esquisse — My favorite package, the best addition to R.
Not liking ggplot2?.having some trouble using ggplot2 and its functions then this package is for you.
This package has brought the most important feature of Tableau to R.
Just drag and drop, and get your visualization done in minutes.
This is actually an enhancement to ggplot2.
This addin allows you to interactively explore your data by visualizing it with the ggplot2 package.
It allows you to draw bar graphs, curves, scatter plots, histograms, then export the graph or retrieve the code generating the graph.
It’s awesome, isn’t it?4.
BioConductorWhen you get into Data Science, you deal with different kinds of data.
You may not know what sort of data you gotta deal with in future.
If you are in health industry then trust me, you’ll find this very useful.
I consider this library to be highly useful when you are working on genomic data.
Bioconductor is an open source project that hosts a wide range of tools for analyzing biological data with R.
To install Bioconductor Packages, you need to install biocmanager.
Graphics: geneplotter, hexbin.
Annotation: annotate, AnnBuilder <-data packages.
Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn.
Pre-processing two-color spotted DNA microarray data: limma, marrayClasses, marrayInput, marrayNorm, marrayPlots, marrayTools, vsn.
Differential gene expression: edd, genefilter, limma, multtest, ROC.
Graphs and networks: graph, RBGL, Rgraphviz.
Analysis of SAGE data: SAGElyzer.
Click here to know more about installation and other Bioconductor packages : https://bit.
ShinyThis is a very well known package in R.
When you want to share your stuff with people around you and make it easier for them to understand and explore it visually, you can use shiny.
It’s a Data Scientist’s best friend.
Shiny makes it easier to build interactive web apps.
You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards.
LubridateThis library serves its purpose really well.
It’s mainly used for data wrangling.
It makes the dealing of date-time easier in R.
You can do everything you ever wanted to do with date arithmetic using this library, although understanding & using available functionality can be somewhat complex here.
When you are analyzing time series data and want to aggregate the data by month then you can use floor_date from lubridate package, it gets your work done quite easily.
It has wide range of functions.
You can read the documentation here: https://bit.
KnitrThis package is used for dynamic report generation in R.
The purpose of knitr is to allow reproducible research in R through the means of Literate Programming.
This package also enables integration of R code into LaTeX, Markdown, LyX, HTML, AsciiDoc, and reStructuredText documents.
You can add R to a markdown document and easily generate reports in HTML, Word and other formats.
A must-have if you’re interested in reproducible research and automating the journey from data analysis to report creation.
MlrThis package is absolutely incredible in performing machine learning tasks.
It almost has all the important and useful algorithms for performing machine learning tasks.
It can also be termed as the extensible framework for classification, regression, clustering, multi-classification and survival analysis.
It also has filter and wrapper methods for feature selection.
Another thing is most operations performed here can be parallelized.
The wide range of functions are mentioned here in the documentation: https://bit.
dictionariesThis package extends the capabilities of quanteda package.
It consists of dictionaries for text analysis.
It’s mainly designed to work with quanteda but can also work well with other text analysis libraries like tm, tidytext and udpipe.
With the liwcalike() function from the quanteda.
dictionaries package, you can easily analyze text corpora using exising or custom dictionaries.
You can install this package from their github page.
It is used for data display, you can display R matrices and data frames as interactive HTML tables.
You can create a sortable table with minimum amount of code using this library, actually you can create a sortable, searchable table in just one line of code.
You can also style your table.
DataTables also provides filtering, pagination, sorting, and many other features in the tables.
RCrawlerRCrawler is a contributed R package for domain-based web crawling and content scraping.
It adds the functionality of crawling that Rvest package lacks.
RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications.
The process of a crawling operation is performed by several concurrent processes or nodes in parallel, so it’s recommended to use 64bit version of R.
CaretCaret stands for classification and regression training.
One of the primary tools in the package is the train function which can be used to.
evaluate, using re-sampling, the effect of model tuning parameters on performance.
Caret has several functions that attempt to streamline the model building and evaluation process, as well as feature selection and other techniques.
This package alone is all you need to know for solve almost any supervised machine learning problem.
It provides a uniform interface to several machine learning algorithms and standardizes various other tasks such as Data splitting, pre-processing, feature selection, variable importance estimation etc.
RMarkdownR Markdown allows you to create documents that serve as a neat record of your analysis.
In the world of reproducible research, we want other researchers to easily understand what we did in our analysis, otherwise nobody can be certain that you analysed your data properly.
R Markdown is a variant of Markdown that has embedded R code chunks, to be used with knitr to make it easy to create reproducible web-based reports.
You can turn your analyses into high quality documents, reports, presentations and dashboards with R Markdown.
Moreover, you can directly use these maps from R console.
Leaflet provides you with different set of functions that can be used to style and customize your map.
Development work for this library is also extensive.
Make sure to try out this library if you want to work around with maps.
You can also use different tiles for your maps apart from the base maps.
JanitorJanitor makes basic data cleaning made easy, such as finding duplicates by multiple columns, making R-friendly column names and removing empty columns.
It also has some nice tabulating tools, like adding a total row, as well as generating tables with percentages and easy cross tabs.
And, its get_dupes() function is an elegant way of finding duplicate rows in data frames, either based on one column, several columns, or entire rows.
Other worth mentioning R libraries :GgvisPlotlyRchartsRbokehBroomStringRMagrittrSlidifyRvestFutureRMySQLRSQLiteProphetGlmnetText2VecSnowballCQuantmodRstanSwirlDataScienceRIf I’ve missed out any important library then do let me know down below in the comments section.
So, This is it, these were some of the top libraries that you need to know in order to get your day to day Data Science operations done.
Show some love, if it was helpful!Thanks for reading!.