Beginning with R — The uncharted territoryPuneet SharmaBlockedUnblockFollowFollowingJul 3Coming from a non-programming background and python being the first exposure to programming and data analysis, trying to get my hands dirty in R seemed pretty daunting at first.

R at times can feel a bit peculiar and unique since it is based on the premise of doing data analysis and statistics rather than software programming which is the case with python.

But as I push myself and try to learn the many quirks and leverages of R over python, it sort of gives a different perspective of doing data analysis.

Plus, there is a strong edge of using R over python — the vast and contemporary libraries of various statistical methodologies being implemented in R by statisticians world over.

Besides its quirks, the most interesting IDE developed so far for R — Rstudio , makes doing data analysis seem like fun activity.

The various other things in Rstudio like making reports with support of LaTex and HTML and making static websites using HUGO is something which makes life quite easy.

But to be able to do all this cool stuff we need to first grasp the basics of R which are the building blocks of any complex data analysis pipeline.

So let’s start the journey of R.

????Table of ContentsIntroduction to RR data typesHandling undefined valuesOperatorsLogical operatorsData structures — Lists, Vectors, Matrices, Arrays, Factors, Data framesIntroduction to RR is a dynamic language developed largely for statistical computing and visualization.

It includes various statistical and machine learning packages which are easy to learn and employ for data analysis.

R data typesWhen a variable is created in R workspace, R automatically assigns the data type to the variable.

## [1] 4.

2## [1] "Hello!"To check the type of variable, use typeof()## [1] "double"## [1] "character"Basic data types areString/characterNumber • Integer • Double • ComplexBoolean/LogicalA number whether integer or float is always represented as double.

## [1] "double"For explicit requirement of integer, add suffix L## [1] "integer"Handling undefined valuesHandling undefined/missing values is somewhat different than python.

Python has only NaN values as undefined/missing values.

In R, undefined values are basically represented usingNULLNANaNAll of three work differently.

NULL which is a null object is used when there is no value present.

If there is some value present in the vector or matrix and the value is not usable (fill_value), we use NA or NaN.

NA or NaN are missing value indicator.

## [1] "NULL"## [1] "logical"## [1] "numeric"NA comes when there is no TRUE or FALSE i.

e.

logical indeterminacy.

It can also come for missing value.

NaN means 0/0OperatorsMultiplication (*)Division (/)Addition (+)Subtraction (-)Exponent (^)Modulus (%%)Integer Division (%/%)Logical OperatorsNOT (!)Element wise AND (&)AND (&&)Element wise OR (|)OR (||)In the set (%in%)Data StructuresIn R there are 6 types of data structuresListsVectors (or Atomic vectors)MatricesArraysFactorsData framesListsList in R can hold elements of different types.

There is no coercion.

A list can contain numeric, characters, boolean, matrices, vectors, arrays, lists etc.

To create list, use list() argument.

## [[1]]## [1] "green"## ## [[2]]## [1] "yellow"## ## [[3]]## [1] 1## ## [[4]]## [1] 2## ## [[5]]## [1] 4 5 6To give names to each entry of list, use names() argument.

To access a particular entry in list use $## [1] "green"## [1] "yellow"## $A## [1] "green"## [1] "green"To merge two or more lists, use c()## [[1]]## [1] 1## ## [[2]]## [1] 2## ## [[3]]## [1] 3## ## [[4]]## [1] 4## ## [[5]]## [1] 5## ## [[6]]## [1] 6## ## [[7]]## [1] 7## ## [[8]]## [1] 8Some predefined lists in R.

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"## [12] "Dec"## [1] "January" "February" "March" "April" "May" ## [6] "June" "July" "August" "September" "October" ## [11] "November" "December"VectorsTo create a vector, we use c() function.

It basically concatenates things like a list in python.

## [1] "1" "2" "3" "4" "5.

4" "hello" "TRUE" "FALSE"As we can see, a vector can have any data type, be it number, character or boolean.

But we notice something.

All the elements in the vector are coerced to character type because the vector contains a string "hello".

This is the effect of implicit coercion.

For strictly making a numeric vector, use vector() function.

## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0We can use such a vector to preallocate a vector which can be used for appending values from a for loop which is faster than appending values to an empty vector since every time a value is appended in an empty vector, R makes a copy of it thus slowing the whole process.

Coercion — Objects like vectors, data frames etc.

can be coerced to different classess using as.

class function.

## [1] "numeric"## [1] "character"## [1] "logical"MatricesMatrix is same as a vector except it has an additional attribute of dimension.

It is a two dimensional data structure.

## [,1] [,2] [,3]## [1,] 6 8 6## [2,] 2 3 8## [3,] 6 2 0## $dim## [1] 3 3Matrices start filling row wise.

Whereas in python, a matrix starts filling column wise.

In R, we can pass the names of rows and columns.

## x y z## a 6 8 6## b 2 3 8## c 6 2 0## [1] "x" "y" "z"## [1] "a" "b" "c"To access the elements of a matrix, use square brackets.

## [1] 3## x y## b 2 3## c 6 2But a[2,] (2nd row) or a[,2] (2nd column) gives a vector i.

e.

it tries to return the simplest data structure.

To avoid this i.

e.

to get a matrix, use drop = FALSE## x y z ## 2 3 8## NULL## x y z## b 2 3 8## [1] 1 3Specific indexing can also be done.

## [1] 6 2 8 2You can also do indexing using logical vectors.

## x y## a 6 8## c 6 2To transpose a matrix use t(a)To combine vectors or matrices, use rbind or cbindDimension of a matrix can also be changed (reshape).

## [1] 1 9ArraysArray is an object which can hold multidimensional data.

Matrices are a subset of arrays as in they are two dimensional arrays.

So, together with an attribute of dimension i.

e.

dim, arrays also have attribute dimnames.

Array is simply a multidimensional data structure.

Its syntax is a <- array(data, dim = c(x,y,z,t.

))## , , 1## ## [,1] [,2] [,3] [,4]## [1,] 1 4 7 10## [2,] 2 5 8 11## [3,] 3 6 9 12## ## , , 2## ## [,1] [,2] [,3] [,4]## [1,] 13 16 19 22## [2,] 14 17 20 23## [3,] 15 18 21 24## , , 1## ## [,1] [,2]## [1,] 10 30## [2,] 20 40## ## , , 2## ## [,1] [,2]## [1,] 12 14## [2,] 13 15To define labels for different dimensions, use dimnames## , , g## ## d e## a 10 30## b 20 40## ## , , h## ## d e## a 12 14## b 13 15## , , 1## ## [,1] [,2] [,3]## [1,] 1 4 7## [2,] 2 5 8## [3,] 3 6 9## ## , , 2## ## [,1] [,2] [,3]## [1,] 10 13 16## [2,] 11 14 17## [3,] 12 15 18## ## , , 3## ## [,1] [,2] [,3]## [1,] 19 22 25## [2,] 20 23 26## [3,] 21 24 27## $dim## [1] 2 2 3## , , 1## ## [,1] [,2]## [1,] 1 4## [2,] 2 5## ## , , 2## ## [,1] [,2]## [1,] 10 13## [2,] 11 14## ## , , 3## ## [,1] [,2]## [1,] 19 22## [2,] 20 23FactorsFor the representation of categorical data, R has specific object called factors.

Factors are basically integers and have labels associated with them.

So, a particular number of factors are associated with a particular label.

These labels are called levels.

Factors look like characters but are integers in reality.

Further uses of Factors are to sort all the categorical datasets according to one categorical dataset.

factor() command is used to create a factor object.

## $levels## [1] "apple" "banana" "orange"## ## $class## [1] "factor"The levels are by default unordered.

To order them you can define the levels.

## $levels## [1] "apple" "orange" "banana"## ## $class## [1] "factorDataframesDataframes are used to store tabular data.

Lists of equal length are stored in dataframes.

## city rank## 1 Jaipur 2## 2 Jammu 3The data stored can be of different type.

One column may be character, another may be factors and so on.

But each column must have same type of data.

.