R: rank vs.
orderRebecca PeltzBlockedUnblockFollowFollowingJun 12, 2018If you’re learning R you’ve come across the sort, rank and order functions.
Because there is similarity and even overlap in the semantics, questions come up: what exactly does each do and what are the use cases for each?All three functions require that the values they operate on are comparable.
Comparisons in R can apply to string, numeric, complex and logical date types.
Rank vs Order ConfusionSort, Rank, and Order are functions in R.
They can be applied to a vector or a factor.
If you are used to thinking of data in terms of rows and columns, vector represents a column of data.
A factor is created from a vector and represents discreted labeled values.
In the R code below, X is loaded with data and then sorted, ranked, and ordered.
R reports the results as vectors.
X = c(3,2,1) X 3 2 1 sort(X) 1 2 3rank(X) 1 2 3order(X) 1 2 3It seems clear enough:you load data into a vector using the “c”ombine functionwhen you view X it appears arranged as it was loadedwhen you sort X, you see a vector containing values from X arranged in ascending orderwhen you rank X, you see a vector containing values from X arranged in ascending order (like sort)when you order X, you see a vector containing values f rom X arranged in in ascending order (like sort)Now, lets apply a simple permutation when creating the X vector and run these functions.
X = c(2,3,1) X 2 3 1sort(X) 1 2 3rank(X) 2 3 1order(X) 3 1 2In the output above the sort function affirms what we stated above, but the rank and order are more difficult to explain.
Now, look at a different vector with a similar permutation on a different range of integers.
X = c(5,6,4) X 5 6 4sort(X) 4 5 6rank(X) 2 3 1order(X) 3 1 2In the code above we see the same rank and order for “5, 6, 4” as we did for “2, 3, 1”.
The reason that these two sequences have the same rank and order is because rank and order are reporting on relative locations as opposed to relative values.
Rank and order are based on the results of an ascending sort of the data in the vector.
Specifically the range of values returned by rank and order is the range of indexes of values in the original sequence.
Rank references the position of the value in the sorted vector and is in the same order as the original sequenceOrder returns the position of the original value and is in the order of sorted sequence, that is smallest value to largest valueThe graphic below helps tie together the values reported by rank and order with the positions from which they come.
Involutive CyclesThe “1,2,3” sequence first presented that returned the vector “1,2,3” for both Rank and Order is actually a special sequence because these values and several other permutations of “1,2,3” cause rank and order to behave as involutory functions.
An involutary function is function that is its own inverse.
X = c(1,2,3)RANK(X) == ORDER(X) == XRANK(ORDER(X)) == XORDER(RANK(X)) == 1:length(X)In the code below, you can see all six of the permutations of “1,2,3” tested to see if they are involutive.
The two permutations that do not result in involutive functionality can be identified by the cycles which they break down into.
See the article rank vs order in R below for more information on involutive cycles.
X = c(1,2,3)all(order(rank(X)) == X) TRUEX = c(2,3,1)all(order(rank(X)) == X) FALSEX = c(3,1,2)all(order(rank(X)) == X) FALSEX = c(1,3,2)all(order(rank(X)) == X) TRUEX = c(2,1,3)all(order(rank(X)) == X) TRUEX = c(3,2,1)all(order(rank(X)) == X) TRUEall(order(X)[rank(X)] == rank(x)[order(X)]) == 1:length(X)TRUEWhile it’s tempting when learning to look at simple data sets to help understand the behavior of functions, it can lead to confusing conclusions when the arrangement of the data affects the output of the functions.
Sorted SequencesFor any vector sequence in ascending order, the code below demonstrates the relationship between Order and Rank as they interact with each other.
The Order of the Rank will always equal the Rank of the Order.
X = c(100,200,300)all(order(X)[rank(X)] == rank(X)[order(X)])TRUEIn addition, the code below verifies that for any sequence in ascending order both the Order of the Rank and the Rank of the Order will always equal a vector made up of the positions of the ordered elements.
x = c(100,200,300)all(order(X)[rank(X)] == 1:length(X))TRUEall(rank(X)[order(X)] == 1:length(X))TRUE1:length(X) 1 2 3Use Case for the Order FunctionYou can use the order function to sort a dataframe.
The sort command can be used to create a new vector from any vector of comparable values into a vector arrange in an ascending sequence.
The default sort order is ascending, but there are options to make it descending, as well as options for dealing with undefined values and specifying a sorting method.
When you read data from a file system into a data frame or construct the data frame in code, you have a structure that contains rows and columns of data which may be of different types.
In order to “sort” the row of data frame by column values, whether it’s a single column or multiple columns, you must use the order command as the sort command only sorts vectors.
To see how this works, the example below builds up a data frame from raw data loaded into vectors.
This data could easily have been read in from a CSV or other formatted text file as well.
Note: enclosing the last instruction in parentheses causes the data frame to be referenced by the test.
data variable and displays what’s in the test.
The first integer in the display is a counter identifier assigned by R to the rows in the data frame.
size = 5sex=sample(c("male","female"),size,replace=T)age = sample(21:100, size, replace=T)degree = sample(c("BA","BS","MS","MBA"), size, replace=T)(test.
data = data.
frame(sex=sex, age=age, degree=degree))sex age degree1 female 30 BA2 male 49 BA3 male 39 MBA4 male 27 MS5 male 61 MSWe can sort the data by age using the order command.
The order function is passed the name of the column to order by and the order is ascending.
The result of the order command is a vector where each value references the value of the position of the item in the original data frame and it, itself, is located in the sorted data’s position.
For example the 1st age in the original data frame is 30 and in the sorted data frame 30 will be in the 2nd position.
Therefore, the value 1 is located in the 2nd position of the order vector.
Once the order vector is obtained it is used to extract data from the original test.
You can see the original counter id in the result and how it matches the order vector used to do the sort.
R extracts data from a data frame (or matrix) using the square brackets with a Row, Column designation.
data$age) 4 1 3 2 5test.
data$age),]sex age degree4 male 27 MS1 female 30 BA3 male 39 MBA2 male 49 BA5 male 61 MSThe data frame can be sorted in descending order by using the negative sign in front of the column name specified by the order command.
data$age) 5 2 3 1 4test.
data$age),]5 male 61 MS2 male 49 BA3 male 39 MBA1 female 30 BA4 male 27 MSWe can also provide multi-column sorts by adding multiple columns to the order command.
data$age) 2 1 3 5 4test.
data$age),]sex age degree2 male 49 BA1 female 30 BA3 male 39 MBA5 male 61 MS4 male 27 MSUse Case for the Rank FunctionYou can use the rank function to create a value that represents the relative standing of a value within its sequence.
The IEEE provided a list of top 10 programming languages for 2017.
They are stored in a file, in my local file system, sorted in alphabetical order by language name.
The code below will read them in to a variable which references them by the name language.
ieee and displays the contents.
csv")) X language score1 2 C 99.
72 5 C# 87.
73 4 C++ 97.
14 9 Go 75.
15 3 Java 99.
67 8 PHP 81.
28 1 Python 100.
09 6 R 87.
710 10 Swift 73.
1We can get a vector of the ranked data.
The data in the rank vector appears as float because there is a tie: C# is tied with R for 5th and 6th place.
There are options for dealing with ties in the rank function, but the default is to use the “average” method and assign each the average value.
The values themselves represent the descending order of the corresponding value by position of the value in the original data set.
A higher rank value represents a larger data value.
0I can use the rank vector to order the data by rank, that is, the descending order of scores, by supplying the negative rank to the order command.
ieee$score)),] X language score8 1 Python 100.
01 2 C 99.
75 3 Java 99.
53 4 C++ 97.
12 5 C# 87.
79 6 R 87.
67 8 PHP 81.
24 9 Go 75.
110 10 Swift 73.
1Calculating rank is not only used for ordering data.
Correlation of rankings can be used to test the null hypothesis of the relationship between two variables.
Since variable may differ in type and scale, rank provides a sort of normalization.
For example see studies on the use of Spearman’s Rank Correlation: https://geographyfieldwork.
ConclusionR is a statistical programming languages with many functions that help with formatting and processing data.
Its services are made available through function calls.
In addition to reading the documentation it helps to run data sets through these functions to help figure out what exactly they do.
Sort, Order, and Rank are semantically the same, but in practice they have very different uses and sometime work with each other to achieve the desired result.