Parallelising (multi-processing) in R — prime number exampleLU ZOUBlockedUnblockFollowFollowingMar 5MotivationThis is a practice blog to implement multi-processing in R.

R and Python are the most popular programming languages in data science, and both has their advantages at different aspects.

This blog compares them in terms of data collection, exploration, modelling and visualization.

For applying the multi-processing, I have to say R is much easier and convenient to use.

I will use the prime number example used in my Python parallelization blog.

Given an integer, the following function is to determine whether it is a prime number.

It will return a list of two elements: the input integer and whether it’s prime (0 for no, 1 for yes).

is_prime <- function(n){ if ((n<2) | ((n%%2==0)&(n>2))){ return(c(n, FALSE)) }else if(n==2 | n==3){ return(c(n, TRUE)) }else{ for (i in seq(3, ceiling(sqrt(n)+1), 2)){ if (n%%i ==0) {return(c(n, FALSE))} } return(c(n, TRUE)) }}The R library to implement the both socket and forking parallelizations is:library(parallel)I will use lapply() from the *apply() functions as the benchmark, because it runs usually faster than a for loop and tidier to use.

system.

time({ prime_results1 <- lapply(1:1000000, is_prime)})Time of using lapply() is 16.

71s.

There are two methods for parallelization: socket and forking.

Forking is not working on Windows system, so cannot achieve acceleration on running time which I will demonstrate later.

Parallel Processing in R describes the differences between socket and forking, as well as many details that are not covered in this blog.

Socket parallelizationSocket parallelization runs one unique process on one node, thus no cross-contamination.

As a results, the libraries need to be loaded onto each node.

For the prime number task, we don’t need to load any extra libraries.

The general steps to apply socket parallelization are:Start a cluster with n nodes (normally the number of cores available);Apply any pre-processing codes on each node (e.

g.

loading libraries)Apply par*apply() functions, here is parLapply()Close the clustersystem.

time({ cl <- makeCluster(detectCores()) clusterEvalQ(cl, {}) # load libraries inside {} prime_results <- parLapply(cl, 1:1000000, is_prime) stopCluster(cl)})Time for socket parallelization is 6.

51s.

Forking parallelizationIn R, simply replacing *apply() functions with mcapply() applies the forking parallelization.

As mentioned above, this doesn’t achieve any speed gain on Windows; it calls the *apply() underline.

system.

time({ prime_results2 <- mclapply(1:1000000, is_prime)})Time for forking parallelization is 16.

37s.

SummaryIf you are not on Windows, using forking parallelization is a better and convenient choice.

On Windows, socket parallelization works just fine for the purpose.

Although there’re more arrangements on the codes, the application in R is much easier and straight forward compared to Python.

.. More details