Pythran: Python at C++ speed !

Pythran: Python at C++ speed !Case Study: Simulated Annealing applied to the TSPOlivier BorderiesBlockedUnblockFollowFollowingJan 23Python is a high level, versatile language that is almost as easy to read and write as pseudo code.

On top of it Numpy is a ubiquitous package for scientific computing that eases the use of N-dimensional array objects.

Together they are ideal to implement algorithms concisely — and Numpy array operations are fast because often compiled to C or Fortan.

Pythran helps convert Python+Numpy functions to native code with only a few annotations and a relatively simple installation process.

More precisely its core developer, serge_sans_paille (whose commitment to the project and reactivity are quite impressive, by the way), describes it as “an ahead of time compiler for a subset of the Python language, with a focus on scientific computing.

It takes a Python module annotated with a few interface description and turns it into a native Python module with the same interface, but (hopefully) faster.

It is meant to efficiently compile scientific programs, and takes advantage of multi-cores and SIMD instruction units.

Pythran supports Python 2.

7 and also has a decent Python 3 support.

” It works on Linux, macOS, and Windows.

In this article we try to show how simply it can be used in a relatively generic case: the search of solutions for the Traveling Salesman Problem using the Simulated Annealing class of algorithms.

All the code demonstrated in this article is available in this gitlab repo: oscar6echo/tsp-pythran.

Traveling Salesman Problem (TSP)The well known Traveling Salesman Problem(TSP) states this problem: “Given a list of N cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?”.

It is an NP-hard problem and has a huge (N-1)!/2 candidate solution space.

Consequently, attempting to solve it is very computing intensive.

Let us consider the problem of 100 cities randomly located in the [0, 1]x[0, 1] plane, generated by the demo-tsp-pythran notebook.

100 random cities in the [0, 1]x[0, 1] planeSimulated Annealing (SA)Simulated Annealing (SA) is a heuristic for approximating the global optimum of a given function.

It is often used when the search space is discrete, and works in a fixed amount of time.

Conveniently it does not impose any prerequisite on the function to optimise.

# Objective# Find energy with low, hopefully minimum, energy# Inputsenergy(): function state -> energymutate(): function state -> state – typically randomtemperature T reduction profile from Tmax down to Tmin# Starts = init_statefor T from T_max to T_min: E_s = energy(s) m = mutate(s) # random E_m = energy(m) dE = E_m – E_s if rand(0, 1) < exp(-dE/T): # always true for dE < 0 # sometimes true for dE > 0, all the more so if T is high s = mreturn s # should be a low energy stateThe general idea is that instead of moving only to a lower energy state, it is allowed to explore states that are not necessarily lower than the current best in the hope of reaching regions containing lower optima, a bit like a high energy particle can move outside a local potential well and travel to lower potential regions and ultimately settle there when it energy (i.

e.

temperature) subsides.

This is illustrated in the graph below.

from https://www.

researchgate.

net/publication/308786233_A_Review_of_Optimization_Techniques_in_Artificial_NetworksAs well known as this algorithm is, it maybe not be as famous as it deserves: Its extreme simplicity and many parameters and predictable runtime makes it applicable to a wide range of real world problems.

As you can see there is a lot to tailor in this description:The energy(state) functionThe mutate(state) functionThe temperature reduction profileThe most important one is probably the mutate() function which must be designed to generate an interesting candidate state from the current one.

Obviously this function is eminently related to the problem to solve.

You can see it as a kind of genetic mutation, which is either kept or discarded.

In our TSP case, the mutations applied at each step are — for i, j random path positions:Reverse path between cities in i-th and j-th positionMove city in i-th position to j-th positionSwap city in i-th position with city in j-th positionWhich of these 3 mutations occurs at each step is also random.

If evaluating the energy of a state is expensive, then it may be more efficient to keep track of the states already visited and avoid re-evaluating the same energy twice.

In the case of the TSP the unique signature of a state is the tuple of cities (mapped to integers from 0 to N) starting with 0 such that the second element is less than the last — to account for the fact that a city tour has no direction.

The TSP energy() function is extremely simple and fast to compute so the overhead of keeping tracks of signatures is not worth it (as experience shows).

But in other cases the trade off may be in favour of checking signatures before evaluating a state.

For a more precise description see tsp_compute_single_threaded.

py: The store_stats keeps track of how often signature collisions take place:if check_signature: store_stats = { 'nb_step': nb_step, 'store_len': len(store), 'ratio': len(store)/nb_step, 'compute_energy': compute_energy, } return (best_energy, best_path, store_stats)else: return (best_energy, best_path, {})Pythran AnnotationsWhen writing any algorithm it is a good idea to start in pure Python, and Numpy if it can be concisely expressed in matrix form, then then benchmark and fine tune it until it works as expected.

Indeed, “premature optimisation is the root of all evil (or at least most of it) in programming” (Donald Knuth).

Well Pythran is totally compatible with this approach: the only tiny addition you have to make in the source code is an annotation (as a regular Python comment line starting with “# pythran”) for the numerical function you want to accelerate.

From this interface information Pythran turns it into C++ source code, compiles it and creates a native Python module.

Here is an example of annotation:# pythran export search_for_best(int, float list list, int, float, int, float, float)def search_for_best(seed, cities, nb_step, beta_mult=1.

005, accept_nb_step=100, p1=0.

2, p2=0.

6): """exported function""" # etcMore generally Pythran annotations are allowed anywhere in a module, but most likely at the top of the file or just above the functions to compile.

They have the following syntax:# pythran export function_name(argument_type*)where function_name is the name of a function defined in the module, and argument_type* is a comma separated list of argument types, composed of any combination of basic types and constructed types.

Below is the list of basic types.

Constructed types are either tuples, introduced by parenthesis, like (int, (float, str)) or lists (resp.

set), introduced by the list (resp.

set) keyword.

Here is the full grammar:argument_type = basic_type| (argument_type+) # this is a tuple| argument_type list # this is a list| argument_type set # this is a set| argument_type []+ # this is a ndarray, C-style| argument_type [::]+ # this is a strided ndarray| argument_type [:,.

,:]+ # this is a ndarray, Cython style| argument_type [:,.

,3]+ # this is a ndarray, some dimension fixed| argument_type:argument_type dict # this is a dictionarybasic_type = bool | int | float | str | None| uint8 | uint16 | uint32 | uint64 | uintp| int8 | int16 | int32 | int64 | intp| float32 | float64 | float128| complex64 | complex128 | complex256It is rich enough to accommodate the vast majority of numerical functions.

A key point is that ndarrays are accepted, which enables regular Python + Numpy users to accelerate their code “as is”.

They only have to isolate the compute intensive operations in some fixed type interfaced functions, which is natural to do anyway.

Install PythranPythran can be installed on the 3 main OS families: Linux, macOS, Windows.

Below I assume miniconda3 is installed.

Here are the steps that worked for me:LinuxHere are the steps that worked on a cloud VM.

Also check the script_setup_linux.

sh.

Install compiler# ubuntu$ sudo apt-get update$ sudo apt-get -y install g++Install Pythran in a dedicated conda env:$ conda create -n pythran python=3$ source activate pythran$ pip install numpy$ pip install pythranmacOSHere are the steps that worked on macOS (HighSierra and Mojave).

Also check the script_setup_macos.

sh.

Install compiler with homebrew:$ brew install llvmInstall Pythran in a dedicated conda env:$ conda create -n pythran python=3 -y$ source activate pythran$ conda install -c conda-forge pythran -yThe alternate way, pip install pythran, misses some low level dependencies like blas.

So I would recommend the conda install for simplicity.

SDK HeadersOn macOS Mojave you may have to install SDK Headers for macOS 10.

14.

To do that:# sudo required$ xcode-select –installThen go to /Library/Developer/CommandLineTools/Packages/.

There is the package macOS_SDK_headers_for_macOS_10.

14.

pkg.

Install it (double click or Cmd+O).

Create a ~/.

pythranrc file:[compiler]cflags=-std=c++11 -fno-math-errno -wldflags=-L/usr/local/opt/llvm/lib # from `brew info llvm`CC=/usr/local/opt/llvm/bin/clang # brew installed clang pathCXX=/usr/local/opt/llvm/bin/clang++ # brew installed clang++ pathWindows7Here are the steps that worked on a Windows7 desktop.

Install compilerInstall Visual Studio Community 2017 workload “Desktop Development with C++"Install Pythran in a dedicated conda env:$ conda create -n pythran python=3$ source activate pythran$ pip install numpy$ pip install pythranMake sure ~/.

pythranrc does not exist or is empty or is as follows:[compiler]defines=undefs=include_dirs=libs=library_dirs=cflags=/std:c++14ldflags=blas=blasCC=CXX=ignoreflags=Compile and Run with PythranCompileAfter the code is annotated as explained above I just compiled the 2 files, each containing one annotated function:tsp_compute_single_threaded.

py:# regular compilation – under pythran env# In my case it's safe to use -Ofast (affects precision)$ pythran -Ofast -march=native tsp_compute_single_threaded.

pytsp_compute_multi_threaded_omp.

py:# compilation with omp – under pythran env# the compilation flags activate OMP and vectorization using# https://github.

com/QuantStack/xsimd$ pythran -DUSE_XSIMD -fopenmp -march=native tsp_compute_multi_threaded_omp.

pyIn each case, if Pythran is properly installed, it silently creates a native module named as follows:on macOS: tsp_compute_[xxx].

cpython-37m-darwin.

soon Linux: tsp_compute_[xxx].

cpython-37m-x86_64-linux-gnu.

soon Windows: tsp_compute_[xxx].

cp37-win_amd64.

pydRunThe native module behaves like a regular Python module to the user who imports it — except it runs faster!.If a both regular and native modules exist, the native module is imported.

ParallelismLooking for speed does not stop at accelerating numerical operations.

Going parallel where possible is also an important source of improvement.

In our TSP example SA is a probabilistic search so it can be paralleled.

Here we have tested 2 techniques:concurrent.

futures at Python levelOMP (Open Multi Processing) through PythranOMPOMP (Open Multi Processing) is an API that allows multi processing programming through a relatively simple API.

It enables a developer to describe parallelism to C++ code through OMP directives introduced with comments starting with “# pragma omp”.

Pythran allows a Python developer to write such directives in Python code as comments starting with “# omp” and passes them to the compiler.

OMP is powerful but as for most low level stuff, if anything goes wrong, it may be tricky to know why and correct it.

See the file tsp_compute_multi_threaded_omp.

py.

The two annotations enabling OMP are:“parallel for”:# omp parallel for for x in range(n_run): beta = 1.

0 n_accept = 0 etc.

“critical”:# omp criticalconcurrent.

futuresFor coarser parallelism across all CPUs of a machine at Python level, using concurrent futures is a very efficient alternative for this algorithm.

See the search_concurrent() function in tsp_wrapper.

py.

It uses ProcessPoolExecutor to spawn sub processes and as_completed to collect the results, as demonstrated in this example in the official concurrent.

futures documentation.

Demo package and notebookThis demo-tsp-pythran notebook contains the user interface for the tsp-pythran demo package.

It allows to:Compile the tsp_compute_(single|multi)_threaded.

py functions into native modules — IMPORTANT: Restart kernel after compiling to load native moduleRemove the native modules to go back to Python modulesGenerate a random set of cities and set search paramsRun both search implementations: “concurrent” and “OMP”, with and without signature check (see below)Visualise and save the resultsHere are the 3 best routes found for several distribution of 100 cities — each search used 32 concurrent runs with 1e6 steps without signature check:3 good solutions (random seed 54321)3 good solutions (random see 12345)3 good solutions (random see 33333)3 good solutions (random seed 11111)PerformanceI ran the TSP search:in 2 versions: “concurrent” and “OMP”, with/without “signature check”on 3 types of machines: laptop (macOS), desktop (iMac), Linux (VM)laptop: macOS MacBook Pro 2017 2.

3 GHz Intel Core i5desktop: macOS iMac 2014 4.

0 GHz Intel Core i7VM GCP: Linux n1-highcpu-64 — Ubuntu:18.

04The signature check did not prove efficient for the TSP problem, as anticipated.

So I focused on the simpler version without signature check.

The table below shows the results for a TSP of 100 cities, each task with 1e6 steps.

It has been created by running the demo-tsp-pythran-perf-measures.

ipynb notebook on the 3 machines, and aggregating results with the demo-collect-perfs.

ipynb notebook.

Note: The numbers were produced for only one run in each case so expect some variability if you re-run the notebooks.

The number of cores taken into account is:for macOS: the physical (not logical) number of cores for macOSfor the Linux VM the number of CPUs (64) divided by the number of threads per coreWith this definition (i.

e.

half of the “marketing” definition for these 3 machines), then several concurrent tasks run in about the same time as one task, if the number of concurrent tasks does not exceed the number of cores — which validates the “definition” of number of cores.

The Pythran speed up is ~x16 on macOS and ~x32 on the Linux VM.

Adding concurrency to Pythran, the increase in the number of steps is ~x32 on the laptop, ~x70 on the desktop, and a whopping ~800 on the Linux VM !!It must be emphasised that this speed up is obtained with:minuscule extra work for Pythran !a reasonable effort for concurrent futuresa decent price (e.

g.

~$2.

3/hour on GCP — $0.

8 if preemptible)The OMP version performance is just below that of the concurrent version, on macOS.

But the extra work to use OMP is quasi zero with Pythran (2 annotations) while using concurrent futures is more substantial — but not daunting.

What puzzles me is that the promise of OMP crashes catastrophically on the Linux VM.

For some reason (unknown to me), it does not handle OMP as well as macOS… This is a pity because if it did, the ratio speed up / effort would beat any other solution.

Now the remarks above consider raw speed only.

In the case of the SA algorithm, search breath (number of threads) and depth (number of steps in one single threaded search) are not equivalent.

It turns out that depth is much more critical to finding good solutions.

So practically the benefit of the Linux VM is that it searches more broadly and consequently is more likely to find good quality solutions in a slightly shorter time.

However for other problems where the search can be split more efficiently, then the speed up offered by a combination of Pythran acceleration, concurrency over many-cores single machines can yield great immediate benefits to the practitioner.

Packaging/DistributingWhen it come to packaging/distributing Pythran’s intelligent design proves particularly convenient.

Depending on your objectives, you can choose to distribute:the Python source code (.

py)the C++ generated code (.

cpp)the native module (.

so/pyd)First, Pythran provides a distutils extension that enables standard use in the setup.

py.

Cf.

the distutil integration section in the official documentation.

For a working example, see the “Distributing Python source code (.

py)” section below.

If you distribute the Python source code your users must have Pythran and a compiler on they computer.

If you do not want to share the source code or your users does not always have compiler (e.

g.

Visual Studio is long to download and requires admin rights to install).

If you are only concerned with privacy and do not share the source code, you can distribute the C++ generated code simply obtained with the `-E` flag as follows.

It is heavily templated, cryptic code which is difficult to reverse engineer.

# generate .

cpp and .

so files$ pythran -E tsp_compute_single_threaded.

pyFinally if you just want your users to use the native module you can distribute the .

so/pyd files instead of the corresponding .

py files.

But then the native module must be compiled on the same platform — often the case in medium sized organisations.

Here is how to packages in these 3 cases:First generate the .

cpp and .

so/pyd files from the .

py files.

Distributing Python source code (.

py)Use this setup.

py and this MANIFEST.

in files to:Exclude the .

cpp and .

so/pyd files for the bundle (MANIFEST.

in: see recursive-exclude)Launch Pythran compilation from .

py upon installation (setup.

py: see ext_modules)Distributing C++ generated code (.

cpp)Use this setup_2.

py and this MANIFEST_2.

in files to:Exclude the .

py and .

so/pyd files for the bundle (MANIFEST.

in: see recursive-exclude)Launch Pythran compilation from .

cpp upon installation (setup.

py: see ext_modules)Distributing native module (.

so/pyd)Use this setup_3.

py and this MANIFEST_3 files to:Exclude the .

py and .

cpp files for the bundle (MANIFEST.

in: see recursive-exclude)Additionally you can embed compiling instructions in the source files, and also expose them as a module function: see compiler.

py.

This may have the benefit of raising the awareness of the end user — if it is relevant.

It is also important to mention that Pythran generated C++ code can be deployed “as is”.

There are always these conflicting requirements: Fast prototyping in a high-level human-friendly language (e.

g.

Python) and deploying in a fast low-level machine-friendly language (e.

g.

C++) are difficult to reconcile.

See the article Pythran as a bridge between fast prototyping and code deployment for a more thorough development.

Well Pythran helps bridge this eternal chasm!Note: Distributing native code compiled without an active effort to obfuscate it does not guarantee the confidentiality of the source code.

To achieve that goal you may explore obfuscator-llvm or epona.

Remote JupyterTo run a Jupyter notebook server on a cloud ☁️ VM and thereby benefit from more computing power (in general: CPU, GPU, #cores, memory; in our TSP case CPU speed and #cores), see these step-by-step setup guides, from zero to playing the demo-tsp-pythran notebook on:Gloogle CloudMicrosoft AzureConclusionIn this article we have:installed Pythran on Linux (Ubuntu), macOS, Windowsrun a Jupyter notebook off a cloud VM (GCP, Azure)briefly presented the Simulated Annealing Algorithm (SA)applied the latter to the Traveling Salesman Problem (TSP) as an sample problemimplemented the SA in Python — with signature checking (if evaluating energy state is expensive)accelerated it by up to 3 order of magnitude with:Pythran to convert Python to C++, with annotations, incl.

OMP directivesconcurrent futures to run parallel tasks (on local cores) from Pythonand the various ways to distribute Pythran modulesOverall the convenience, performance, flexibility of Pythran are simply eye-popping!.????.Undoubtedly it a very valuable productivity booster.

Kudos to the developers !!.????.

. More details

Leave a Reply