The Ultimate Technical Skill in Data Visualization for Data ScientistsWhat lies beyond ggplot2, matplotlib, seaborn, Shiny, Dash, R or python.
These are the skills to reach the next level in data visualization.
David-Olivier PhamBlockedUnblockFollowFollowingApr 14SummaryAnyone working with data must be able to visualize it.
With the recent advance in user interface (UI) technologies, expectations about the UI increased considerably.
I advocate that common frameworks for data scientists to create rich web applications encourage a frustrating development experience and that learning a few selected basic skills in web development will improve drastically the development experience.
IntroductionIt is commonly accepted that data is the new gold for many businesses, as it permits better decisions and better insights.
We always need to look at data: at the exploration phase, at the model diagnostic stage or during production.
Tools come in a wide range of sophistication from Excel, to Tableau via python matplotlib.
Different requirements will lead to different types of pain for automating the process.
Nevertheless, I am fairly certain that we all desire to present some nice, fancy and interactive dashboards and stories and to distribute them to our stakeholder or audience.
The standard process is to package our solutions as web apps (for an application that runs part of the logic on a client device and the other a server).
In my experience, creating web apps using the typical toolkit in data science leads to frustrating and infuriating moments.
Nevertheless, there exists a really elegant solution to this problem, and my hope is it can help you as it did for me.
Spoiler alert, my solution involves learning Lisp (a particular dialect of it), one of the oldest programming language associated with artificial intelligence.
This article is aimed at anyone who needs to display data for their job, and especially to the community who need to create highly customized and adaptable visualizations typically found in web applications.
That being said, any typical data scientist who needs to create customized user-interface could benefit from the reading.
Before continuing to read, if you never used ggplot2 and plotly, I definitively recommend scratching them before, as they teach some standard logic in data visualization and provide a baseline of what you should expect from a plot.
StoryAn argument states that ideas are always better understood with a context.
My background is fairly common: I always loved to solve problems so I followed the typical path of data analyst (or any synonym you like to use for it: scientist, quantitative analyst, statistician).
At university, I studied an impractical domain for my passion for the field (mathematics) for my first degree.
In my first job, I started to appreciate statistics, machine learning and programming, which led me to study them in depth during my free time in online nanodegree and in a second formal master degree (don’t judge me, education is fortunately cheap in Switzerland).
During this time, I invested the time to study R and python with their standard package for data analysis.
Luckily, I forced myself to learn and use emacs, which made me a happier professional.
As for my jobs, I slowly got an increased requirement to build dynamic analysis, reports, solutions as people are getting used to interacting with their smartphones or web browser.
I used R Shiny professionally for two years.
This is where our story begins.
ProblemAmong data scientists, R or python are the most prominent languages and each has a de facto standard library for creating web apps: Shiny and Dash.
They also handle communication (e.
data transfer and user interaction) between the browser (front end) and the server (the back end), where a process in the host language runs until termination.
Moreover, they handle all the asynchronous js interaction, and both offer solutions easing the deployment of the applications.
To be fair, they do quite a really good job with respect to these points.
However, I argue that merging the front and back end also brings a lot of inconvenience that reduces greatly the value of these frameworks.
Users are accustomed to an extremely smooth, fast and dynamic behavior interface with many fancy features such as error messages in forms or dynamic UI.
As the goal is to completely avoid js, shiny and dash users have a reduced expressiveness for customizing the HTML output to meet expectations.
For example, it would be quite challenging without js to create a button with color depending on the width size of the browser.
If any solution exists, it will involve a communication cycle between the front and the back end, leading to an unpolished experience.
Really modular components are hard to write in both frameworks.
The main reason is the intense coupling between the nodes HTML identifiers and the back end effects, triggered when the HTML nodes throw user events.
Additionally, for Shiny, modular code is hard to achieve as environments, namespaces or modules are not common in R.
Finally, this is the most important point, both frameworks lead to a frustrating development experience.
Feedback cycles in projects based on these packages is astonishingly and painfully slow: when developing new features, components, views or tabs in the production environment, the back end usually restarts completely every time there is a change on the source code, the web browser usually needs to be refreshed and the developer restores the particular state of the app to see if the modification creates the desired intent.
In any reasonably sized project, data and packages will be loaded on start and the developer will pay the initialization cost for every single iterative change.
In the worst scenario, the whole application crashes on reload because of a syntax error.
(Note, there is a hack to simulate hot reloading in both frameworks: the back end listens to a text area and can evaluate [in the sense of eval command] the code written in it.
That being said, this might lead to severe security issues.
)SolutionThe solution is to split the front and the back end again, and learn how to control the front end.
This will solve all the problems listed previously:You have the full benefit and control the front end with it and as the calculation are rendered on the web browser, they need to communicate with the back end for all UI operations disappear.
Most front end technology can deal with modules or namespaces.
So you can have several variables named type or date as long as they live on a different namespace, there will be no confusion.
Moreover, with the reduced necessity of communication with the server, the code will be more modular, hence more robust.
Hot module replacement (or hot reloading) is achieved by most frameworks as well.
The last point is the best single feature that you should convince you to try, you will not be able to return to a life without that feature.
Hard and Inelegant WayLearn js and ReactJS (around which Dash is wrapped).
This is the hard way, because as said previously, js is a hard language with many quirks.
I am not a js specialist, so I will not make any strong claims, but I think this path requires a substantial investment.
Parasite LanguagesParasite Languages are programming languages that leverage on existing languages and their tools by converting their syntax to the underlying technology.
For example, Scala, respectively TypeScript, are parasite languages to Java, respectively js.
The advantage of parasite languages is they can add certain paradigms and discourage some of the weaknesses of targeted languages.
For example, TypeScript add the type properties to js.
I claim that ClojureScript, cljs, a version of Clojure which targets js is the perfect candidate for most data scientists.
As cljs code ultimately is js code, you enjoy the same benefits for speed.
Hot-reloading is feasible thanks to figwheel or shadow-cljs.
Nonetheless, all the previous point could be achieved with js as well, so why do I favor cljs?The language is dead simple, extremely stable and designed to yield elegant and simple code.
It eliminates most pitfalls of js and best practices encourage a model where data is at the center of your code and not the container (no more project with 1500 classes).
Moreover, it comes with a battle-tested standard library for manipulating your data.
I definitively recommend reading this and watch that to see the benefits of cljs over js.
Back endAs for the backend, the part that does the computation, the code performing business logic will remain the same, but there will be a need to add a wrapper around the code to respond to HTTP(S) requests from the front end.
This is easily achieved with R plumbr or python flask.
For simple cases where operations could be performed by the client (e.
g filtering and grouping tables by groups), the input data could be stored as csv or json, and a simple HTTP server (in the sense of python3 -m http.
server command in the root folder) would be enough.
See my personal website for an example.
TradeoffsLearning cljs also teaches their users to skeptical about any new solution.
What are the disadvantages of adopting cljs?In terms of features, data scientists are now responsible for the communication between the client and the server, it means we need to learn how to design some elementary API.
That being said, it means that the backend returns different data for different URL requests from the client.
Deployment to production is different from the solution offered by RStudio and Plotly.
That being said, as the architecture is the standard one for web apps (separation of back and front end) there exist several cloud providers and tutorials that could support you.
There are now two languages in the code base.
Introducing a new language and its tools always entails risks.
However, I strongly believe that the tangible pros by far outweigh the potential risks.
Additional BenefitsThere are many additional benefits from learning cljs and using the proposed architecture.
As cljs is a Lisp dialect and hence functional programming language.
As js can target web, desktop and mobile clients, so can you with cljs.
Picking up Clojure will be dead easy, and you will then be able to program on the Java Virtual Machine as well.
Tools from Google Closure like code splitting and dead code elimination are available to cljs and will optimize the code size of your application, leading to a smoother user experience.
You can use any js visualization framework, although plotly is a good default option.
Ever heard about Vega?.You can now use it without relying on wrappers for R or python.
Thanks to interoperability between cljs and js, You can leverage on any ReactJS component written in js on a wrapper for your language.
You could even reuse dash-html-components.
At some point, you will learn emacs, which is the best productivity tool ever.
Coupled with your knowledge of Lisp, you will be able to configure it at your whim (but do not learn emacs and cljs at the same time).
ConclusionIn this article, I advocated for data scientists to invest time to learn key skills in web development (cljs) to improve their development experience when crafting these applications.
Although R Shiny and python Dash are great tools for their original goals, they lack key features from modern front end technologies such as hot module reloading without browser refresh, leading to a suboptimal development experience.
Arguments to favor cljs over js also have been mentioned.
But I would like to mention the next argument to finally convince anyone to try cljs: creating visualization and web app with cljs is extremely fun.
You solve problems as fast as your brain can think and it let you stay in that zone where you stay focused and creative.
I really hope you will give it a try!In a follow-up article, if there is any interest, I will provide a project template and guide you through it.
In the meantime, I would invite you to explore this beautiful language by visiting the following links.
What do you think?.Did I miss anything central to the question?.Are shiny and dashgood enough for your use case?EpilogueFor a while, I tried hard to evade from my professional responsibilities concerning UI, because of all the reasons mentioned above (Shiny did traumatize me).
After all, I am a data analyst/scientist, I am more interested in discovering patterns, shaping my team's decisions based on data and ultimately creating better services and products for my clients.
However, an opportunity to shape my team infrastructure and technology stack for web development presented itself.
As soon as I heard about the project, I knew I was going to be included (whether I liked it or not).
So I did, what anyone else would have done: I took my destiny into my hands.
I already knew about the many advantages and disadvantages of cljs and decided to lobby hard for it.
Somehow, my team was crazy enough to trust me.
There was a lot of pain, sleepless nights and long hours, but we learned a lot and gained valuable skills.
We have been designing our front end with cljs quickly and I could not be happier.
Essentially, this article is my way to exorcise my fear and frustration of UI development as a data scientist and I hope it will help many to rediscover with joy this part of our job.
Sources and ResourcesClojureScript Koans, learn the syntax of the language interactively on the browser.
Clojure Brave and the True, best place to start to learn the concepts of Clojure and functional programming.
4Clojure, entertaining exercises and learn from others’ solutions.
re-frame documentation, the best source to understand the framework, and really enjoyable to read.
tv, many exceptional post blogs to understand reagent (wrapper for ReactJS) and re-frame.
Simplicity matters, by the creator of Clojure himself.
ClojureScript: Lisp’s Revenge, by one of the key architect of ClojureScript.
.. More details