Lessons Learned from Creating a Custom Graph Visualization in ReactHorst WernerBlockedUnblockFollowFollowingJun 10By Horst Werner and Simon FishelGraphs are probably the most powerful and versatile of all data structures.
Graph structures are everywhere: in social media, genealogy, supply and distribution chains, financial networks, security, and business processes, to name just a few.
Visualizations of graphs, in the form of circles connected by lines, have been used for centuries (see e.
The Great Stemma, the “Tree of Knowledge”, or any Mind Map).
Computers have enabled us to automatically create such visualizations from data and layout algorithm research has been done for decades.
Open source libraries such as d3.
js provide generic graph visualizations that can be used and adapted with minimal effort.
One could assume that by today there is a perfect layout strategy for pretty much every common use case and probably some open source library implementing it.
So open source libraries were our first stop when we set out to build Splunk Business Flow, an application that provides business operations professionals with insights into their actual end-to-end business processes and customer experiences through interactive exploration and visualization.
Oftentimes, the actual (“as-is”) business processes of our customers differ from the ideal (“to-be”) processes they have defined, and surfacing such discrepancies, bottlenecks and delays creates tremendous value.
1: Simple process graph generated with Cytoscape + Dagre layoutWe evaluated vis.
js, some d3.
js solutions, and Cytoscape, a Canvas-based visualization library that allows you to select from multiple layout algorithms, and provides the option to implement your own.
The Dagre layout algorithm appeared to be the most advanced, but the graphs it creates don’t look quite like a business process (Fig.
Other generic algorithms, such as the force-directed graph layout, don’t produce much better results either (Fig.
2: Force-directed layoutThis is due to the fact that generic layout algorithms can’t take the meaning of a graph into account, so while they can optimize for certain geometric properties (such as keeping the edges as short as possible, or minimizing the intersection of edges), they can’t produce a visual pattern reflecting the meaning.
This insight led us to the decision to write a custom layout algorithm to visualize business processes.
We use React to develop our UI; unfortunately, the graph libraries we evaluated don’t properly support the rendering paradigm of React (e.
in Cytoscape, the graph component is a Canvas, which is updated in a way that completely bypasses React’s virtual DOM optimizations).
So we also developed a new, React-based rendering module, with an abstraction layer that allows us to switch between multiple layout implementations.
Here is what we learned in the process:Map the Meaning of the Data to an Intuitive Visual StructureIn our specific case, we expect the graph to represent a process flow, which has a clear direction.
We decided that a vertical downward flow would be the most intuitive representation of this direction.
Of course, there are lots of branches and loops blurring that ideal straightforward structure.
Therefore, one of our core challenges was making the structure of the main process visible through the noise.
We started with the following rules:Wherever the process is linear (i.
each step has exactly one successor), that segment of the graph should appear as a strictly vertical structure in the graph.
Where processes are branching and looping, keep the most frequent process flows close to the center, to get as close as possible to a central main flow and peripheral variants.
Establish a clear sequence of steps, so that loops become visible in the form of backward-pointing edges.
Needless to say, while these rules make sense for process flows, the rules for other types of graphs will be completely different.
3 shows two of our layout iterations using these rules.
3: Early iterations of the layout algorithmIn the first iteration (left side in Fig.
3), we still tried to get away with only straight or slightly curved edges, without collision detection or dedicated edge routing.
Testing with more complex data sets, such as a real-world example involving 36 nodes with highly varying sequences, soon showed the limits of that approach, which brought us to our next insight:Edge Layout MattersIf you can only render straight edges, the placement of nodes is restricted unless you can live with a mess of overlapping edges and nodes.
The direction of edges can also be hard to decipher because arrow pointers tend to get tiny when you look at a whole graph.
This makes it necessary to have an edge routing algorithm.
The well-established Graphviz library uses Bezier curves, which give the graph an organic look.
However, this didn’t align with our intended visual message.
Business processes are supposed to be systematic and well structured.
As a matter of fact, real business processes reflected in logs usually aren’t, for several reasons, but the process analyst is looking for structure when trying to understand them.
In order to reduce the visual noise created by the edges, we decided to adopt a pattern of mostly orthogonal edge segments and to bundle parts of outgoing and incoming edges of each node as much as possible.
In order to make the direction of edges easier to see, we also established the following rule:All incoming edges enter a node in a single bundle from above, and all outgoing edges leave in a single bundle from the bottom of the node.
The result resembles the conductor paths on a circuit board, which is why we call the layout algorithm the “Circuit Board” layout.
In a further step to reduce the visual complexity, we emphasize forward-pointing edges (which make up the main flow) by rendering them darker than backward-pointing edges (which usually represent loops in the process).
Edges that are more frequent are rendered with wider lines than edges that occur in a smaller fraction of the observed processes.
4: A simple repair process visualized with the GraphViz “organic” edge layout (left) and our “Circuit Board” edge layout (right)Test with Real Data and Worst Case DataWhen designing a layout algorithm, there are always implicit assumptions about the nature of the data, mostly that the data is benign, and complies with the desired structure that we envision.
In practice, data tends to be noisy and can have large numbers of nodes and edges.
From about 100 nodes upwards, the value of conventional node-edge graph visualizations decreases quickly; the human brain can’t handle the visual complexity, and it is impossible to read labels at a zoom level that shows the whole graph.
Force-directed layouts can still convey the number and size of clusters, but that information is only useful in a limited number of use cases.
In order to optimize the layout algorithm, we evaluated it in every stage with multiple data sets (including a particularly noisy set of real-world data) and with a synthetic worst-case data set (200 nodes with random, heavily branching transitions).
The iterations with different data sets were essential in tweaking the parameters of the layout algorithms: settings that would produce a beautiful graph for one dataset could have ugly side effects in another.
The main goal of testing with the worst-case data set was to ensure that, despite chaotic data, the sequence, structure, and visual messaging of the business process was preserved, even if the user can’t grasp the totality of connections any more.
5 shows how the circuit board layout performs compared to Digraph, which is the best open source library we could find to handle large amounts of edges in a structured manner.
The fact that the Circuit Board layout uses significantly fewer pixels for the rendering of edges in such an extreme case is due to the edge bundling rule explained earlier.
5: Circuit board layout (left) vs.
Digraph layout (right) for our worst-case data setConsider Showing Partial GraphsAs the examples above show, the main drawback of graph visualizations is handling the visual complexity that comes with a larger number of nodes and edges.
In order to reduce that complexity, we offer a control called “Noise Slider”, that allows the user to determine how many nodes and edges are displayed: When the slider is set to 50%, only nodes and edges that occur in at least 50% of the observed processes are rendered in the graph.
In our use case, the graph structure is one-dimensional (i.
it consists of only one type of edge: the successor relationship), so the main metric for relevance we use is frequency.
We also allow the user to set filters on attribute values and particular sequences.
Often, graphs are multidimensional, containing multiple types of nodes (such as people, organizations, channels and documents) and edges (such as “follows”, “reports to”, “has written”, “has read”), and we find structures (e.
hierarchies or clusters) embedded into such graphs.
Such structures tend to become buried in the sum of associations, so enabling the user to switch between different visualizations, each of which highlights one particular relationship or structure, helps to surface relevant information, making the graph manageable for the human brain.
Provide Additional VisualizationsA single visualization can never tell the whole truth: for business processes, the graph shows the essence from thousands, or tens of thousands, of individual process instances.
However, it doesn’t tell us what the individual processes look like, how many times they loop, and how the different steps correlate with each other.
Therefore, we added compressed, color-coded visualizations of individual processes to our list view (Fig.
Also, since we can extract further properties from the processes, apart from their sequence, we can show how the processes are distributed with respect to various attributes, such as the defect type in the repair process example (Fig.
The user can interact with these charts to set filters affecting the graph for interactive exploration.
6: Complementary visualizations: List view and Attribute view.
Visualizing Very Large GraphsThe “worst case” data we tested our algorithms with (200 nodes and thousands of edges), is a worst case only in our narrow domain.
Graphs can have hundreds of thousands, or even millions, of nodes.
No conventional layout algorithm can transform such a data set into a succinct diagram that the human brain can process.
However, if we leave behind assumptions that a graph visualization consists of shapes connected by lines and that we are limited to a two-dimensional static medium, there are ways to make graphs with millions of nodes accessible to visual exploration.
The trick is to encode the millions of data points into visual patterns that can morph in response to user interactions.
But that is another story to be told another day.
Contact us if you are curious :)Some Remarks on Rendering PerformanceReact’s virtual DOM approach works best with SVG (as opposed to Canvas) — in a graph that means that each graph node and edge is a DOM node that only needs to be regenerated if it changes.
In general, SVG performs better than Canvas if there are relatively few objects with relatively large filled areas.
The term “few objects” must be put into context — we saw no issues with the rendering performance up to several thousands of SVG elements.
With SVG, zooming is free — you only need to manipulate the CSS transform property of the SVG element, whereas in Canvas you have to re-render the whole graph.
For very large numbers of SVG elements, we observed that zooming on Firefox performs much worse than on Chrome.
If you primarily target Firefox, you should try manipulating the SVG viewBox instead of the transform attribute.
However, that approach performed much worse on Chrome in our early experiments (Spring 2018).
If your graph is very complex, you’ll probably be better off with Canvas.
The tradeoff is that you’ll have to implement logic to calculate what nodes or edges the mouse cursor touches if you want to allow the user to interact with the graph.
We did it anyway, to preserve the option to use Canvas.
Another detail worth mentioning is that collision-free edge routing gets expensive when the edge density is high: the algorithm will spend a lot of time searching for free space to route edges through.
Therefore, the performance of the algorithm turned out to be very sensitive to the spacing of nodes.
However, if the spacing is too large, the user will need to zoom out too much in order to see the whole graph so individual nodes will be rendered relatively small.
ConclusionsWhen visualizing complex graph structures, getting the information funneled into the human brain is the biggest challenge.
Our brains are very good at mapping visual patterns to meaning, but we can’t make much sense of relatively uniform patterns consisting of too many similar things.
There are probably hundreds of proven graph visualizations around, but most of them are generic.
If the particular questions you have can be answered with these — good for you, take your pick!However, if you aim for a competitive advantage by tailoring the visualization to your use-case specific semantics, developing a custom layout algorithm still makes sense.
Two main takeaways from our journey in developing a custom graph visualization are:Rendering an SVG graph visualization (generated by React components) is sufficiently fast for all reasonable graph sizes — by the time the rendering performance is getting critical, you’ll have long passed the number of nodes/edges the user can digest.
Reducing the number of rendered nodes and edges by the discussed means is one way to make the information easier to consume.
It is easy to imagine dozens of use cases for which generic graph visualizations are suboptimal, especially for multidimensional graphs.
Therefore, exploring creative ways to generate use-case specific graph visualizations continues to be a valuable endeavor.
If you enjoyed this post and found the challenges discussed here interesting, we’d love to work with you — apply to one of our many engineering positions here!Horst Werner holds a Ph.
in Computer Aided Engineering and has been working with graph data structures since 1998.
In his personal life, he enjoys all sorts of outdoor activities and tinkering with metal and wood.
His answer to “Why I love working at Splunk”: Splunk gives me the opportunity to work on exciting user interfaces for a quantity and variety of data that is available to no other company I’ve worked for so far.
Simon Fishel has been working as a Frontend Engineer at Splunk for 7 years.
His answer to “Why I love working at Splunk”: As an engineer at Splunk, you get to design software that creates “Aha!” moments for your users, which is even more exciting than having those “Aha!” moments yourself.