Darius Foroux, from Why More Technology Does Not Equal More ProductivityWhy You Need a Unified Analytics Data Fabric for SparkMark PalmerBlockedUnblockFollowFollowingMar 7How to optimize Spark for analytics without tanking productivityApache Spark has seen broad adoption for big data processing.
It was originally developed to speed up map-reduce operations for data stored in Hadoop.
Today, it’s still best suited for batch-oriented, high throughput operations on data.
Although Spark continues to improve, it’s still at best an incomplete solution for analytics, especially when it comes to real-time interactive workloads on changing data — the kind needed by BI, data science and IoT applications.
Software vendors like Databricks and AWS address this by making it easier to stitch together big data solutions and in-house IT groups often deploy additional data management tools on top of Spark.
But, as Darius Foroux points out, more technology does not equal more productivity.
The missing link is a way to optimize Spark for BI users, data engineers, and data scientists, without piling more non-Spark based tools on top that drain productivity.
A Unified Analytics Data Fabric (UADF) solves this problem.
It adds support for streaming and transactional data and optimizes Spark for lightning-fast BI, data science and IoT applications.
And because it’s native to Spark, you leverage the people skills, operational processes, and tools that you already have.
A UADF improves productivity by extending Spark into a lightning-fast platform for BI, data science, and IoT applications.
As the SVP of analytics at TIBCO, we see our customers struggle with this challenge.
We think SnappyData, a UADF created by the visionary team behind Gemfire, helps overcome the shortcomings of Spark for analytics.
This article explains how it can help you get more from Spark while increasing your productivity at the same time.
What is Unified Analytics Fabric technology?A UADF adds support for streaming, transactions, machine learning, and analytics to Spark.
The goal is to augment Spark with four capabilities:Data mutability and transactional consistency for Spark dataData sharing across users and applications with high concurrencySupport for low latency queries (e.
, a key-value read/write operation) and high latency operations (expensive aggregation query or ML training jobs).
Unblocked analytics query access while high-speed transactions update dataWhy not just use <insert-database-here>?Spark developers often ask: “Which database should I use when I want to do BI, data science, and streaming on my Spark data?”The answer is often “it depends.
” Options include columnar stores (Vertica, Teradata, Redshift), cube-type stores (Jethro, AtScale, Kyvos), data science-centric analytics platforms (Databricks) and NoSQL Stores (Cassandra, MongoDB).
These options all have their virtues, but, as a group, they suffer from similar drawbacks:They add more tools to Spark that increase complexity and reduce productivity.
It’s hard for BI and data science tools to optimize for every database under the sun, leading to suboptimal performance.
These data stores specialize in historical data, not streaming data.
Real-time streaming data is an increasingly important aspect of digital business such as IoT-aware applications and automated algorithmic systems.
Why you need a Unified Analytics Data FabricHere are five reasons why you should consider using a Unified Analytics Data Fabric:#1: Transactions and streams go together like peanut butter and jellyIncreasingly, firms want all their data in one place — Spark — including transactional and IoT data.
For example, one energy company use the SnappyData UADF to store their transactional, streaming, and reference data in Spark.
They store real-time IoT weather forecast updates alongside customer and equipment maintenance records from CRM and operational systems.
A single BI chart can generate queries that span three data domains, like:“Based on a real-time weather forecast data, show me the top 100 customers that are likely to have property damage and which ones have not been recently secured by maintenance.
Put that data on a map so I can decide who to service next, and keep it updated, in real time”Without a UADF, this kind of query would require complex Spark software development, custom data integration, and perhaps the purchase, installation, and configuration of non-Spark based data store.
The technical extensions of a UADF add approximate queries, caching, and MVCC to Spark so it can handle all these data domains at the same time.
So Spark and all your data go better together, like peanut butter and jelly.
#2: Leverage what you haveTo use UADF with Spark, you simply download, configure and go.
You leverage everything you already know and love about Spark — your data, your skills, your infrastructure.
And, because its SQL based, most BI and data science tools “just work” with a UADF.
So, you leverage your investment in the BI and data science tools you already use.
As a result, you increase productivity and declutter your tech fabric at the same time.
#3: Lightning-fast BIFor BI performance alone, the Unified Analytics Data Fabric screams.
Out of the box, simple analytics queries out-perform vanilla Apache Spark by 12–20X.
You can try it for yourself here.
For example, a large conglomerate reduced 120 ERP systems to 45 by unifying financial reporting and tax data, saving millions in infrastructure cost and reduced fines from delays in tax reporting.
One Spotfire pharmaceutical customer uses it just for that: to speed up BI performance on their Spark repository.
#4: Unify streaming BI with streaming historyStreaming BI is a recent innovation in the analytics space.
It provides a continuous, live analytics experience when attached to streaming data.
The results bring BI to operational systems for the first time.
The example below shows Streaming BI in action for a Formula One race car.
Embedded IoT sensors stream data as the car speeds around the track.
Analysts see a real-time, continuous view of the car’s position and data: throttle, RPM, brake pressure — potentially hundreds, or thousands of metrics.
By visualizing some of those metrics, a race strategist can see what static snapshots could never reveal: motion, direction, relationships, the rate of change.
Like an analytics surveillance camera.
Streaming Business Intelligence allows business analysts to query real-time data.
By embedding data science models into the streaming engine, those queries can also include predictions from models scored in real time.
But notice the graph that says: “Gear color by difference from ideal gear?” How would an analyst know what the ideal gear is, given the current weather conditions, the car configuration and this particular track?.This kind of comparison might require deep streaming history, comparing current conditions to previous races, or even practice laps from a few minutes ago; that data can be ideally stored in a UADF for comparison to real-time, and be used later for deep learning and machine learning.
#5: New opportunities to get a grip on your dataBy unifying master data management, data virtualization, and integration technology, Spark can become the foundation of a complete data management platform.
Often, these tools live on their islands; by carefully connecting them, you can get a better grip on enterprise data.
The Unified Analytics Data Fabric goes globalIf you’ve embraced Spark for BI, data science, and IoT applications, it might be time to check out UADF technology.
We’ve gone all in at TIBCO and look forward to work with the open source community to keep making it better.
We think it’s a key technology to provide lightning-fast performance for analytics and improve productivity, all at once.
.. More details