4 most used languages in big data projects: ScalaKiarash IrandoustBlockedUnblockFollowFollowingJan 1, 2017This article is the last part in a series about Programming Languages for Big Data projects.
Other articles that were published in this series can be found here:#1: 4 most used languages in big data projects: Java#2: 4 most used languages in big data projects: Python#3: 4 most used languages in big data projects: RJava, Python, R, and Scala are commonly used in big data projects.
In a series of articles, I am describing these languages briefly and the reasons for their popularity among data scientists.
Java, Python and R were described in the previous articles.
This article focuses on Scala, and provides an overview of this language and why it is common for big data projects.
ScalaScala which stands for “scalable language” is an open source, multi-paradigm, high-level programming language with a robust static type system.
Its type system supports parameterization and abstraction.
Scala is hailed for integrating the functional and object-oriented features.
As a result, every value is an object, and every operator is a method while you can pass functions around as variables.
Scala has a flexible and modular mixin composition that unites the advantages of mixins and traits (it allows programmers to reuse new class definitions that are not inherited).
It also has a syntax that supports anonymous functions as well as higher-order functions.
Scala also supports other paradigms including imperative and declarative.
However, Scala has an advantage over conventional imperative programming languages in parallelisation.
Scala enables to describe algorithms at a higher level of abstraction.
Wilkinson describes, this abstraction allows the same exact algorithm to be run in serial, in parallel across available cores on a single machine, or in parallel across a cluster of machines, without changing any code.
Scala runs on the Java Virtual Machine and interoperate seamlessly with Java.
It is possible to directly use Java libraries, call Java codes, implement Java interfaces, and subclass Java classes in Scala and vice versa.
However, there are some Scala features including traits with defined methods, and Scala’s advanced types that cannot be accessed from Java.
Furthermore, Scala programming language is concise.
Several loops can be replaced by a single word that makes it significantly less verbose than standard Java.
In addition, its statically typed and functional nature makes it type-safe.
It is worth noting that in a comparison, a well specified compact algorithm were implemented in four languages: Scala, Java, Go, and C++.
Scala’s concise notation and powerful language features “allowed for the best optimization of code complexity”.
Scala has undeniably become a decisive tool for data science and machine learning at large scale.
Several big names including Twitter, LinkedIn, and The Guardian built their websites with Scala.
This progression is mainly due to:1.
Scala is a concise programming languageScala creates a good balance between readability and conciseness.
As the result, it is easier to understand the code.
Scala’s conciseness is mainly due to:Its type inference — in contrast to other functional languages, Scala’s type inference is local.
Since types can be inferred by the compiler, they get out of the way.
Pattern matching mechanism — the second most used feature of Scala, which allows to match on any sort of data with a first-match policy.
Brian Clapper has a good introductory to Pattern Matching in ScalaThe ability to use functions as variables and reusing utility functions2.
Cutting-edge class compositionAs an object-oriented language, Scala allows to extend classes with sub-classing and a flexible mixin-composition; a brilliant way for code reuse and a replacement for multiple inheritance to avoid inheritance ambiguity.
Furthermore, modular mixin composition combines the advantages of mixins and traits.
Streams processing in real-timeWhile the Hadoop MapReduce can process and generate large datasets in-parallel, it has been criticized for the inability to handle real-time stream processing.
Spark gives Scala an edge over other programming languages to process streams in real-time.
It has made Scala the computational engine for the fast data processing.
Scala’s vast ecosystem due to seamless interoperability with JavaScala integrates perfectly with the big data eco-system, which is considerably Java based.
Java libraries, IDEs (such as Eclipse and IntelliJ), frameworks (like Spring and Hibernate) and tools all work flawlessly with Scala.
Even popular frameworks contain dual APIs for Java and Scala.
Following frameworks and APIs are commonly used in big data projects by Scala programmers:Apache Spark — a fast and general-purpose framework for large-scale data processingSpark is written in Scala and runs on JVM.
It provides APIs in Java, Scala, and Python, and also supports R and Clojure.
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
Immutable, distributed, lazily evaluated, catchable are its common properties.
Scala also has additional libraries for Big Data analytics and Machine Learning including, Spark Streaming, Spark SQL, Spark MLlib (Machine Learning), and Spark GraphX (graph analytics).
Apache Flink — a framework for distributed stream and batch data processingFlink’s core is a hybrid (Real-Time Streaming + Batch) distributed data processing engine written in Java and Scala.
Flink contains several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML — pure Scala), complex event processing (CEP) and graph processing (Gelly).
Apache Kafka — a distributed streaming platform for handling real-time data feedsWritten in Java and Scala, Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
It works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.
The Kafka documentation and cloudera describe Kafka’s design and implementation aspects.
To note, Kafka Manager (by Yahoo) is an open source web-based tool for managing Apache Kafka.
Apache Samza — a distributed stream-processing frameworkApache Samza uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
Samza is similar to Apache Storm while it is easier to operate.
For an example of Samza stream processing job written in Scala check this GitHub page.
Akka — a concurrent framework for building distributed applicationsAkka is an actor-based message-driven runtime for managing concurrency, elasticity and resilience on the JVM that supports Java and Scala.
Akka uses Actor Model that is an ideal model for highly scalable and concurrent systems.
Summingbird — a framework for integrating batch and online MapReduce computationsSummingbird is a data processing framework for streaming MapReduce in a type-safe way.
It provides a domain-specific language (DSL) implemented in Scala for expressing analytical queries that generates either Hadoop jobs (batch computations — using Scalding/Cascading) or Storm topologies (online computations) without requiring any changes to the program logic.
It can also operate in a hybrid batch/online processing mode.
Scalding — a Scala API for the Cascading, an abstraction of MapReduceBuilt on top of Cascading, a Java library that abstracts Hadoop MapReduce, Scalding simplifies writing the MapReduce jobs in Scala.
Scalding is comparable to Pig, while offering tight integration with ScalaScrunch — a framework for writing, testing, and running MapReduce pipelinesScrunch is a Scala wrapper for Apache Crunch which provides a framework for writing, testing, and running MapReduce pipelinesFurthermore, there are several java-based data storage solutions that work well with Scala including: Apache Cassandra (phantom and Cassie are Scala Cassandra clients), Apache HBase, Datomic, and Voldemort.
Several libraries for data science and data analysisEven though Scala’s libraries are not as comprehensive as Python or R libraries, they provide a solid foundation for big data projects.
Awesome Machine Learning which is a curated list of machine learning frameworks, libraries and software (covering several languages), presents a list of useful Scala libraries and tools for Machine Learning, data analysis, data visualization, and NLP.
In addition, Typelevel provides several helpful libraries and extensions to Scala.
Following libraries are few of the most used machine learning and data analysis libraries:Saddle — a high-performance data manipulation library (strongly influenced by the pandas library for Python)ScalaNLP — a suite of different libraries, including Breeze (set of libraries for machine learning and numerical computing) and Epic (high-performance statistical parser and structured prediction library).
Apache Spark MLlib — machine learning library for Scala, Java, Python, and RApache PredictionIO — a machine learning server based on Apache Spark, HBase and Spray that can be installed as a full machine learning stackDEEPLEARNING4J — a distributed deep-learning library for Java and ScalaScala-datatable and Framian — for data frames and data tables (Darren Wilkinson’s research blog has a great post on this topic)To note, Awesome Scala and Scala Wiki categorize and list some useful libraries, frameworks, and tools available for Scala.
Scaladex represent a map of all published Scala libraries.
Any updates related to Scala libraries is announced by implicit.
ly, when the latest versions become available.
A vibrant growing communityScala has an active community that is expanding rapidly.
According to the KDnuggets Analytics/Data Science 2016 Software Poll, Scala was among the tools with the highest growth.
Scala has an active community on Stack Overflow, in addition to its large community on GitHub and Reddit.
Furthermore, Scala has three Gitter channel for the users of GitHub repositories: scala/scala (for general discussion and questions), scala/contributors (for contributors to discuss work on changes to Scala), and spark-scala/Lobby (for discussions and questions about using Scala for Spark programming).
For the latest trends related to Scala on GitHub check Trending Scala repositories.
Scala Times (weekly Scala newspaper) and This week in Scala (published weekly by cakesolutions blog) are good sources for the latest information in the Scala world.
In addition, Scala Space and Scala meetup provide information about Scala meetup groups around the world.
Scala also has a good community-powered learning resources, including:Scala ExercisesScala SchoolEffective ScalaScala PuzzlersInteractive TourLearning Scala in Small BitesLearning ScalaHaoyi’s Programming BlogTo conclude, advantages and disadvantages of Scala are described in the following:AdvantagesScala unifies object-oriented and functional programmingAs an object-oriented programming language, all values are objects, and types and behavior of objects are described by classes and traits.
As a functional programming languages every function is a value.
Scala includes many of functional programming including: currying, type inference, immutability, lazy evaluation, and pattern matching (Java lacks these features)Conciseness and robustnessScala is designed for concurrency and parallelism (implicit parallelism in parallel collections)Several of the Hadoop’s high-performance data frameworks are written in Scala or Java.
The main reason for using Scala in these environments is due to its amazing concurrency support, which is the key in parallelizing processing of the large data sets.
Seamless interoperability with JavaScala runs on the JVM, hence Java classes and libraries may be used directly in Scala code and vice versa.
Besides accessing to Java’s vast ecosystem, Scala has a wide verity of native libraries for scientific computing and big data projectsScala has immutability by default and is built into the standard libraryScala allows imperative programmingWhile imperative and functional styles are in contrast, Scala is designed to make functional constructs, imperative constructs.
Scala includes a useful REPL for interactive useScala has native tuplesScala is a type safe programming languageScala has a strong and static type system that unifies algebraic data types with class hierarchiesScala has a built-in type inference that allows to omit certain type annotationsScala’s type system enables abstract types and path-dependent types apply the vObj calculus to a concrete language design.
Packages that enable a bridge between Scala and other programming languages, such as rscala package, a bidirectional interface between R and Scala with call backsDisadvantagesOriginally Scala was written to make a better java, however especially for beginners it is not an easy shift and is difficult to learn or adopt to.
In addition, Scala has a weak tool/ IDE support compared to Java.
Furthermore unless you have a superfast processor, Scala compiler can cost a large amount of time correlated to Java.
In an attempt to overcome Scala’s critiques dotty project is created.
.. More details