Programming language that rules the Data Intensive (Big Data+Fast Data) frameworks.
Md KamaruzzamanBlockedUnblockFollowFollowingJan 6Back in the good old days, companies primarily used one main programming language (e.
C,C++, Java, C#,…), one type of database (SQL) and two data exchange formats (XML, JSON).
The situation has changed since the beginning of 21st century with the rise of internet and mobile devices.
Every year, the volume and speed of data manifolds and continues to grow exponentially.
Large IT companies like Amazon, Google suddenly found that the existing tools and frameworks were not enough to handle those Big, Fast data.
These companies developed closed source frameworks to work with “Web scale” data and shared their concept/architecture in form of papers.
In 2003, Google published the Google File System paper followed by the Map Reduce paper and Big Table paper.
Also Amazon published the Dynamo paper in 2007.
As the whole Industry was facing the same problem of handling “Web scale” data, developer community and industry adapted the concept of these papers and developed systems like Hadoop ecosystem, NoSQL databases and other frameworks.
With the proliferation of GitHub and open source friendliness, many more Data Intensive frameworks was developed and adapted by industry in the following years.
Recently I was reading the excellent book: “Designing Data-Intensive Applications ” written by Martin Kleppmann which gives a comprehensive and in-depth overview on various Data Intensive frameworks.
While reading that book, one question popped up in my mind: what is the most used Programming Language in Data Intensive frameworks?Here I will try to find the most used programming language among the Open Source Data Intensive frameworks.
As there are many Data Intensive frameworks/libraries, I will mainly focus on top open source frameworks in each category.
Also, if a framework/library is written in polyglot programming languages, I will only pick up only the main language.
Search EnginesIn 1997, Doug Cutting was trying to develop a full text search engine using the then “new kid on the block” programming language Java.
He finished developing basic feature in 3 months and released the Java Library as Lucene for Information search/retrieval in 1999.
Lucene is a Java library and could not be integrated with other programming languages.
Apache Solr is the search Engine which offers a HTTP wrapper around Lucene so that it could be integrated with other programming languages.
Solr also offers scalable, distributed full text search with SolrCloud.
Like Lucene, Solr was also developed in Java and first open sourced in 2006.
To search recipes for his wife, Shay Banon has developed distributed search engine Elasticsearch (former Compass).
Elasticsearch offers a RESTful API around Lucene as well as complete stack (ELK) for distributed full text search.
It was released in 2010 and developed in Java.
Sphinx is the final distributed search engine in this list.
Back in 2001, Andrew Aksyonoff started developing search engine Sphinx to search database driven web sites.
Sphinx offers REST as well as well SQL like (SphinxQL) search client and developed in C++.
Lucene: JavaSolr: JavaElasticsearch: JavaSphinx: C++File SystemsIn 2002, Doug Cutting and Mike Cafarella was working on Project Nutch to crawl and index the whole internet.
In 2003, they hit the limit of one machine and wanted to scale it on four machines but faced the problem of moving data between nodes, space allocation (History of Hadoop).
In 2003, Google published its Google File System paper.
Taking the blueprint of Google’s paper, Doug Cutting and Mike Cafarella has developed the Hadoop Distributed File System HDFS (then named NDFS) which offers schema less, durable, fault tolerant distributed storage system.
HDFS played a key role in the development of Hadoop ecosystem and Big Data frameworks.
Although Hadoop MapReduce is losing its appeal to newer frameworks (Spark/Flank), HDFS is still used as the de facto distributed file system.
Like other Hadoop frameworks, it is developed in Java and open sourced in 2005.
In Gluster Inc.
, Anand Babu Periasamy has developed a software defined distributed storage GlusterFS for very large-scale data.
Open sourced in 2010, GlusterFS is written in C.
Inspired by the Facebook paper Finding a needle in Haystack: Facebook’s photo storage, Chris Lu has implemented a simple and highly scalable distributed file system named SeaweedFS which can store billions of files.
It was open sourced in 2011 and written in Go.
Another powerful distributed storage platform is Ceph which offers Object, Block and File level storage.
Ceph was initially developed by Inktank and currently developed by a group of companies lead by Red Hat.
Although it was announced first in 2006, Ceph has its first stable release in 2012.
Its File Systems, CephFS is written in Rust although Ceph Object storage is written in C++.
Fraunhofer Institute in Germany has developed a Parallel Cluster file system BeeGFS optimized for high performance computing focusing on flexibility, scalability and usability.
The first beta version of BeeGFS was released in 2007.
It is developed in C++ in a very compact way (~40k lines of Code).
Although BeeGFS is a parallel file system and used extensively in the Super Computers, it can be used as a faster alternative to HDFS.
HDFS: JavaGlusterFS: CSeaweedFS: GoCephFS: RustBeeGFS: C++Batch ProcessingAfter developing HDFS, Doug Cutting and Mike Cafarella wanted to run large scale computing on multiple machines for Project Nutch.
Inspired by the Google Map Reduce paper, Doug Cutting and a team at Yahoo started to develop large scale, distributed computing using commodity hardware and Hadoop MapReduce was born.
First open sourced in 2005, Hadoop MapReduce along with HDFS was hugely successful and paved the way for a whole ecosystem of Data Intensive frameworks including Yarn, HBase, Zookeeper and others.
Like Lucene and Nutch, Hadoop MapReduce and almost whole Hadoop ecosystem was developed in Java and arguably set Java the as the de-facto programming language in the Data Intensive domain.
If Hadoop MapReduce has pioneered large scale distributed computing, then Apache Spark is the most dominant Batch processing framework at this moment.
Developed in famous Berkley AMPLab lead by Matei Zaharia, it addresses the limitations of Hadoop MapReduce framework (e.
storing data after every Map Job).
Spark additionally offers a way to compose multiple Job as a data processing pipeline in similar way like Unix and gives developer friendly API to program the distributed tasks using RDD, Dataset, DataFrame.
Spark is developed mainly in Scala and released in 2014.
Apache Flink is another very popular and promising Cluster computing framework.
Like Spark, it also came from academia and originated from TU Berlin Big Data project Stratosphere.
Unlike Spark which is a Batch-first framework, Flink is a Stream first framework and handles Batch Job as a special case of Stream.
Like Spark, Flink also offers developer friendly API to program distributed computing using DataSet, DataStream.
Flink is primarily written in Java and released in 2015.
Hadoop MapReduce: JavaSpark: ScalaFlink: JavaStream ProcessingAfter the release of Hadoop MapReduce, it was widely accepted by the community and companies started to process their large-scale data with Hadoop MapReduce.
Soon after, companies needed Real Time distributed Stream processing frameworks because in reality, data is produced mainly as Stream.
Processing Stream in Real Time for large scale data is more difficult than processing Batch job due to many factors (e.
Fault Tolerance, State Management, Timing issues, Windowing …).
The first open source Real Time distributed Stream processing framework was Apache Storm developed by Nathan Marz and released in 2011 with the slogan “The Hadoop of Real-Time”.
It was developed in Java.
Also Apache Spark offers a Stream processing framework „Spark Streaming“ which uses Micro-Batch approach for Stream processing with near Real Time processing.
If Apache Storm was the framework that popularized the distributed Stream processing, then Apache Flink has taken distributed Stream processing one step further and is arguably the current leader in distributed Real Time Stream processing.
Along with low latency Stream processing, it additionally offers some innovative and advanced features like State Management, exactly once handling, Iterative processing, Windowing.
Another interesting distributed Stream processing framework is Apache NiFi developed by NSA in 2006.
Although it does not offer advanced Stream processing features like Flink, it is a very useful framework for simple use cases with large scale data.
Apache NiFi also developed in Java and open sourced in 2014.
Apache Samza is the final Stream processing framework in this list developed in Linkedin.
Samza offers some advanced Stream processing features but tightly coupled with Apache Kafka and Yarn.
It is also developed in Java and released in 2005.
Storm: JavaSpark: ScalaFlink: JavaNiFi: JavaSamza: JavaMessagingMessaging is one of the preferred ways to communicate between processes/Micro Services/systems because of its asynchronous nature.
Predominantly, messaging is implemented via a Queue where producers write messages in Queue, consumers read messages from Queue and once all consumers read the message, it is deleted from the Queue (fire and forget style).
When Linkedin had broken their very large Monolith application to many Micro Services, they had the issue of handling data flow between the Micro Services in a scalable way without any single point of failure.
A group of Linkedin Engineers (Jey Kreps, Neha Narkhede, Jun Rao) have used distributed Log (like write ahead log in Databases) instead of Pub/Sub Queue for messaging and developed Apache Kafka.
Like File systems or databases in Batch processing, Kafka works as a “single source of truth” in the Stream processing world and used heavily for data flow between Microservices.
Kafka is developed mainly in Java and first open sourced in 2011.
Yahoo has developed Apache Pulsar as a distributed Messaging framework which offers both Log based messaging (like Kafka) and Queue based messaging.
Apache Pulsar was open sourced in 2016 and mainly developed in Java.
Rabbit Technologies Inc.
has developed the traditional Queue based Pub-Sub system RabbitMQ.
It was first released in 2007 and developed using Erlang.
ActiveMQ is another Queue based Pub-Sub system which implements the Java Message Service (JMS).
It was initially developed by the company LogicBlaze and open sourced in 2004.
ActiveMQ is developed in Java.
A relatively new but very promising Queue based distributed pub-sub system is NATS.
It is developed by Derek Collison in CloudFoundry on a weekend using Ruby.
NATS was released in 2016 and later re-implemented in Go.
Kafka: JavaPulsar: JavaRabbitMQ: ErlangActiveMQ: JavaNATS: GoDatabasesI have used the DB-Engines ranking to get the top 5 distributed databases of 21st century.
According to this ranking, MongoDB is ranked 5th and only superseded by the Big Four SQL databases.
MongoDB is a distributed document-oriented database developed by 10gen (currently MongoDB Inc.
) and released in 2009.
It is one of the most popular and mostly used NoSQL databases written mainly in C++.
Another hugely popular NoSQL database is Redis which is a distributed key-value store database.
Back in 2001, an Italian developer Salvatore Sanfilippo started developing Redis to solve the scalability problem of his startup.
Released in 2009, Redis is written in C.
Avinash Laskshman (co-author of Amazon Dynamo paper) and Prashant Malik has developed Cassandra in Facebook to improve Facebook’s inbox search.
Heavily influenced by the Google BigTable and Amazon Dynamo paper, Cassandra was open sourced by Facebook in 2008.
Cassandra is both a distributed key-value store and wide column store which is predominantly used for time series or OLAP data and written in Java.
Part of the Apache Hadoop project, HBase is developed as a distributed wide column database and released in 2008.
Like almost the whole Hadoop ecosystem, HBase is developed in Java.
Neo4j is the final NoSQL database in this list and de-facto standard as Graph Database.
Developed by Neo4j, Inc, it was released in 2007 and written in Java.
MongoDB: C++Redis: CCassandra: JavaHBase: JavaNeo4j: JavaVerdictIf we consider the programming languages used in most dominant Data Intensive frameworks, then there is one clear winner: Java.
Although criticized in the early days as a slow language, Java has clearly surpassed the near Metal languages like C, C++ or some concurrency friendly languages like Erlang, Scala in this field.
Here I am listing five main reasons for the success of Java in the Data Intensive frameworks:JVM: Although process virtual machines exist since 1966, JVM has arguably taken the virtual machine concept in another level and played a huge role in the popularity of Java and use of Java in the Data Intensive frameworks.
With JVM, Java has abstracted the low-level machine from developers and gave the first popular “Write Once, Run anywhere” programming language.
Also, with the support of Generation Garbage Collection, developer does not need to worry about Object lifecycle management (although it has own share of problems) and can fully concentrate on the domain problem.
First Sun and later Oracle has improved JVM over the years and currently JVM is the battle-hardened behemoth which has passed the test of time.
There is a reason why so many languages (e.
Kotlin, Scala, Clojure) also uses JVM instead of developing their own Virtual Machine.
Language Support and Frameworks: As Java was developed 23 years after C and 10 years after C++, James Gosling and co.
at Sun has clearly looked at the pain points of C/C++ and tried to address those issues in Java.
Java has introduced language support for Multithreading, Exception/Error handling, Serialization, RTTI, Reflections from the beginning.
Also, Java offers a large ecosystem of libraries which make the Software development in Java mush easier.
It is true that with similar software architecture and algorithm, C/C++ can vastly outperform Java in most of the time.
But in lots of time, software architecture and programming paradigm plays bigger role on the performance of an application rather than programming language.
This is the reason the Hadoop MapReduce framework developed in Java has outperformed similar framework developed in C++ at Yahoo.
Developer Productivity: Developer productivity is the Holy Grail in the industry as most of the time, developer hour is more expensive resource than CPU/Memory.
When Java first came in 1995, it was much simpler, leaner compared to C/C++ which leads to faster development time.
As Bjarne Stroustrup (creator of C++) has pointed out, Java has evolved later to a heavier, complex language but during 2000–2010 when most of the dominant Data Intensive frameworks were developed, Java was probably the best language which stroke the right balance between developer productivity and performance.
Also, developer productivity leads to shorter development cycle which was significantly important to the early Data Intensive frameworks.
Fault Tolerance: One of the main goal of Data Intensive frameworks is to use cheap commodity hardware instead of costly special hardware.
The downside of this approach is that commodity hardware can fail and Fault tolerance is the cornerstone of Data Intensive applications.
With its in-built language support of Error Handling, Generational Garbage collection, Bound checking and hardware agnostic programming, Java offers better Fault tolerance compared to C/C++.
Hadoop Factor: The use of Java in Hadoop ecosystem played a big role to the general adaptation of Java in the Data Intensive domain and worked as an advertisement for Java.
As HDFS and Hadoop MapReduce was arguably the first disruptive Data Intensive frameworks, the later frameworks always looked at them and has used the Hadoop proven Java to develop their own Data Intensive framework.
FutureEvery year, the volume and speed of data is increasing exponentially and bringing new challenges.
Also, the Data Intensive landscape is so diverse that the age of “one solution for all” is over.
Although with the Advent of GraalVM and newer Garbage collections (ZGC), Java will be dominant in the Big Data landscape, however I believe Java will concede its ground to some other languages.
What are the programming languages which can displace Java in the Data Intensive frameworks in coming years?.I hope to write in a follow up post.