First, there are HDFS clusters with 600+ PB of capacity.
The in-memory nature of HDFS metadata means you can happily handle 60K operations per second.
AWS S3 broke a lot of whats found in POSIX file systems in order to achieve scalability so rapid file modifications, like the kind needed when converting CSV to Parquet files, isnt possible with S3 and requires something like HDFS if you want to distribute the workload.
If conversion software was modified to make this an S3-only workload the data locality trade-offs would likely be significant.
Second, the Hadoop Ozone project aims to provide an S3 API-compatible system that can distribute trillions of objects without the need of a cloud vendor.
The project aims to have native support in Spark and Hive giving it good integration with the rest of the Hadoop Ecosystem.
This software, when released, will be one of the first such offerings in the open source world that can store this many files on a single cluster.
Third, even if youre not working with PBs of data the APIs available to you in the Hadoop ecosystem provide a consistent interface for handling GBs of data.
Spark is the definitive solution for distributed machine learning.
Once you know its APIs, it doesnt matter if your workload is GBs or PBs, the code you produce doesnt need to be re-written, you just need more machines to run it on.
Id much sooner teach someone how to write SQL and PySpark code then teach them to distribute awk commands across multiple machines.
Fourth, a lot of the features of the Hadoop ecosystem are a leading light for commercial vendors.
Every failed sales pitch for a proprietary database results in the sales team learning of just how many missing features, compromises and pain points their offering has.
Every failed POC results in the sales team learning just how robust their internal testing of their software really is.
No One Needs Big Data When you hear "No one needs big data", look over the CV of the speaker.
You might find a lot of internally-hosted web applications in an Airlines headquarters but when it comes to analysing PBs of aircraft telemetry for predictive maintenance there might not be any PHP developers on that project.
The above projects often arent advertised in a way that web developers would be exposed to them.
This is why someone could spend years working on new projects that are at the bottom of their S-curve in terms of both growth and data accumulated and largely never see a need for data processing outside of what could fit in RAM on a single machine.
Web Development was a big driver in the population growth of coders over the past 25 years.
Most people that call themselves a coder are most often building web applications.
I think a lot of the skillsets they possess overlap well with those needed in data engineering but often distributed computing, statistics and story telling are lacking.
Websites often dont produce much load with any one user and often the aim is keep load on servers supporting a large number of users below the maximum hardware thresholds.
The data world is made up of workloads where a single query is trying its best to maximize a large number of machines in order to finish as quickly as possible while keeping the infrastructure costs down.
Companies producing PBs of data often have a queue of experienced consultants and solutions provider at their door.
Ive rarely seen anyone plucked out of web development by their employer and brought into the data platform engineering space; its almost always a lengthy, self re-training exercise.
That Dataset Can Live in RAM I hear people arguing "a dataset can fit in memory".
RAM capacity, even on the Cloud, has grown a lot recently.
There are EC2 instances with 2 TB of RAM.
RAM can typically be used at 12-25 GB/s depending on the architecture of your setup.
Using RAM alone wont provide any failure-recovery if the machine suffers a power failure.
To add to this, the cost per GB will is tremendous compared to using disks.
Disks are catching up in speeds as well.
There was a 4 x 2 TB NVMe SSD PCIe 4.
0 card announced recently that could read and write at 15 GB/s.
The price point of the PCIe 4.
0 NVMe drives will be very competitive with RAM and provide non-volatile storage.
I cant wait to see an HDFS cluster with some good networking using those drives as itll demonstrate what an in-memory data store with non-volatile storage and the rich existing tooling of the Hadoop ecosystem looks like.
Its Over-Engineered I wouldnt want to spend 6 or 7 figures on designing a data platform and team for a business that couldnt scale beyond what fits on anyone developers laptop.
In terms of workflow, my days mainly consists of using BASH, Python and SQL.
Plenty of new graduates are skilled in the above.
A PB of Parquet data can be nicely spread across one million files on S3.
The planning involved with that isnt much more than considering how to store 100,000 micro-batched files on S3.
Just because a solution scales doesnt mean its overkill.
Just use PostgreSQL?.Ive also heard arguments that row-oriented systems like MySQL and PostgreSQL can fit the needs of analytical workloads as well as their traditional transactional workloads.
Both of these offerings can do analytics and if youre looking at less than 20 GB of data its probably not worth the effort of having multiple pieces of software running your data platform.
That being said, Ive had to work with a system that was feeding 10s of billions of rows into MySQL on a daily basis.
There is nothing turnkey about MySQL and PostgreSQL that lends themselves to handling this sort of workload.
The infrastructure costs to keep the datasets, even for just a few days, in row-oriented storage eclipsed the staffing costs.
The migration to a columnar storage solution for this client brought down those infrastructure costs by two orders-of-magnitude and sped up querying times by two orders-of-magnitude.
PostgreSQL has a number of add-ons for columnar storage and multi-machine query distribution.
The best examples Ive seen are commercial offerings.
The announced Zedstore could go some way to bringing columnar storage as a standard, built-in feature of PostgreSQL.
Itll be interesting to see if single query distribution and storage decoupling become standard features as well in the future.
If you have a transactional need for your dataset its best to keep this workload isolated with a transactional data store.
This is why I expect MySQL, PostgreSQL, Oracle and MSSQL to be around for a very long time to come.
But would you like to see a 4-hour outage at Uber because one of their Presto queries produced unexpected behaviour?.Would you like to be told your company needs to produce invoices for the month so the website will need to be switched off for this week so there is enough resources available for the project?.Analytical workloads dont need to be coupled with transactional workloads.
You can lower operational risks and pick better suited hardware by running them on separate infrastructure.
And since youre on separate hardware you dont need to use the exact same software.
Many skills that make a competent PostgreSQL Engineer lend themselves well to the analytics-focused data world as well; its less of a leap than that of a web developer moving into the Big Data space.
What does the future look like?.I expect to continuing analysing and widening my skillset in the data space for the foreseeable future.
In the past 12 months Ive delivered work using Redshift, BigQuery and Presto, almost in even amounts.
I try to spread my bets as Ive yet to find a crystal ball for the data world.
One thing I do expect is more fragmentation and more players to both enter and crash out of this industry.
There is a reason for most databases to exist but the use cases they can serve can be limited.
That being said, good sales people can go some way to extending the market demand for any given offering.
Ive heard people estimate it would take $10M to produce a commercial-quality database which means this is probably a sweet spot for venture capital.
There are plenty of offerings and implementations out there that leave customers with a bad taste in their mouth.
There is such a thing as Cloud sticker shock.
There are solutions which are great but very expensive to hire expertise for.
Arguing the trade-offs with the above will keep the sales and marketing people in the industry busy for some time to come.
Cloudera and MapR might be going through hard times right now but Ive heard nothing to make me believe its anything other than sunshine and roses at AWS EMR, DataBricks and Qubole.
Even Oracle is releasing a Spark-driven offering.
It would be good for the industry to see Hadoop as more than just a Cloudera offering and acknowledge that the above firms, as well as Facebook, Uber and Twitter have all made significant contributions to the Hadoop world.
Hortonworks, which merged with Cloudera this year, are the platform providers for Azure HDInsight, Microsofts managed Hadoop offering.
The company has the people that can deliver a decent platform for a 3rd-party Cloud provider.
I hope whatever offerings theyre working on for the future are centred around this sort of delivery.
I suspect Clouderas early customers were users of HBase, Oozie, Sqoop and Impala.
It would be good to see these not compete for so much engineering time and for future versions of their platform to come with Airflow, Presto and the latest version of Spark out of the box.
At the end of the day, if your firm in planning on deploying a data platform, there is no replacement for the astute management team that can research diligently, plan carefully and fail quickly.