👩🏾‍🤝‍👨🏿 ☹️ ✊🏼 Is Hadoop Dead? Part 2 🧑🏿‍🤝‍🧑🏽 😲 🧕🏿

A translation of the article was prepared specifically for students of the Data Engineer course.

Read the first part

Nobody needs Big Data

When you hear, “No one needs Big Data,” look at the speaker’s resume. An African telecom operator experiencing amazing levels of growth is not going to contact the new JavaScript web developer and ask him if they can help develop their data platform and optimize billing calculations. You can find many internal web applications at the airline’s headquarters, but when it comes to analyzing petabytes of aircraft telemetry for preventative maintenance, there may not be a single PHP developer in this project.

The above projects are often not advertised in such a way that web developers can find out about them. This is why someone can spend years working on new projects that are at the bottom of their S-curve in terms of both growth and data accumulation, and in most cases never see the need for data processing beyond what can fit in RAM on one machine.

Over the past 25 years, web development has been a big driver in the growth in the number of programmers. Most people who call themselves programmers most often create web applications. I think that many of the skill sets that they possess are well aligned with those needed for data design, but they often lack distributed computing, statistics, and storytelling.

Web sites often do not create a heavy load for any one user, and often the goal is to keep the load on servers supporting a large number of users below the maximum hardware threshold. The world of data consists of workloads in which one request does everything possible to maximize a large number of machines, to complete work as quickly as possible, while reducing infrastructure costs.

Petabyte data companies often have experienced consultants and solution providers in their arsenal. I rarely saw anyone being pulled out of web development by their employer and transferred to the data platform development area; it is almost always the result of lengthy self-retraining.

This data set can live in RAM

I heard people say that "a data set can fit in memory." The amount of RAM, even in the cloud, has grown significantly lately. There are instances of EC2 with 2 TB of RAM. Typically, RAM can be used at 12-25 GB / s, depending on the architecture of your installation. Using RAM alone will not provide a recovery after a failure if a power failure occurs on the machine. In addition, the cost per GB will be huge compared to using drives.

Drives are also getting faster. Recently announced was a PCIe 4.0 NVMe 4 x 2 TB SSD card capable of reading and writing at a speed of 15 GB / s. The price of a PCIe 4.0 NVMe drive will be quite competitive with RAM and will provide non-volatile memory. I can’t wait to see an HDFS cluster with a good network using these drives, because it will demonstrate what a data archive looks like in memory with non-volatile storage with rich existing Hadoop ecosystem tools.

Overloaded with engineering excesses

I would not want to spend 6 or 7 digits on the development of a data platform and a team for a business that could not scale beyond what fits on a laptop of one developer.

From a workflow perspective, my days mostly consist of using BASH, Python, and SQL. Many new graduates are qualified in the above.

Petquet data Parquet can be easily distributed across a million files on S3. Planning related to the above is not much more complicated than considering how to store 100,000 micropacket files on S3. Just because a solution is scalable does not mean that it is redundant.

Just use PostgreSQL?

I have also heard arguments that row-oriented systems like MySQL and PostgreSQL can fit the needs of analytic workloads as well as their traditional transactional workloads. Both of these suggestions can do analytics, and if you are viewing less than 20 GB of data, then scaling is probably not worth the effort.

I had to work with a system that loaded 10 billion rows daily into MySQL. In MySQL and PostgreSQL, there is nothing that can handle such a load. The cost of infrastructure for storing data sets, even for several days, in row-oriented storage, has overshadowed staff costs. Switching to a column storage solution for this client reduced infrastructure costs and accelerated query times by two orders of magnitude for each.

PostgreSQL has a number of add-ins for column storage and distribution of queries across multiple machines. The best examples I've seen are commercial offers. The announced Zedstore may, to one degree or another, promote the establishment of column storage as a standard PostgreSQL built-in function. It will be interesting to see whether the distribution of individual requests and the separation of storage will become standard functions in the future.

If you need a transactional dataset, it's best to keep this workload isolated using a transactional data warehouse. This is why I expect MySQL, PostgreSQL, Oracle, and MSSQL to last a very long time.

But would you like to see a 4-hour break in Uber because one of their Presto requests caused unexpected behavior? Would you like your company to be informed about the need for monthly billing, why would you have to turn off your website for a week so that there are enough resources for this task? Analytical workloads should not be associated with transactional workloads. You can reduce operational risks and choose more suitable equipment by running them in a separate infrastructure.

And since you work on separate hardware, you do not need to use the same software. Many of the skills inherent in a competent PostgreSQL engineer are well suited to the analytic-oriented data world; This is a small step compared to jumping for a web developer moving into the big data space.

What does the future look like?

I will continue to analyze and expand my data skills for the foreseeable future. Over the past 12 months, I have done work using Redshift, BigQuery and Presto in almost equal amounts. I try to distribute my bets, because I have not yet found a working crystal ball of the predictor.

What I really expect is more fragmentation and more players entering and leaving the industry as well. There are reasons for most databases to exist, but the use cases that they can serve can be limited. At the same time, good sellers can expand market demand for any offer. I heard that people believe that creating a database of commercial quality requires about $ 10 million, which is probably the best place for venture capital.

There are many suggestions and implementations that leave customers with an unpleasant aftertaste. There is also such a thing as shock from a cloud price tag. There are solutions that are good, but too expensive due to the cost of hiring experts. Sales and marketing professionals in the industry will be busy for some time discussing the above trade-offs.

Cloudera and MapR may be in a difficult time right now, but I haven’t heard anything like this to make me believe that AWS EMR, DataBricks and Qubole have something to compete with. Even Oracle is releasing a Spark-driven offering. It would be nice if the industry saw in Hadoop something more than just a Cloudera offer and recognized that these companies, as well as Facebook, Uber and Twitter, made a significant contribution to the world of Hadoop.

Hortonworks, which merged this year with Cloudera, is a platform provider for Azure HDInsight, managed by Microsoft Hadoop. There are people in the company who can provide a decent platform to a third-party cloud service provider. I hope that any proposals they are working on will be focused on this kind of supply.

I suspect that the early Cloudera customers were users of HBase, Oozie, Sqoop, and Impala. It would be nice to see that they do not compete for such a long development time and for future versions of their platforms that will ship with Airflow, Presto and the latest version of Spark out of the box.

In the end, if your company plans to deploy a data platform, it will not find a replacement for an astute management team that can thoroughly research, plan carefully, and quickly identify failures.

Is Hadoop Dead? Part 2

Nobody needs Big Data

This data set can live in RAM

Overloaded with engineering excesses

Just use PostgreSQL?

What does the future look like?

More articles: