Machine Learning Limitations

Hello, Habr! I present to you the translation of the article “The Limitations of Machine Learning” by Matthew Stewart.

Most people reading this article are probably familiar with machine learning and the corresponding algorithms used to classify or predict results based on data. However, it is important to understand that machine learning is not the solution to all problems. Given the usefulness of machine learning, it can be difficult to accept that sometimes this is not the best solution to the problem.

Machine learning is a branch of artificial intelligence that revolutionized the world as we know it over the past decade. The information explosion has led to the collection of huge amounts of data, especially by large companies such as Facebook and Google. This amount of data combined with the rapid development of processor power and computer parallelization makes it relatively easy to receive and study huge amounts of data.

Nowadays, the hyperbole of machine learning and artificial intelligence is ubiquitous. Perhaps this is correct, given that the potential for this area is huge. Over the past few years, the number of AI consulting agencies has increased, and according to Indeed, the number of AI-related jobs has increased by 100% from 2015 to 2018.

As of December 2018, Forbes found that 47% of businesses have at least one opportunity to use AI in their business process, and the Deloitte report says that penetration rate of enterprise software with integrated AI and cloud-based AI development services will reach approximately 87 and 83 percent respectively. These numbers are impressive - if you plan to change your career in the near future, AI seems to be a good area.

Everything seems gorgeous, right? Companies are happy, and consumers are apparently happy too, otherwise companies would not use AI.

It's great, and I'm also a big fan of machine learning and artificial intelligence. However, there are times when the use of machine learning is simply not necessary, it makes no sense, and sometimes when implementation can lead to difficulties.

Limit 1 - Ethics

It is easy to understand why machine learning has had such a profound impact on the world, but what is less clear is what exactly its capabilities are and, more importantly, what its limitations are. Yuval Noah Harari, as you know, coined the term “datism”, which refers to the proposed new stage of civilization, which we enter when we trust algorithms and data more than our own judgment and logic.

Although this idea may seem ridiculous, remember when you last went on vacation and followed the GPS instructions, and not your own judgments about the map - do you question the GPS rating? People literally drove into the lakes because they blindly followed the instructions of their GPS.

The idea of trusting data and algorithms more than we think has its pros and cons. Obviously, we benefit from these algorithms, otherwise we would not use them in the first place. These algorithms allow us to automate processes, making informed judgments using available data. Sometimes, however, this means replacing someone else's work with an algorithm that has ethical consequences. Also, who are we blaming if something goes wrong?

The most frequently discussed case today is self-driving cars: how do we decide how the vehicle should react in the event of a fatal collision? Will we have the opportunity in the future to choose the ethical framework for the purchase that our self-driving car would follow?

Who is to blame if my self-driving car kills someone on the road?

While these are all fascinating questions, they are not the main purpose of this article. However, it is obvious that machine learning cannot tell us anything about what normative values we should adopt, that is, how we should act in this situation.

Limit 2 - Deterministic Problems

This is a limitation that I personally had to deal with. My area of expertise is environmental science, which relies heavily on computer modeling and the use of IoT sensors / devices.

Machine learning is incredibly effective for sensors and can be used to calibrate and adjust sensors when connected to other sensors that measure environmental variables such as temperature, pressure, and humidity. The correlations between the signals from these sensors can be used to develop self-calibration procedures, and this is a hot topic in my research in atmospheric chemistry.

However, things get a little more interesting when it comes to computer modeling.

Launching computer models that simulate global weather, emissions from the planet and transferring these emissions are very computationally expensive. In fact, it is so computationally difficult that modeling at the research level can take several weeks even when working on a supercomputer.

Good examples of this are MM5 and WRF, which are numerical weather prediction models that are used to study climate and to provide you with weather forecasts in the morning news. I wonder what weather forecasters do all day? Run and learn these models.

Working with weather models is good, but now that we have machine learning, can we use it instead to get our weather forecasts? Can we use data from satellites, weather stations and use an elementary forecasting algorithm to determine if it will rain tomorrow?

The answer is, surprisingly, yes. If we have information about air pressure around a certain region, humidity levels in the air, wind speed and information about neighboring points and their own variables, then it becomes possible to train, for example, a neural network. But at what cost?

Using a neural network with thousands of inputs allows you to determine whether it will rain tomorrow in Boston. However, the use of a neural network skips the entire physics of the weather system.

Machine learning is stochastic, not deterministic.

A neural network does not understand Newton’s second law, or that density cannot be negative — there are no physical limitations.

However, this cannot be a limitation for a long time. There are already a number of researchers who are considering adding physical constraints to neural networks and other algorithms so that they can be used for purposes such as this.

Limitation 3 - Data

This is the most obvious limitation. If you "feed" the model poorly, then this will only give bad results. There are two reasons for this: lack of data and lack of reliable data. If you don’t have such problems, then you can safely study the processing of large amounts of data on the Big Data Books Telegram channel, where various books and resources on Big Data are published.

Lack of data

Many machine learning algorithms require large amounts of data before they begin to produce useful results. A good example of this is a neural network. Neural networks are data-eating machines that require a lot of training data. The larger the architecture, the more data is required to produce viable results. Reusing data is a bad idea, it is always preferable to have more data.

If you can get the data, then use it.

Lack of good data

Despite the appearance, it is not the same as written above. Imagine that you think you can cheat by generating ten thousand fake data points to be placed on a neural network. What happens when you insert this?

He will learn by himself, and then when you come to test him on a new data set, he will not work well. You had the data, but the quality is better.

Just as a lack of good features can lead to a poor performance of your algorithm, so a lack of good truthful data can also limit the capabilities of your model. No company is going to introduce a machine learning model that works worse than a human error.

Similarly, applying a model trained on a data set in one situation may not necessarily apply equally well to the second situation. The best example of this that I have found so far is in predicting breast cancer.

Mammography databases have many images, but they have one serious problem that has caused significant problems in recent years - almost all x-rays were taken from white women. This might not seem like a big deal, but in fact it has been shown that black women are 42 percent more likely to die from breast cancer due to a wide range of factors, which may include differences in detection and access to care. Thus, learning the algorithm primarily for white women in this case adversely affects black women.

In this particular case, more x-ray images of black patients are required in the training database, more evidence is associated with a 42 percent increase in probability, and the algorithm is more fair by stratifying the data set along the corresponding axes.

Limit 4 - Misuse

Regarding the second limitation discussed earlier, it is assumed that this is a “machine learning crisis in academic research” when people blindly use machine learning to try to analyze systems that are either deterministic or stochastic in nature.

For the reasons discussed in the second limitation, the use of machine learning in deterministic systems will be successful, but an algorithm that does not study the relationship between two variables and will not know when it violates physical laws. We just gave some inputs and outputs to the system and told her to study the relationship - just as someone translates word for word from a dictionary, the algorithm will seem like only a superficial understanding of basic physics.

For stochastic (random) systems, everything is a little less obvious. The machine learning crisis for random systems manifests itself in two ways:

P-hacking
Scope of analysis

p-hacking

When someone has access to big data, which can have hundreds, thousands or even millions of variables, it is easy to find a statistically significant result (given that the level of statistical significance required for most scientific studies is p <0.05). This often leads to the detection of false correlations, which are usually obtained using p-hacking (looking through mountains of data until a correlation is found showing statistically significant results). These are not true correlations, but simply a response to noise in the measurements.

This led to the fact that individual researchers “caught” statistically significant correlations through large data sets and disguised them as true correlations. Sometimes this is an innocent mistake (in this case, the scientist should be better prepared), but in other cases this is done to increase the number of articles published by the researcher - even in the world of the scientific community, the competition is high, and people will do anything to improve their metrics.

Scope of analysis

There are significant differences in the scope of analysis for machine learning compared to statistical modeling - statistical modeling is by nature confirming, and machine learning is essentially research.

We can consider confirmatory analysis and models as what someone does when receiving Ph.D. or in research. Imagine that you are working with an adviser and trying to develop a theoretical basis for studying any real system. This system has a set of predefined attributes that it affects, and after carefully designing the experiments and developing hypotheses, you can run tests to determine the validity of your hypotheses.

Research analysis, on the other hand, lacks a number of qualities associated with confirmatory analysis. In fact, in the case of truly enormous amounts of data and information, supporting approaches are completely destroyed due to the enormous amount of data. In other words, it is simply impossible to accurately state the final set of testable hypotheses in the presence of millions of signs.

Therefore, and, again, in general terms, machine learning algorithms and approaches are best suited for research predictive modeling and classification with huge amounts of data and computationally complex functions. Some will argue that they can be used for “small” data, but why do it when classic, multidimensional statistical methods are much more informative?

Machine learning is an area that largely solves the problems associated with information technology, computer science, etc., it can be both theoretical and applied problems. As such, it is related to areas such as physics, mathematics, probability, and statistics, but machine learning actually represents a field in itself, a field that is not burdened with problems raised in other disciplines. Many of the solutions that experts and practitioners of machine learning come up with are painfully wrong, but they do their job.

Limitation 5 - Interpretability

Interpretability is one of the main problems of machine learning. An AI consulting firm trying to reach a firm that uses only traditional statistical methods can be stopped if they do not see the model as interpreted. If you cannot convince your client that you understand how the algorithm came to the decision he made, how likely is it that he will trust you and your experience?

A business manager is more likely to accept machine learning recommendations if the results are explained from a business perspective.

These models as such can be rendered powerless if they cannot be interpreted, and the process of human interpretation follows rules that go far beyond technical mastery. For this reason, interpretability is the primary quality that machine learning methods must achieve if applied in practice.

In particular, the developing sciences in the field of physics (genomics, proteomics, metabolomics, etc.) have become the main goal for machine learning researchers precisely because of their dependence on large and non-trivial databases. However, they suffer from a lack of interpretation of their methods, despite their apparent success.

Conclusion

As I hope, I have clearly explained in this article that there are limitations that, at least at the moment, impede the solution of all the problems of humanity. A neural network can never tell us how to be a good person, and at least not yet understand the laws of Newton's motion or Einstein's theory of relativity.

There are also fundamental limitations based on the underlying theory of machine learning, called the theory of computational learning, which are mainly statistical limitations. We also discussed issues related to the extent of the analysis and the dangers of p-hacking, which can lead to false conclusions.

There are also problems with the interpretability of the results, which can adversely affect companies that cannot convince customers and investors that their methods are accurate and reliable.

Machine learning and artificial intelligence will continue to revolutionize the industry and will only become more common in the coming years. Although I recommend that you take full advantage of machine learning and AI, I also recommend that you keep in mind the limitations of the tools you use - after all, there is nothing ideal.

All Articles