Bad advice about introducing Machine Learning into business

Do not rely on artificial intelligence,

unless you have a deep understanding of the process.



Ray Dalio



At Jet Infosystems, we introduce machine learning in a wide variety of industries, and based on our experience we single out the necessary components for a successful implementation:





In practice, all these elements are extremely rare together, according to statistics, only about 7% of projects with ML are considered successful. Projects with all of these components can be safely classified as breakthrough! To illustrate, we have formulated several points that can be called harmful tips about the introduction of machine learning in business.



Bad advice No. 1: “The task is simply to implement ML”



Often, the customer formulates the task as “just to introduce machine learning for some optimization”, without any connection with business metrics and prioritization of business tasks.



In this case, we can see several negative scenarios. For example, the targets will change as they work, but this means that all preprocessing and the choice of optimization methods will change, because they are directly related to the meaning of the target. Or a data scientist will choose some metric from machine learning, for example, auc, and will improve it, bring in all the hype frameworks and libraries, based on his sense of beauty - perfect the “fifth decimal place” in the chosen metric. At the same time, for business this work may be completely unimportant and not lead to successful implementation. Or some minor task for business will begin to be solved, when in fact there is much greater potential for introducing machine learning nearby.



As a result, you may encounter negative consequences:





Bad advice # 2: “Any data scientist will do”



There is an opinion that you can take any data scientist from the market, plant him in isolation with excels and he will magically figure out what needs to be optimized. In our opinion, the mentality of data scientists who are involved in production optimization is extremely important. This means that they must be ready to dive deep into technological processes (for example, aluminum electrolysis, oxygen-alkaline cellulose treatment, blast furnace production, etc.). The willingness of data scientists to travel on distant business trips with the goal of personally speaking with technologists and operators at the factory is also important, in order to understand how everything really works. Without this, most likely, they will be doomed to a large number of thoughtless iterations of enumeration of models, and you can never reach a useful implementation.



Bad advice number 3: "Work should be patchwork"



The ideology of the most fragmented organization of work with the maximum division of labor to minimize costs is regularly met. For example, there is an analyst who understands the process, communicates with customers and technologists. There is an engineer date - he processes the data, generates features. And finally, there is a data scientist - he does just import sklearn and fit / predict. Thus, the work of a data scientist occurs in isolation from the realities of life, extremely laboratory, and there is a high risk of committing a large number of errors and missing important aspects of the original task.



Bad advice # 4: “Don't explain to data scientists how data is collected”



It's not always obvious that data scientists need to understand how and where data is collected. There are even cases when ML implementation contracts are signed without first familiarizing themselves with the data, and under such conditions there is a risk of never reaching the target values ​​of the metrics described in the contract. With this approach, problems will inevitably arise both with assessing the quality of models and with the possibility of their real application.



Many data properties influence the choice of methods: averaging data and measurement errors, uneven sampling of examples, time lag in measurements. It is important to correctly clean data from noise in factors and targets, the causes of noise can be different: digitization errors, outlier, duplication of variables, instrument errors, etc.



The company should be interested in that data scientists thoroughly understand the nature of data, otherwise data processing will be long and will not lead to successful modeling. Without a deep understanding of the specifics of the process of collecting and storing data, one may encounter the following problems:





Bad advice # 5: “Make data collection a complicated and incomprehensible process so that no one knows how it works. After the introduction of models, be sure to make changes to the process ”



Often, in parallel with the development and implementation of the model, technological processes change that affect data collection. Imagine that it is necessary to optimize the technological process, and after the introduction of the model, some units are reconfigured and this affects the data collection: features will “float”, distributions will change, the training sample will cease to be representative. Of course, no one knows about this in advance. As a result: the model will stop working and everything needs to be redone. For example, in cases with trees, an out of domain problem may occur.



It is important to coordinate in advance with data scientists all changes in technological processes so that they can quickly adapt models to new conditions.



Bad advice # 6: “Average the signs”



Some types of averaging lead to problems, for example:





In such cases, the task may not receive an adequate solution until the relevant raw data appears.



Bad advice No. 7: "Do not give out additional data"



There are several scenarios where data scientists ask for additional data:





Data scientists ask for additional data when they have experience solving similar problems in which the use of this data yields a positive result, otherwise the quality of the models can be much worse than potentially achievable.



Bad advice number 8: "The accuracy of manual marking is not important"



Let it be required to predict the quality of products based on manual marking, i.e. Production operators manually record target values. If at the same time operators receive bonuses for good results and punishment for bad ones, then:





Similar problems can arise with the use of crowdsourcing solutions (for example, Yandex.Toloka), where experts receive a reward for marking up the data. In this case, you need to carefully validate the resulting markup. There are a number of approaches for this:





Conclusion: if there is a manual markup of data - you need to check it, otherwise systematic errors may occur.



Bad advice number 9: "Use the most fashionable"



Read popular articles and demand that the solution to the problem be based on a fashionable method.



Today, data science is a fashion field, a lot of articles are published, conferences are held almost every day, an increasing number of methods are being created. However, this does not mean that an arbitrarily taken popular method is optimal in industrial tasks. Usually you do not need to use LSTM in the task of optimizing pig iron production, nor do you need to use RL on small data sets of marketing or mining. In such tasks, it is reasonable to start with traditional methods (for example, gradient boosting), which can be quite difficult to convince customers. Fashionable ML methods are not always suitable for the tasks of the industry and often prove costly to implement.



Morality



The above set of tips is not exhaustive, but all of them are regularly met in practice. With this approach, it is likely to make sure that ML is not working in the industry and is simply a waste of money.



Summarizing, we can say that the truly breakthrough cases are ML-projects, implemented on time and stably bringing measurable profit to the business. To achieve this, the competencies of data analysis and machine learning are important, and the conditions when data scientists understand the whole picture of a business problem well.



Posted by Irina Pimenova, Head of Mining, Jet Infosystems



All Articles