Why we decided to develop ML testing practice

Predictive and optimization services based on Machine Learning are of interest to many companies today: from large banks to small online stores. Solving the tasks of various clients, we encountered a number of problems, which served as the basis for our discussions on the features of ML testing. For those who are interested, this is our next post from Sergey Agaltsov, test manager of Jet Infosystems.

Previously, only large companies could take advantage of machine learning, as it was expensive and difficult, and there was very little data in the public domain. Today it has become much easier. Expertise ML can be requested from an integrator, a specialized company or at a thematic site. This is good for everyone, because with the growth of expertise new algorithms are developed, and the “piggy bank of experience” in the field of machine learning is constantly being enriched.

The only aspect for which we did not find a ready-made solution is testing ML services. Googling, you can make sure that in the search results there is no mention of the activities of testers in the development of such services. Of course, data science specialists themselves test their models using various metrics, and according to these metrics the service may even be as accurate as possible, however, the reality is that the model is not always able to take into account various production nuances and bottlenecks. Then the logic of machine learning begins to grow into a hardcode.

In this regard, we are starting to face a number of problems:

Does our optimization model always take into account possible production constraints?
Are our models able to work with bottlenecks?
Is our model able to correctly respond to production changes?

This is where we decided to focus the testing team.

Our task is to unify ML testing practice in order to be able to respond to all the above problems. At the moment, we have come to a number of conclusions, and now I will tell you which ones.

Test compliance with production restrictions and requirements for their consideration by the optimization algorithm

In classical testing, in any test we always have an “expected result”. We thoroughly know what system response should be to one or another input data. However, if we are talking about ML in production environments, then this most expected result may be compliance with regulatory documents, such as GOSTs, technological instructions and temporary flow sheets, which limit both the production processes themselves and the quality criteria of the final product. During their testing, we must be sure that all these restrictions are actually observed, and despite their number, we are sure that each of them is covered by test cases.

On the example of a real project for optimizing the production of materials N (we have not yet disclosed the case, therefore we will use anonymous names), we solved this issue as follows:

We have classified all brands of N material mixtures by the level of chemical elements in them. As a result, we got a list, which we later planned to use as an aid to ensure sufficient test coverage.
We were convinced that the recommendations issued by the model for all these mixtures were in fact unconditionally accepted by production technologists and recorded the result of this issue in a CSV file. Thus, we received recommendations of the system of a certain “gold standard”.
Then a script was written that from version to version ran the list of our reference mixtures through the model and compared the result of its delivery with what was stored in our “gold standard” csv.
If no changes in the behavior of the model were detected, then the regression tests could be considered successful. If not, then there was a “debriefing”.

Thus, we were able to solve the problem of regression testing and gain confidence that the changes introduced into the model do not affect the earlier results of our work.

Testing focused on bottlenecks

It is best for optimization models to predict what is most often found in historical data, and, conversely, a model can “turn into a pumpkin” as soon as it encounters something unusual for itself.

Often in such cases, the optimization service has to “prompt” an adequate behavior model and this generates the hardcode that I wrote about earlier. But how to identify these bottlenecks and debug the service? I’ll talk about this on the example of developing a service for recommendations on managing the production of material N (the case is not yet subject to disclosure, therefore, the veiled names are referred to below).

First of all, our architect developed an integration emulator that generated data similar to productive and thereby filled in the date frame, on the basis of which the optimization model issued recommendations for processing the material N. Next, the tester analyzed these recommendations and identified the most suspicious ones - those that were knocked out total mass flow recommended parameters. Already at this stage, we were able to identify many problems when the model in one way or another could not adequately process the incoming data stream. This allowed us to stabilize the state of the service and move on to the next step.

The second stage was “Silence” testing. The service was raised in the production environment in the background. He worked without distracting the material processing operator N from the machine control, and we, in turn, collected “operator solutions”, comparing them with what the service recommended. Thanks to this, we found blind spots of the model that could not be caught in the previous stage.

The model must be able to respond to production changes.

We have a service portfolio in our project portfolio to optimize the production of fuel materials. The essence of the service lies in the fact that the technologist transfers the stocks of production components to the model, sets the limiting indicators of product quality and sets the necessary production plan, and in response receives a recommendation: in what proportions he needs to use certain components in order to get the fuel of a given amount quality.

In the process of developing this service, we encountered a curious problem that we could not foresee in advance.

For several years, the company in the production of fuel worked in a certain range of aggregate revolutions and used the plus / minus the same ratio of components.

But recently, the organization has changed the supply of these components and it has become possible to compensate for this fact by increasing the speed of the units. The customer expected that the model will be able to respond to these changes, so from the point of view of technological production - this is an acceptable solution, but this did not happen. Why? The answer is obvious - the model was trained on a historical sample, where this did not happen before. You can talk for a long time on the topic of who is right in this situation and who is to blame, but in the future we planned to reduce the likelihood of such situations as follows:

More closely interact with the customer representative from the production unit to identify bottlenecks and potential product changes.
In advance, cover test cases with such scenarios of system behavior.
Write autotests to check compliance with production restrictions and correlation of signs.

Just a few words about the testing tools that we had to use:

bugtracking - Jira,
quality management system - Test Rail,
version control system - GitLab,
CI / CD - Jenkins,
autotests - Java + Junit / TestNG,
scripts for direct interaction with the model - Python + Jupyter.

Do you test ML?

For us, building an ML testing practice has become a challenge, it actually had to be grown from scratch. But testing is necessary - it helps to reduce the number of errors before going into trial operation and shorten the implementation time.

Today, we all need to share and share experiences. We need to start testing discussions at specialized sites and professional forums, which, by the way, are becoming more and more in the field of ML. And if you already have established practices for testing ML, I think everyone will be interested to read about them, so share your experience in the comments.

Sergey Agaltsov, test manager, Jet Infosystems

All Articles

Why we decided to develop ML testing practice

Test compliance with production restrictions and requirements for their consideration by the optimization algorithm

Testing focused on bottlenecks

The model must be able to respond to production changes.

Do you test ML?

More articles: