Search for similar incidents and claims. Metrics and Optimization

In a previous article, I talked about our search engine for similar applications . After its launch, we began to receive the first reviews. Analysts liked and recommended some recommendations, some didn’t.







In order to move on and find better models, it was necessary to first evaluate the performance of the current model. It was also necessary to select criteria by which the two models could be compared with each other.







Under the cut, I’ll talk about:









Feedback collection



It would be ideal to collect explicit feedback from analysts: how relevant is the recommendation of each of the proposed incidents. This will allow us to understand the current situation and continue to improve the system based on quantitative indicators.







It was decided to collect reviews in an extremely simple format:









The "vote" (a small project that accepted GET requests with parameters, and put the information in a file) was placed directly in the recommendations block so that analysts could leave their feedback immediately by simply clicking on one of the links: "good" or "bad".







Additionally, for a retrospective review of the recommendation, a very simple solution was made:









So it was possible to collect data on approximately 4000+ incident-recommendation pairs.







Initial review analysis



The initial metrics were “so-so” - the share of “good” recommendations, according to colleagues, was only about 25%.







The main problems of the first model:







  1. incidents on "new" problems received irrelevant recommendations from the system; It turned out that in the absence of coincidences in the content of the appeal, the system selected incidents close to the department of the contacting employee.
  2. recommendations for an incident on one system hit incidents from other systems. The words used in the appeal were similar, but described the problems of other systems and were different.


Possible ways to improve the quality of recommendations were selected:









Development of quality criteria and assessment methods



To search for an improved version of the model, it is necessary to determine the principle of assessing the quality of model results. This will allow you to quantitatively compare the two models and choose the best.







What can be obtained from the collected reviews



We have many m tuples of the type: "Incident", "Recommended Incident", "Assessment of Recommendation".









Having such data, we can calculate:









These indicators, calculated on the basis of estimates from users, can be considered as “basic indicators” of the original model. With it we will compare similar indicators of new versions of the model.







Take all the unique "incidents" from m , and drive them through the new model.







As a result, we get a lot of m * tuples: "Incident", "Recommended Incident", "Distance".

Here, "distance" is the metric defined in NearestNeighbor. In our model, this is the cosine distance. The value "0" corresponds to the complete coincidence of vectors.







Selection of "cutoff distance"



Supplementing the set of recommendations m * with information about the true estimate of v from the initial set of estimates of m , we obtain the correspondence between the distance d and the true estimate of v for this model.







Having the set ( d , v ), one can choose the optimal cutoff level t , which for d <= t the recommendation will be "good", and for d> t it will be "bad". The selection of t can be done by optimizing the simplest binary classifier v = -1 if d>t else 1



respect to the hyperparameter t, and using, for example, AUC ROC as a metric.







 #     class BinarizerClassifier(Binarizer): def transform(self, x): return np.array([-1 if _x > self.threshold else 1 for _x in np.array(x, dtype=float)]).reshape(-1, 1) def predict_proba(self, x): z = self.transform(x) return np.array([[0 if _x > 0 else 1, 1 if _x > 0 else 0] for _x in z.ravel()]) def predict(self, x): return self.transform(x) # #   : # -  , # -    m* # -   (d,v)  z_data_for_t # #   t b = BinarizerClassifier() z_x = z_data_for_t[['distance']] z_y = z_data_for_t['TYPE'] cv = GridSearchCV(b, param_grid={'threshold': np.arange(0.1, 0.7, 0.01)}, scoring='roc_auc', cv=5, iid=False, n_jobs=-1) cv.fit(z_x, z_y) score = cv.best_score_ t = cv.best_params_['threshold'] best_b = cv.best_estimator_
      
      





The obtained value of t can be used to filter recommendations.







Of course, this approach can still skip the “bad” recommendations and cut off the “good” ones. Therefore, at this stage we always show the "Top 5" recommendations, but we specially mark those that are considered "good", taking into account the t found.

Alternative option: if at least one “good” recommendation is found, then show only “good”. Otherwise, show all available (also - "Top N").







Assumption for comparing models



For training models, the same incident case is used.

Suppose that if a “good” recommendation was previously found, then the new model should also find a “good” recommendation for the same incident. In particular, the new model may find the same “good” recommendations as the old. However, with the new model, we expect that the number of "bad" recommendations will become less.







Then, considering the same indicators for the recommendations m * of the new model, they can be compared with the corresponding indicators for m . Based on the comparison, you can choose the best model.







The “good” recommendations for the set m * can be taken into account in one of two ways:







  1. based on the found t : assume that all recommendations from m * with d < t are “good” and take them into account for calculating metrics
  2. based on the corresponding true estimates from the set m : from the recommendations m *, select only those for which there is a true estimate in m , and discard the rest.


In the first case, the "absolute" indicators ( n_inc_good



, n_rec_good



) of the new model should be greater than for the base model. In the second case, the indicators should approach the indicators of the base model.

The problem of the second method: if the new model is better than the original one, and it finds something previously unknown, such a recommendation will not be taken into account in the calculation.







Select model comparison options



When choosing a new model, I want indicators to improve compared to the existing model:









For comparison with the original model, we will use the relations of these parameters of the new model and the original. Thus, if the ratio of the parameter of the new model to the old is more than 1, the new model is better.







 benchmark_agv_inc_good = avg_inc_good* / avg_inc_good benchmark_n_inc_good = n_inc_good* / n_inc_good
      
      





To simplify the selection, it is better to use a single parameter. We take the harmonic mean of individual relative indicators and use it as the only composite quality criterion for the new model.







 composite = 2 / ( 1/benchmark_agv_inc_good + 1/benchmark_n_inc_good)
      
      





New model and its optimization



For the new model, in the final vector representing the incident, we add the components responsible for the "incident area" (one of several systems serviced by our team).

Information about the unit and location of the employee who created the incident is also placed in a separate vector component. All components have their weight in the final vector.







 p = Pipeline( steps=[ ('grp', ColumnTransformer( transformers=[ ('text', Pipeline(steps=[ ('pp', CommentsTextTransformer(n_jobs=-1)), ("tfidf", TfidfVectorizer(stop_words=get_stop_words(), ngram_range=(1, 3), max_features=10000, min_df=0)) ]), ['short_description', 'comments'] ), ('area', OneHotEncoder(handle_unknown='ignore'), ['area'] ), ('dept', OneHotEncoder(handle_unknown='ignore'), ['u_impacted_department'] ), ('loc', OneHotEncoder(handle_unknown='ignore'), ['u_impacted_location'] ) ], transformer_weights={'text': 1, 'area': 0.5, 'dept': 0.1, 'loc': 0.1}, n_jobs=-1 )), ('norm', Normalizer()), ("nn", NearestNeighborsTransformer(n_neighbors=10, metric='cosine')) ], memory=None)
      
      





Model hyperparameters are expected to influence model targets. In the selected model architecture, we will consider as hyperparameters:









The initial values ​​of the text vectorization hyperparameters are taken from the previous model. The initial component weights are selected based on expert judgment.







Parameter selection cycle



How to compare, select the misfire level and compare models among themselves have already been determined. Now we can proceed to optimization through the selection of hyperparameters.







Optimization cycle







 param_grid = { 'grp__text__tfidf__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2)], 'grp__text__tfidf__max_features': [5000, 10000, 20000], 'grp__text__tfidf__min_df': [0, 0.0001, 0.0005, 0.001], 'grp__transformer_weights': [{'text': 1, 'area': 0.5, 'dept': 0.1, 'loc': 0.1}, {'text': 1, 'area': 0.75, 'dept': 0.1, 'loc': 0.1}, {'text': 1, 'area': 0.5, 'dept': 0.3, 'loc': 0.3}, {'text': 1, 'area': 0.75, 'dept': 0.3, 'loc': 0.3}, {'text': 1, 'area': 1, 'dept': 0.1, 'loc': 0.1}, {'text': 1, 'area': 1, 'dept': 0.3, 'loc': 0.3}, {'text': 1, 'area': 1, 'dept': 0.5, 'loc': 0.5}], } for param in ParameterGrid(param_grid=param_grid): p.set_params(**param) p.fit(x) ...
      
      





Optimization Results



The table shows the results of experiments in which interesting results were achieved - top 5 best and worst values ​​for controlled indicators.













The cells with indicators in the table are marked as:









The best composite indicator was obtained for a model with parameters:







 ngram_range = (1,2) min_df = 0.0001 max_features = 20000 transformer_weights = {'text': 1, 'area': 1, 'dept': 0.1, 'loc': 0.1}
      
      





A model with these parameters showed an improvement in the composite indicator compared to the original model 24%







Some observations and conclusions



According to the optimization results:







  1. Using trigrams ( ngram_range = (1,3)



    ) does not seem to be justified. They inflate the dictionary and slightly increase accuracy compared to bigrams.







  2. An interesting behavior when building a dictionary using only bigrams ( ngram_range = (2,2)



    ): the "accuracy" of recommendations increases, and the number of recommendations found decreases. Just like a precision / recall balance in classifiers. A similar behavior is observed in the selection of the cutoff level t - for the bigrams a narrower “cone” of cutoff and a better separation of “good” and “bad” recommendations are characteristic.







  3. The nonzero parameter min_df, along with bigrams, increases the accuracy of recommendations. They begin to be based on terms that occur at least several times. As the parameter increases, the dictionary begins to shrink rapidly. For small samples, as in our case, it will probably be more understandable to operate with the number of documents (integer value min_df) than the fraction of documents (fractional value min_df) containing the term.







  4. Good results are obtained when the incident attribute responsible for the "region" is included in the final vector with a weight equal to or close to the text component. Low values ​​lead to an increase in the proportion of "bad" recommendations due to finding similar words in documents from other areas. But the signs of the customer’s location do not affect the results of recommendations so well in our case.









Some new ideas have come up:










All Articles