👌 🔻 🎗️ “Does it seem to have happened already?” Search for similar incidents and applications 📿 ⛅️ 🤐

Everyone who spent some time supporting the systems is familiar with the déjà vu feeling when they received a new application: “it was like that, it was sorted out, but I don’t remember exactly how”. You can spend time, delve into previous applications and try to find similar ones. This will help: the incident will be closed faster, or it may even be possible to detect the root cause and close the problem once and for all.

The “young” employees who just joined the team have no such story in their heads. Most likely, they do not know that a similar incident, for example, occurred six months to a year ago. And the colleague from the next room decided that incident.

Most likely, the "young" employees will not look for something similar in the incident database, but will solve problems from scratch. Spend more time, gain experience and next time will cope faster. Or maybe they will immediately forget it under the stream of new applications. And next time everything will happen again.

We are already using ML models to classify incidents . To help our team process applications more efficiently, we have created another ML model to prepare a list of “previously closed similar incidents”. Details - under the cut.

What do we need?

For each incident received, it is necessary to find “similar” closed incidents in the history. The definition of "similarity" should occur at the very beginning of the incident, preferably before the support staff has started the analysis.

To compare incidents, you must use the information provided by the user when contacting: a brief description, a detailed description (if any), any attributes of the user’s record.

The team supports 4 groups of systems. The total number of incidents that I want to use to search for similar ones is about 10 thousand.

First decision

There is no verified information on the "similarity" of incidents on hand. So the state-of-the-art options for training Siamese networks will have to be postponed for now.

The first thing that comes to mind is a simple clustering of a "bag of words" made up of the contents of the appeals.

In this case, the incident handling process is as follows:

Highlighting the necessary text fragments
Text preprocessing / cleaning
TF-IDF vectorization
Find your nearest neighbor

It is clear that with the described approach, similarity will be based on a comparison of dictionaries: using the same words or n-gram in two different incidents will be regarded as “similarity”.

Of course, this is a fairly simplified approach. But remembering that we evaluate the texts of user hits, if the problem is described in similar words - most likely the incidents are similar. In addition to the text, you can add the name of the user's department, expecting that users of the same departments in different organizations will have similar problems.

Highlighting the necessary text fragments

Incident data we get from service-now.com in the simplest way - by programmatically launching user reports and receiving their results in the form of CSV files.

Data on messages exchanged between support and users as part of the incident is returned in this case in the form of one large text field, with the entire history of the correspondence.

The information about the first call from such a field had to be "cut out" by regular expressions.

All messages are separated by a characteristic line <when> - <who>.
Messages often end with formal signatures, especially if the appeal was made by e-mail. This information is noticeably "fonil" in the list of significant words, so the signature also had to be deleted.

It turned out something like this:

def get_first_message(messages): res = "" if len(messages) > 0: # take the first message spl = re.split("\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} - ((\w+((\s|-)\w+)?,(\s\w+)+)|\w{9}|guest)\s\(\w+\s\w+\)\n", messages.lower()) res = spl[-1] # cut off "mail footer" with finalization statements res = re.split("(best|kind)(\s)+regard(s)+", res)[0] # cut off "mail footer" with embedded pictures res = re.split("\[cid:", res)[0] # cut off "mail footer" with phone prefix res = re.split("\+(\d(\s|-)?){7}", res)[0] return res

Preprocessing incident texts

To improve the quality of classification, the appeal text is pre-processed.

Using a set of regular expressions in the incident descriptions, characteristic fragments were found: dates, server names, product codes, IP addresses, web addresses, incorrect forms of names, etc. Such fragments were replaced with the corresponding concept tokens.

In the end, stamming was used to bring words to a common form. This allowed us to get rid of the plural forms and endings of verbs. The well-known snowballstemmer

was used as a stemmer.

All processing processes are combined into one transforming class, which can be used in different processes.

By the way, it turned out (experimentally, of course) that the stemmer.stemWord()

method is not thread safe. Therefore, if you try to implement parallel text processing within the pipeline, for example, using joblib

Prallel / delayed, then access to the general instance of the stemmer must be protected with locks.

 __replacements = [ ('(\d{1,3}\.){3}\d{1,3}', 'IPV4'), ('(?<=\W)((\d{2}[-\/ \.]?){2}(19|20)\d{2})|(19|20)\d{2}([-\/ \.]?\d{2}){2}(?=\W)', 'YYYYMMDD'), ('(?<=\W)(19|20)\d{2}(?=\W)', 'YYYY'), ('(?<=\W)(0|1)?\d\s?(am|pm)(?=\W)', 'HOUR'), ('http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', 'SOMEURL') #      ] __stemmer_lock = threading.Lock() __stemmer = snowballstemmer.stemmer('english') def stem_string(text: str): def stem_words(word_list): with __stemmer_lock: res = __stemmer.stemWords(word_list) return res return " ".join(stem_words(text.split())) def clean_text(text: str): res = text for p in __replacements: res = re.sub(p[0], '#'+p[1]+'#', res) return res def process_record(record): txt = "" for t in record: t = "" if t == np.nan else t txt += " " + get_first_message(str(t)) return stem_string(clean_text(txt.lower())) class CommentsTextTransformer(BaseEstimator, TransformerMixin): _n_jobs = 1 def __init__(self, n_jobs=1): self._n_jobs = n_jobs def fit(self, X, y=None): return self def transform(self, X, y=None): features = Parallel(n_jobs=self._n_jobs)( delayed(process_record)(rec) for i, rec in enumerate(X.values) ) return np.array(features, dtype=object).reshape(len(X),)

Vectorization

Vectorization is carried out by the standard TfidfVectorizer

with the following settings:

max_features

= 10000
ngram

= (1,3) - in an attempt to catch stable combinations and semantic connectives
max_df

/ min_df

- left by default
stop_words

- a standard list of English words, plus its own additional set of words. For example, some users mentioned analyst names, and quite often proper names became significant attributes.

TfidfVectorizer

itself does L2 normalization by default, so incident vectors are ready to measure the cosine distance between them.

Search for similar incidents

The main task of the process is to return a list of the nearest N neighbors. The sklearn.neighbors.NearestNeighbors

class is quite suitable for this. One problem is that it does not implement the transform

method, without which it cannot be used in the pipeline

.

Therefore, it was necessary to make it based on Transformer

, which only then put it on the last step of the pipeline

:

 class NearestNeighborsTransformer(NearestNeighbors, TransformerMixin): def __init__(self, n_neighbors=5, radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None, **kwargs): super(NearestNeighbors, self).__init__(n_neighbors=n_neighbors, radius=radius, algorithm=algorithm, leaf_size=leaf_size, metric=metric, p=p, metric_params=metric_params, n_jobs=n_jobs) def transform(self, X, y=None): res = self.kneighbors(X, self.n_neighbors, return_distance=True) return res

Processing process

Putting it all together, we get a compact process:

 p = Pipeline( steps=[ ('grp', ColumnTransformer( transformers=[ ('text', Pipeline(steps=[ ('pp', CommentsTextTransformer(n_jobs=-1)), ("tfidf", TfidfVectorizer(stop_words=get_stop_words(), ngram_range=(1, 3), max_features=10000)) ]), ['short_description', 'comments', 'u_impacted_department'] ) ] )), ("nn", NearestNeighborsTransformer(n_neighbors=10, metric='cosine')) ], memory=None)

After training, the pipeline

can be saved to a file using pickle

and used to handle incoming incidents.

Together with the model, we will save the necessary incident fields - in order to later use them in the output when the model is running.

 # inc_data - pandas.Dataframe,     # ref_data - pandas.Dataframe,    . #     .    # inc_data["recommendations_json"] = "" #   . # column_list -  ,          nn_dist, nn_refs = p.transform(inc_data[column_list]) for idx, refs in enumerate(nn_refs): nn_data = ref_data.iloc[refs][['number', 'short_description']].copy() nn_data['distance'] = nn_dist[idx] inc_data.iloc[idx]["recommendations_json"] = nn_data.to_json(orient='records') #     , .     -. inc_data[['number', 'short_description', 'recommendations_json']].to_json(out_file_name, orient='records')

First application results

The reaction of colleagues to the introduction of a “hint” system was generally very positive. Recurring incidents began to be resolved faster, we began to work on troubleshooting.

However, one could not expect a miracle from the unsupervised learning system. Colleagues complained that sometimes the system offers completely irrelevant links. Sometimes it was even difficult to understand where such recommendations come from.

It was clear that the field for improving the model is huge. Some of the shortcomings can be resolved, including or excluding some attributes of the incident. Part - by selecting an adequate cutoff level for the distance between the current incident and the “recommendation”. Other vectorization methods may be considered.

But the main problem was the lack of quality metrics for recommendations. And if so, it was impossible to understand "what is good and what is bad, and how much is", and build a comparison of models on this.

We did not have access to http logs, since the service system works remotely (SaaS). We conducted user surveys - but only qualitatively. It was necessary to proceed to quantitative assessments, and build on their basis clear quality metrics.

But more about that in the next part ...

“Does it seem to have happened already?” Search for similar incidents and applications