The chatbot industry is booming. At first they were stupid enough and could conduct a dialogue with the user, being leading and offering possible answers. Then the bots got a little smarter and started demanding text input from the user in order to get keywords out of the answers. The development of machine learning has led to the emergence of the ability to communicate with the bot also in voice. However, most of the solutions did not go very far from the same construction of the dialogue graph and the transition between its nodes by keywords.
Recently, in Parallels, we decided to optimize a number of internal processes and create a bot for our own needs as an experiment. After a short search, we decided to try our luck on the
RASA open source project. According to the developers themselves, they made a third-generation chat bot. That is, this bot does not just go around the state graph, but is able to save and use the context of the previous dialog. To date, the best illustration for modern chat bots looks something like this:
That is, chat bots are just a set of verified rules for moving from one point in the graph to another. If you look at the existing solutions from the giants of the market, in fact there is nothing very different from the set of rules there. Roughly speaking, this set looks something like this:
Dialogue at point XXX.
If the user entered an offer with the words ['buy', 'ticket'], go to the point “ASK WHERE”
If the user entered an offer with the words ['buy', 'cutlet'], go to the point “ASK FROM WHAT”
It’s immediately obvious that it’s garbage, if the user entered: “I would like to buy a ticket to Porto,” they will ask him anyway, “Where do you want to go?” To make the dialogue more human, you will have to add new rules on what to do if there is an indication of the place.
Then add rules about what to do if there is an indication of the place and time, and so on.
This set of rules will grow fast enough, but this is not the worst thing, all the “right” ways can be described, improved and licked.
The most unpleasant thing is that a person is an unpredictable creature, unlike a bot, and can at any moment start asking completely different. That is, at the moment when the bot is already ready to book a ticket, a person may ask “by the way, what is the weather like?” Or “although not, I would like to go in my car, how long will it take?”
However, he can ask this also at the moment after choosing a city, but before choosing a departure time or even choosing a place where he wants to go. The bot, based on state machines, will jam and its mechanical pseudopods will twitch sadly, and the user will be frustrated.
Here you can (and should) use machine learning. But then new problems arise: for example, if you use reinforcement training to predict transitions to graph points, questions arise: where to get the data for this training, and who will rate the quality of the answers?
Users are unlikely to agree to teach your bot, and, as practice shows, a user community can teach a bot not at all what you want and what society considers decent. In addition, the bot at the initial stage will respond completely randomly, which will make users nervous and not mess with such support in principle.
After analyzing and thinking about all the shortcomings of the existing bots, RASA developers tried to solve the problems as follows:
- Any input from the user goes through the "definition of intent", that is, the text entered using machine learning is mapped to one (or several) intentions. Also, if necessary, entities are isolated from the text and added to the bot's memory.
- This process is similar to other bots, with the exception of the intention model used.
- The next action of the bot is predicted using machine learning based on the context, that is, the previous actions, intentions and state of the bot's memory.
- At the same time, there is not much data needed for initial training, and the bot can quite predict for itself what action to perform even without specific examples and rules.
Consider the mechanisms of work in more detail.
RASA NLU
Let's start with the first whale on which the bot rests. This is a Natural Language Understanding, which consists of two main parts: determination of intention and recognition of entities.
Intent detection
The determination of intent is based on a modified algorithm called
StarSpace from Facebook, implemented on Tensorflow. In this case, pre-trained models of vector representations of words are not used, which allows you to circumvent the limitations of these representations.
For example, intent determination in RASA algorithms will work well for any language, as well as with any specific words that you specify in the training examples. When implemented through pre-trained vector representations like GloVe or word2vec, localization of the bot and its application in highly specialized areas will bring enough headache.
The algorithm works on the basis of vectorization of sentences through bag of words and comparison of their “similarity”. Examples of intentions and intentions themselves are converted into vectors with the help of bag of words and fed to the input of the corresponding neural networks. At the output of the neural network, a vector is obtained for this particular set of words (the same embedding).
The training takes place in such a way as to minimize the loss function in the form of the sum of pairwise distances (either cosine or vector products) between two similar vectors and k-dissimilar ones. Thus, after training, a certain vector will be associated with each intention.
Upon receipt of user input, the proposal is similarly vectorized and run through the trained model. After that, the distance from the resulting vector to all intent vectors is calculated. The result is ranked, highlighting the most likely intentions and cutting off negative values, that is, completely dissimilar.
In addition to the above buns, this approach allows you to automatically distinguish more than one intention from the proposal. For example: “yes, I understood that. But how can I get home now? ”Is recognized as“ intent_confirm + intent_how_to_drive ”, which allows you to build more humane dialogs with the bot.
By the way, before training, you can create artificial sentences from examples by mixing existing ones to increase the number of training examples.
RASA Entity recognition
The second part of the NLU is the extraction of entities from text. For example, a user writes, “I want to go to a Chinese restaurant with two friends,” the bot must highlight not only the intention, but also the data corresponding to it. That is, fill in your memory that the dishes in the restaurant should be Chinese, and that the number of visitors is three.
For this, an approach based on Conditional Random Fields is used, which was already described somewhere on the
Habré , so I won’t repeat it. Those who wish can read about this algorithm
on the Stanford website .
In addition, I note that you can get entities from text based on templates, texts (for example, city names), as well as connect to a separate Facebook Duckling service, which would also be nice to write about someday.
RASA Stories
The second blue whale on which the RASA Core is based is stories. The general essence of the stories are examples of real conversations with the bot, formatted as intention-reaction. Based on these stories, a recurrent neural network (LSTM) is trained, which maps the previous message history to the desired action. This allows you to not set graphs of dialogs rigidly, and also not to determine all possible states and transitions between them.
With a sufficient number of examples, the network will adequately predict the next state for the transition, regardless of the presence of a specific example. Unfortunately, the exact number of stories for this is unknown and all that can be guided by is the phrase of the developers: "the more, the better."
To train the system, not just recording some invented dialogs there, you can use interactive training.
There are two options:
1. Make a number of engineers engage in conversations with the bot, correcting incorrect predictions, incorrect definition of entities and plugs with predicting actions on stories.
2. Save conversations to the database and then continue using specially trained engineers to look at those dialogs where the user could not solve his problem, that is, switched to a person, or the bot admitted his helplessness and could not answer.
In order to understand the mechanism of stories, it is easiest to parse some simple example. Let’s say a table reservation in a restaurant, an example provided by the developers in the source code examples section. To begin with, we will determine the intentions, and then we will make a couple of stories.
Intentions and their examples:
Intent_hello
- Hi
- Hello
- Aloha
- Good morning
...
Intent_thanks
...
Intent_request
Intent_inform
Next, you need to make the bot's memory, that is, determine the slots in which what the user requires will be recorded. Define the slots:
cuisine:
type: unfeaturized
auto_fill: false
num_people:
type: unfeaturized
auto_fill: false
And now we will show examples (a small part) for the intentions omitted above. The brackets in the examples are the data for training Ner_CRF, in the format [entity] (variable name for storage: what we store).
intent_request_restaurant
- im looking for a restaurant
- can i get [swedish] (cuisine) food for [six people] (num_people: 6)
- a restaurant that serves [caribbean] (cuisine) food
- id like a restaurant
- im looking for a restaurant that serves [mediterranean] (cuisine) food
intent_inform
- [2] (num_people) people
- for [three] (num_people: 3) people
- just [one] (num_people: 1) person
- how bout [asian oriental] (cuisine)
- what about [indian] (cuisine) food
- uh how about [turkish] (cuisine) type of food
- um [english] (cuisine)
Now we determine the history of the main path:
* greet
- utter_greet
* Intent_request
- restaurant_form
- form {"name": "restaurant_form"}
- form {"name": null}
- action_book_restaurant
* thankyou
- utter_noworries
That's the whole perfect bot for a perfect world. If the user immediately indicated all the necessary data in the first sentence, then a table will be reserved. For example, he writes "i want to book table in spanish restaurant for five people." In this case, num_people will be 5, and cuisine - spanish, which is enough for the bot for further actions on the reservation.
However, if you look at the examples, you can see that the data is not always present in the required quantity, and sometimes they are not at all. So non-core dialogues appear.
Suppose there is no data about the kitchen in the request, that is, something like this dialog:
Hello
Hi
I want to book restaurant for five people
...
In order for it to complete correctly, you need to determine the history of the following form:
* greet
- utter_greet
* Intent_request
- restaurant_form
- form {"name": "restaurant_form"}
- slot {"requested_slot": "num_people"}
- utter_ask_coven
* form: inform {"cuisine": "mexican"}
- slot {"cuisine": "mexican"}
- form: restaurant_form
...
And the best part is that if you create stories for several kitchens, then, meeting an unfamiliar, the bot will predict the next action on its own, although it will not be very sure. At the same time, if you create a similar story, but where the "cuisine" slot is filled in, and not the "num_people" slot, then the bot will absolutely not care in what order the information on the table reservation parameters will be provided.
There are two ways to stop any attempts to lead a bot from the right path: to identify possible stories for talking “about nothing”, or to all attempts to start such a conversation - to answer that it’s worth getting back to business.
Since our company is at the beginning of an amazing journey into the world of chatbots, there is a chance that there will be new articles about what rake we collected and what we did. Stay tuned!