Dmitry Muromtsev, ITMO, Head of the International
Laboratory of Intellectual Methods of Information Processing and Semantic Technologies, and Head
of the IPM Department , spoke about the essence of ontological modeling, the use of knowledge graphs in business processes and the work on creating conversational intelligence.
Interviewer: Anna Angelova (A.A.)
Respondent: Dmitry Muromtsev (D.M.)
. .: What is the essence of ontological modeling and how is the compilation of knowledge graphs going on?
D.M .: Ontological modeling is the compilation of information models in the form of conceptual descriptions of subject areas that meet certain standards. There are special languages for ontologies, they are standardized and already used in the industry. The main purpose of ontologies is the description of data and knowledge schemes that may exist in a wide variety of sources. The problem is that these sources are many, they differ greatly in the type of data storage, in software architecture, and so on. To link them into a single information space, we need special integration mechanisms — they are ontologies. They are used when integrating databases, describing poorly structured data on the Internet, when creating knowledge bases on a specific topic or non-thematic, large knowledge bases - for example, based on Wikipedia information.
The process of creation itself involves the participation of subject matter experts: experts are always involved in those issues for which data will be presented in the knowledge column. For example, it may be issues related to cultural heritage, medicine, education, with some kind of production.
These experts identify key concepts - objects that are critical for this subject area. For example, cultural heritage is the objects of art, the creators of these objects, the process of creativity, the processes of restoration or some modifications (if it is an architectural object, it could be rebuilt), these are issues of display, storage, etc. All that is important for a full description of the subject the area, the expert formulates. Further relations, communications between these objects are designated. This formalized description allows you to subsequently make queries to knowledge columns.
Technologically, the transformation procedure can be quite complex and include many tools: natural language processing tools, machine learning, pattern recognition and a number of other tools. Ultimately, we get a network or graph of interconnected objects. The key feature of such a system, unlike databases, is that this network is self-descriptive, self-documenting. It does not need additional explanations
from the developer .
AA: What is the scope of knowledge graphs?
DM: Virtually any. Now there are graphs of knowledge of universal content (the most famous is Google), there is Wikidata, Dbpedia, which more closely resemble Wikipedia in terms of width. There are specialized knowledge graphs: in medicine, in cultural heritage, in open government data. There are corporate knowledge graphs - they are in the closed access.
A.A .: Tell us about the project for DataFabric. What did they need and what results could they achieve?
DM: Let's put the question somewhat wider. The project for DataFabric is one of the examples, we had several of them. We started our activities about 8 years ago. Much of the time we spend on the popularization of semantic technologies, on carrying out various
activities of a scientific and educational nature , hakatons, etc. We regularly meet with industry representatives. Dozens of such meetings take place annually, and some industry representatives are interested.
In the situation with DataFabric, their specialists mainly worked, whom we advised in terms of methodology and recommended certain technologies and tools. We also checked their results - an analysis of how well everything was done correctly. The project of this company itself is interesting because this is the first example in Russia when a business has invested its own funds in the development of knowledge graphs, in the development of technology-related data and has managed to prove that it can be profitable. As far as I know, the company continues to use the created knowledge graph and plans to develop it. From the speeches of its representatives, we can conclude that thanks to the knowledge graph, they managed to automate a large amount of manual labor. But for more accurate information it is better to contact the company directly.
Sergey Isaev,
CEO DataFabric
We wanted to make a clever counterparty verification system and collect information about companies. We were a very small company and wanted a competitive advantage. Our competitors - Spark Interfax, “Kontur. Focus” - are very big, powerful, for many years on the market, and it’s impossible to compete with them just like that, head on.
We collect all the same company information as our competitors: data from the Federal Tax Service, Rosstat, and other sources. We load them into a single database. Since it is graph, we have connections between all the objects in it. The system uses ontological modeling: we describe for it the value of absolutely all the data with which it works. Therefore, she begins to understand the context, the semantic load of certain data. Thanks to this, she can even ask open-ended questions, for example: “Show me all the companies that are likely to go bankrupt next year.” Since she understands the meaning of each word in this question, she will give out a list.
I don’t know how much time, money and energy competitors spend on solving their problems. But I know that they have hundreds of developers, and there are only 12 of us, and we have built our system in a year and a half. Now it allows you to quickly prototype new cases, new services, because it is smarter and more flexible.
AA: On the website of the laboratory you are heading, many partners are indicated. Who of them is working on current projects?
DM: If we take cooperation in the widest sense of the word, then by the amount of time allocated, the Council for Open Data of the Russian Federation will be the main one. We are trying to carry out methodological and research work there aimed at promoting knowledge graphs to federal authorities and other structures that are required to publish open data. Now the requirement of the law on the publication of open data is carried out quite formally and in a limited way. We are trying to prove that this can be done much more efficiently, and this will bring much greater benefits to the economy. We also actively cooperate on e-learning educational technologies with different structures, there are research projects with several universities in Germany, Finland, and Austria.
A.A .: Which companies in the industry should be monitored?
DM: For the community as a whole. It is clear that large companies are in some sense an indicator of how mature these or other technologies are. But at the last
ISWC conference in Austria - and this is the world's largest conference on semantic technologies - a report from Google raised many questions: the problems they posed for themselves were often already resolved by companies in a more research area.
It is characteristic that large players, as a rule, do not engage in research from scratch. They pose a certain problem, then they find a team that can solve this problem, and start cooperating with it or buy it if it is some kind of startup. That is, major players rather play a systemic role.
And if you follow the community as a whole, you can see a lot of different interesting research groups, companies, startups that offer the most innovative solutions. For example, there is now a very serious trend
in the development of chatbots , voice interfaces and other systems, which over time will actually be full-fledged assistants, assistants.
AA: Your laboratory is also developing speech processing projects. There are two of them on the site: one is completed, the other continues. Tell about them.
[cf .: “Development of computer morphology for case studies of variable text”, 2015–2016; “Development of a parser for Russian spontaneous speech using data mining methods using semantic knowledge bases”, 2015–2018]
D.M .: The first project was initiated by the Speech Technology Center - the creation of an intelligent dialogue manager. Those decisions that exist now are rather primitive. They are faced with when a customer calls to an organization or a bank and has to switch from one line to another for a long time. More advanced systems are able to analyze text that is obtained in the course of recognition - for example, Siri, Amazon Alexa. But the content of this text for the car remains unknown. In Russia, by the way, the iPavlov project was recently launched, but so far there is little data on their results.
Further, as soon as we recognize the speech signal, we need to understand what kind of question it contains. The problem is that when people communicate, the speech channel of interaction is only one of many. Information is not the most loaded. There are channels of non-verbal communication, there is general knowledge about the world, a context that a person understands, etc. In the absence of additional information, it is almost impossible to understand what this is about. If we take the decryption of texts and try to give them to someone, completely eliminating the context, most likely, even a person will not be able to understand them. Therefore, we are now trying to create analyzers that will efficiently process speech and identify objects and relationships between them — that is, create information models of the message that is contained in the text. And then work is planned to enrich these models with information from other sources.
A.A .: Can you specify? What is the difference between research directions in a completed project and a project that is underway now?
DM: These are interrelated directions. It is impossible to make a qualitative analysis without case studies, because it is necessary to teach the algorithm patterns of recognition of patterns in the text. We did this in the first draft. The second is studying the principles of the formation of objects. The text contains a description of some concepts. These concepts themselves may be more informative than the information that is present about them in the text. Accordingly, it is necessary to contact other databases and knowledge graphs and try to supplement this information from other sources.
Suppose a customer calls the help desk and talks about some kind of problem. It may not correctly name the device or process of using the system. The user is not required to have full technical information. The system, when understanding the context, can supplement the data from the user with information from its sources. This greatly simplifies the process of identifying the problem.
The first project was small, carried out in collaboration with the Center for Speech Technology. We proved in it that the combined use of ontology, speech recognition systems and a text parser can lead to the formation of so-called conversational intelligence. We have successfully demonstrated how it works. The next stage is deeper research in each of these areas. In the field of ontological modeling, we no longer worked with speech in general, but with information from the Internet in the field of cultural heritage: how to model it, enrich it, how to make a structured search using this information. In the area of parsing work continues. We achieved quite good results in the quality of text processing.
The next stage is the combination of these areas and the creation of an enrichment system for data from various sources, including non-textual modalities.
AA: Last question: what is the laboratory planning to work on next year?
DM: We have crystallized two directions: the Internet of things and conversational intelligence. The second direction will become dominant. The Internet of Things is a supporting area: the creation of voice and text interfaces (chat bots) for interacting with various devices, robots, information systems.
All this will make the human interaction with information objects more transparent and natural.