🏟️ 🚯 👼 The architecture of modern systems of object video analytics. The process of becoming or rooted over time flaws? 👎🏽 🔡 🎨

The current year is a rally among various systems for recognizing and detecting objects from various vendors. New devices for executing neural networks: FPGA, VPU, multi-core processors with VNNI and much more are offered from the developers of the hardware. In parallel, there is an increase in the number of available topologies, as well as ready-made pre-trained grids. Detection of incidents, accidents, calculation of passenger flows, the construction of age and gender portraits, recognition of emotions and much more is available today for developers. And everything would be fine if it were not for the intricate “Time to market” (faster, faster to the same market where money is there and if we are not the first, then we definitely won’t have time), the result of which we see weakly (read - it’s difficult, expensive) supported monstrous All-in-one systems. But in parallel there are architects (people), virtualization (approaches), ways to automate processes, systems for monitoring the states and parameters of a device, or many of them. But in view of the tight deadlines, this is omitted and the very monsters described above appear. And yes, the “faster to market” task is often achieved. But the main mistake at the initial stage is that after reaching the primary goals today, the requirements for the speed of completion and development of decisions will only be aggravated. The market is growing, the system is imperfect and requires development, and not a step back and processing of proof of concept into an industrial solution. And at this stage, the hypothesis test goes into production.

What is this fraught with and who will suffer?

For the developer, the following consequences can be noted:

The complexity of supporting and continuing development. Due to the lack of a systematic approach and architecture, a huge amount of copy-paste will require either the correction of previously made errors, and this time, or their sequential accumulation, and this is a dead end in the future.
The difficulty of expanding the team and the inability to delegate specific tasks to outsourcers or other departments within the company.
The larger the system, the more difficult it is to maintain. Harder is longer, longer is more expensive. And where there is a bad expensive solution, sooner or later, a good one will appear, built according to the correct model initially and, as a result, much cheaper in development and support.
Often, the lack of inheritance, repositories and branches in development does not give the technical ability to create and test alternative hypotheses for similar solutions. For example, the branch of object video analytics on x86 architecture is porting to ARM to work in the immediate vicinity of the data source (camera) and / or directly to the camera. Further, from the system port branch to the camera there can be various vendors for which adaptation is done (for example, the interface and other parts). In the absence of structure, each of these branches, which should be in a clear hierarchy, is done by copying the current state of the project. Further, the fragmented development of many projects and the parallel solution of the same problems appear. A parallel solution of the same thing - wasted time and money.
Difficulties in localizing products and solutions. The lack of localization slows down the process of entering other markets where the solution may be more in demand than in the current one.

For the owner of the solution (sometimes the same person as the developer, but this is not important):

Infinite project cost growth.
Lack of opportunities for project forecasting and budgeting.
The ever-growing risk over time to start working minus yourself.
The complication and appreciation of the project development processes over time and the further refactoring is, the more expensive and longer it will turn out.

And for customers (or developers based on vendor SDKs), the result is:

The farther, the support becomes worse and worse.
Over time, changes and bug fixes take longer and longer.
New protocols, devices, grids or other directions in the development of the system are not supported, although initially, usually, this is assumed and stands out as an advantage.
Lack of stability in the operation of systems.
The lack of ways to check the status of the systems and / or devices on which they work and which are an integral part of the overall infrastructure. Usually, this remark concerns hybrid inference, where some of the systems of object video analytics work “on the edge”, some on the servers, and some directly in the cameras. And the end user has a similar "garden", the performance of which, taking into account the expansion of the client's infrastructure, must remain, but it turns out exactly the opposite. The growth of such systems without the mechanisms of verification, debugging and control inherent in them leads to a decrease in their working capacity in time and complication of the processes for monitoring the operation of their elements.

And what, in fact, is the problem?

The main problem lies in the approach, all at once, and as quickly as possible. Architecture, design, prototyping take time. We are not building a house, but just an IT solution, service or module. Why all this? And as soon as you hear these words, then there is no turning back.

Consider an example of a system that came to our support at a certain stage of its development. This is a complex monolithic system, the main task of which is to parse the incoming video stream and record events in the database.

To begin, consider how this system was originally built:

The maximum time and effort was devoted to the grids at the core: image segmentation, object detection by type and quality / speed of recognition. This is the core of the system.
A minimum of time was devoted to tying, those to the architecture as a whole, and not to its main module. As a result, the RTSP stream was received in one environment, it was cut into frames, frames were processed for further actions, segmentation, detection, recognition, tractor analysis, object identity control in the frame to prevent duplicate fixation, DBMS, post / preprocessing, import / export data, web interface, REST API and more.
The market demanded more and more. Instead of one grid, which was originally the basis, many grids. After all, events, in principle, are universal and from the point of view of positioning systems - a detector of various events: people, persons, numbers, no matter what. But at the heart of the system is not many grids, but one specific one.
Extension and refinement of the system are errors. The general environment of many different systems of subsystems is the difficulty in debugging, time. Loss of time - loss of advantage in entering the market, loss of competitive advantage.

And what could architecture look like in a first approximation?

The first is container virtualization in Docker. A set of independent narrow-profile services to solve their problems. Within each block, multiple containers can be located. Conventionally, blocks can be represented as follows:

Input and input stream processing: gstreamer, ffmpeg + ffserver, valkka
Web interface
REST API
Container for the inference of multiple networks
Post-processing of frames and the formation of the outgoing RTSP stream, taking into account the application of data on found objects (based on the results of processing by the detector)
DBMS and data storage: PostgreSQL, MySQL, NoSQL DB

Docker - building the right architecture

From the point of view of detectors and grids, it is logical to use the applicability matrix indicated in the diagram. Then the detector works only for those networks for which it is intended. One detector can be used for one or more networks. For example, we need to fix cars driving on a red prohibiting signal of a traffic light or to identify people crossing the road to a red light. This is the same detector, but different detection and recognition grids.

In this approach, the result is unification (common blocks work for many networks and detectors), ease of debugging and optimization of each of the blocks, as well as scalability and the possibility of parallel development of the system as a whole (delegation of tasks to various development groups).

Result

Many containers with various options for processing incoming data and storing them, depending on the requirements of a particular implementation.
Unified storage system for various types of events.
Unified bundle of various neural networks and event detectors.
The ability to expand the functionality by adding new public networks and detectors to test hypotheses and with their further training for launching production versions.
Built-in system for monitoring the status of devices and the system as a whole (zabbix, chronograf).
Cross-platform solutions and the possibility of a U-turn on any equipment.
Ease of product development and the ability to delegate the development of individual parts to various groups, including teams of outsourcers.
Ease of debugging problems during operation (clear identification of the location of the problem and the ability to log inside each container).
The ability to scale the system and execute containers both within the framework of one physical machine, and within many

The main thing in the implementation process is not to overplay and correctly build the architecture of the project as a whole. Work with professionals, the result - saving time, cost, effort, resources, money. Otherwise, with Docker it happens like this:

Docker drove through Docker, sees Docker in Docker Docker ...

The architecture of modern systems of object video analytics. The process of becoming or rooted over time flaws?

What is this fraught with and who will suffer?

And what, in fact, is the problem?

And what could architecture look like in a first approximation?

Result

More articles: