Face recognition at the knee-applied level

In general, face recognition and identification of people according to their results looks like teenage sex for elders - everyone talks a lot about him, but few practice. It is clear that we are no longer surprised that after downloading photos from friendly gatherings, Facebook / VK suggests marking the people found in the picture, but here we intuitively know that social networks have a good help in the form of a graph of person connections. And if there is no such graph? However, let's start in order.



Some damn thing is hiding in the door with a box



Initially, face recognition and identification of “friend / foe” with us grew out of exclusively domestic needs - drug addicts got into a colleague’s staircase, and constantly monitor the picture from the installed video camera, and even disassemble where the neighbor and where the stranger had no desire no.



Therefore, literally in a week, a prototype consisting of an IP camera, a single-plate device, a motion sensor and the Python face recognition library face_recognition was assembled on the knee. Since the python library was on a single-plate, rather powerful hardware ... let's say it carefully, not very fast, we decided to build the processing process as follows:



The whole process was glued together by our beloved Erlang, and in the process of testing on colleagues he proved his minimal working capacity.



However, the assembled model did not find application in real life - not because of its technical imperfection, which undoubtedly was - as experience shows, collected on a knee in greenhouse office conditions has a very bad tendency to break in the field, and at the time of demonstration to the customer , and because of the organizational ones, the inhabitants of the entrance refused the majority of any video surveillance.



The project went on the shelf and periodically poked with a stick for demonstrations during sales and the urge to reincarnate in the personal household.



Everything has changed since the moment when we had a more specific and quite commercial project on the same topic. Since it would not be possible to cut corners in it using a motion sensor directly from the statement of the problem, I had to delve deeper into the nuances of searching and recognizing faces in three heads (okay, two and a half, if you count mine) directly on the stream. And then a revelation happened.



The trouble is that most of the findings on this issue are purely academic sketches on the topic that “I had to write an article in a magazine on a fashionable topic and get a tick for publishing.” I do not detract from the merits of scientists - among the papers I found there were a lot of useful and interesting, but, alas, I have to admit that the reproducibility of the work of their code posted on Github leaves much to be desired or looks like a dubious undertaking with time wasted in the end.



Numerous frameworks for neural networks and machine learning were often difficult to lift - face recognition was a separate narrow task for them, uninteresting in a wide range of problems that they solved. In other words, to take a ready-made example and run it on the target hardware just to check how it works and whether it works, did not work. That was not an example, then the need to obtain it suggested a sour quest from the assembly of certain libraries of certain versions for strictly defined OS. Those. to take and fly on the move - literally crumbs like the previously mentioned face_recognition, which we used for the previous crafts.



Big companies, as always, saved us. Both Intel and Nvidia have long felt the growing momentum and commercial appeal of this class of tasks, but, being suppliers of equipment in the first place, they distribute their frameworks for solving specific application problems for free.



Our project was more likely not of a research, but of an experimental nature, so we did not analyze and compare the solutions of individual vendors, but simply took the first one with the aim of collecting a ready-made prototype and testing it in battle, while receiving the fastest response. Therefore, the choice very quickly fell on Intel OpenVINO - a library for the practical application of machine learning in applied tasks.



To get started, we put together a stand, which is traditionally a nettop kit with an Intel Core i3 processor and IP cameras from Chinese vendors on the market. The camera was directly connected to the nettop and supplied it with an RTSP stream with a not very large FPS, based on the assumption that people still would not run in front of it like in competitions. The processing speed of one frame (search and recognition of faces) fluctuated in the region of tens to hundreds of milliseconds, which was quite enough to embed the search mechanism for persons using existing samples. In addition, we also had a backup plan - Intel has a special coprocessor to speed up the calculations of Neural Compute Stick 2 neural networks, which we could use if we did not have a general-purpose processor. But - so far nothing has happened.



After completing the assembly and verifying the functionality of the basic examples - a distinctive feature of the Intel SDK was in a step-by-step and very detailed example guide - we started building the software.



The main task facing us was to search for a person in the camera’s field of view, her identification and timely notification of her presence. Accordingly, in addition to recognizing faces and comparing them with patterns (how to do this, at that time, raised no less questions than everything else), we needed to provide secondary things of the interface plan. Namely, we must receive frames with the faces of the necessary persons from the same camera for their subsequent identification. Why, from the same camera, I think it’s also quite obvious - the camera’s installation point on the object and the lens optics introduce certain distortions, which, presumably, can affect the quality of recognition, we use a different source of source data than a tracking tool.



Those. In addition to the stream handler itself, we need at least a video archive and a video file analyzer that will isolate all detected faces from the recording and save the most suitable ones as reference ones.



As always, we took the familiar Erlang and PostgreSQL as glue between ffmpeg, applications on OpenVINO and Telegram Bot API for notification. In addition, we needed a web-UI to provide the minimum set of procedures for managing the complex, which our colleague-front-ender had uploaded to VueJS.



The logic of work was as follows:





It looks something like this:



So we can poke into the archive

View archive



And so - poke into the faces that lit up in the video

Video Analysis Results



And here they are, these guys!

List of observed personalities



The resulting solution is in many ways controversial and sometimes even not optimal both in terms of productivity and in terms of reducing reaction time. But, I repeat, we did not have a goal to get an effective system right away, but simply to go this route and fill the maximum number of cones, identify narrow paths and potential non-obvious problems.



The assembled system was tested in greenhouse office conditions for a week. During this time, the following observations were noted:



What is ahead? Checking the system in battle, optimizing the detection processing cycle, simplifying the procedure for searching for events in the video archive, adding more data to the analysis (age, gender, emotions) and another 100,500 small and not-so tasks that still need to be done. But the first step in the path of a thousand steps we have already taken. If anyone shares their experience in solving such problems or gives interesting links on this issue - I will be very grateful.



All Articles