👨🏽‍✈️ 👩🏾‍🚀 🎂 Active Restore: can disaster recovery be faster? Much faster? 👩🏻‍🤝‍👨🏾 ◀️ ♾

Backing up important data is good. But what if the work needs to be continued immediately, and every minute counts? We at Acronis decided to check how possible it is to solve the task of starting the system as quickly as possible. And this is the first post in the Active Restore series in which I will tell you how we started the project with Innopolis University, what solution we found, and what we are working on today. Details - under the cut.

Hello! My name is Daulet Tumbaev, and today I want to share with you my experience in developing a system that accelerates disaster recovery. To talk about the whole development path of the project, let's start a little from afar. I currently work at Acronis, but I am also a graduate of Innopolis University, which I graduated in the Master's Program in Software Development Management (known as MSIT-SE). Innopolis is a young university, and the curriculum is even younger. But then it is built on the curriculum of Carnegie Mellon University (Carnegie Mellon University), in the achievements of which there is such a topic as industrial projects.

The purpose of an industrial project is to immerse the student in real development and consolidate the knowledge gained in practice. To do this, the university cooperates with companies such as Yandex, Acronis, MTC and dozens of others (in total, the university had 144 partners for 2018). In the course of cooperation, companies offer their work directions to the university, and students choose one of the projects that is closer to them in their interests and level of training. Just two years ago, I was still “on the other side of the barricades” and worked as a student on another Acronis project. But this time I became a technical consultant for students from the side of the company and proposed the Active Restore project to Innopolis. The idea of Active Restore was formulated by the Kernel team at Acronis, but the development of the solution began with the University of Innopolis.

Active Restore - why is this needed?

Traditionally, disaster recovery works according to a standard scheme. After troubles with the computer, you go to the web interface of some backup system, for example, Acronis True Image, and click the big “restore” button. Then you need to wait N minutes, and only after that you can continue to work.

The problem is that this number N, also known as RTO (recovery time objective), the acceptable recovery time, can be quite impressive, which depends on the connection speed (if recovery from the cloud occurs), on the volume of your machine’s hard drive and a number of other factors. Can it be reduced? Yes, you can, because in order to resume work you do not always need a full computer disk. The same photos and videos in no way affect the functionality of the device and can be pulled up later in the background.

Driver needed ...

The operating system expects to start with a fully finished disk. Therefore, Windows conducts a series of checks for disk integrity. The system will not allow a normal start if some files that the OS expects to find are missing or damaged. To solve this problem, it was decided to put on the disk the so-called redirector files that we created, which replace missing or damaged files, but in fact are dummies. To create such redirectors is not very long, because in fact they have no content.

Further recovery occurs as follows. Background process, in parallel with the operation of the operating system, "dummies" are filled with data. The background recovery process takes into account the load on the disk and does not exceed the set limit. However, the user or the operating system itself may suddenly request a file that does not exist yet. Here comes the second recovery mode. The priority of the requested file is increased to the maximum, and the recovery process urgently uploads the file to disk. The operating system receives the desired file, albeit with a slight delay.

It looks like a perfect picture. However, in the real world, there are a huge number of pitfalls and potential deadlocks. Together with the Innopolis undergraduates, we decided to investigate this recovery scenario, evaluate the winnings in RTO, and understand whether this approach will be implemented? Indeed, such decisions on the market simply did not exist at that time.

And if I decided to give up the service component to the guys from Innopolis, then inside Acronis work began on a mini-filter file system driver . This was done by the Windows Kernel team. The plan was this:

Run the driver at an early stage of starting the OS,
During operation, when user space is fully ready, download the service
The service processes driver requests and coordinates its further work.

The subtleties of driver engineering

If my colleagues will talk about the service in another post, then in this text we will reveal the intricacies of driver development. The already developed driver mini-filter has two operating modes - when the system started in normal mode, and when the system just experienced a failure and its recovery occurs. Before loading the user libraries and applications, and therefore our service, the driver behaves the same. He does not know which state the system is in now. As a result, each create, read and write is logged, all meta-data is recorded. And when the service is online, the driver provides this information to the service.

In the case of a normal start, the service transmits a “Relax” signal to the driver so that it “relaxes” and stops scrupulously logging all the data. In this case, the driver switches to logging only changes on the disk and reports them to the service, which with the help of other Acronis tools maintains the disk backup in the most up-to-date state on the media that the user defined. This can be a cloud, remote, gradual or night backup.

If recovery mode is activated, the service informs the driver that it needs to work in the “Recovery” mode. The system has just recovered after a crash, and as soon as it gives a request to open a file on disk, the mini-filter should intercept this operation, make this request itself, check if there is such a file on disk and whether it can be opened.

If there is no file, the mini-filter transfers this information to the service, which increases the priority of file recovery (all this time, the recovery is in progress). It turns out that this file just jumps to the beginning of the queue. After that, the service itself (or by other Acronis tools) restores this file and tells the driver that everything is ok, now the operating system can access it and the driver “releases” the original request, from the system to the disk.

If recovery is not possible, the service informs the driver that there is no file in the backup either. Our mini-filter driver simply skips the system request further and the original requestor (the OS itself or the application) receives a “file not found” error. However, this is quite normal if the file really was not on the disk and in the backup.

Of course, the operating system will work much slower, because reading any file or library occurs in several stages, and possibly with access to remote resources. But on the other hand, the user can start working as soon as possible while recovery is still taking place.

Need lower, even lower ...

The prototype has proven its worth. But we also found the need to move on, because in some cases deadlocks still occur. For example, the operating system may request various libraries in several threads, which leads to the closure of our service to itself.

The problem I'm working on now is increasing the speed of Active Restore and increasing the level of system security. Suppose the system does not need a whole file, only a part of it is needed. For this, another driver was developed - a disk filter driver. It no longer works on the file, but on the block level. The principle of operation is similar: in normal operation, the driver simply logs the changed blocks on the disk, and in recovery mode, it tries to read the block on its own, in case of failure, it requests the service to increase the priority. However, all other parts of the system remain the same. For example, an OS-level service does not even suspect that it is being offered to communicate with another driver, because the main task is to provide the OS with exactly the data that is necessary for functioning. This direction requires significant improvements, if only because the service still does not know how to think at the block level.

The next step, I decided to run the driver deeper and earlier, dropping to the level of UEFI drivers and Native Windows applications instead of the service. For this, a UEFI boot driver (or DXE driver) was developed, which starts and dies before the OS starts. But the “history” of UEFI drivers, details about the assembly and installation, as well as the specifics of Windows Native applications, we will consider in the next post. So subscribe to our blog, and for now I’ll prepare a story about the next stage of work. I would be happy for your comments and suggestions.

Active Restore: can disaster recovery be faster? Much faster?

Active Restore - why is this needed?

Driver needed ...

The subtleties of driver engineering

Need lower, even lower ...

More articles: