Active Restore: Can Disaster Recovery Be Faster? Much faster?

Backing up important data is good. But what if the work needs to be continued immediately, and every minute counts? We at Acronis decided to test how it is possible to solve the problem of the fastest system startup. And this is the first post from the Active Restore series, in which I will tell you how we started the project together with Innopolis University, what solution we found, and what we are working on today. Details are under the cut.

Active Restore: Can Disaster Recovery Be Faster? Much faster?

Hello! My name is Daulet Tumbaev, and today I would like to share with you my experience in developing a system that accelerates disaster recovery. To tell about the entire development of the project, let's start a little from afar. I currently work for Acronis, but I am also a graduate of Innopolis University, where I completed my MSc in Software Development Management (known as MSIT-SE). Innopolis is a young university, and the curriculum is even younger. But on the other hand, it is built on the curriculum of Carnegie Mellon University (Carnegie Mellon University), in the developments of which there is such a topic as industrial projects.

The purpose of the industrial project is to immerse the student in real development and consolidate the acquired knowledge in practice. To do this, the university cooperates with companies such as Yandex, Acronis, MTC and dozens of others (in total, the university had 2018 partners in 144). In the course of cooperation, companies offer the university their work areas, and students choose one of the projects that is closer to them in terms of interests and level of training. Literally two years ago, I was still “on the other side of the barricades” and worked as a student on another Acronis project. But this time, I became a technical consultant for students from the company and proposed the Active Restore project to Innopolis. The very idea of ​​Active Restore was formulated by the Kernel team at Acronis, but the development of the solution began together with Innopolis University.

Active Restore - why is it needed?

Traditionally, disaster recovery has worked in a standard way. After some trouble with your computer, you go to the web interface of some backup system, such as Acronis True Image, and click the big "restore" button. Next, you need to wait N minutes, and only after that you can continue working.

Active Restore: Can Disaster Recovery Be Faster? Much faster?

The problem is that this number N, also known as the RTO (recovery time objective), the allowable recovery time, can be quite impressive, which depends on the connection speed (if you are recovering from the cloud), on the size of your machine's hard drive and a number of other factors. Can it be reduced? Yes, you can, because in order to resume work, you do not always need a full computer disk. The same photos and videos do not affect the functionality of the device in any way and can be pulled up later in the background.

Driver needed...

The operating system expects to start up with a fully prepared disk. Therefore, Windows conducts a series of disk integrity checks. The system will not allow a normal startup if some files that the OS expects to find are missing or corrupted. To solve this problem, it was decided to put on the disk the so-called redirector files that we created, which replace missing or damaged files, but in fact are dummies. It doesn't take long to create such redirectors, because they don't actually have any content.

Further recovery occurs as follows. The background process, in parallel with the operation of the operating system, "dummies" are filled with data. The background recovery process takes into account the load on the disk and does not exceed the set limit. However, the user or the operating system itself may suddenly require a file that does not yet exist. This is where the second recovery mode comes into play. The priority of the requested file is increased to the maximum, and the recovery process urgently loads the file to disk. The operating system receives the desired file, albeit with a slight delay.

This is what the perfect picture looks like. However, in the real world, there are a huge number of pitfalls and potential deadlocks. Together with the undergraduates of Innopolis, we decided to investigate this recovery scenario, evaluate the gain in RTO, and understand whether such an approach is feasible? After all, there were simply no such solutions on the market at that time.

And if I decided to give the service component to the guys from Innopolis, then inside Acronis work began on mini-filter file system driver. This was done by the Windows Kernel team. The plan was this:

  • Run the driver at an early stage of OS startup,
  • During work, when userspace will be fully ready, load the service
  • The service processes driver requests and coordinates its further work.

Active Restore: Can Disaster Recovery Be Faster? Much faster?

Subtleties of driver building

If my colleagues talk about the service in another post, then in this text we will reveal the intricacies of driver development. The already developed mini-filter driver has two modes of operation - when the system starts up normally, and when the system has just experienced a failure and is being restored. Before the loading of user libraries and applications, and hence our service, the driver behaves in the same way. It does not know in which state the system is currently in. As a result, every create, read, and write is logged, and all metadata is captured. And when the service is online, the driver provides this information to the service.

Active Restore: Can Disaster Recovery Be Faster? Much faster?
In the case of a normal start, the service sends the “Relax” signal to the driver so that it “relaxes” and stops scrupulously logging all data. In this case, the driver switches to logging only changes on the disk and reports them to the service, which, using other Acronis tools, keeps the disk backup as up-to-date as possible on the media specified by the user. It can be cloud, remote, incremental, or nightly backups.

Active Restore: Can Disaster Recovery Be Faster? Much faster?
If recovery mode is enabled, the service tells the driver that it needs to work in “Recovery” mode. The system has just recovered from a crash, and as soon as it gives a request to open a file on the disk, the mini-filter should intercept this operation, make this request itself, check if such a file exists on the disk and whether it can be opened.

If the file is missing, the mini-filter passes this information to the service, which raises the priority of file recovery (all this time the recovery is going on in the background). It turns out that this file just jumps to the front of the queue. After that, the service itself (or by other means of Acronis) restores this file and informs the driver that everything is OK, now the operating system can access it and the driver “releases” the original request from the system to the disk.

If recovery is not possible, the service informs the driver that the file is not in the backup either. Our mini-filter driver simply passes the system request on and the original requester (the OS itself or the application) gets a “file not found” error. However, this is quite normal if the file really was not on the disk and in the backup.

Active Restore: Can Disaster Recovery Be Faster? Much faster?

Of course, the operating system will run much slower, because reading any file or library takes place in several stages, and possibly with access to remote resources. But on the other hand, the user can get to work in the shortest possible time, while the recovery is still in progress.

Need lower, lower still...

The prototype has proven to work. But we also found the need to move on because there are still deadlocks happening in some cases. For example, the operating system can request various libraries in several threads, which leads to the closure of our service on itself.

The problem I'm currently working on is speeding up Active Restore and improving system security. Suppose the system does not need the whole file, only part of it is needed. For this, another driver was developed - the disk filter driver. It no longer works at the file level, but at the block level. The principle of operation is similar: in normal operation, the driver simply logs the changed blocks on the disk, and in recovery mode, it tries to read the block on its own, in case of failure, it requests a priority increase from the service. At the same time, all other parts of the system remain the same. For example, an OS-level service does not even suspect that it is offered to communicate with another driver, because the main task is to provide the OS with exactly the data that is necessary for its functioning. This direction requires significant improvements, if only because the service still does not know how to think at the block level.

The next step, I decided to run the driver deeper and earlier, dropping down to the level of UEFI drivers and Native Windows applications instead of a service. For this, it was developed UEFI boot driver (or DXE driver) that starts and dies before the OS even starts. But the “history” of UEFI drivers, details about assembly and installation, as well as the specifics of Windows Native applications, we will consider in the next post. So subscribe to our blog, and for now I will prepare a story about the next stage of work. I will be glad to your comments and advice.

Only registered users can participate in the survey. Sign in, you are welcome.

Have you ever had a situation where recovery took an excruciatingly long time:

  • 65.1%Yes28

  • 23.2%No10

  • 11.6%Didn't think 5

43 users voted. 3 users abstained.

Source: habr.com

Add a comment