Skip to main content
Back to the blog

Quickly restoring 1M+ files from backup

· 11 min read

How does one implement restoring a file from a backup on Windows? Well, you start from CreateFile, then use WriteFile to reassemble its content, and finally CloseHandle¹.

Oh, the restore operation needs to be durable?

I guess we’ll need to call FlushFileBuffers before closing the file. Or perhaps even use FILE_FLAG_WRITE_THROUGH so all our write operations go straight to disk. Flushing and writing through aren’t cheap, but we’re still talking about a single file.

Did you just say a million files?

Part 1: Denial

A million files sounds inadequate, right?

It might be reasonable to assume you won’t regularly have a million or more files in one location on Windows. That many files quickly becomes a burden for both the filesystem and the administrator when they have to perform maintenance tasks. Native Windows applications usually design their storage in ways that rarely produce millions of files².

Unfortunately, this doesn’t pass the reality check.

We were planning a backup feature for VisualSVN Server that had to be fully compatible with all existing Subversion repository formats. Older repository formats did not support packing that reduces the number of files on disk by repacking individual revision files together. So for a repository with a million commits you would have roughly 2 million files on disk.

Also, packing a repository is a regular background operation that an administrator may opt out from. It’s not something we have control over, and we absolutely had to be ready to deal with “a couple million files”.

Is there a reason to make restores durable?

In other words, why even bother with restore durability?

The problem is that a non-durable restore is like an uncommitted transaction. You created some directories, wrote some files and indicated that your operation is complete. Then your server has a hard restart, and suddenly some of those directories are missing file entries and some of those files have zeroes in them or wrong metadata.

So before you indicate success and leave the scope of your operation, you have to go the extra mile to ensure that the data you wrote actually hit the disk — and will remain consistent in the event of a hard reboot or a BSOD.

How does one do that? By making things slower and flushing to disk.

Part 2: Anger

Flushing to disk can’t be that slow!

Actually, it is ridiculously slow, even on modern hardware.

When you call an API to write something to disk, the OS copies the data and will happily confirm the write request without waiting for the data to reach the disk. This improves latency and throughput.

But if you want to guarantee that the written data has reached the disk, you can ask the OS to FlushFileBuffers. This breaks the entire asynchronous pipeline and causes the OS to block until the data is written to disk and hardware flushed.

Those costs stack up quickly if you have to do that for every file. For writing a million files, expect a 5-10× slowdown in wall-clock time. It does get worse with HDDs.

If we were on Linux, all that surely would be much faster

No, not really.

Unix I/O calls are generally faster due to reasons like not having tens of filesystem filters. But there is no silver bullet when it comes to flushing to disk, because you actually need to go all the way and make sure that the data is physically on the device.

There is some variety in things like fsync, fdatasync, F_FULLFSYNC, F_BARRIERFSYNC, so you actually may make your app faster by persisting less. But if we’re talking about similar guarantee levels, then the overall speed is pretty much the same.

It’s trickier to perform an fdatasync on Windows, but there you also have NtFlushBuffersFileEx and FLUSH_FLAGS_FILE_DATA_SYNC_ONLY.

Part 3: Bargaining

So, apparently both the speed and durability are equally important for restoring a repository from backup.

You don’t want a restore to take hours, since that is going to hinder your disaster recovery strategy. But neither do you want to risk having a repository with a hidden corruption if you remotely called Restore-SvnRepository in PowerShell and didn’t notice that the VM got abruptly powered off.

How do we approach this problem if we want to retain both speed and durability?

Maybe we write through?

Let’s say instead of calling FlushFileBuffers for every file, we open the files with FILE_FLAG_WRITE_THROUGH. It could be faster because the OS knows we want the durable writes for a file and has more opportunities to optimize the actual I/O.

Unfortunately, it doesn’t solve the speed problem for a million files. Even though we started writing directly to disk, the OS is still forced to complete a physical flush per every WriteFile. So each file becomes an unavoidable barrier in the terms of writing to the device. If most of the files are small, writing through doesn’t really help.

Maybe we restore multiple files at once?

Building on the previous idea, we could make most of our I/O asynchronous, and restore multiple files at once. Theoretically, if we have N outgoing FlushFileBuffers calls or write-through operations, we can possibly expect an up to N times speedup.

In practice, the idea doesn’t work:

  • The speedup is not so great. Perhaps, because we didn’t entirely remove flushing, but just queued as many operations as possible and expect the OS to magically coalesce them.
  • In the actual case, there are both files and directories, which results in additional dependencies that limit how many operations you can queue.
  • Asynchronously calling FlushFileBuffers is tough.
  • Juggling with a lot of open file handles is also tough and error-prone.
  • That makes the restore code asynchronous, thus far less predictable and more complex.

Maybe we restore straight into a packed form?

We could try reassembling content on-the-fly when restoring, so that our million of files would be restored as a much lower number of packed files. Then flushing them to disk is no longer a problem.

A nice concept, but it’s impractical for multiple reasons:

  • Restore cannot result in data that is different from what has been backed up. Otherwise, you’ve just lost the level of protection backup was supposed to give and turned your performance problem into a much more high-level problem.

  • If for some reason your backup was taken from a corrupted repository, with repacking it on-the-fly, you may not be able to restore it at all.

  • Repacking requires random access to files in a repository, so that would require building a fairly complex VFS that works over incremental backup snapshots.

Maybe we flush the whole volume?

Documentation for FlushFileBuffers says the following:

To flush all open files on a volume, call FlushFileBuffers with a handle to the volume. The caller must have administrative privileges.

While we don’t really want all open files, we’re looking for a way to ensure that the content of all our cached files hits the device. Luckily, the DDK docs for NtFlushBuffersFileEx say exactly that:

//  If a volume handle is specified:
// - Write all modified data for all files on the volume from the
// Windows in-memory cache.
// - Commit all pending metadata changes for all files on the volume
// from the Windows in-memory cache.
// - Send a SYNC command to the underlying storage device to commit
// all written data in the devices cache to persistent storage.

Theoretically, if we accompanied every restore with a full volume flush, that would solve all problems. Flushing the volume doesn’t depend on the number of restored files. And it only introduces a single flush barrier to the whole operation, allowing all disk writes to happen asynchronously.

But the requirement for administrative privileges is a deal-breaker. In our case, the user may not have OS-level administrative privileges but should still be able to restore a repository from backup.

Part 4: Depression

Doing the math makes the results look grim.

If you have two million files and each FlushFileBuffers takes just 2 milliseconds, that’s more than an hour just for the metadata and hardware synchronization. And that’s before we’ve even accounted for the actual data transfer.

With all known options exhausted, it’s time to accept…

…that we’re ready to start relying on undocumented internals.

Part 5: Acceptance

Out of all the proposed solutions, only flushing the volume root seems to address the root cause of the problem and scale well. We accepted that we’re okay with undocumented behavior, but what can we really do if flushing the volume normally requires elevated privileges?

Microsoft has been always known for accurately keeping compatibility and porting undocumented features, so if there’s a hint, it would be in the code of the FAT filesystem driver. As it turns, when this driver handles IRP_MJ_FLUSH_BUFFERS, flushing the volume root directory does exactly the same as flushing the volume. But that doesn’t require special privileges!

Did this behavior find its way to NTFS and ReFS? Only one way to find out.

Going practical

Below is a quick snippet we can use to open the mount point as a directory and flush it:

#define WIN32_LEAN_AND_MEAN
#include <Windows.h>

WCHAR volumePath[MAX_PATH];

// Resolve the volume mount point.
GetVolumePathName(filePath, volumePath, _countof(volumePath));

// Open the mount point as a directory with FILE_FLAG_BACKUP_SEMANTICS
// and OPEN_EXISTING flags.
// NtFlushBuffersFileEx() requires FILE_APPEND_DATA or FILE_WRITE_DATA
// so we try both of them.
HANDLE handle = CreateFile(
volumePath,
FILE_APPEND_DATA,
FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
NULL,
OPEN_EXISTING,
FILE_FLAG_BACKUP_SEMANTICS,
NULL);

// Simplified error checking, better to check for ERROR_ACCESS_DENIED,
// ERROR_NETWORK_ACCESS_DENIED and ERROR_SHARING_VIOLATION.
if (handle == INVALID_HANDLE_VALUE)
{
handle = CreateFile(
volumePath,
FILE_WRITE_DATA,
FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
NULL,
OPEN_EXISTING,
FILE_FLAG_BACKUP_SEMANTICS,
NULL);
}

if (handle != INVALID_HANDLE_VALUE)
{
FlushFileBuffers(handle);
CloseHandle(handle);
}

Here is how this program behaves in Process Monitor:

So, that FlushFileBuffers call worked, and it took more than a second for the volume. It’s easy to confirm that the behavior mirrors a volume handle flush, and experiments also confirm that a physical flush is indeed triggered.

This approach works on FAT, FAT32, NTFS and ReFS.

This doesn’t work over the SMB protocol. Fortunately, it’s relatively straightforward to detect this in advance and select an appropriate fallback. The same applies if the user doesn’t have any access to the root directory.

The conclusion: undocumented, but okay to use

Sometimes it’s fine to go into a slightly undocumented territory.

Incorporating this trick into our restore implementation allowed us to make the whole operation durable without sacrificing performance. We replaced per-file flushing with a single (costly!) flush for the volume root, but got rid of the stacking costs and allowed our writes to be fully asynchronous.

In this case, examining the FAT driver code indicates that the undocumented behavior is not accidental. Since it’s shared by FAT, FAT32, NTFS and ReFS, we’d assume it is there to stay.


¹ It gets a bit more complex if you also want to guarantee that your half-written content can’t be seen by others. In that case, you can atomically rename it into the final destination or use exclusive file sharing bits.

² Yes, node_modules, I’m looking at you.