🙌🏼 🤢 👤 Surprise fsync () PostgreSQL 🔓 😵 🛀🏽

DBMS developers, by virtue of necessity, are concerned that data safely gets into permanent storage. Therefore, when the PostgreSQL community discovered that the way the kernel handles I / O errors can lead to data loss without any errors being reported to user space, a lot of discontent arose. A problem that is aggravated by the fact that PostgreSQL performs buffered I / O is not unique to Linux, and it will not be easy to solve even there.

Craig Ringer first reported the issue to the pgsql-hackers mailing list in late March. In short, PostgreSQL assumes that a successful fsync()

call indicates that all data recorded since the last successful call was safely transferred to persistent storage. When a buffered I / O record fails due to a hardware error, file systems react differently, but this behavior usually involves deleting data on the corresponding pages and marking them as clean. Therefore, reading blocks that have just been written will most likely return something else but not recorded data.

What about error reporting? A year ago, the Linux Filesystem, Storage and Memory-Management Summit (LSFMM) summit included an error reporting session in which it was all called a “mess”; errors can easily be lost, so no application will ever see them. Some patches included in 4.13 improved the situation somewhat during the development cycle (and in 4.16 there were some changes to further improve it), however, there are ways to lose error notifications, as described below. If this happens on a PostgreSQL server, it can lead to automatic database corruption.

PostgreSQL developers were unhappy. Tom Lane described this as " brain damage to the nucleus, " while Robert Haas called it " 100% stupid ." At the beginning of the discussion, PostgreSQL developers understood quite clearly how, in their opinion, the kernel should work: pages that could not be written should be stored in memory in a “dirty” state (for subsequent attempts), and the corresponding file descriptor should be translated into Permanent error status so that the PostgreSQL server cannot skip the problem.

Where did something go wrong

However, even before the kernel community entered into the discussion, it became clear that the situation was not as simple as it might seem. Thomas Munro said that Linux is not unique in this behavior; OpenBSD and NetBSD may also not report write errors in user space. And, as it turned out, the way PostgreSQL handles buffered I / O is a big complication.

This mechanism has been described in detail by Haas. A PostgreSQL server acts as a set of processes, many of which can perform I / O on database files. The fsync()

call fsync()

, however, is handled in a single checkpointer (“checkpointer” process) process, which is about keeping disk storage in a consistent state to recover from failures. Checkpointer usually does not keep all relevant files open, so it often has to open the file before calling fsync()

. This is where the problem arises: even in kernels 4.13 and later versions, checkpointer will not see any errors that occurred before the file was opened. If something bad happens before calling open()

checkpointer-a, then the next call to fsync()

will return success. There are several ways that an I / O error may occur outside of the fsync()

call; for example, the kernel may encounter one of them while doing background writeback. Someone calling sync()

may also encounter an I / O error and “absorb” the resulting error state.

Haas described this behavior as being unable to meet PostgreSQL's expectations:

All that you (or someone) has is basically an unproven assumption that

which file descriptors may be relevant to a particular error, but it so happened that PostgreSQL never matched it. You can continue to say that the problem is in our guesses, but it seems to me wrong to assume that we are the only program that has ever done them.

As a result, Joshua Drake moved the conversation to the development list for ext4, including part of the kernel development community. Dave Chinner quickly described this behavior as "a recipe for disaster, especially in cross-platform code, where each OS platform behaves differently and almost never matches what was expected ." Instead, Ted Tso explained why the affected pages are marked as clean after an I / O error; in short, the most common cause of I / O errors is when the user ejects the USB drive at the wrong time. If a process copied a lot of data onto this disk, the result will be the accumulation of dirty pages in memory, possibly to the point that the system does not have enough memory for other tasks. Thus, these pages cannot be saved and will be cleared if the user wants the system to remain usable after such an event.

Both Chinner and Tso, along with others, said PostgreSQL had the right solution - switch to direct I / O (DIO). Using DIO gives a greater level of control over writeback and I / O in general; this includes access to information about which I / O operations may have failed. Andres Freund, like several other PostgreSQL developers, acknowledged that DIO is the best long-term solution. But he also noted that one should not expect that the developers are now plunging headlong into the implementation of this task. Meanwhile, he said that there are other programs (he mentioned dpkg) that are also prone to this behavior.

Towards a short-term solution

During the discussion, considerable attention was paid to the idea that a write failure should lead to the fact that the affected pages will be stored in memory in their dirty state. But PostgreSQL developers quickly moved away from this idea and did not demand it. What they really need is ultimately a reliable way to find out if something went wrong. With this in mind, conventional PostgreSQL error handling mechanisms can handle this; however, little can be done in his absence.

At some point in the discussion, Tso mentioned that Google has its own I / O error handling mechanism. The kernel was instructed to report I / O errors through the netlink socket; The dedicated process receives these notifications and responds accordingly. Yet this mechanism never did this at the input. Freind indicated that this mechanism would be "ideal" for PostgreSQL, so it may appear in the public domain in the near future.

Meanwhile, Jeff Leighton was thinking about another idea: to set a flag in the superblock of the file system when an I / O error occurs. A call to syncfs()

will then clear this flag and return an error if it was set. PostgreSQL checkpointer can periodically call syncfs()

to poll for errors in the file system containing the database. Freund agreed that this could be a viable solution to the problem.

Of course, any such mechanism will appear only in new kernels; in the meantime, PostgreSQL installations typically run on older kernels supported by enterprise distributions. In these kernels, it seems that even those improvements that were included in 4.13 are missing. For these systems, little can be done to help PostgreSQL detect I / O errors. It may be enough to start a daemon that scans the system log and looks for I / O error messages there. Not the most elegant solution, and it is complicated by the fact that different block drivers and file systems, as a rule, report errors in different ways, but this may be the best option available.

The next step is likely to be a discussion at the LSFMM 2018 event, which will take place on April 23. If you're lucky, there will be some kind of solution that will work for interested parties. However, one thing that will not change is the simple fact that error handling is difficult to do correctly.

Surprise fsync () PostgreSQL

Where did something go wrong

Towards a short-term solution

More articles: