Wednesday, April 18, 2018

renameat2(2)

I just recently discovered that Linux kernels from about version 3.15 or so implement a system call called renameat2. (That's pronounced "rename at 2" not "rena meat 2" BTW.) This is a useful thing to know, because a close reading of the renameat2(2) manual page reveals it eliminates two related thorny problems that I haven't found any other way to solve.

I also discovered that despite what the output of the man renameat2 command might imply about the C function prototype, recent versions of the glibc library, at least the one on my Ubuntu 16.04 "xenial" development system, don't provide a function wrapper for the system call. Fortunately, it turns out that you can easily roll your own, something I have never had to do before.

Here is the implementation of the equivalent C function.

(N.B. In these code snippets I've eliminated all the error checking for clarity. In my production code, I am avidly paranoid about checking all error returns. I encourage you to do the same. All of these code snippets are from actual running examples. Since Blogger seems incapable of consistently displaying my code snippets correctly across all browsers - they all look correct in the Blogger editor - these snippets can be found on GitHub.)

#include <stdio.h>
#include <fcntl.h>
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

int renameat2(int olddirfd, const char * oldpath, int newdirfd, const char * newpath, unsigned int flags) {
 return syscall(SYS_renameat2, olddirfd, oldpath, newdirfd, newpath, flags);
}

A careful application of strace(1) on some unit tests reveals nothing but goodness in the result.

Problem One

Accidents happen. Mistakes are made. No matter how carefully you write your boot-time start-up scripts, or your interactive restart scripts, for system daemons, sooner or later your product is going to try to start the same daemon twice. This will happen in the development lab, or during quality assurance testing, if no where else. Wackiness may ensue.

A common idiom used to prevent this is for a daemon to initially create a lock file with a unique name like /run/lock/mydaemonname.lck in such a way that if the file already exists, the file creation fails, and the daemon exits assuming that prior version of itself is already running. This lock file is deleted when the daemon exits, and the entire /run/lock directory is cleaned up when the system reboots.

Ensuring that the lock file does not already exist can be easily done by using the combination of the O_CREAT | O_EXCL | O_WRONLY flags on on the open(2) system call; the system call will fail if the file already exists, and the check for it existing, and the subsequent creation of the file, is atomic. There is no way another process running concurrently on the system can see an intermediate result. The Linux kernel and its file system implementation guarantees that the file either exists, or it doesn't exist; there is no possible race condition.

This pattern gets ugly when the daemon wants to implement another commonly used idiom: writing its own process identifier, typically as text, into the lock file. This is really useful, because it allows other applications to verify that the daemon that created the lock file is still running, and to subsequently communicate with it, for example by sending it a signal. The ugliness happens because if the daemon atomically creates the file using open(2), then writes its PID into the file, applications can easily see an intermediate state: an empty lock file, or worse, a lock file with only a partially written PID that may look legitimate, because writing the PID to the lock file isn't necessarily atomic.

What the daemon would like to do is create a unique temporary file, perhaps using mkstemp(3), write its PID into that file, then atomically rename(2) the temporary file to the name of the lock file using the same exclusivity semantics as provided by open(2). We want the lock file to have exactly two visible states:
  • the lock file does not exist;
  • the lock file exists and contains a valid process identifier.
Unfortunately, while the rename(2) system call can be atomic, providing the source and destination files are in the same file system (so that no copying needs to be done),  it does not provide the same optional exclusivity flags as open(2). But that's exactly what renameat2(2) provides with its RENAME_NOREPLACE flag. And even better, renameat2(2) also checks that the source and destination file names are in the same file system, keeping you from shooting yourself in the foot.

Here is the implementation of the lock function.

int my_lock(const char * file)
{
    static const char SUFFIX[] = "-lock-XXXXXX";
    pid_t pid = getpid();
    char * path = (char *)malloc(strlen(file) + sizeof(SUFFIX));
    strcpy(path, file);
    strcat(path, SUFFIX);
    int fd = mkstemp(path);
    FILE * fp = fdopen(fd, "w");
    fprintf(fp, "%d\n", pid);
    fclose(fp);
    int rc = renameat2(AT_FDCWD, path, AT_FDCWD, file, RENAME_NOREPLACE);
    if (rc < 0) { unlink(path); }
    free(path);
    return rc;
}

Problem Two

Another common idiom is for a parent process to atomically create an empty lock file, fork(2) off a child process, the child process daemonifies itself, then it writes its PID into the lock file. In this approach, the lock file has three states instead of two:
  • the lock file does not exist,
  • the lock file exists but is empty,
  • the lock file exists and contains a valid PID.
We want the lock file creation to be atomic as before, and we want the transition from empty to containing a PID to be atomic as well, with no intermediate states visible. The parent can wait for some time out period, then check the lock file for the PID. This not only gives the parent a PID through which it and other applications can signal the child, but also informs the parent that the child has started successfully.

The initial atomic creation of the lock file can be done as described previously using open(2) with the appropriate flags. As a subsequent step, the child process creates a temporary file containing its PID, and, providing the temporary file and the lock file are in the same file system, renameat2(2) can be used with its RENAME_EXCHANGE flag to atomically swap the temporary file and the lock file. renameat2(2) does all the heavy lifting for us by not only atomically swapping the two files, but also by guaranteeing that the two files are in the same file system, and insuring that the lock file already exists.

Here is the implementation of the pre-lock function called by the parent process that atomically exclusively creates the lock file, and the post-lock function called by the child that populates the lock file with the child's PID. (The implementation of the functions to delete the lock file, and to read the PID from the lock file, are left as an exercise for the reader.)

int my_prelock(const char * file)
{
    int fd = open(file, O_CREAT | O_EXCL | O_WRONLY, 0600);
    if (fd < 0) { return -1; }
    close(fd);
    return 0;
}

int my_postlock(const char * file)
{
    static const char SUFFIX[] = "-post-XXXXXX";
    pid_t pid = getpid();
    char * path = (char *)malloc(strlen(file) + sizeof(SUFFIX));
    strcpy(path, file);
    strcat(path, SUFFIX);
    int fd = mkstemp(path);
    FILE * fp = fdopen(fd, "w");
    fprintf(fp, "%d\n", pid);
    fclose(fp);
    int rc = renameat2(AT_FDCWD, path, AT_FDCWD, file, RENAME_EXCHANGE);
    unlink(path);
    free(path);
    return rc;
}

Note that once the swap is successful, the post-lock function deletes the temporary file, which will now be empty.

Alternatives

I recently worked on a embedded product development project in which the underlying Linux distribution chosen by the team supported systemd(1), the system and service manager. systemd replaces the bulk of the boot-time init infrastructure, daemon start-up scripts, and even the cron daemon. As you might imagine, its introduction is somewhat controversial in the Linux mainstream, especially given that the traditional UNIX architectural philosophy is to resist centralized management.

But I - who have been doing embedded Linux development since the Linux 2.4 kernel - was a little surprised to find I liked systemd. If your projects supports systemd, you will find that it provides other mechanisms to manage the daemons you might be called upon to write. You should use those mechanisms.

My computer scientist friends will have appreciated that the renameat2 system call implements a kind of semaphore or mutex, or at least a kind of atomic swap that can be used to implement such capabilities, in the file system. That means there are probably lots of other things it can be used for.

Linux (and other UNIX variants) provide other synchronization and mutual-exclusion mechanisms through which something like this could be implemented. But this approach just uses the file system. It requires only a small amount of cooperation between the applications that make use of it. It's simple and straight forward.

Important Safety Tips

renameat2(2) requires support from the underlying file system. Not all file systems may implement it. Some, especially network file systems, may never be able to implement it, because of their semantics.

At least on Ubuntu 16.04 "xenial", the synthetic processor implemented by valgrind(1) doesn't provide the renameat2 system call. This means you can't test applications using that call for memory leaks. This makes me nervous.

Final Remarks

I like renameat2(2) because I haven't found any other way to reliably and simply implement what it provides. In fact, renameat2(2) is so uniquely useful, I suspect that systemd(1) uses the renameat2 system call under the hood to perform its own service management. (Update: I just checked the systemd repository on GitHub and this is indeed the case.)

I came to use renameat2(2) in the implementation of the diminuto_lock.c feature, and in the renametool utility, in my Diminuto C-based systems programming library. Diminuto is a collection of useful C functions and utilities I have written that has found its way into a number of shipping products, not to mention in many of my internal projects.

No comments: