vfork, close_range, and execve for launching processes in Procex

Las Safin

2021-07-20

procex is a Haskell library for launching processes. No other existing Haskell package offers flexible process execution, this post details the C side, which uses vfork, close_range, and execve.


Forking

If you’re not familiar with how launching processes works on Linux, there are two important syscalls:

Instead of fork, however, we will be using a closely related sycall vfork. vfork is like fork, except it doesn’t clone the memory. It was originally made to conserve resources, however, in our case it is actually much more practical.

Since memory is shared, the parent will not resume until the child has exited, and since memory is shared, it is easy to communicate between the child and parent.

It also fixes a rather annoying problem with fork:

Since threads aren’t cloned, if a thread in the parent is holding a lock when the fork happens, that lock will still be held in the child, even though the thread no longer exists, thus it will never be released. This also means we can’t use malloc and any functions that might access locks in the child, because it could possibly attempt to acquire a lock that was held across the fork and thus never released.

In the case of vfork, however, the threads in the parent will have access to the same memory as in the child, so locks will eventually get released (if the program is not buggy), avoiding the above problem entirely.

Of course it would be nice if we didn’t have to use fork at all, because it is a completely nonsensical syscall.

Piping and file descriptors

There is another important part of creating a new process, and that is setting up pipes and file descriptors. We want to like in bash be able to do cmd < file | othercmd. To do this, we need to set up the file descriptors in our own process before calling execve, since the new process inherits all our file descriptors. This is another pain point: All file descriptors you don’t close will be passed on to the new process. This means open files, open sockets, open handles to almost all kernel interfaces you can think of might be leaked into the new process. There are a couple of solutions on Linux to closing extra file descriptors: - Loop from 0 to the maximum number of file descriptors you can open, i.e. RLIMIT_NOFILE. (What System.Process.createProcess from process does, completely inane and horribly slow.) - Walk through the directory /proc/self/fd/, which contains an entry for each open file descriptor. - Use the new close_range syscall from Linux 5.9.

close_range is clearly the best solution if you’re on a recent Linux.

An important thing to keep in mind though, is that you must not launch a new process with the RLIMIT_NOFILE soft limit set to anything other than FD_SETSIZE (usually 1024), since select is hard-coded to that limit. We must set it back to FD_SETSIZE before launching a new process, and that also provides a good fallback for older kernels: If we set RLIMIT_NOFILE to FD_SETSIZE in the new process, we only have to close file descriptors below FD_SETSIZE, which is much faster than closing potentially millions of file descriptors.

The code

Now that all these steps are in place, we can start implementing our glue code:

#define _GNU_SOURCE

#include <sys/syscall.h>
#include <linux/version.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <sched.h>

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
#include <linux/close_range.h>
// glibc does not wrap close_range so we need to do it ourselves.
static int close_range(unsigned int first) {
    return syscall(__NR_close_range, first, ~0U, 0);
}
#else
static int close_range(unsigned int first) {
    // There could be fds above FD_SETSIZE, but this might not be a problem
    // because we reset the soft fd limit to FD_SETSIZE (1024) later.
    for (unsigned int i = first; i < FD_SETSIZE; i++) close(i);
    return 0;
}
#endif

// This contains the current environment.
extern char **environ;

// This is like vfork_close_execve but replaces the current process.
int close_execve(
    const char *path,
    char *const argv[],
    char *const envp[],
    int fds[],
    size_t fd_count
) {
    // We make sure the file desciptors in the array point to what they're
    // supposed to point to, since if e.g. one pointed to stdin (fd 0),
    // we want it to mean the old stdin, not the new stdin.
    for (size_t i = 0; i < fd_count; i++) {
        if (fds[i] != -1) {
            int fd = dup(fds[i]);
            if (fd == -1) return -1;
            fds[i] = fd;
        }
    }

    // Rename the file descriptors as specified,
    // closing the ones we don't want.
    for (int i = 0; i < fd_count; i++) {
        if (fds[i] != -1) {
            if (dup2(fds[i], i) == -1) return -1;
        } else {
            if (close(i) == -1) return -1;
        }
    }

    // We close all file descriptors that are larger than or equal to fd_count.
    if (close_range(fd_count) == -1) return -1;

    // Reset fd limit for compatibility with select(), see http://0pointer.net/blog/file-descriptor-limits.html.
    struct rlimit rl;
    if (getrlimit(RLIMIT_NOFILE, &rl) < 0) return -1;
    rl.rlim_cur = FD_SETSIZE;
    if (setrlimit(RLIMIT_NOFILE, &rl) < 0) return -1;

    return execve(path, argv, envp != NULL ? envp : environ);
}

// Fork, close file descriptors, then execute.
pid_t vfork_close_execve(
    const char *path, // The path to executable, does not look through PATH
    char *const argv[], // Will be passed verbatim to execve
    // Will be passed verbatim to execve if not NULL, otherwise it will be set to the current environment
    char *const envp[],
    // This is an array that is fd_count long of all file descriptorswe want to share.
    // In the new process, the descriptors will be renamed, fd[i] will be renamed to i using dup2.
    // -1 means it will be closed.
    int fds[],
    size_t fd_count
) {
    // We mark this volatile so it isn't contained in a register.
    // In the child, we can not write to the parent's registers, only its memory.
    // This is used to pass the result code to the parent.
    volatile int result = 0;
    // We fork. If you change this to a standard fork it would break since we
    // wouldn't be able to write to the parent's memory.
    pid_t pid = vfork();
    // vfork had an error.
    if (pid == -1) {
        return -1;
    // We are in the child.
    } else if (pid == 0) {
        result = close_execve(path, argv, envp, fds, fd_count);
        _exit(1);
    // We are in the parent, the child must have finished by now.
    } else {
        // The child had an error.
        if (result != 0) {
            return result;
        // Success! Return the new PID.
        } else {
            return pid;
        }
    }
}

This code on GitHub


Related articles

About me

I like dependent types.

Posts

This page has a markdown version

Atom Feed

Public PGP key (6B66 1F36 59D3 BAE7 0561 862E EA8E 9467 5140 F7F4)