2021-07-20
procex
is a Haskell library for launching processes. No other existing Haskell
package offers flexible process execution, this post details the C side,
which uses vfork
, close_range
, and
execve
.
If you’re not familiar with how launching processes works on Linux, there are two important syscalls:
execve
: This replaces the current process with a new
process. It takes the path to the new executable, the arguments, and the
environment.fork
: This clones the current process, and is the
fundamental way of making a new process. Once you’re in the new cloned
process, you can replace it with a new process using
execve
.Instead of fork
, however, we will be using a closely
related sycall vfork
. vfork
is like
fork
, except it doesn’t clone the memory. It was originally
made to conserve resources, however, in our case it is actually much
more practical.
Since memory is shared, the parent will not resume until the child has exited, and since memory is shared, it is easy to communicate between the child and parent.
It also fixes a rather annoying problem with fork
:
Since threads aren’t cloned, if a thread in the parent is holding a
lock when the fork happens, that lock will still be held in the child,
even though the thread no longer exists, thus it will never be released.
This also means we can’t use malloc
and any functions that
might access locks in the child, because it could possibly attempt to
acquire a lock that was held across the fork and thus never
released.
In the case of vfork
, however, the threads in the parent
will have access to the same memory as in the child, so locks will
eventually get released (if the program is not buggy), avoiding the
above problem entirely.
Of course it would be nice if we didn’t have to use fork
at all, because it is a
completely nonsensical syscall.
There is another important part of creating a new process, and that
is setting up pipes and file descriptors. We want to like in
bash
be able to do cmd < file | othercmd
.
To do this, we need to set up the file descriptors in our own process
before calling execve
, since the new process inherits all
our file descriptors. This is another pain point: All file descriptors
you don’t close will be passed on to the new process. This means open
files, open sockets, open handles to almost all kernel interfaces you
can think of might be leaked into the new process. There are a couple of
solutions on Linux to closing extra file descriptors: - Loop from 0 to
the maximum number of file descriptors you can open,
i.e. RLIMIT_NOFILE
. (What
System.Process.createProcess
from process
does, completely inane and horribly slow.) - Walk through the directory
/proc/self/fd/
, which contains an entry for each open file
descriptor. - Use the new close_range
syscall from Linux
5.9.
close_range
is clearly the best solution if you’re on a
recent Linux.
An important thing to keep in mind though, is that you must
not launch a new process with the RLIMIT_NOFILE
soft limit set to anything other than FD_SETSIZE
(usually
1024), since select
is hard-coded to that limit. We must set it back to
FD_SETSIZE
before launching a new process, and that also
provides a good fallback for older kernels: If we set
RLIMIT_NOFILE
to FD_SETSIZE
in the new
process, we only have to close file descriptors below
FD_SETSIZE
, which is much faster than closing potentially
millions of file descriptors.
Now that all these steps are in place, we can start implementing our glue code:
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <linux/version.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <sched.h>
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
#include <linux/close_range.h>
// glibc does not wrap close_range so we need to do it ourselves.
static int close_range(unsigned int first) {
return syscall(__NR_close_range, first, ~0U, 0);
}
#else
static int close_range(unsigned int first) {
// There could be fds above FD_SETSIZE, but this might not be a problem
// because we reset the soft fd limit to FD_SETSIZE (1024) later.
for (unsigned int i = first; i < FD_SETSIZE; i++) close(i);
return 0;
}
#endif
// This contains the current environment.
extern char **environ;
// This is like vfork_close_execve but replaces the current process.
int close_execve(
const char *path,
char *const argv[],
char *const envp[],
int fds[],
size_t fd_count
) {
// We make sure the file desciptors in the array point to what they're
// supposed to point to, since if e.g. one pointed to stdin (fd 0),
// we want it to mean the old stdin, not the new stdin.
for (size_t i = 0; i < fd_count; i++) {
if (fds[i] != -1) {
int fd = dup(fds[i]);
if (fd == -1) return -1;
[i] = fd;
fds}
}
// Rename the file descriptors as specified,
// closing the ones we don't want.
for (int i = 0; i < fd_count; i++) {
if (fds[i] != -1) {
if (dup2(fds[i], i) == -1) return -1;
} else {
if (close(i) == -1) return -1;
}
}
// We close all file descriptors that are larger than or equal to fd_count.
if (close_range(fd_count) == -1) return -1;
// Reset fd limit for compatibility with select(), see http://0pointer.net/blog/file-descriptor-limits.html.
struct rlimit rl;
if (getrlimit(RLIMIT_NOFILE, &rl) < 0) return -1;
.rlim_cur = FD_SETSIZE;
rlif (setrlimit(RLIMIT_NOFILE, &rl) < 0) return -1;
return execve(path, argv, envp != NULL ? envp : environ);
}
// Fork, close file descriptors, then execute.
(
pid_t vfork_close_execveconst char *path, // The path to executable, does not look through PATH
char *const argv[], // Will be passed verbatim to execve
// Will be passed verbatim to execve if not NULL, otherwise it will be set to the current environment
char *const envp[],
// This is an array that is fd_count long of all file descriptorswe want to share.
// In the new process, the descriptors will be renamed, fd[i] will be renamed to i using dup2.
// -1 means it will be closed.
int fds[],
size_t fd_count
) {
// We mark this volatile so it isn't contained in a register.
// In the child, we can not write to the parent's registers, only its memory.
// This is used to pass the result code to the parent.
volatile int result = 0;
// We fork. If you change this to a standard fork it would break since we
// wouldn't be able to write to the parent's memory.
= vfork();
pid_t pid // vfork had an error.
if (pid == -1) {
return -1;
// We are in the child.
} else if (pid == 0) {
= close_execve(path, argv, envp, fds, fd_count);
result (1);
_exit// We are in the parent, the child must have finished by now.
} else {
// The child had an error.
if (result != 0) {
return result;
// Success! Return the new PID.
} else {
return pid;
}
}
}
Type theorist. Rolling my own crypto.
This page has a markdown version
Public PGP key (6B66 1F36 59D3 BAE7 0561 862E EA8E 9467 5140 F7F4)