--- author: Las Safin date: "2021-07-20" keywords: - linux - vfork - syscall - haskell - guide title: "vfork, close_range, and execve for launching processes in Procex" ... [`procex`](https://hackage.haskell.org/package/procex) is a Haskell library for launching processes. No other existing Haskell package offers flexible process execution, this post details the C side, which uses `vfork`, `close_range`, and `execve`. ----- # Forking If you're not familiar with how launching processes works on Linux, there are two important syscalls: - `execve`: This replaces the current process with a new process. It takes the path to the new executable, the arguments, and the environment. - `fork`: This clones the current process, and is the fundamental way of making a new process. Once you're in the new cloned process, you can replace it with a new process using `execve`. Instead of `fork`, however, we will be using a closely related sycall `vfork`. `vfork` is like `fork`, except it doesn't clone the memory. It was originally made to conserve resources, however, in our case it is actually much more practical. Since memory is shared, the parent will not resume until the child has exited, and since memory is shared, it is easy to communicate between the child and parent. It also fixes a rather annoying problem with `fork`: Since threads aren't cloned, if a thread in the parent is holding a lock when the fork happens, that lock will still be held in the child, even though the thread no longer exists, thus it will never be released. This also means we can't use `malloc` and any functions that might access locks in the child, because it could possibly attempt to acquire a lock that was held across the fork and thus never released. In the case of `vfork`, however, the threads in the parent will have access to the same memory as in the child, so locks will eventually get released (if the program is not buggy), avoiding the above problem entirely. Of course it would be nice if we didn't have to use `fork` at all, because it is [a completely nonsensical syscall](https://drewdevault.com/2018/01/02/The-case-against-fork.html). # Piping and file descriptors There is another important part of creating a new process, and that is setting up pipes and file descriptors. We want to like in `bash` be able to do `cmd < file | othercmd`. To do this, we need to set up the file descriptors in our own process before calling `execve`, since the new process inherits all our file descriptors. This is another pain point: All file descriptors you don't close will be passed on to the new process. This means open files, open sockets, open handles to almost all kernel interfaces you can think of might be leaked into the new process. There are a couple of solutions on Linux to closing extra file descriptors: - Loop from 0 to the maximum number of file descriptors you can open, i.e. `RLIMIT_NOFILE`. (What `System.Process.createProcess` from `process` does, completely inane and horribly slow.) - Walk through the directory `/proc/self/fd/`, which contains an entry for each open file descriptor. - Use the new `close_range` syscall from Linux 5.9. `close_range` is clearly the best solution if you're on a recent Linux. An important thing to keep in mind though, is that you **must not** launch a new process with the `RLIMIT_NOFILE` soft limit set to anything other than `FD_SETSIZE` (usually 1024), since [`select`](https://man7.org/linux/man-pages/man2/select.2.html) is hard-coded to that limit. We must set it back to `FD_SETSIZE` before launching a new process, and that also provides a good fallback for older kernels: If we set `RLIMIT_NOFILE` to `FD_SETSIZE` in the new process, we only have to close file descriptors below `FD_SETSIZE`, which is much faster than closing potentially millions of file descriptors. # The code Now that all these steps are in place, we can start implementing our glue code: ```c #define _GNU_SOURCE #include #include #include #include #include #include #if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0) #include // glibc does not wrap close_range so we need to do it ourselves. static int close_range(unsigned int first) { return syscall(__NR_close_range, first, ~0U, 0); } #else static int close_range(unsigned int first) { // There could be fds above FD_SETSIZE, but this might not be a problem // because we reset the soft fd limit to FD_SETSIZE (1024) later. for (unsigned int i = first; i < FD_SETSIZE; i++) close(i); return 0; } #endif // This contains the current environment. extern char **environ; // This is like vfork_close_execve but replaces the current process. int close_execve( const char *path, char *const argv[], char *const envp[], int fds[], size_t fd_count ) { // We make sure the file desciptors in the array point to what they're // supposed to point to, since if e.g. one pointed to stdin (fd 0), // we want it to mean the old stdin, not the new stdin. for (size_t i = 0; i < fd_count; i++) { if (fds[i] != -1) { int fd = dup(fds[i]); if (fd == -1) return -1; fds[i] = fd; } } // Rename the file descriptors as specified, // closing the ones we don't want. for (int i = 0; i < fd_count; i++) { if (fds[i] != -1) { if (dup2(fds[i], i) == -1) return -1; } else { if (close(i) == -1) return -1; } } // We close all file descriptors that are larger than or equal to fd_count. if (close_range(fd_count) == -1) return -1; // Reset fd limit for compatibility with select(), see http://0pointer.net/blog/file-descriptor-limits.html. struct rlimit rl; if (getrlimit(RLIMIT_NOFILE, &rl) < 0) return -1; rl.rlim_cur = FD_SETSIZE; if (setrlimit(RLIMIT_NOFILE, &rl) < 0) return -1; return execve(path, argv, envp != NULL ? envp : environ); } // Fork, close file descriptors, then execute. pid_t vfork_close_execve( const char *path, // The path to executable, does not look through PATH char *const argv[], // Will be passed verbatim to execve // Will be passed verbatim to execve if not NULL, otherwise it will be set to the current environment char *const envp[], // This is an array that is fd_count long of all file descriptorswe want to share. // In the new process, the descriptors will be renamed, fd[i] will be renamed to i using dup2. // -1 means it will be closed. int fds[], size_t fd_count ) { // We mark this volatile so it isn't contained in a register. // In the child, we can not write to the parent's registers, only its memory. // This is used to pass the result code to the parent. volatile int result = 0; // We fork. If you change this to a standard fork it would break since we // wouldn't be able to write to the parent's memory. pid_t pid = vfork(); // vfork had an error. if (pid == -1) { return -1; // We are in the child. } else if (pid == 0) { result = close_execve(path, argv, envp, fds, fd_count); _exit(1); // We are in the parent, the child must have finished by now. } else { // The child had an error. if (result != 0) { return result; // Success! Return the new PID. } else { return pid; } } } ``` [This code on GitHub](https://github.com/L-as/procex/blob/master/cbits/glue.c) -------- # Related articles - [Using Haskell as my shell](/blog/haskell-as-shell.html) - 2021-07-23 # About me Type theorist. Rolling my own crypto. - E-mail: mdwuaidiuawhdiuhe`@`{=html};lajxujxujuxjujus.rs - GitHub: [\@L-as](https://github.com/L-as) - Matrix: [\@Las:matrix.org](https://matrix.to/#/@Las:matrix.org) # Posts - [All you need is higher kinded types](/blog/all-you-need-is-hkt-s.html) - 2023-01-13 - [Using Haskell as my shell](/blog/haskell-as-shell.html) - 2021-07-23 - [vfork, close_range, and execve for launching processes in Procex](/blog/vfork_close_execve.html) - 2021-07-20 - [F2FS swap files broken and the arcane ritual to fix them](/blog/f2fs.html) - 2021-07-07 This page has a [markdown version](./vfork_close_execve.md) [Atom Feed](/atom.xml) [Public PGP key (6B66 1F36 59D3 BAE7 0561 862E EA8E 9467 5140 F7F4)](/public-pgp-key.txt)