2021-07-23
Oddly, programmers use one programming language for their shell, yet another one to write programs.
When we need to run a lot of external commands, we use a shell scripting language, and when we need to write algorithms, we use a “real” programming language.
The core difference can be summarized as the lack or presence of data structures. Bash doesn’t support data structures well, something like Haskell, or any imperative language, like Python, does.
When we try to use Bash to handle structured data, it quickly goes
wrong. Take a look at this stackoverflow
answer for how to split a string into an array using ,
as a delimiter:
readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; }' <<<"$string, "); unset 'a[-1]';
declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")
This is the answer burried underneath many other almost correct yet still incorrect answers. Quite honestly, this is horrifying. What ought to be a simple function call, has been mangled into something beyond recognition.
Yet, we still use Bash. For our shell, the most important thing isn’t
how easy it is to split a string, it’s how fast and easily you can run
an external command. Relying on Python for that won’t go well, as
evident by this
stackoverflow question; the clean answers depend on
sh
!
Imagine, if we could use one language for both?
There are many alternatives to Bash, but they are all fundamentally boring shells. Zsh, Fish, Oil, Elvish, Nushell, rc, es, XS, are domain specific languages, and offer no real value as a “real” programming language. Would you write your IRC client in Elvish?
Instead of making a shell language that can do more than Bash, why don’t we go the other way around, and try making an existing language usable as a shell?
For it to be a good shell, we want to make running external commands as ergonomic as possible. It shouldn’t require multiple lines of code to read a file and pipe it to a process.
There are many languages with a REPL. If we were to make the Python REPL fit for use as a shell, we could make a library to make it easier to run external commands, but due to Python’s syntax, it would likely require a great amount of overhead to execute external processes, what we’re likely to do the most in our shell. We could make a function with syntax similar to this:
"ls","-l") r(
While it’s not a lot of visual noise, it’s much slower to
type than ls -l
. The "
, ,
, and
parentheses, take up our valuable time.
Haskell, on the other hand, has a much more lightweight syntax. Throughout this blog post I manage to make the following syntax possible:
#ls#_l Îľ
Its REPL, GHCi, is also quite featureful, importantly supporting path completion.
While longer than ls -l
, for arguments without special
characters and capital letters, it grows at the same rate, 1 key stroke
of overhead per argument.
To do this, I made my own library Procex.
Why did I make another library when shh
,
shell-conduit
, Shelly.hs
, etc. already exist?
The reason is that these solutions are all designed around createProcess
.
createProcess
doesn’t support all the features you’d
expect on a Unix system, notably passing arbitrary file descriptors
through to the called process.
In addition, it has the issue that launching processes takes a non-trivial amount of time, around 0.5 seconds on my ODROID N2 (which is unusual hardware I admit).
On POSIX-like systems, you generally need to close all file
descriptors you don’t want to pass on to the new process. This could be
handles to files, pipes, etc., notably, stdin, stdout, stderr are the
first file descriptors (in that order). Depending on your system, the
limit is different. On my system it’s 1048576
.
createProcess
is implemented such that it loops from
stderr+1
until this limit, closing every file descriptor in
this range. On my ODROID N2, this takes around a second, meaning I had
to wait an extra second on every command to execute. This was not
usable.
Procex doesn’t have this problem, the specifics are detailed here, though it is not of great importance to this article.
Let’s start off by loading Procex up in GHCi, after installing the procex
package. This depends on your system, but if you’re using cabal it
will generally be:
$ cabal update
$ cabal install procex --lib procex
$ cabal install pretty-simple --lib pretty-simple # Heavily recommended, gives us Text.Pretty.Simple.pPrint
$ ghci -Wall -Wno-type-defaults -XExtendedDefaultRules -XOverloadedStrings -interactive-print Text.Pretty.Simple.pPrint
> import Procex.Prelude
> import Procex.Shell
> import Procex.Shell.Labels
GHCi has a couple of problems Procex helps us work around, notably, stdin is not set to line buffering, and changing directories doesn’t affect path completion.
To fix the former, we run:
initInteractive
This is equivalent to
hSetBuffering stdin LineBuffering
.
The latter can be fixed by doing the following:
:set prompt-function promptFunction
Now we can use cd
from Procex.Shell
,
and path completion will be different depending on your working
directory, where as before, path completion would always be from the
directory you started GHCi in. As a side effect the prompt will also be
changed.
Procex has the concept of commands, which represent a process to execute, along with the arguments and file descriptors we want to pass to it.
To create a command, we can use the mq
function. After mq
you can write the arguments you want to
pass, wrapping them in quotes, but without any commata, parentheses or
similar.
Listing the current directory:
"ls" "-l" mq
Or if you want to use the short syntax:
#ls#_l mq
The labels (prefixed with #
) are interpreted as strings,
where _
is replaced by -
, since that character
is illegal in labels.
The helpers you’ll likely be interested in are all in Procex.Quick
.
diff
-ing two strings, then capturing the output:
diff :: ByteString -> ByteString -> IO ByteString
= captureLazyNoThrow $ mq "diff" (pipeArgStrIn x) (pipeArgStrIn y) diff x y
cat
-ing a string:
"cat" <<< "Hello World!\n" mq
Piping curl
to kak
:
"kak" <| mq "curl" "-sL" "ipinfo.io" -- The reverse will wait for curl to end instead of kak mq
stat
-ing all the entries in your directory:
import System.Directory
"." >>= mq "stat" listDirectory
Piping curl
to a file:
"curl" "-sL" "ipinfo.io") >>= B.writeFile "./myip.json" captureLazy (mq
Piping stdout
and stderr
to different
places:
import qualified Data.ByteString.Lazy as B
mq"nix"
"eval"
"nixpkgs#hello.name"
1 $ \_ stdout -> B.hGetContents stdout >>= B.putStr)
(pipeHOut 2 $ \_ stderr -> B.hGetContents stderr >>= B.writeFile "./log") (pipeHOut
pipeHOut
gives us the raw handle, allowing us to handle the data in Haskell,
allowing us to use all the usual Haskell libraries we’d use.
In general, it is a better idea to rely on Haskell alternatives to
the tools in coreutils
, as they are fit for Bash and
traditional shells:
createDirectory
instead of mkdir
removeFile
instead of rm
createSymbolicLink
instead of ln
replace-megaparsec
’s
streamEdit
, etc. instead of sed
,
grep
, etc.You need to copy this
directory, fix shellrcSrcPath
, then refer to the
derivation built by default.nix
in your
environment.systemPackages
, or whatever you prefer. The
derivation produces a single file bin/s
that launches your
shell.
The equivalent of your .bashrc
will be in the
ShellRC.hs
file. GHCi commands will have to be put directly
into default.nix
.
All the imports in your ShellRC.hs
file will in addition
be available in the shell.
The :li
command will reload the ShellRC.hs
file from source instead of using the pre-compiled version from the nix
store.
Let’s make a $HOME/.ghci-shell.hs
file, with the same
purpose as the .bashrc
file.
Let’s for now put this inside:
:set -Wall -Wno-type-defaults -XExtendedDefaultRules -XOverloadedStrings -interactive-print Text.Pretty.Simple.pPrint
import Procex.Prelude
import Procex.Shell
import Procex.Shell.Labels
:set prompt-function promptFunction
initInteractive
You can then launch your shell with:
env GHCRTS="-c" ghci -ignore-dot-ghci -ghci-script "$HOME/.ghci-shell.hs"
This should work fine, but your init script won’t be compiled, whereas it will with Nix.
While the number of characters isn’t very different compared to Bash, there are some tricks to make it faster to type.
I’m using a Japanese
keyboard with the UK layout. I don’t use the extra Japanese keys, so
I have rebound the Hiragana_Katakana
key (2 keys right of
space) to "
, a valuable trick that is applicable to Bash
too and has also saved my fingers from unnecessary pain holding down
shift.
I’ve also renamed mq
to Îľ
as such:
:: (QuickCmd a, ToByteString b) => b -> a
Îľ= mq Îľ
I’ve bound my unused Muhenkan
key (1 key left of space)
to that to save another key stroke.
I recommend omitting extraneous spaces whenever possible, since the code in your shell is write-once-read-never:
#nix#build"nixpkgs#hello"#_o#out Îľ
Since I need to hold down shift to type _
, I’ve mapped
my unused Henkan
key (1 key right of space) to it to save
one more key stroke.
My .XCompose
:
<Henkan> : "_"
<Hiragana_Katakana> : "\""
<Muhenkan> : "Îľ"
You’re likely better off doing this by modifying your XKB layout, but I didn’t want to delve into that mess.
With this we’re down to 34 key strokes on my keyboard.
The equivalent command in Bash:
nix build nixpkgs#hello -o out
This took me 31 key strokes, surprisingly quite close!
You could further save key strokes by renaming functions in Procex to
shorter names, however, I am of the belief that the user should choose
the names, not just for functions from Procex, but also for other common
functions they use. I’ve myself made aliases to Data.ByteString.Lazy.UTF8.toString
,
Data.ByteString.Lazy.UTF8.fromString
,
and some other common functions I use a lot.
The first step was making my own glue code in C for interfacing with
the vfork
and execve
for creating processes,
as detailed here. You could do
this in Haskell if you’re careful, but file descriptors in the child,
which would effectively be another Haskell thread, would point to
different things than the parent. This is problematic since handles from
the environment will now suddenly point to different things, but only in
the child. Because of this the code that runs in the child before
execve
is in C.
If you didn’t bother reading the above article, the gist is that the glue
code provides functions that combine the forking and execution, in
addition to allowing file descriptors to be set up for the child. This
is then bound to inside Procex.Execve
.
We interface to it from Procex.Core
,
which defines the core Cmd
type.
Cmd
is internally
Args -> IO (Async ProcessStatus)
, where
Args
is a record of the raw arguments to pass as
ByteString
s, the file descriptors to pass (and how to map
them), and what “executor” to use (used to allow exec
-ing
without fork
-ing).
This design was chosen as it is easy to compose. The exported functions are:
makeCmd' :: ByteString -> Cmd
: Takes the path to an
executable and gives you a Cmd
passArg :: ByteString -> Cmd -> Cmd
: Passes an
argumentpassFd :: (Fd, Fd) -> Cmd -> Cmd
: Passes the
second fd to the command, renaming it to the value of the first fdpassArgFd :: Fd -> Cmd -> Cmd
: Passes an argument
that points to the fd, while passing the fd too. This allows process
substitution, since opening the path (/proc/self/fd/$fd
)
will open what’s behind the file descriptor.unIOCmd :: IO Cmd -> Cmd
: Embeds the IO action
inside the Cmd
, executing the IO action when the
Cmd
is run.postCmd :: (Either SomeException (Async ProcessStatus) -> IO ()) -> Cmd -> Cmd
:
Runs an IO action just after the process is launched.run' :: Cmd -> IO (Async ProcessStatus)
: Runs the
command and gives you the handle to a thread that’s waiting for it to
finish.runReplace :: Cmd -> IO ()
: Replaces the current
process with the process launched by the command.Notably, Procex.Core
does not expose any overlapping
functionality, since it’s only meant to expose the core interface.
These all internally wrap the original function passed, resulting in
a new function that takes Args
.
When we run a command, we simply pass it an empty Args
,
then each “layer” will add what it needs to it, then finally reaching
the root function defined in makeCmd'
, that calls the
functions defined in the glue code (bound in Procex.Execve
).
Procex.Process
provides functionality that is commonly needed when executing processes,
and wraps over Procex.Core
. It defines a family of
pipe*
functions, which make pipes, then pass one end of the
pipe (as a file descriptor) to the process, and the other end to
something else.
In principle, we need nothing more, but this is not very ergonomic to
use as a shell. Each argument we want to pass to a process needs a
cmd & passArg "myarg"
, and passArg
doesn’t
even work when you’re in a shell:
Often, in our shell, we’ll pass paths as arguments, but if you pass
in non-ASCII paths to passArg
as a literal, they will get
mangled. The top bit of each byte in the string will simply be unset by
the IsString
implementation of ByteString
,
since it’s not UTF-8 aware, so it doesn’t know how to encode such bytes
into the ByteString
.
To avoid this problem, we need a helper function that takes a
String
instead of a ByteString
, so that we
don’t use ByteString
’s IsString
instance.
In Procex.Quick
we define a ToByteString
class, that has a single
toByteString
member. It has an instance for
[a]
where a ~ Char
(defined this way to aid
type defaulting), such that we can define functions that take any
a
where ToByteString a
.
To attain a Bash-like syntax that is more concise, a
QuickCmd
class is defined, with
quickCmd :: QuickCmd a => Cmd -> a
.
It has three instances:
QuickCmd Cmd
, which uses id
for the
definition.(a ~ ()) => QuickCmd (IO a)
, which uses
run
for the definition, i.e. it synchronously waits for it
to finish and throws if the exit code is non-zero. The reason this isn’t
an instance for IO ()
is again to aid type defaulting.(QuickCmdArg a, QuickCmd b) => QuickCmd (a -> b)
,
this means quickCmd cmd
can result in another function that
takes an a
where QuickCmdArg a
then returns a
b
where QuickCmd b
again.QuickCmdArg
has all the instances you can guess,
String
, ByteString
, etc. We actually can’t use
ToByteString
for our instances for
QuickCmdArg
, as that would 1) require
UndecidableInstances
and 2) make type inference not work in
a lot of cases.
Wrapping it all up, we have the mq
function that wraps
makeCmd
and quickCmd
, as shown in the basic
examples.
There are also various operators that wrap over
Procex.Process
and call
Data.ByteString.Lazy.hGetContents
for you,
e.g. <<<
, |>
, <!|
,
and the capture*
family of functions.
The capture*
functions all wrap
captureFdsAsHandles
, which simply runs a command and
provides the handles to the specified file descriptors. They all output
a ByteString
, which can be read lazily or strictly.
An important part of running commands is also checking for failures.
In Bash, we have set -e
. In Procex, run
runs
commands synchronously and waits for them to exit, throwing if the
command failed. This obviously works fine for capture*
functions that wait for the command to finish, but what about when we’re
using lazy IO?
The answer is more lazy IO. We attach a “finalizer” to the
ByteString
(GitHub):
attachFinalizer :: IO () -> ByteString -> IO ByteString
= B.fromChunks <$> go (B.toChunks str)
attachFinalizer finalizer str where
go' :: [BS.ByteString] -> IO [BS.ByteString]
= finalizer >> pure []
go' [] : xs) = (x :) <$> go xs
go' (x
go :: [BS.ByteString] -> IO [BS.ByteString]
= unsafeInterleaveIO . go' go
A Data.ByteString.Lazy.ByteString
is internally
isomorphic to a list of Data.ByteString.ByteString
. By
converting it to and then from a list of such chunks, we can insert lazy
IO into it, executing the finalizer when we reach the nil case using
unsafeInterleaveIO
.
In practice this works quite well, but some times we don’t want it to
err, for example when we’re using diff
. diff
returns a non-zero exit code when the inputs differ, but we want to
ignore that, so for each lazy capture*
function there is a
-NoThrow
version. This could be extended to allow filtering
what exit codes you want to ignore, but this would complicate the
“quick” module, and if you want more advanced behavior, you’d likely be
better off using Procex.Core
and
Procex.Process
directly, then passing the resulting
Cmd -> Cmd
to mq
.
Procex.Shell.Labels
contains this:
{-# OPTIONS_GHC -Wno-orphans #-}
module Procex.Shell.Labels where
import Data.Functor
import Data.Proxy (Proxy (..))
import GHC.OverloadedLabels (IsLabel (..))
import GHC.TypeLits (KnownSymbol, symbolVal)
instance (a ~ String, KnownSymbol l) => IsLabel l a where
=
fromLabel Proxy :: Proxy l) <&> \case
symbolVal ('_' -> '-'
-> x x
Labels like #label
when -XOverloadedLabels
is enabled are translated into something like
fromLabel @"label"
.
The reason, it’s IsLabel l a
where
a ~ String
instead of IsLabel l String
, is
that with the latter, type inference wouldn’t work properly, meaning
something like mq #echo
wouldn’t type check.
With this instance, fromLabel @"label"
will be inferred
to be of the type String
, causing it to be evaluated as
"label"
.
This will likely conflict with other uses of labels, so you might not want it if you use other libraries that use labels.
In the beginning it was certainly painful, it was as if I had to
relearn talking. Thankfully GHCi provides an escape hatch:
:!
allows you to shell out to sh
easily.
In the process of switching my shell to Haskell, I also got a lot
faster at writing Haskell. Haskell is now the primary interface through
which I use my computers, and it has been very pleasant. I no longer
have to deal with regexes, since I can whip out a full parser combinator
library any time. You could likely also include a PostgreSQL library in
the shell to access databases without going through the
psql
REPL.
I’ve also been removing my scripts one by one completely, replacing
them with simple Haskell functions in my ShellRC.hs
, where
they can interface with structured data rather than raw bytes.
Advanced completion like in Fish would be quite nice, but unfortunately GHCi is a bit hard to customize due to its integration into the GHC source code. Perhaps a GHCi alternative external to GHC could be implemented, or the Idris REPL could be modified instead, since it seems more amenable to customisation.
Type theorist. Rolling my own crypto.
This page has a markdown version
Public PGP key (6B66 1F36 59D3 BAE7 0561 862E EA8E 9467 5140 F7F4)