Last year I wrote the first version of microcosmos, which is a webserver for running simple blogs. Since then Ive been mulling over the biggest problem that microcosmos has: Its huge. To be precsise, its 113MB once compiled.
To be clear, this isnt just a problem with microcosmos. Its pervasive among Haskell programs. The venerable Pandoc document translator comes in at 68MB on its own. Cabal is 28MB. This isnt just a problem with compiled binaries, though - some libraries are so cumbersome to use that they literally cannot be compiled on lower-end devices.
Why?
Why are we lumbered with titanic binaries when other programming languages arent?
Well, there are a few reasons.
Haskells standard library is considerably smaller than that of most modern, mature languages. (For example, the
Data.Map
type, which is pervasive in Haskell, comes from thecontainers
package and is not part ofbase
.) This means that people are likely to turn to external libraries much earlier than one might in, say, Python.There are multiple ways of doing everything. The case that springs to mind is with String types. A program that interacts with 3 or more external libraries may well end up containing an implementation of String, Text, Bytestring and Lazy ByteString, and possibly more besides. The lack of canonical use this and only this basic data types has resulted in the patchwork of drop-in replacements that we see today. Sure, some are better suited to some tasks and some to others, but thats true in any programming language - the mantra of use the best possible option means that in practice, we end up with every option.
Haskell compiles everything statically, so your binary contains a compiled of every module youve used, directly or transitively. I dont actually think this is a problem.
GHC doesnt take many steps to reduce binary sizes. It compiles the code youve got, and then its done. This means theres a lot of space in the binary which we can squeeze out.
People dont seem to care about binary size in the Haskell community. Now this isnt universal, but many people really dont seem to mind that binaries for Haskell programs are gigantic. Most online threads on the topic seem to conclude with Yeah, its because Haskell compiles everything statically, thats just the way it is.
What can I do about it?
If you want to produce smaller Haskell programs, you have a few tools at your disposal.
Dependencies
The first and most important: Introduce dependencies with
care. The interpolatedstring-perl6
library gives
you perl-style string interpolation within QuasiQuotes. This is a
relatively simple feature, so you might expect it to depend on
TemplateHaskell and a couple of other things, right? In fact, it
transitively depends on Cabal
.
Its surprising (and depressing) how many libraries depend on
Cabal
, or lens
, or some other behemoth.
Avoiding such naughty libraries will not only reduce your file size, but
reduce first-compilation times and disk usage for other people who use
your code.
If your needs are minimal, you might want to consider looking at
alternative libraries. Compare the dependency lists for lens
and microlens
- which features do you need?
Some platform libraries come with a single Import ___
line which then imports a large number of modules on your behalf. Given
that you are compiling in every module which youre transitively using,
it might be worth replacing these catch-all imports with the specific
parts of the library you need.
Dynamic Linking
It is usual to compile your Haskell programs statically. This ensures
that the correct libraries and versions are available to the program,
and makes your binaries more portable. However, some Linux distributions
are moving over to a dynamically-linked model of Haskell binaries, which
means you can lift dependencies at the level of the package manager. If
this is the use case for you, then consider passing the
-dynamic
flag to ghc, which will exclude the libraries from
your binary. This will make your binaries much, much smaller - but they
will only be able to run on computers which have the requisite set of
libraries.
Stripping
This ones nice and easy! GHC includes symbol names by default. These
are helpful when debugging, but for compiled, distributed programs they
are not necessary. Running the strip
command over your
binaries will shrink the filesize by about 1/3, with no real downsides!
(If you write programs which crash a lot, though, maybe leave the
symbols in!)
Thanks to @tsprlng on Twitter for showing me just how big of a difference this makes!
Compression
Once youve cut out the maximum possible amount of dead code, you
will probably still be left with a fairly large binary. Fortunately some
brilliant people (Markus F.X.J. Oberhumer, Laszlo Molnar, and John F.
Reiser) are on the case. They developed upx
, which is a mature
executable packer. It doesnt delete code, it just removes the hot air
from your binary files - and theres a lot of hot air in a Haskell
binary. Its common to see the file size drop by 80-90%.
Microcosmos Version 2
So, after learning the above, I tried to put it all into practice with the blog program Microcosmos.
The first thing I did was remove the dependency on
Yesod
, which is a kitchen-sink-included web framework.
Yesod
was way, way too heavy for this very simple task, and
I was only using it due to familiarity (and the nice DSLs it comes
with!). I also went through and cut out any dependencies which I was
barely using (eg. for a single helper function), instead producing a
small number of helper functions as a replacement.
Microcosmos-Yesod | Microcosmos |
---|---|
containers warp http-types text time friendly-time filepath directory yesod-static yesod-core shakespeare blaze-html pandoc megaparsec |
containers warp http-types text time friendly-time filepath directory wai bytestring file-embed split process |
It might not look like Ive made much difference here, but almost all of the dependencies on the right were already transitively included on the left. The ones on the right are also smaller, and depend on far fewer libraries transitively.
One notable thing is that I removed the dependency on
Pandoc
. While microcosmos still uses Pandoc internally, it
now calls out to it from the shell rather than using the Haskell
library. Slightly less neat, perhaps, but much, much more concise.
Microcosmos-Yesod | Microcosmos | |
---|---|---|
Before compression | 113MB | 15MB |
After compression | 14MB | 1.84MB |
So following these techniques I was able to get from a 113MB binary down to something less than 2MB. (This is all still statically linked - when linking dynamically the binaries shrink to well under 100 kilobytes.)
You might not be someone who cares about big executable sizes, but if you are, please consider implementing some of the above! Save a life, drop a dependency :)