Generating small binaries in Haskell

[ up a level ]

2020-01-10 14:28

Last year I wrote the first version of microcosmos, which is a webserver for running simple blogs. Since then Ive been mulling over the biggest problem that microcosmos has: Its huge. To be precsise, its 113MB once compiled.

To be clear, this isnt just a problem with microcosmos. Its pervasive among Haskell programs. The venerable Pandoc document translator comes in at 68MB on its own. Cabal is 28MB. This isnt just a problem with compiled binaries, though - some libraries are so cumbersome to use that they literally cannot be compiled on lower-end devices.

Why?

Why are we lumbered with titanic binaries when other programming languages arent?

Well, there are a few reasons.

Haskells standard library is considerably smaller than that of most modern, mature languages. (For example, the Data.Map type, which is pervasive in Haskell, comes from the containers package and is not part of base.) This means that people are likely to turn to external libraries much earlier than one might in, say, Python.
There are multiple ways of doing everything. The case that springs to mind is with String types. A program that interacts with 3 or more external libraries may well end up containing an implementation of String, Text, Bytestring and Lazy ByteString, and possibly more besides. The lack of canonical use this and only this basic data types has resulted in the patchwork of drop-in replacements that we see today. Sure, some are better suited to some tasks and some to others, but thats true in any programming language - the mantra of use the best possible option means that in practice, we end up with every option.
Haskell compiles everything statically, so your binary contains a compiled of every module youve used, directly or transitively. I dont actually think this is a problem.
GHC doesnt take many steps to reduce binary sizes. It compiles the code youve got, and then its done. This means theres a lot of space in the binary which we can squeeze out.
People dont seem to care about binary size in the Haskell community. Now this isnt universal, but many people really dont seem to mind that binaries for Haskell programs are gigantic. Most online threads on the topic seem to conclude with Yeah, its because Haskell compiles everything statically, thats just the way it is.

What can I do about it?

If you want to produce smaller Haskell programs, you have a few tools at your disposal.

Dependencies

The first and most important: Introduce dependencies with care. The interpolatedstring-perl6 library gives you perl-style string interpolation within QuasiQuotes. This is a relatively simple feature, so you might expect it to depend on TemplateHaskell and a couple of other things, right? In fact, it transitively depends on Cabal. Its surprising (and depressing) how many libraries depend on Cabal, or lens, or some other behemoth. Avoiding such naughty libraries will not only reduce your file size, but reduce first-compilation times and disk usage for other people who use your code.

If your needs are minimal, you might want to consider looking at alternative libraries. Compare the dependency lists for lens and microlens - which features do you need?

Some platform libraries come with a single Import ___ line which then imports a large number of modules on your behalf. Given that you are compiling in every module which youre transitively using, it might be worth replacing these catch-all imports with the specific parts of the library you need.

Dynamic Linking

It is usual to compile your Haskell programs statically. This ensures that the correct libraries and versions are available to the program, and makes your binaries more portable. However, some Linux distributions are moving over to a dynamically-linked model of Haskell binaries, which means you can lift dependencies at the level of the package manager. If this is the use case for you, then consider passing the -dynamic flag to ghc, which will exclude the libraries from your binary. This will make your binaries much, much smaller - but they will only be able to run on computers which have the requisite set of libraries.

Stripping

This ones nice and easy! GHC includes symbol names by default. These are helpful when debugging, but for compiled, distributed programs they are not necessary. Running the strip command over your binaries will shrink the filesize by about 1/3, with no real downsides! (If you write programs which crash a lot, though, maybe leave the symbols in!)

Thanks to @tsprlng on Twitter for showing me just how big of a difference this makes!

Compression

Once youve cut out the maximum possible amount of dead code, you will probably still be left with a fairly large binary. Fortunately some brilliant people (Markus F.X.J. Oberhumer, Laszlo Molnar, and John F. Reiser) are on the case. They developed upx, which is a mature executable packer. It doesnt delete code, it just removes the hot air from your binary files - and theres a lot of hot air in a Haskell binary. Its common to see the file size drop by 80-90%.

Microcosmos Version 2

So, after learning the above, I tried to put it all into practice with the blog program Microcosmos.

The first thing I did was remove the dependency on Yesod, which is a kitchen-sink-included web framework. Yesod was way, way too heavy for this very simple task, and I was only using it due to familiarity (and the nice DSLs it comes with!). I also went through and cut out any dependencies which I was barely using (eg. for a single helper function), instead producing a small number of helper functions as a replacement.

Microcosmos-Yesod	Microcosmos
containers warp http-types text time friendly-time filepath directory yesod-static yesod-core shakespeare blaze-html pandoc megaparsec	containers warp http-types text time friendly-time filepath directory wai bytestring file-embed split process

It might not look like Ive made much difference here, but almost all of the dependencies on the right were already transitively included on the left. The ones on the right are also smaller, and depend on far fewer libraries transitively.

One notable thing is that I removed the dependency on Pandoc. While microcosmos still uses Pandoc internally, it now calls out to it from the shell rather than using the Haskell library. Slightly less neat, perhaps, but much, much more concise.

	Microcosmos-Yesod	Microcosmos
Before compression	113MB	15MB
After compression	14MB	1.84MB

So following these techniques I was able to get from a 113MB binary down to something less than 2MB. (This is all still statically linked - when linking dynamically the binaries shrink to well under 100 kilobytes.)

You might not be someone who cares about big executable sizes, but if you are, please consider implementing some of the above! Save a life, drop a dependency :)

[ up a level ]