NixOS Clusters

Thoughts on Hyper-Converged Infrastructure with NixOS

I fully buy into the Nix way of having your infrastructure configuration as versioned code, ready to test on the local development machine.

My servers have their services compartmentalized into autonomous Linux systems using LXC containers or MicroVMs. Not so much for security reasons with containers, but as a unit that I want to backup, update, or even move to another host. With virtual machines, removing much of the attack surface of a shared host kernel is a welcome side effect.

Simply running containers and virtual machines is fine by me and probably for many other small use-cases. Yet for use at a larger scale, the need for redundancy arises so that services remain operational in the event of hardware failure. It’s a call to duplicate servers. And because just two servers cannot properly decide if they’re the one that is last standing, having at least an odd number of cluster members is the first recommendation in any documentation. With Nix Flakes duplicating servers is a no-brainer because all builds are reproducible and transferrable with nix copy.

Moving containers and virtual machines across hosts, that is, stopping it on the source host, and subsequently starting it on the target server, doesn’t have many strings attached because these virtualized systems are fairly self-contained. To achieve automation of that process in the event of hardware failure, I have looked around for the standard solution on Linux servers. The popular answer seems to be Pacemaker for which I discovered a dead pull request to nixpkgs. I revived it along with modules and a test for NixOS.

I got Pacemaker to take care of my systemd services, starting a container on one server of the cluster, starting it on another if the first goes down. There’s a plethora of Pacemaker’s own tools to operate the cluster and its resources. I wonder how much can be masked away in a declarative setup.

Not all services are stateless, so storage must be synchronized. I was happy to discover that nixpkgs already ship the three major cluster filesystems.

drbd shares block devices between hosts. NixOS includes a test. After a proof of concept I’ve had some afterthoughts regarding the identification of them through single numerical identifiers. I haven’t yet had a good idea how to map my declarative configuration to this scheme in a way that is stable enough that configuration changes won’t cause storage chaos.

glusterfs looks so easily usable, it seems almost too good to be true. NixOS includes a test. Downside: it only shares directory trees, no block devices.

ceph is a handful of magnitudes up in complexity. NixOS includes three tests. It keeps Rados block devices synchronized and directory trees (CephFS), too. Ceph is powerful enough to deal with all sorts of environments but requires a much more well-thought setup.

The cluster filesystems are crucial to keep your stateful /var instances in sync. They will hopefully uphold consistency and availability in the event of partition. The /nix/store on the hand could be synchronized in deployment scripts with a simple nix copy to the other servers of the cluster.

I am considering how this might look like packed up in a reusable Nix flake. It should provide tooling to setup the various stateful parts. Then again, every setup is different, especially when it comes to the host’s network configuration. Example: availability is improved by distributing servers spatially. In that case I would like all intra-cluster communication to flow exclusively through Wireguard tunnels. It seems unnecessary to burn CPU like that in other cases where cluster machines have their own Ethernet segment because they just sit atop each other.

As with anything reusable, there are a lot of questions surrounding the balance between ease of use by just dictating opinionated defaults and a configuration schema that allows for maximum customizability. I am seeking opinions and general interest on that topic.