This is the story of how we upgraded over 2000 Ubuntu production servers – turning over millions an hour – by installing the operating system in memory, wiping the root disk and reinstalling the OS back on disk from RAM. We did it, there was zero data loss and it saved us lots of time and money in support. It also took months of careful planning and many many tests.
We run a cloud based hosting platform for Magento, Hypernode. Magento is a somewhat beefy e-commerce framework that in some cases requires a lot of firepower to operate with acceptable performance. This means we don’t control all the software that runs on the servers and we occasionally have to jump through hoops to keep our customers happy.
In this case those hoops had to do with static IPs. We purposely don’t employ dynamic IPs to retain multi-cloud deployment capabilities and prevent vendor lock-in with one platform. Unfortunately, in the e-commerce world it is commonplace to assume that your IP doesn’t change so you can have it whitelisted by external services. If a server IP changes, we have to call up our customers and inform them that they may have to notify their vendors.
A normal upgrade would entail provisioning a new server for the latest Ubuntu version and migrate the customer there. We could save a lot of effort if we instead could in-place upgrade the whole fleet of servers two LTS releases into the future (12.04 Precise to 16.04 Xenial) and keep their IPs, but there’d have to be only a short amount of downtime.
To release-upgrade a running 12.04 system to 16.04, you have to go through 14.04 (Trusty) first, upgrading the system twice. Upgrading to Trusty first would increase transition time and risk: more variables and moving parts. This is why we wanted to find a way to skip 14.04 and go directly to 16.04.
Previously, we had heard about people performing in-place total reinstall hacks. This technique is popular for installing Linux distributions on a cloud instance that does not officially support it, like Archlinux on DigitalOcean. On first thought this strategy did not seem feasible to perform safely on this many servers, especially considering these machines combined do millions and millions in transactions and any catastrophic failure would be very bad for business.
But the idea of on the fly completely reinstalling a server from memory was very appealing; it would allow us to upgrade to Xenial in one go. Investigating possible strategies, we also found the very interesting takeover.sh script which contains this friendly warning:
This is experimental. Do not use this script if you don’t understand exactly how it works. Do not use this script on any system you care about. Do not use this script on any system you expect to be up. Do not run this script unless you can afford to get physical access to fix a botched takeover. If anything goes wrong, your system will most likely panic.
If we still needed encouragement, this did the trick. The game was on!
We decided to spend 2 weeks investigating a safe way to perform an in-place upgrade. If we wouldn’t be confident enough after 2 weeks, we would abandon the plan and do a – boring but proven – regular migration. As it turned out, we needed only 1 week to perform a smooth, manual in-place upgrade and win the bet some more sceptical colleagues had set up.
The process goes like this: we first install a minimal version of the new operating system completely in memory on the server using debootstrap. Then we stop all services and do a pivot_root after which the new operating system is in use (except for the kernel). Up until this point, nothing has changed on the hard disk and there is nothing to worry about. After a reboot everything would be back to normal.
Then comes the point of no return. We wipe the root except for the customer data and install the new operating system over the old one on by debootstrapping Xenial once more, but now on disk. If the process fails after this point of no return there is no way back without using our backups.
A word of reassurance: even if this worst case scenario were to take place, we were well prepared. Luckily we have a fully automated backup restoration process that has been thoroughly battle-tested. The result of restoring to a backup node in case of emergency would be the equivalent of a normal migration to a new node – new IP but no data loss.
After the reinstall, the server can be rebooted and the completely new operating system will be live, including the new kernel. A fun side effect of the pivot_root was that we had to reboot by writing to the sysrq kernel trigger because the system would be in such a state that a graceful reboot was no longer possible.
We could manage most of the process from the outside over SSH, but at a certain point we’d have to upload a script and execute it blindly to perform the last steps and reboot. Tweaking this script to work on all types of nodes was quite the exercise in trial and error. If the server did not come back up, we’d be lucky to get an emergency shell or VNC.
At one point we were debugging solely based on time-sensitive Amazon EC2 instance console screenshots to get the bootloader configuration right, which often didn’t give you more than than this:
The most difficult parts of this process were the subtle differences in base images at cloud providers and hypervisor environments. There were issues with the absence or presence of things like ‘cloud-init’, custom bootloaders, slightly different kernels and generic configuration drifts like resolvconf using symlinks on older systems but not on newer ones.
After sacrificing many test servers we got it right, and the resulting script for installing Xenial in memory, pivoting to this new Linux and installing Xenial on disk before switching back, looked something like this.
We had what we needed but there was one challenge left. We had to make this reliable to perform over two thousand times without any hiccups.
For Hypernode we have a very elaborate job processing system built on top of Openstack Taskflow. This system enables us to create flows which describe processes and their states, including reverts and retries for every step. All workers in this system are completely redundant. If one of the conductors executing a job would explode, the next one would just pick up the process exactly where it left of.
To make the in-memory-reinstall a success every time, we broke up all steps of the script above in small pieces and came up with various sanity checks for each part of the process. For example, after unmounting all special filesystems, are we sure no process is still holding any related file descriptors open that might throw a spanner in the works?
If any of the checks failed, our automation would revert any changes to the state of the system and retry if it could do so safely. If the amount of safe retries ran out, an engineer would receive a Pagerduty alert to notify of an aborted upgrade.
The Python code that makes up this logic composes a directed graph which we drew as a dendrogram to give a high level overview of all steps.
We upgraded around 100 to 200 servers per night and the work was fully automated. During the day an engineer would queue up the jobs for the night and if anything went wrong or looked suspicious, the automation would wake him up to take a look.
In the end only around eight servers in our entire fleet failed at the first attempt. The problems with those machines were then figured out the next day and implemented as special cases in the automation and the jobs would be re-queued for a later run, or restored from backup directly to 16.04 depending on the situation.
To conclude, the in-place upgrade was a great success. It was like replacing the wheels on a moving vehicle and it was only possible to do it safely because we have strict control over every facet of our platform. In the world of distributed systems it’s all about fencing off the influence of external dependencies and if you are sure about the capabilities and limitations of your domain you can get away with some crazy things.