The Hypernode hosting platform for Magento is entirely cloud based. Building on top of public infrastructure offers many advantages in terms of scalability and maintenance, but it also presents new challenges, for example a backup system. In this post we describe how we built an efficient, highly durable and highly available backup system that will effortlessly scale with the rest of our platform.
Deciding how to store the data
One of the biggest considerations for designing a backup solution is deciding what software to use and determining the type of storage. A traditional way of storing backups would be to put them on a fileserver. Here at Byte we have large file servers and quite some experience managing them and we’ve used them for backup storage in the past, yet for Hypernode we chose a different route for scalability reasons.
Using managed backup storage space providers offers the benefit of not having to worry about the fileserver yourself. While it’s not hard to manage a file server, you do have to take into account that it will eventually fill up. When that happens, logic needs to be introduced to distribute the data over multiple servers.
An even more scalable solution than a managed file server, would be using object storage. The object abstraction layer, unlike some other storage systems, allows for horizontal scaling. Because we were looking to build a backup system that scales up to thousands of nodes without us having to change a thing, object storage started to look like the best solution.
There are many object store providers out there. You can even build your own if you’d want to with projects like Ceph and Swift. We decided against that because pricing is already competitive enough for it to not be worth the time and a public object store can provide higher peak throughput compared to a smaller own built object store.
Finding the best object storage provider
Now we had to find the best provider. We looked at many providers and selected the three most interesting ones with data centers in Europe based on price. We ended up testing with Amazon and Google. They simply had the best competitive rates at the time and both allowed us to read/write in excess of 500 mbit, which meant that network throughput would not be our bottleneck.
First we implemented Google Cloud storage because they were marginally cheaper at the time, however soon we discovered that account isolation could become a problem. The Google object store did not offer the necessary tooling to isolate credentials as convenient as on AWS. Managing access control was a must have feature for us to isolate the thousands of Hypernode accounts for security reasons alone. Amazon/S3 offered the easiest and most complete and out of the box solution for this with IAM and S3 policies.
We evaluated multiple approaches to push the data to the object store. The first options we looked at were full filesystems based on top of the S3 object store, namely S3FS and S3QL. They appeared as a good fit at first since it would allow to simply rsync data to these mounts. More importantly, it would allow to simply browse through backups as if they were stored locally. S3QL in particular looked promising since it offers both deduplication and incremental snapshots.
After building a couple proof of concepts we discovered that a full restore containing small files using rsync was too slow. Even when parallelizing it with 10 threads, a full restore took too long. Compared to S3QL, S3FS did not really offer anything for us to make backups efficient/incremental.
Next we compared S3QL to duplicity, which is a backup tool that has been around for a while and supports many storage backends. It supports object stores, ssh and even unconventional storage backends such as imap.
With duplicity full restores were way faster compared to S3QL and on top of that it also allows for incremental backups. Duplicity has one big downside though. It works with backup chains, meaning that all incrementals are based on a full backup. Because we want to keep a daily 30 day retention we have to store multiple full backups to avoid having very long backup chains and the performance penalty/risk associated with it.
The restore performance is what made us choose duplicity. The backup creation process is important, but the restoration process even more so. Duplicity in combination with object storage was the only solution which gave us acceptable restore speed in case of a full restore while maintaining incremental backup functionality.