Your questions and remarks are welcome!
Just the facts, ma’am
Here are a few facts:
We wrote an OCaml daemon to prepare and publish custom metrics to closely monitor some particularities of our systems. This daemon perfectly illustrates the when it compiles, it works phenomenon: it has been deployed on thousands of servers and various settings, with uptime ranging from a few day to a few months, and never failed nor required a bug-fix of any sort. Really.
We write custom deployment scripts, effecitvely implementing the immutable server pattern, which are written in OCaml. The scripts are robust, had little-bugs (mostly caused by details about the underlying system) and are easy to extend, adapt. Various refactorings never broke functionality or introduced regressions of any sort.
We write custom backup procedures for some of our databases. After a week of development, the script entered the final phase of testing, and had exactly one (trivial) bug. It now dumps terabytes of data over hours of operation, without a hitch.
After these, hopefully teasing, statements, I would like to present aspects of the recent development of this last program as a short testimonial.
Description of the backup procedure
We have some NoSQL databases hosted as a managed service in the AWS cloud, the largest ones being several terabytes large. Unfortunately the managed service does not offer any backup procedure and we decided to roll out our own. It is made of three procedures dump performing a full-dump of a database, incremental monitoring and writing down subsequent activity of the database, and, of course, a restore procedure, whose inputs are the artefacts produced by the previous dump and incremental procedures.
The dump procedure must read all the items from the database, write them to a file, compress this file and transfert it to a permanent storage where it is archived. In order to make the procedure more resilient to errors, we decided not to work with a huge archive file holding all the items but with several smaller archive volumes. This choice:
makes it easy to concurrently fetch items, compress and upload volumes;
allows to work in constant disk space, independently from the size of the database being dumped;
makes it easy to resume reading items or to resume uploading archive volumes if the process gets interrupted.
Choosing OCaml for the job
On the 4 remaining days, 2 were spent writing the dump procedure itself and 2 were spent writing a deployment script and performing the first operational tests.
At a higher level, the dump procedure prepares the streams of items to be saved from the database. This stream is a lazy structure, which under the hood handles all the low-level details for item retrieval (such as pagination of results). This stream is then subdivided in smaller streams of fixed size, each of them is given to a sink that write its items to a volume file, compress it and upload it to a permanent storage. There is two little additional logical units. One to ensure throttling of item retrieval, so that the dump procedure interacts nicely with normal operation. A second which guarantees that no more than a configurable but fixed number of volumes is being compressed and uploaded at a given instant, which implies that the procedure works in constant disk space, independently of the size of the database being dumped.
The first test uncovered a single, trivial bug, that was quickly fixed, after that, the program could be used to do a full dump of a 2,5 TB database in about seven days — the throttling made it last so long.
In the first iteration of development after the completion of the working prototype, was expanded to split the dump operation across several machines. This iteration worked perfectly from the first time it compiled, meaning that the refactorings performed introduced no regression. This is mostly imputable to the typing system of OCaml, which is a powerful ally when refactoring, as it signals through type errors most of the the places where a given code modification induces problems and needing some special attention. I am now working on the second iteration that should improve logging facilities and the monitoring of the dump procedure.
Or rather, was available, at the time of writing. ↩