OCaml is a versatile, generalist, programming language, which started its life as research project led by the Cristal Group at the INRIA, the renowned French research institute. Yes, you got it: a language with that passport usually meets a lot of disdain outside of academic circles and is the object of many clichés thrown by (usually misinformed) practitioners. Nevertheless, over the years this language and its cousin F# have settled their mark in the industry, as the growing list of consortium members testifies. Some recent events reflecting this adoption are the acquisition of Unikernel (an OCaml shop) by Docker, or the release by Bloomberg of an OCaml to JavaScript compiler. There is however still a domain where OCaml is not expected to be met: operations. We started to use OCaml for operations at quintly in place of the more obvious Python or JavaScript and I am writing today to share our experience: this is an excellent choice!

Your questions and remarks are welcome!

Just the facts, ma’am

Here are a few facts:

  • We wrote an OCaml daemon to prepare and publish custom metrics to closely monitor some particularities of our systems. This daemon perfectly illustrates the when it compiles, it works phenomenon: it has been deployed on thousands of servers and various settings, with uptime ranging from a few day to a few months, and never failed nor required a bug-fix of any sort. Really.

  • We write custom deployment scripts, effecitvely implementing the immutable server pattern, which are written in OCaml. The scripts are robust, had little-bugs (mostly caused by details about the underlying system) and are easy to extend, adapt. Various refactorings never broke functionality or introduced regressions of any sort.

  • We write custom backup procedures for some of our databases. After a week of development, the script entered the final phase of testing, and had exactly one (trivial) bug. It now dumps terabytes of data over hours of operation, without a hitch.

After these, hopefully teasing, statements, I would like to present aspects of the recent development of this last program as a short testimonial.

Description of the backup procedure

We have some NoSQL databases hosted as a managed service in the AWS cloud, the largest ones being several terabytes large. Unfortunately the managed service does not offer any backup procedure and we decided to roll out our own. It is made of three procedures dump performing a full-dump of a database, incremental monitoring and writing down subsequent activity of the database, and, of course, a restore procedure, whose inputs are the artefacts produced by the previous dump and incremental procedures.

The dump procedure must read all the items from the database, write them to a file, compress this file and transfert it to a permanent storage where it is archived. In order to make the procedure more resilient to errors, we decided not to work with a huge archive file holding all the items but with several smaller archive volumes. This choice:

  • makes it easy to concurrently fetch items, compress and upload volumes;

  • allows to work in constant disk space, independently from the size of the database being dumped;

  • makes it easy to resume reading items or to resume uploading archive volumes if the process gets interrupted.

Choosing OCaml for the job

The AWS cloud services we use are easily accesible using JavaScript (NodeJS) and Python API and no native interface for OCaml is available.1 This means that picking OCaml implies writing a bit of plumbing code, which can take up to 2-3 days of development. This might seem a bit of a drawback and many developers will prefer to use a language for which an API to the required services is readily available. This kind of decision is usually sealed with the “We don’t need to reinvent the wheel” cliché. During an internship at TÜV Rheinland, I had to process geographic information automatically and decided to write my programs in Python because I could find a library capable of reading and writing ESRI shape files — the de facto standard for storing geographic information. This turned out to be a poor decision, because the library was very brittle and not very nicely documented, and also because I am not nearly as productive when writing Python programs than when writing OCaml programs (I program OCaml since 1998). You will recognise these points in the nice list prepared by Dimitri about the disadvantages of code reuse. Of course there are lot of situations where code reuse is desirable, but it is worth noting that it also has its trade-offs, especially since many persons consider this a no-brainer. Based on this experience, I decided to write the dump procedure in OCaml and taking advantage of the JavaScript API to the required services, using the js-of-ocaml compiler, turning OCaml bytecode into JavaScript.

Development

The development occupied me roughly 8 days, spending about 4 days writing bindings for OCaml to the JavaScript API. While js-of-ocaml makes this as easy as possible, it took longer as expected because the original JavaScript API is very sloppily documented. For instance, if they write “The member X is empty.” we cannot determine if it means “X is the empty string” or “X is set to null” or even “X is not defined in the structure”. This sloppiness implies that many tests need to be conducted. The nice side of using OCaml is that the signature of OCaml functions cleanly document, once for all, all these inaccuracies. Time spent clarifying the original API is time well invested because it produces an artifact put under version control and does not need to be done a second time. If the API had directly been used in JavaScript, then the time spent clarifying the API would most probably not have been saved in concrete artefacts, so its results would “stay in the air” for a while and once they vanish, the time needed to produce them must later be invested again.

On the 4 remaining days, 2 were spent writing the dump procedure itself and 2 were spent writing a deployment script and performing the first operational tests.

Writing the dump procedure was really fast, mostly because OCaml has a wonderful threading library called Lwt, which continues to work after the bytecode has been translated to JavaScript. This library proposes a monadic interface to compose cooperative threads, and many useful higher level abstractions it provides can be leveraged to quickly implement rather complex workflows. Using a monadic interface to write multi-threaded code is a far superior approach to the callbacks used by JavaScript either in its usual form or in conjunction with promises. The reason for this is that the monadic approach allows to simply compose treatments, while callbacks are just spaghetti code in disguise — debugging or refactoring any significant application written using this technique should be a convincing experience! To put it shortly, a program structured in the form of callbacks presents an inside-out control flow, where the elementary operations occupy the most visible place in the code while control operations of the programs are nested one level deeper, inside the callbacks. Monadic code is easier to write and understand because it presents a traditional control flow, where the elementary operations are hidden under higher level abstractions. Promises perform the same salutary reversal, however many APIs still rely on callbacks, so that using promises with them still needs to write some adapter code.

At a higher level, the dump procedure prepares the streams of items to be saved from the database. This stream is a lazy structure, which under the hood handles all the low-level details for item retrieval (such as pagination of results). This stream is then subdivided in smaller streams of fixed size, each of them is given to a sink that write its items to a volume file, compress it and upload it to a permanent storage. There is two little additional logical units. One to ensure throttling of item retrieval, so that the dump procedure interacts nicely with normal operation. A second which guarantees that no more than a configurable but fixed number of volumes is being compressed and uploaded at a given instant, which implies that the procedure works in constant disk space, independently of the size of the database being dumped.

The first test uncovered a single, trivial bug, that was quickly fixed, after that, the program could be used to do a full dump of a 2,5 TB database in about seven days — the throttling made it last so long.

Maintenance

In the first iteration of development after the completion of the working prototype, was expanded to split the dump operation across several machines. This iteration worked perfectly from the first time it compiled, meaning that the refactorings performed introduced no regression. This is mostly imputable to the typing system of OCaml, which is a powerful ally when refactoring, as it signals through type errors most of the the places where a given code modification induces problems and needing some special attention. I am now working on the second iteration that should improve logging facilities and the monitoring of the dump procedure.

  1. Or rather, was available, at the time of writing.