Context

Today’s software systems are very complex modular entities, made up of many interacting components that must be deployed and coexist in the same context.

Modern operating systems provide the basic infrastructure for deploying and handling all the components that are used as the basic blocks for building more complex systems. These infrastructures are the very foundation of any distributed system, from the Internet to the ones Service Oriented Architectures are built on.

The world is moving towards an on-line mode of operation where most of the activities are performed through services that are accessed through the Net. Since users rely on these services to carry out their daily work, service providers must guarantee a high quality of service. Long downtimes due to maintenance problems could compromise the work of thousands of people and might cause big loss for the parties who manage the service itself, both in terms of money and reputation.

With the shift towards an on-line mode of operation the “minimal acceptable standard” for the quality of service of distributed systems has been raised to a very high level. We almost never see an Error page on Google Mail or on their search Engine. But when it happens, even temporarily for a single search, we already start to feel disappointed. Being unable to access an online service for several minutes might be considered unacceptable. For example, a Google search page with a banner saying “Down for maintenance from 10am to 12am” would be unconceivable. As small and medium enterprises rely more and more on these distributed systems for their daily businesses, this concern is not restricted to big corporations.

Any of these players in the modern information infrastructure is faced with the dilemma of

on the one hand having to maintain their data centers and production machines in a healthy state, in order to guarantee the availability today’s users expect,
and on the other hand maintaining their systems in an always up-to-date state, not only because of the evolution in the core technologies, but especially to address security vulnerabilities that are found very frequently and must be corrected immediately in order to protect the system from malicious attackers that could compromise it with catastrophic consequences.

One of the most challenging environments where these issues arise is that of Free and Open Source Software (F/OSS) distributions. The F/OSS movement has gained — and is still gaining — momentum; applications developed by F/OSS communities are being deployed in many contexts, be it private users, education and research, public administration or professional Internet service providers. It is no surprise that big companies such as Google and Linden Lab, or large public bodies, like the french Ministry of Finance, build an important part of their information technology infrastructures relying on F/OSS components.

Free and Open Source Software: a complex, decentralised system

In the F/OSS world components evolve independently from each other. Development projects may be short-lived or longed-lived, have a central organizational structure or no organizational structure at all, may be conducted by a single person, a small homogeneous group of developers, an open group where contributors join or leave at will, or a large and geographically distributed group of developers. Projects may fall into a state of suspension, and suspended projects may be picked up and put back to live under a completely different organizational structure. As a consequence, project management, objectives, quality assurance procedures, version control, release process and contributors’ goals and motivation vary widely.

While it is in principle possible to deploy a complete F/OSS infrastructure by fetching the sources and recompiling each components from scratch and independently, the complexity of this task is daunting, and has led to the development of what are now called distributions, which have a privileged place in the F/OSS market. This problem has become so complex that there are now even intermediate agents creating software bundles from different sources, which then in turn are taken up by distribution editors. Examples of these are the CPAN network of Perl libraries, or bundles of typesetting utilities for the T_EX system like tex-live.

A distribution is a consistent and functional collection of software components comprising all the software that is necessary to set up a system, in other words a complete operating system. Distributions may be general purpose or cater for a specific application domain. There are desktop-oriented distributions for the large public (e.g., Mandriva, Ubuntu, Fedora, etc.), server-oriented distributions for managing and running distributed systems providing services to the users, and even more specific ones targeted to mobile phone, home appliances and so on. Some general-purpose distributions like for instance Debian have spawned off domain-specific, so-called custom distributions, for instance for educational purposes (Skolelinux) or medical applications.

A distribution contains at least one kernel of the operating system (today often a GNU/Linux kernel and/or a BSD kernel, in the future we may see in addition the choice of a GNU hurd kernel), all the essential software components that are necessary to make a basic system operative (this is usually an implementation of the well-know UNIX tools), and a choice of user applications.

A fundamental challenge: managing the evolution of the Free and Open Source Software complex system

Typically a distribution has an automated mechanism for managing the components it is made of. These components, in fact, are often provided in a packaged form, i.e., in a format that can be easily processed by automatic tools, like dpkg and rpm, and that contains some additional information useful to handle their installation, removal and update. The most important information concerns the specification of inter-package relationships such as dependencies (i.e, what a package needs in order to be correctly installed and to function correctly), and conflicts (i.e., what should not be present on the system in order to avoid malfunctioning).

Because of the huge amount of F/OSS components (and packages) available, it is not easy to manage the life-cycle of a distribution; users are in fact allowed to choose and install a wide variety of alternatives whose consistency cannot be checked a priori to their full extent. It is then possible to easily render the system unusable by installing or removing some packages that “break” the consistency of what is installed in the system itself. In the case of commercial operating systems where the core components are developed and controlled by a single entity this problem is either mitigated or partly hidden: it is not possible, for example, to change the boot loader or the graphical subsystem of Windows, as only Microsoft is allowed to do that consciously and hopefully safely; still, the dreaded DLL hell problem is fully present even if the user is too often left with the impression that any problem comes necessarily from a third party vendor.

Another problem consists in ensuring the correct upgrade of such a kind of system. This is an even more complex problem than the installation of single packages since we might have to preserve some additional properties and find an “optimal-path” to migrate the system from the current state to the targeted new state. Of course the basic property that we want to preserve is the consistency of the system. Since an upgrade is no more than a removal of some components and the installation of more recent ones, all the issues that apply to the installation of single packages are also valid in this context. There may, however, be additional constraints. For example, it would be reasonable to ask for an upgrade of the whole system that “preserves” the web-server infrastructure in its current state, (e.g., we don’t want our Apache 1.3 to be upgraded to the latest available version). Or, alternatively, it would be reasonable as well to ask for upgrades that minimize the size of the installed packages (e.g., in the case of limited devices or home appliances) or that remove the least number of packages (e.g., we would like to preserve as much as possible of the previous system).

This highlights the necessity of taking also into account non-functional aspects in the package selection and installation. With this respect, non-functional parameters in this setting can be partitioned into two categories: (i) environmental parameters - that are the ones affecting the downloading of packages, such as the availability and throughput of network connections to different sites hosting copies of the same package, (ii) system parameters - that are proper to the running system and are affected by the characteristics of different package versions, where a choice between different types of (functionally equivalent) packages is in place.

All these constraints complicate the management of the system, and it would be useful to have some tools that can handle this complexity by exploring the space of possible solutions and report to the users, in a succinct but meaningful way, the different possibilities and trade-offs. And, of course, these tools must also be able to enact the changes and to take the system to the new state correctly. Additional constraints shall be formulated to guarantee a certain level of Quality of Service upon installing new packages by taking into account non-functional aspects such as reliability, performance, etc.

Until now, these issues have been dealt with in ad-hoc ways. Maintainers of a server machine, anxious to satisfy very high availability expectations, play conservatively and use a “do not touch anything” approach, by keeping things that work in that state until modifications become unavoidable (e.g., to fix a severe security issue). Moreover, in critical systems, people often maintain and modify systems by hand, seldom relying on automatic tools. This is reasonable when the size of the server installations are small but becomes unacceptable when this size grows.

Software distributors strive to ensure the existence of safe upgrade paths between successive versions of their distribution, unfortunately with varying degrees of success. Distributions aiming at desktop users are under the pressure that they have to provide the latest “cutting-edge” software, and they are sometimes tempted to compromise quality assurance for the perceived benefit of being closest to the edge. Much too often this results in users following a “backup and redo from scratch” approach: upgrade by actually doing a fresh installation of the new system, often after having unsuccessfully tried to perform an “automatic update”. This may be reasonable for a single desktop system not requiring a heavily customized configuration and for which a downtime of several hours is without consequence, but it is absolutely unacceptable in an enterprise context where there are hundreds of workstations.

Given the challenging nature of the F/OSS world and its widespread utilization, we want to focus on this context in order to provide advanced technology for handling the previously described scenarios. So our ultimate objective is to improve the state of the art for upgrading and maintaining complex Free and Open Source platforms, with a particular focus on GNU/Linux distributions.