Package Managers Comparison - take 2

On year ago, we (the mancoosi team) published a comparison study regarding the state of the art of dependency solving in debian. As few noticed, the data presented had few glitches that I promised to fix. So we’ve repeated our tests using exactly the same data we used one year ago, but now using the latest available versions of all package managers as available in debian unstable.

During the last year, three out of four solver that we evaluated release a major upgrade so I expected many improvements in performances and accuracy.

  • apt-get 0.8.10 -> 0.9.7
  • aptitude 0.6.3 -> 0.6.7
  • smart 1.3-1 -> 1.4
  • cupt 1.5.14.1 -> 2.5.6

Mpm, our test-bench for new technologies, changed quite a bit under the wood as a consequence of the evolution of apt-cudf and the recent work done in apt-get to integrate external cudf dependency solvers.

Overall the results of our study are not changed. All solvers but mpm, that is based on aspcud, are not scalable as the number of packages (and alternatives) grows. It seems that Smart is the solver that does not give up, incurring in a timeout (fixed at 60 seconds) most of the time. Aptitude is the solver that tried to give you a solution, doesn’t matter what and as result providing solutions that do not satisfy the user request for one reason or the other. Apt-get does surprisingly well, but it gives up pretty often showing the incomplete nature of it’s internal solver. Cupt sometimes timeout, sometimes gives up, but when it is able to provide an answer it is usually optimal and it is very fast … Mpm consistently finds an optimal solution, but sometimes it takes really a long time to do it. Since mpm is written in python and not optimized for speed this is not a big problem for us. The technology used by mpm is now integrated in apt-get and I hope this will alleviate this problem.

All the details of our study can be found one the Mancoosi Website as usual with a lot of details. For example here you can find the results when mixing four major releases : sarge-etch-lenny-squeeze.

Comments are more then welcome.


apt-get with external solvers : call for testers

Last year we invited David to work with us for a few days to add a generic interface to apt to call external solvers. After a few iterations, this patch finally landed in master and recently (about 3 months ago), in debian unstable.

[ David Kalnischkies ] * [ABI-Break] Implement EDSP in libapt-pkg so that all front-ends which use the internal resolver can now be used also with external ones as the usage is hidden in between the old API * provide two edsp solvers in apt-utils: - ‘dump’ to quickly output a complete scenario and - ‘apt’ to use the internal as an external resolver

Today the new version of apt-cudf was upload in unstable and with it the latest bug fixes that makes it ready for daily use. I’ve used it quite a lot myself to upgrade my machine and it seems working pretty well so far… The most important difference with the old version is the support for multi-arch enabled machines.

This marks an important milestone in our efforts to integrate external solvers, built using different technologies, directly into apt. From a user prospective, this means that (s)he will have the possibility to check if there exists a better (best ?) solution to an installation problem then what proposed by the internal apt solver. Moreover, even if apt-get gives very satisfactory answers, there are occasions where it fails miserably, leaving the user wondering how to unravel the complex web of dependencies to accomplish his task. Available cudf solvers in debian are at the moment : aspcud, mccs and packup.

From an architectural point of view this is accomplished by abstracting the installation problem via a simple textual protocol (EDSP) and using an external tool to do the heavy duty translation. Since all solvers now available in debian are not meant to be debian-specific, using them involve a two step translation. The EDSP protocol specification is for the moment “hidden” in the apt documentation. I hope to find a better place for it soon : it would be cool if other package managers as smart or cupt could add an implementation on EDSP in their code so to automatically benefit of this technology.

To make it happen, Apt first creates an EDSP file that is then handled to apt-cudf that takes care of the translation to cudf and back into EDSP that is then read back by apt. Apt-cudf is the bridge between edsp and the external solvers and takes care of doing the book keeping and to select the right optimization criteria.

Roberto recently wrote a very nice article explaining how to use apt-get with an external solver.

In a nutshell, if you want to try this out you just need to install apt-cudf, one external solver like aspcud from the university of Pasdam and then call apt-get using the —solver option (that is not yet documented #67442 ). For example :

apt-get install -s gnome --solver aspcud

This will install gnome while using the a optimization criteria that tries to minimizing the changes on the system. Various other optimization criteria for all apt-get default actions can be specified in the apt-cudf configuration file /etc/apt-cudf.conf

I hope the new release of apt-cudf make it into testing before the freeze. Time to test !!!


Learning from the Future of Component Repositories - CBSE 2012

Learning from the Future of Component Repositories ( Pietro Abate, Roberto Di Cosmo, Ralf Treinen and Stefano Zacchiroli ) has been accepted to be presented at CBSE 2012 (26-28 June, Bertinoro, Italy)

Abstract

An important aspect of the quality assurance of large component repositories is the logical coherence of component metadata. We argue that it is possible to identify certain classes of such problems by checking relevant properties of the possible future repositories into which the current repository may evolve. In order to make a complete analysis of all possible futures effective however, one needs a way to construct a finite set of representatives of this infinite set of potential futures. We define a class of properties for which this can be done.

We illustrate the practical usefulness of the approach with two quality assurance applications: (i) establishing the amount of “forced upgrades* induced by introducing new versions of existing components in a repository, and (ii) identifying outdated components that need to be upgraded in order to ever be installable in the future. For both applications we provide experience reports obtained on the Debian distribution.

The tools presented in this paper (outdated and challenges) are already in Debian as part of the ‘dose-extra’ package.

Update

For the second year in a raw our paper won the Best Paper Award for the CBSE 2012 conference !!!

Update 2

I presented this paper at cbse2012. The slides of my presentations are attached.


Mancoosi fosdem talks

A small (and probably not complete) list of videos associated to the mancoosi project @ FOSDEM.

  • QA tools for FOSS distributions (FOSDEM 2012) - Pietro Abate (video)

  • Mancoosi tools for the analysis and quality assurance of FOSS (FOSDEM 2011) - Ralf Treinen (video)

  • Improving Rollback in Linux via DSL approach & distributing (FOSDEM 2011) - John Thomson (video)

  • Cross distro dependency resolution reusing solvers among distros (FOSDEM 2010) - Stefano Zacchiroli (video)

  • The MANCOOSI project (FOSDEM 2008) - Ralf Treinen (video)


Dependency order matters !

Recently I’ve discovered a subtle consequence of how the order in which dependencies are specified in debian actually matters. While re-factoring the code of dose3, I changed the order in which dependencies are considered by our sat solver (of edos-fame) . I witnessed a twofold performance loss just by randomizing how variables were presented to our sat solver. This highlights, on one hand, how our solver is strongly dependent on the structure of the problem and, on the other hand the standard practice of debian maintainers to assign an implicit priority in the disjunctive dependencies where the first is the most preferred packages (and maybe the most tested, at least dependency-wise).

The basic idea of distcheck is to encode the dependencies information contained in a Packages file in CNF format and then to feed them to a sat solver to find out if a package has broken dependencies or if its dependencies are such that no matter what, it would be impossible to install this package on a user machine.

Conflicts are encoded as binary clauses. So if package A conflicts with package B, I add a constraint they says “not (A and B)” , that is A and B cannot be considered together. The dependencies encoding associates to each disjunction of the depends field a clause that says “A implies B”. If a package foo depends on A,B | C,D , I’ll add “foo implies A and B” & “foo implies C and D” . This encoding is pretty standard and it is easy to understand.

The problem is how the sat solver will search for a solution to the problem “Is is possible to install package foo in an empty repository”. The solver we use is very efficient and can easily deal with 100K packages or more. But in general is not very good at dealing with random CNF instances. The reason because edos-debcheck is so efficient lies in the way it exploits the structure of the sat problems.

The goal of a sat solver is to find a model (that is a variable assignment list) that is compatible with the given set of constraints. So if my encoding of the debian repository is a set of constraints R, the installability problem boils down to add an additional constraint to R imposing that the variable associated to the package foo must be true, and then ask the solver to find a model to make this possible. This installation, in sat terms, would be just an array of variables that must be true in order to satisfy the given set of constraints.

If you look at the logic problem as a truth table, the idea is to find a row in this table. This is the solution of your problem. Brute force of course is not an option and modern sat solvers use a number of strategies and heuristic to guide the search in the most intelligent way possible. Some of them try to learn from previous attempts, some of them, when they are lost try to pick a random variable to proceed.

If we consider problems that have a lot of structure, award winning sat solver do not back track very much. By exploiting the structure of the problem, their algorithm allows them to considerably narrow down the search only to those variables that are really important to find a solution.

All this long introduction was to talk about the solver that is currently used in edos-debcheck and distcheck (that is a rewrite of the edos-debcheck).

So why dependency order matters ? If we consider any package, even if the policy does not specify any order in the dependencies, it’s common practice to write disjunctive dependencies specifying the most probable and tested alternative first and all other, more esoteric choices later. Moreover real packages are considered before virtual packages. Since every developer seems be doing the same, some kind of structure might be hidden in the order in which dependencies are specified.

Part of the efficiency of the the solver used in our tools is actually due to the fact that its search strategy is strongly dependent on the order in which literal are specified in each clause. Saying the package foo depends on A and B is “different” then saying it depends on B and A, even if semantically equivalent.

In my tests, I found about a twofold performance loss if the order of literals is either randomized or inverted. This is clearly a specific problem related to our solvers, while other solvers might not be susceptible to such small structural changes. Sat competitions often employs some form of obfuscation strategies of well known problems with well known structures in order to make useless to encode a search strategy that exploits the specific structure of a problem.

Since here we’re not trying to win a sat competition, but to provide tool to solve a specific problem, we are of course very happy to exploit this structure.