Dependency order matters !

Recently I’ve discovered a subtle consequence of how the order in which dependencies are specified in debian actually matters. While re-factoring the code of dose3, I changed the order in which dependencies are considered by our sat solver (of edos-fame) . I witnessed a twofold performance loss just by randomizing how variables were presented to our sat solver. This highlights, on one hand, how our solver is strongly dependent on the structure of the problem and, on the other hand the standard practice of debian maintainers to assign an implicit priority in the disjunctive dependencies where the first is the most preferred packages (and maybe the most tested, at least dependency-wise).

The basic idea of distcheck is to encode the dependencies information contained in a Packages file in CNF format and then to feed them to a sat solver to find out if a package has broken dependencies or if its dependencies are such that no matter what, it would be impossible to install this package on a user machine.

Conflicts are encoded as binary clauses. So if package A conflicts with package B, I add a constraint they says “not (A and B)” , that is A and B cannot be considered together. The dependencies encoding associates to each disjunction of the depends field a clause that says “A implies B”. If a package foo depends on A,B | C,D , I’ll add “foo implies A and B” & “foo implies C and D” . This encoding is pretty standard and it is easy to understand.

The problem is how the sat solver will search for a solution to the problem “Is is possible to install package foo in an empty repository”. The solver we use is very efficient and can easily deal with 100K packages or more. But in general is not very good at dealing with random CNF instances. The reason because edos-debcheck is so efficient lies in the way it exploits the structure of the sat problems.

The goal of a sat solver is to find a model (that is a variable assignment list) that is compatible with the given set of constraints. So if my encoding of the debian repository is a set of constraints R, the installability problem boils down to add an additional constraint to R imposing that the variable associated to the package foo must be true, and then ask the solver to find a model to make this possible. This installation, in sat terms, would be just an array of variables that must be true in order to satisfy the given set of constraints.

If you look at the logic problem as a truth table, the idea is to find a row in this table. This is the solution of your problem. Brute force of course is not an option and modern sat solvers use a number of strategies and heuristic to guide the search in the most intelligent way possible. Some of them try to learn from previous attempts, some of them, when they are lost try to pick a random variable to proceed.

If we consider problems that have a lot of structure, award winning sat solver do not back track very much. By exploiting the structure of the problem, their algorithm allows them to considerably narrow down the search only to those variables that are really important to find a solution.

All this long introduction was to talk about the solver that is currently used in edos-debcheck and distcheck (that is a rewrite of the edos-debcheck).

So why dependency order matters ? If we consider any package, even if the policy does not specify any order in the dependencies, it’s common practice to write disjunctive dependencies specifying the most probable and tested alternative first and all other, more esoteric choices later. Moreover real packages are considered before virtual packages. Since every developer seems be doing the same, some kind of structure might be hidden in the order in which dependencies are specified.

Part of the efficiency of the the solver used in our tools is actually due to the fact that its search strategy is strongly dependent on the order in which literal are specified in each clause. Saying the package foo depends on A and B is “different” then saying it depends on B and A, even if semantically equivalent.

In my tests, I found about a twofold performance loss if the order of literals is either randomized or inverted. This is clearly a specific problem related to our solvers, while other solvers might not be susceptible to such small structural changes. Sat competitions often employs some form of obfuscation strategies of well known problems with well known structures in order to make useless to encode a search strategy that exploits the specific structure of a problem.

Since here we’re not trying to win a sat competition, but to provide tool to solve a specific problem, we are of course very happy to exploit this structure.