Recently I’ve discovered a subtle consequence of how the order in
which dependencies are specified in debian actually matters. While
re-factoring the code of dose3, I changed the order in which
dependencies are considered by our sat solver (of edos-fame) . I
witnessed a twofold performance loss just by randomizing how variables
were presented to our sat solver. This highlights, on one hand, how
our solver is strongly dependent on the structure of the problem and,
on the other hand the standard practice of debian maintainers to
assign an implicit priority in the disjunctive dependencies where the
first is the most preferred packages (and maybe the most tested, at
least dependency-wise).
The basic idea of distcheck is to encode the dependencies information
contained in a Packages file in CNF format and then to feed them to a
sat solver to find out if a package has broken dependencies or if its
dependencies are such that no matter what, it would be impossible to
install this package on a user machine.
Conflicts are encoded as binary clauses. So if package A conflicts
with package B, I add a constraint they says “not (A and B)” , that is
A and B cannot be considered together. The dependencies encoding
associates to each disjunction of the depends field a clause that says
“A implies B”. If a package foo depends on A,B | C,D , I’ll add “foo
implies A and B” & “foo implies C and D” . This encoding is pretty
standard and it is easy to understand.
The problem is how the sat solver will search for a solution to the
problem “Is is possible to install package foo in an empty
repository”. The solver we use is very efficient and can easily deal
with 100K packages or more. But in general is not very good at dealing
with random CNF instances. The reason because edos-debcheck is so
efficient lies in the way it exploits the structure of the sat problems.
The goal of a sat solver is to find a model (that is a variable
assignment list) that is compatible with the given set of constraints.
So if my encoding of the debian repository is a set of constraints R,
the installability problem boils down to add an additional constraint
to R imposing that the variable associated to the package foo must be
true, and then ask the solver to find a model to make this possible.
This installation, in sat terms, would be just an array of variables
that must be true in order to satisfy the given set of constraints.
If you look at the logic problem as a truth table, the idea is to find
a row in this table. This is the solution of your problem. Brute force
of course is not an option and modern sat solvers use a number of
strategies and heuristic to guide the search in the most intelligent
way possible. Some of them try to learn from previous attempts, some
of them, when they are lost try to pick a random variable to proceed.
If we consider problems that have a lot of structure, award winning
sat solver do not back track very much. By exploiting the structure of
the problem, their algorithm allows them to considerably narrow down
the search only to those variables that are really important to find a solution.
All this long introduction was to talk about the solver that is
currently used in edos-debcheck and distcheck (that is a rewrite of
the edos-debcheck).
So why dependency order matters ? If we consider any package, even if
the policy does not specify any order in the dependencies, it’s common
practice to write disjunctive dependencies specifying the most
probable and tested alternative first and all other, more esoteric
choices later. Moreover real packages are considered before virtual
packages. Since every developer seems be doing the same, some kind of
structure might be hidden in the order in which dependencies are specified.
Part of the efficiency of the the solver used in our tools is actually
due to the fact that its search strategy is strongly dependent on the
order in which literal are specified in each clause. Saying the
package foo depends on A and B is “different” then saying it depends
on B and A, even if semantically equivalent.
In my tests, I found about a twofold performance loss if the order of
literals is either randomized or inverted. This is clearly a specific
problem related to our solvers, while other solvers might not be
susceptible to such small structural changes. Sat competitions often
employs some form of obfuscation strategies of well known problems
with well known structures in order to make useless to encode a search
strategy that exploits the specific structure of a problem.
Since here we’re not trying to win a sat competition, but to provide
tool to solve a specific problem, we are of course very happy to
exploit this structure.