edos-distcheck - new YAML output format

During the last two days I spent some time to implement part of the [http://wiki.debian.org/EDOS/ProposalDose3 proposed features for distcheck/ edos-distcheck]. Since everybody is at debconf and talk is silver, but code is gold, I hope that a real implementation can get the ball rolling and get us closer to a stable release of the next generation of edos/mancoosi tools.

In particular this post is about the new YAML output format for distcheck. The rational to use YAML is to have a data structure that is at the same time human and machine friendly. There are a lot of scripts in debian that rely on distcheck and we want to provide a grep friendly output that sat the same time doesn’t hurt your eyes. The other proposed solution was to use json, but was ditched in favor of YAML. We also removed the xml output.

In order to provide a machine readable output and to minimize parsing mistakes, I used the schema language proposed here. This is the resulting data structure definition :

type: seq
sequence:
  - type: map
    mapping:
      "package": { type: str, required: true }
      "version": { type: text, required: true }
      "status":  { type: str, enum: [ broken, ok ], required: true }
      "installationset":
         type: seq
         sequence:
           - type: map
             mapping:
               "package": { type: str, required: true }
               "version": { type: text, required: true }
      "reasons":
         type: seq
         sequence:
           - type: map
             mapping:
               "conflict":
                  type: map
                  mapping:
                    "pkg1":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                    "pkg2":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                    "paths":
                      type: seq
                      sequence:
                        - type: map
                          mapping:
                            "path":
                              type: seq
                              sequence:
                                - type: map
                                  mapping:
                                    "package": { type: str, required: true }
                                    "version": { type: text, required: true }
                                    "vpkg": { type: str, required: false}
               "missing":
                  type: map
                  mapping:
                    "pkg":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                        "vpkg": { type: str, required: false}
                    "paths":
                      type: seq
                      sequence:
                        - type: map
                          mapping:
                            "path":
                              type: seq
                              sequence:
                                - type: map
                                  mapping:
                                    "package": { type: str, required: true }
                                    "version": { type: text, required: true }
                                    "vpkg": { type: str, required: false}

There are few improvements on the old output from edos-distcheck. We are going to discuss these with a real example. The following tow snippets are from the output of distcheck on sid/amd64 (04/08/2010).

Distcheck now outputs a list of broken or installable packages depending on the given options (—failoures , —success, —explain and combinations of thereof ) . Two quick examples :

-
 package: python-gi-dbg
 version: 0.6.0-1
 status: broken
 reasons:
  -
   conflict:
    pkg1:
     package: python-gobject
     version: 2.21.4-1
    pkg2:
     package: python-gi
     version: 0.6.0-1
    paths:
     -
      path:
       -
        package: python-gi-dbg
        version: 0.6.0-1
        vpkg: python-gi (= 0.6.0-1)
       -
        package: python-gi
        version: 0.6.0-1
     -
      path:
       -
        package: python-gi-dbg
        version: 0.6.0-1
        vpkg: python-gi (= 0.6.0-1)
       -
        package: python-gi
        version: 0.6.0-1
        vpkg: python-gobject (>= 2.20)
       -
        package: python-gobject
        version: 2.21.4-1

In the example above, the package python-gi-dbg is broken because there is a conflict between the packages python-gobject and python-gi. The reason why python-gi-dbg is affected by this conflict is explained by following the dependency chain from python-gi-dbg to the two offending packages. Note that for each package element of each path we specify the vpkg, that is the dependency (as it is written in the control file) that lead to the conflict. Since a dependency can be a virtual package or a package with a version constraint, it can be expanded to a disjunction of packages (think a dependency on mta-agent can be expanded as postfix. exim or sendmail…). All possible paths to an offending package are reported.

Likewise if a package is broken because there is an unfulfilled dependency, distcheck will show the path leading to the problem . In the following example we show that the package gnash-tools is broken because there are two dependency that depend on the missing package libboost-date-time1.40.0 (>= 1.40.0-1).

-
 package: gnash-tools
 version: 0.8.7-2+b1
 status: broken
 reasons:
  -
   missing:
    pkg:
     package: gnash-common
     version: 0.8.7-2+b1
     vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
    paths:
     -
      path:
       -
        package: gnash-tools
        version: 0.8.7-2+b1
        vpkg: gnash-common-opengl (= 0.8.7-2+b1) | gnash-common (= 0.8.7-2+b1)
       -
        package: gnash-common
        version: 0.8.7-2+b1
        vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
  -
   missing:
    pkg:
     package: gnash-common-opengl
     version: 0.8.7-2+b1
     vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
    paths:
     -
      path:
       -
        package: gnash-tools
        version: 0.8.7-2+b1
        vpkg: gnash-common-opengl (= 0.8.7-2+b1) | gnash-common (= 0.8.7-2+b1)
       -
        package: gnash-common-opengl
        version: 0.8.7-2+b1
        vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)

The code is still in a flux and it is not ready for production yet (everything is in the mancoosi svn). I hope this is a good step in the right direction. Comments on the debian wiki are welcome.

if we compare the output of distcheck with the old edos-debcheck we get the following:

$cat /var/lib/apt/lists/ftp.debian.org_debian_dists_unstable_main_binary-amd64_Packages | edos-debcheck -failures -explain
[...]
holotz-castle-milanb (= 0.0.20050210-1): FAILED
  holotz-castle-milanb (= 0.0.20050210-1) depends on one of:
  - holotz-castle (= 1.3.14-2)
  holotz-castle-data (= 1.3.14-2) and holotz-castle-milanb (= 0.0.20050210-1) conflict
  holotz-castle (= 1.3.14-2) depends on one of:
  - holotz-castle-data (= 1.3.14-2)

$./debcheck --explain --failures /var/lib/apt/lists/ftp.debian.org_debian_dists_unstable_main_binary-amd64_Packages
[...]
-
 package: holotz-castle-milanb
 version: 0.0.20050210-1
 status: broken
 reasons:
  -
   conflict:
    pkg1:
     package: holotz-castle-milanb
     version: 0.0.20050210-1
    pkg2:
     package: holotz-castle-data
     version: 1.3.14-2
    paths:
     -
      path:
       -
        package: holotz-castle-milanb
        version: 0.0.20050210-1
        vpkg: holotz-castle
       -
        package: holotz-castle
        version: 1.3.14-2
        vpkg: holotz-castle-data (= 1.3.14-2)
       -
        package: holotz-castle-data
        version: 1.3.14-2

apt-get / aptitude test upgrades

After reading this interesting blog post from Petter Reinholdtsen, I’ve decided to repeat his experiments and save the results in with dudf-save . Using the Petter’s script, I’ve created a lenny schroot, installed mancoosi-contest and the run apt-get and aptitide in simulation mode to create and upload the dudf to mancoosi.debian.net.

For example : http://mancoosi.debian.net/dudf/file/?uid=adf7b774-9af8-11df-bc37-00163e46d37a is the dudf report for the upgrade of gnome and
http://mancoosi.debian.net/dudf/file/?uid=8222799a-9af8-11df-8b50-00163e46d37a for the upgrade of kde from lenny to squeeze (2010-07-28).

I’ll repeat these tests from time to time. The idea would be to find upgrade problems, but in particular to compare apt-get / aptitude results with other solvers.

using apt-get and aptitude with cudf

apt-get and aptitide were two missing competitors of the misc competition. However it is important and interesting how these two tools compete against other solvers submitted to MISC. In this post I want to present two simple tools to convert cudf documents to something that apt-get based tools can handle. Cudf and debian share many characteristics but also have important semantic differences. One important difference is about installing multiple versions of the same package. Since this is allowed in cudf, but not in debian, we can use apt-get and aptitude only to solver cudf problems that respect this constraint, ruling out, for example, all cudfs from the rpm world. Another difference to take care is about the request semantic. In cudf, request can contain version constraints. For example, one can ask to upgrade the package wheel to a version greater then 2. Since it is not possible to translate directly this request in cudf we are forced to add a dummy package encoding the disjunction of all packages that respect this constraint. This problem does not arise with remove request as the refer always to the currently installed package.

Apt-get needs two files : The Packages file that contains the list of all packages known to the meta-installer and the status file that contains the list of packages that must result currently installed. To generate these files I wrote a small utility using the dose3 framework imaginatively called cudftodeb . This tool gets a cudf and produces three files : Packages, status and Request with the Request file containing the list of files to install or remove in a syntax compatible with apt-get .

In other to run apt-get/aptitude with these files, you would need a simple bash script. You can find details here for apt-get and here for aptitude. Most important option is the -s used to simulate an installation.

With the -v option of apt-get we can generate a parsable solution. This output is the piped through an other tool called aptgetsolutions in order to produce a cudf solution closing the circle.

For example, this is the trace produced by aptitude when trying to solve the legacy.cudf problem :

Reading package lists...
Building dependency tree...
Reading extended state information...
Initializing package states...
Reading task descriptions...
The following NEW packages will be installed:
  bicycle dummy_wheel electric-engine{b} glass{a} window{a}
The following packages will be upgraded:
  door wheel
2 packages upgraded, 5 newly installed, 0 to remove and 1 not upgraded.
Need to get 0B of archives. After unpacking 0B will be used.
The following packages have unmet dependencies:
  gasoline-engine: Conflicts: engine which is a virtual package.
  electric-engine: Conflicts: engine which is a virtual package.
The following actions will resolve these dependencies:

     Remove the following packages:
1)     gasoline-engine

The following NEW packages will be installed:
  bicycle dummy_wheel electric-engine glass{a} window{a}
The following packages will be REMOVED:
  gasoline-engine{a}
The following packages will be upgraded:
  door wheel
2 packages upgraded, 5 newly installed, 1 to remove and 0 not upgraded.
Need to get 0B of archives. After unpacking 0B will be used.
WARNING: untrusted versions of the following packages will be installed!

Untrusted packages could compromise your system's security.
You should only proceed with the installation if you are certain that
this is what you want to do.

  wheel bicycle dummy_wheel door glass electric-engine window

*** WARNING ***   Ignoring these trust violations because
                  aptitude::CmdLine::Ignore-Trust-Violations is 'true'!
Remv gasoline-engine [1] [car ]
Inst bicycle (7 localhost) [car ]
Inst glass (2 localhost) [car ]
Inst window (3 localhost) [car ]
Inst door [1] (2 localhost) [car ]
Inst wheel [2] (3 localhost) [car ]
Inst dummy_wheel (1 localhost) [car ]
Inst electric-engine (1 localhost)
Conf bicycle (7 localhost)
Conf glass (2 localhost)
Conf window (3 localhost)
Conf door (2 localhost)
Conf wheel (3 localhost)
Conf dummy_wheel (1 localhost)
Conf electric-engine (1 localhost)

Not the package dummy_wheel used to encode the upgrade request of wheel>>2. This dummy package encodes the request as a dependency :

Package: dummy_wheel
Version: 1
Architecture: i386
Depends: wheel (= 3)
Filename: /var/fakedummy_wheel1

One last remark about apt-get. I just run on this bug today using an old version of apt-get that is shipped with lenny. For our experiments we are using only the latest version of apt-get in debian testing.

misc 2010, how to run a solver competition

One of the goals of the project Mancoosi is to get together researcher from various disciplines to advance the state of art of package managers. To this end, we organized an sat solving competition specifically tailored to upgrade/installation problems. The winner of the competition was announced during the workshop lococo hosted at the international conference FLOC the 10th of july 2010. I spent several hours preparing the infrastructure for the competition and here I’d like to give a brief account of my experience. This work was done together with Ralf Treinen, Roberto Di Cosmo and Stefano Zacchiroli.

Input - cudf documents

The input for the solvers in the competition are documents in the cudf format. In the last year we collected a number of cudf documents from the community (namely, with the help of mandriva, caixa magica and debian). These documents are stored on the mancoosi.org servers and can be used to train meta installers on particularly difficult problems. We used 20 of such documents (from debian) for the misc 2010 competition. Unfortunately, due to problems in the conversion between dudf (a distribution specific format use to collect upgrade problems) and cudf, we were not able to use neither problems from mandriva or caixa magica.

Other then these real problems, we generated a number of artificial problems built from debian repositories. The utility we used is called randcudf and it is freely available on the mancoosi website, part of the dose3framework. We kept a number of variables into consideration in order to generate difficult problems but not so far away from the reality of every day use.

Among these parameters are the size of the package universe the number of packages that are declared as already installed in the universe (status) the number of packages to install / remove / upgrade the probability of generating install / remove requests with a version constraint the number of packages declared as installed but whose dependencies might not be satisfied the number of packages marked as keep and the type of keep (version or package)

Playing around with these variables we were able to produce problems of different size and different degree of complexity. During the competition, for example, the three categories had respectively a universe with 30k , 50K and 100K packages. Moreover we discarded all problems that do not have a solution at all.

From our experience during the problem selection, considering over 30K packages, if extremely easy to generate cudf problems that do not have a solution at all. For example in debian lenny there are 17K packages connected by a kernel of 80 conflicts. This configuration produce around 5K strong conflicts. This means that if we pick two packages among these 17K there is a high probability that these two packages are in conflict. This is because of the high level of inter-dependencies of open source distributions. With bigger remove/install requests this probability grows even bigger. Since the goal was to provide random problems as close as possible to reality our documents have a request to install at most 10 packages and remove 10 packages at the same time.

The five categories used in the competition :

cudf_set : Encoding in cudf of 1-in-3-SAT
debian-dudf : Real installation problems, Problems collected via dudf-save.
easy : Debian unstable / desktop installation from unstable / 10 install - 10 remove
difficult : Debian stable + unstable / server installation from stable / 10 install - 10 remove - 1 upgrade all
impossible : Debian oldstable, stable, testing, unstable / server installation from oldstable / 10 install - 10 remove - 1 upgrade all

Execution

The execution environment of the competition was set up in a dedicated virtual machine running on Xen. This VM respects the specification given in the rules of the competition. We did two small mistakes (to be fixed in the next version of the competition). First we did not specify the OS running in the virtual machine. To reproduce the results, everybody should be able to replicate the execution environment and re-run the competition. For Misc2010, we used a plain debian lenny (plus security updates). We tried to maintain the strictly minimum all additional software.

Starting from a base system (as generated by debootstrap) we added : subversion (used to update the execution environment from the mancoosi svn) git (used to register various versions of the submitted solvers) * sun-java (from non-free)

The second mistake was to not specify exactly the java version. open-java has subtle differences from sun-java and it seems these differences created a few problems for one of the participants. This problem was quickly rectified.

Running the competition

To run the competition I wrote few simple bash scripts and test cases. The test cases were meant to test the execution environment and to be sure that all constraints were correctly enforced. The execution environment is available in the mancoosi svn. In practice, we run the competition in four phases.

Phase One

In the first one we deployed all solvers in the execution environment. In order to cleanup the solver directory and “start fresh” after every invocation, I created an empty git repository for every solver. After each invocation, the repository was cleaned-up using

git clean -f -d
git reset --hard
find /tmp/ -user misc2010 -exec rm -Rf {} \;

the last line is to make sure that no temporary files was left in the temporary directory.

Phase Two

In the second phase, we actually run the competition. The script used is runcomp.sh. It takes 3 arguments, the list of solvers, the list of problems and a timeout in seconds. Since we used the same set of problems for the trendy and paranoid track we run the competition only once for both tracks. The output of the runcomp.sh script is a directory (i.e. tmp/201007060918) with all the raw results of the competition. All raw data is publicly available here .

Phase Three

In third phase we compute the aggregate results by track using the script recompute.sh. This script takes 4 arguments: the list of all solvers in one track, the list of problems (the same used before), the timestamp of the last run (ex 201007060918) and the name of the track. The output of this script is a file containing all the aggregate results, one per line, of the form category, problem, solver, time, results. For example a snippet from this file looks like :

cudf_set huge1.cudf apt-pbo-paranoid-1.0.5 - ABORT
cudf_set huge1.cudf p2cudf-paranoid-1.6 2.28814 FAIL
cudf_set huge1.cudf uns-paranoid-0.0002 - ABORT
cudf_set huge1.cudf ucl-cprel-1.0 0 FAIL
cudf_set huge1.cudf aspcud-paranoid-1.0 - ABORT
cudf_set huge1.cudf inescp-1.0 1.94812 FAIL
debian-dudf 103c9978-5408-11df-9bc1-00163e7a6f5e.cudf apt-pbo-paranoid-1.0.5 11.75 SUCCESS -18,-21
debian-dudf 103c9978-5408-11df-9bc1-00163e7a6f5e.cudf p2cudf-paranoid-1.6 6.06838 SUCCESS -16,-34
debian-dudf 103c9978-5408-11df-9bc1-00163e7a6f5e.cudf uns-paranoid-0.0002 - ABORT
debian-dudf 103c9978-5408-11df-9bc1-00163e7a6f5e.cudf ucl-cprel-1.0 0 FAIL
debian-dudf 103c9978-5408-11df-9bc1-00163e7a6f5e.cudf aspcud-paranoid-1.0 14.4089 SUCCESS -15,-29
...

Results are generated using the solution checker .

Phase Four

The last step is the classification of the solutions. The misc-classifier gets as input the aggregates results and outputs the html tables that will be then published on the web.

Conclusions

Running a solver competition is not as easy as it seems. To get it right we run an internal competition in january 2009 that helped us to highlight, understand and solve different problems. It is mostly a matter of writing down the rules, specify a clear and understandable protocol for the solver submission (for example asking to version their solver and a md5 hash associated to the binary is a very good idea in order to avoid mix-up ) and spend some time to debug the scripts. The runsolver utility from Olivier Roussel (available here) is a very nice tool that can take care of many delicate details in process accounting and resource management. I added a small patch to be able to specify a specific signal as warning signal. The code is in my private git repository : git clone http://mancoosi.org/~abate/repos/runsolver.git/ . This is the actual code we used for the competition. The 32 bit binary is available in the svn. All in all it was a great experience.

The results of the competition are published here .

analyzing drupal dependencies for fun and profit

After few inspiring talks in the drupal room at fosdem I decided to spend few hours to figure out the module dependency system in drupal.

Drupal has a highly modular design. The core is composed by a set of required modules (dependencies) and a set of optional modules (suggests). All contrib modules declare similar dependencies between each other. All dependencies are conjunctive, that is, in order to install a component all its dependencies must be satisfied. There are no conflict between components, and this implies that a module is always installable. The only implicit conflict is that two versions of the same module cannot be installed at the same time. This makes the module installation algorithm trivial as it is equivalent to a simple visit of the dependency graph (that might have cycles).

There is a nice page on the drupal website explaining the format of the metadata for the next version of drupal .

For example :

name = Tables Filter
description = Provides a filter that converts a [table  ] macro into HTML encoded table.
dependencies[] = filter
package = Input filters
core = 6.x

; Information added by drupal.org packaging script on 2009-09-10
version = "6.x-1.0"
core = "6.x"
project = "tables"
datestamp = "1252563652"

Note the conversion in an intermediate aggregate data below.

In order to analyze all modules’ dependencies I’ve downloaded all available modules for the release 6 of drupal (15th Feb 2010), extracted all the meta data and transform them in something that the tools in dose3 can handle. Downloading all projects archives I’ve also find that there a significant number of archives that cannot be downloaded (403 / 404) and few mistakes in the metadata … I’ll blog about this in the future maybe.

==Numbers and intermediate aggregate modules list== From the file .info in each module archive, I extracted all the relevant data and transformed in a 822 format similar to the one used in debian. There are about 4800 modules in the drupal repository for drupal 6.x.

This is a small snippet representing few drupal core modules and a meta package (that I created from the metadata) to express the core’s dependencies) :

[...]
package: tables
version: 6.x-1.0
depends: filter

package: blogapi
version: 6.15

package: profile
version: 6.15

package: filter
version: 6.15

package: drupal
version: 6.15
depends: system , user , block , node , filter
provides: core = 6.15
suggests: translation , comment , menu , openid , contact , tracker , forum , ping , syslog , help , dblog , search , trigger , poll , update , locale , php , path , taxonomy , color , aggregator , upload , throttle , statistics , blog , book , blogapi , profile
[...]

Since I’m considering only modules for drupal version 6.x, all dependencies for core >= 6.0 , core < 7.0 are left implicit.

Dependency graphs

The result are a set of nice graphs showing for each package their (deep) dependencies. From the global dependency graph, I’ve extracted the “connected” components, that is all modules that are related with each other in some way. This generates 375 sub-graphs. This is the top 10 (WARNING: some of the biggest pdf systematically manage to trash my workstation… handle with care) … and circo didn’t manage to create the pdf for views and taxonomy:

package:views.dot , views.svg
number of nodes:684
max inbound:162

package:taxonomy.dot , taxonomy.svg
number of nodes:232
max inbound:134

package:upload.dot , upload.svg, upload.pdf
number of nodes:134
max inbound:32

package:token.dot , token.svg, token.pdf
number of nodes:90
max inbound:45

package:search.dot , search.svg, search.pdf
number of nodes:78
max inbound:40

package:menu.dot , menu.svg, menu.pdf
number of nodes:57
max inbound:35

package:locale.dot , locale.svg, locale.pdf
number of nodes:52
max inbound:19

package:comment.dot , comment.svg, comment.pdf
number of nodes:48
max inbound:43

package:imagecache.dot , imagecache.svg, imagecache.pdf
number of nodes:48
max inbound:28

package:date-api.dot , date-api.svg, date-api.pdf
number of nodes:47
max inbound:15

The complete list is here

From these graphs, it seems that apart from a couple of dozen of packages, the rest of the drupal components are loosely connected. I don’t think this is a matter of code sharing but this is more likely because the drupal repository has a plethora of small components with a very special functionality that only depends on the drupal core.

Dist check

Distcheck is a small utility that transforms package dependencies in a propositional logic problem and then uses a sta solver to simulate it’s installation. Since there are no conflicts, it should be always possible to install a package. The only reason for a package to be broken is a missing dependency in the repository. Periodically performing this analysis could prevent the distribution of broken packages.

Conclusions

Periodic generation of aggregate module metadata information.

Dist Check the module repository to avoid releasing a module that is

not installable due to a missing dependency.

Integrate a developer tool to display all dependency of a module

(like debtree, or directly using debtree)

As the system grows it might be necessary to review the dependency

system to include disjunctive dependencies and conflicts between modules. At present this might be not necessary, Adding more expressivity to the dependency system of course will significantly increase the complexity of the installation problem (from polynomial to NP-complete).

I think it is important to spend few words about this last point. It is clear that not all 4800 packages can be installed at the same time. Just think about the filter modules that manipulate user’s submissions. At the moment the only was a site developer has to discover a conflict it to try out the module and check if it did not break anything else on the site. Given the complexity of many drupal site this can be a painful and costly task to perform.

Adding conflicts to the meta data will make modules integration much easier for site developers, and move the burden of finding potential problems to the module developers and to the module installer. As I said before if we include conflicts (that is negation, in logical terms) the problem of installing a new module suddenly become NP-complete. Running a NP complete algorithm on a webserver is of course a bad idea, but using drush offline to run complex install operations, should be completely acceptable as much as it is acceptable to wait for apt-get to install the latest program on debian.

If conflicts are indeed needed it would be fun to have a mod_php_minisat and to implement a small dependency solver in php !