5  The Tools

This section is taken from [MBC+06].

We have developed a framework that can be used by a distribution editor to asses the quality of its distribution and to track problems concerning the underlying package repository (i.e., broken packages in non trimmed repositories). Actually we have two distinct sets of tools (Figure 13): a set of independent elements (i.e.,the toolchain) that executes the analysis by incrementally processing the data collected from a repository; and A very specialized tool that does the analysis in one step.

The rationale for having these two sets of tools is that we want to be able to use the data for other kind of analyses as well. Having a modularized toolchain enables us to reuse part of it even for other purposes. On the other hand, having a specialized, small and efficient tool that performs this particular kind of analysis is good for those distribution editors that do not need other kind of features.

Moreover, having plenty of independent tools that execute the same kind of analysis allowed us to validate the whole approach by comparing the results obtained from these different sources.

Figure 13: The framework

In the following we describe each element of our framework.

5.1  Ceve

Ceve, the first element of the toolchain, is a generalized package parser. It can read several package formats (most importantly RPM and DEB packages and distributions), but also the XML rpm-metadata format [Thec]), and output their metadata in several different formats, of which the most important is the EGraph format described in the next section. Ceve is written in OCaml, and it uses the CDuce [Thea] language for XML input and output.

The function of Ceve in the toolchain is that of parsing package metadata from all sorts of formats into one common format (the EGraph format). Since there are many differences between the various package formats, Ceve cannot simply read metadata and output them; the data has to be manipulated. The most notable example of this is RPM dependencies; RPM cannot declare dependencies on other packages directly, but only by way of features, as explained in section 3.1. Ceve can resolve these indirect dependencies, so that if package A declares a dependency on feature F, and feature F is provided by packages B and C, in the output package A will need either package B or package C. Also, packages that install different files in the same location conflict with each other, but this is not explicitly declared; Ceve can explicitly add these conflicts to the output.

5.2  The EGraph file format

The EGraph file format is a package-format agnostic representation of the information that is encoded in the packages stored in a package repository. It is based on the XML [BPSMY04] language and in particular it complies with the GraphML [BEH+02] specification. This intermediate format has been conceived in order to have a common and uniform representation of the metadata that can be used as input by all the tools of our framework, without having to develop a filter for each package format.

The GraphML format is well suited for this purpose because the dependency relationships among packages are easily represented by using graphs and, moreover, it can be extended in order to accommodate other significant metadata.

Figure 14 shows an excerpt of the EGraph format. node tags specify units (Definition 1) and the associated data explicit the available versions, thus the actual packages and, for each version, the provided virtual packages.

Dependencies are specified using edge tags. source and target attributes defines the units between the dependency is established, while the type attribute defines the dependency type (i.e., run or conflict). A version tag specialize the information by giving, in case, version constraints on the dependency requirement.

<node package="aewm" id="aewm">...
   <version number="1.2.0-1"/>
   <provides version="1.2.0-1" 

<edge source="aewm" target="libc6" type="run">... <version source="1.2.0-1" operator="ge" target="2.2.4-4"/>...

Figure 14: Excerpt of an EGraph file

The EGraph format is an effort for proposing a new metadata format that can complement and extend the existing metadata specified at the package level in order to perform more complicated and effective checking on package repositories [EDO05].

5.3  EDOSLib

EDOSLib is a library which provides the foundation of some tools that are used in the framework. In particular EDOSLib implements: an object model for representing package repositories information and their structure; a set of functionalities to explore manage the package repository struture (e.g., extracting subrepositories and dependency closures); EGraph input/output functionalities.

5.3.1  ProblemGenerator

ProblemGenerator is an EDOSLib-based preprocessor that takes as input the EGraph format representing a package repository and gives an encoding of the installability problem in order to verify if the repository is trimmed or not. In particular ProblemGenerator performs three steps: i) Extract the package subrepository underlying the dependency closure for each package that has to be analyzed. ii) Map the standard package version numbers to integers, as mentioned in Section 3; iii) Expands all the reference to virtual packages by substituting them with actual package names and versions.

The output is in the format suitable to be processed by the solvers that will perform the actual verification.

5.3.2  Explorer, Statistics and Visualizer

These are three complementary tools that can be useful for exploring the package repository, giving a help in navigating through the intrinsic complexity of it. In particular Explorer offers a command line interface to the functionalities provided by EDOSLib (e.g., dependency closures extraction, package metadata exploration, etc.). Statistics extract some useful metric with respect to the complexity of the package repository (e.g., number of dependencies, average closure sizes, etc.). Finally, Visualizer is a graph visualization tool that allows the distribution editor to visually navigate the package repository through its graphical representation. Even though these tools are not really part of the analysis engine, they provide a useful aid when it comes to navigate the package repository in order to track down the problems discovered by the actual analysis.

5.4  SAT/FGrasp Solver

The SAT/FGrasp solver takes as input the encoding of an installability problem as produced by ProblemGenerator and translates it to a boolean formula as outlined in section 4.3. After conversion to conjunctive normal form, the formula is fed through the FGrasp SAT solver from T. U. Lisbon. This solver can return a minimal set of assignments that satisfy the formula, which the SAT/FGrasp tool translates back into minimal lists of packages that need to be installed to enable the installation of the initial package. The SAT solving is reasonably efficient, taking less than 1 second on the hardest installability problems.

5.5  CP/Mozart-Oz Solver

The CP/Mozart-Oz Solver translates the output of ProblemGenerator to a CP problem as described in section 4.4, then solves it using a solver written in the Mozart-Oz language [moz] . The solver uses a custom branch-and-bound strategy programmed in Oz itself. While effective on small to medium-sized installability problems, the solver tends to exhibit exponential divergence on large problems.

5.6  DEB/RPMCheck

The debcheck and rpmcheck utilities are separate from the toolchain. They take as input a package repository and check whether one, several or all packages in the repository are installable with respect to that repository. Both utilities are based on the SAT encoding of section 4.3 and exploit a customized Davis-Putnam SAT solver [ES04]. Since all computations are performed in-memory and some of the encoding work is shared between all packages considered, the debcheck and rpmcheck are significantly faster than the SAT/FGrasp tool for checking the installability of all packages of a repository.