performances tweaking - dose3

Lately I’ve been concerned about the performances of dose3. Soon we will have a package in the official debian archive (containing the new distcheck) and we also plan to use dose3 as foundation of an upcoming apt-get future (external solvers !). This week I tackled a couple of problems.

First I wanted to understand the poor performances of my parser for the debian Packages format. The parser itself (written by J. Voullion for dose2) is a home brewed parser, it uses a Str based tokenizer and it is pretty efficient. On the top of it I built the rest of the parsing infrastructure. Because of laziness (well, I followed the [http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf “avoid premature optimization”] mantra) I used a lot of regular expressions using the standard library (Str) module to parse various chunks of the file. Since Str has the reputation of not being the fastest reg exp library in the world (I know I should use Pcre), I started my journey by removing all calls to this library and substituting the with calls to the module String.

====Lesson n. 1==== If you do not need a regular expression to parse a string, you are better off using String.index, String.sub and friends instead. Maybe your function will be a bit longer, but certainly faster. Sscanf is also your friend.

This was only the tip of the iceberg. Second I noticed I used String.lowercase (I use ExtLib.String) a bit every where… I realized I could simply remove all these calls and have a bit more faith in the user input. If the user does not respect the standard it’s his problem, not mine.

====Lesson n. 2==== Calling a String function a zillion times slow you down considerably !!!!

I knew there was something more to do. Following the advices of my colleges, we decided to take a look at what really was happening under the wood. Using ocamlbuild and gproof, this is easily done.

first you need to rebuild your binary using debug and the profiling tags. This can be done once off from the command line :

ocamlbuild -tag debug -tag profile apt-backend.native

Then you have to run the binary as you normally do, to collect profiling information, and in the end you fire up gprof to see what’s going on.

$gprof apt-backend.native | less
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  8.52      0.79     0.79 63008857     0.00     0.00  caml_gc_set
  5.61      1.31     0.52   934076     0.00     0.00  _fini
  5.07      1.78     0.47 55818126     0.00     0.00  caml_MD5Transform
  4.53      2.20     0.42   296139     0.00     0.00  caml_gc_stat
  4.21      2.59     0.39 10910905     0.00     0.00  caml_parse_engine
  2.91      2.86     0.27                             compare_val
  2.48      3.09     0.23 11707884     0.00     0.00  caml_final_register

compare_val !!! This is a bad sign. It means I’m using the generic comparing function instead of a monomorphic comparison function. After a bit of head scratching I realized that in one of my data structures I was using a generic List.assoc . This function uses compare_val ! Bingo.

rewriting the assoc function lowered the number of calls to compare_val ten-folds giving me a considerable speed-up

let rec assoc (n : string) = function
  |(k,v)::_ when k = n -> v
  |_::t -> assoc n t
  |[] -> raise Not_found
;;

====Lesson n.3==== Before using a generic function think twice !

On the same vein, I specialized also a couple of hash tables (for integers and strings) with their monomorphic counterparts.

During my tests I’ve also noticed that I was spending a lot of time resizing my hash tables. In my case this was easily avoidable using a more sensitive default when creating the hash table. This is not always the case because sometimes the default is tied to a value that in not known in advance.

The only think left in the parsing function is to get rid of the last call to Str that I use to tokenize my stream. I think writing few lines of ocamllex would give me an additional speedup, but I’ll leave this for next week…

Since was in the mood for hacking I decided to understand what was wrong in a different part of dose3, that is, the translation from debian Packages format to propositional logic (that will be then used by a SAT solver to perform various installability analysis).

What I immediately noticed looking at my code, is that I had a couple of List.unique functions called by a very important function. Ah ! My first naive solution to this problems was to use the ExtLib List.unique function that forces you to pass a comparison function with it. With this change I noticed a small speed-up (compare_val strikes back), but it was clearly not enough. The obvious solution was to rewrite the routine using a set (of integers in this case) and drop completely the List.unique.

====Lesson n.4 ==== List.unique is slow, ExtLib.List.unique is better, If you can, use Sets.

Last improvement is related to the SAT solver we use. It’s a very specialized and optimized SAT solver (inherited from dose2) and it is written in ocaml. Using again grpof I noticed that the Gc overhead was substantial enough to warrant a bit of Gc tweaking.

    Gc.set { (Gc.get()) with
      Gc.minor_heap_size = 4 * 1024 * 1024; (*4M*)
      Gc.major_heap_increment = 32 * 1024 * 1024; (*32M*)
      Gc.max_overhead = 150;
    } ;

This corresponds to CAMLRUNPARAM=s=4M,i=32M,o=150

====Lesson n.5 ==== Gc tweaking can make the difference sometimes !

After all this work I was quite pleased of the result:

Before (r2454) :
abate@zed.fr:~/Projects/git-svn-repos/dose3/applications$time
./distcheck.native deb://tests/lenny.packages
background-packages: 22311
foreground-packages: 22311
broken-packages: 0

real    0m11.535s
user    0m11.409s
sys     0m0.112s
abate@zed.fr:~/Projects/git-svn-repos/dose3/applications$time
./distcheck.native deb://tests/sid.packages
background-packages: 29589
foreground-packages: 29589
broken-packages: 143

real    0m19.799s
user    0m19.621s
sys     0m0.152s

After (r2467) :
abate@zed.fr:~/Projects/git-svn-repos/dose3/applications$time
./distcheck.native deb://tests/lenny.packages
background-packages: 22311
foreground-packages: 22311
broken-packages: 0

real    0m8.738s
user    0m8.589s
sys     0m0.132s
abate@zed.fr:~/Projects/git-svn-repos/dose3/applications$time
./distcheck.native deb://tests/sid.packages
background-packages: 29589
foreground-packages: 29589
broken-packages: 143

real    0m14.026s
user    0m13.817s
sys     0m0.172s

I shaved about 4 seconds from my processing time. Considering that these applications are going to be called many times per day on the entire debian archive or thousand or times during our experiments, 4 seconds here and there can save quite a bit of time.

distcheck vs edos-debcheck

This is the second post about distcheck. I want to give a quick overview of the differences between edos-distcheck and the new version. First despite using the same sat solver and encoding of the problem, Distcheck has been re-written from scratch. Dose2 has several architectural problems and not very well documented. Adding new features had become too difficult and error-prone, so this was a natural choice (at least for me). Hopefully Dose3 will survive the Mancoosi project and provide a base for dependency reasoning. The framework is well documented and the architecture pretty modular. It’s is written in ocaml, so sadly, I don’t expect many people to join the development team, but we’ll be very open to it.

These are the main differences with edos-debcheck .

Performances

distcheck is about two times faster than edos-debcheck (from dose2), but it is a “bit” slower then debcheck (say the original debcheck), that is the tool wrote by Jerome Vouillon and that was then superseded in debian by edos-debcheck. The original debcheck was a all-in-one tool that did the parsing, encoding and solving without converting the problem to any intermediate format. distcheck trades a bit of speed for generality. Since it is based on Cudf, it can handle different formats and can be easily adapted in a range of situation just by changing the encoding of the original problem to cudf.

Below there are a couple of test I’ve performed on my machine (debian unstable). The numbers speak alone.

$time cat tmp/squeeze.packages | edos-debcheck -failures > /dev/null
Completing conflicts...                                            * 100.0%
Conflicts and dependencies...                                      * 100.0%
Solving                                                            * 100.0%

real    0m19.515s
user    0m19.193s
sys 0m0.276s

$time ./distcheck.native -f deb://tmp/squeeze.packages > /dev/null

real    0m10.859s
user    0m10.669s
sys 0m0.172s

Input

The second big difference is about different input format. In fact, at the moment, we have two different tools in debian, one edos-debcheck and the other edos-rpmcheck. Despite using the same underlying library these two tools have different code bases. distcheck basically is a multiplexer that convert different inputs to a common format and then uses it (agnostically) to solve the installation problem. It can be called in different ways (via symlinks) to behave similarly to its predecessors.

At the moment we are able to handle 5 different formats

deb:// Packages 822 format for debian based distributions

hdlist:// a binary format used by rpm based distribution

synth:// a simplified format to describe rpm based package

repositories

eclipse:// a 822 based format that encoded OSGi plugings metadata

cudf:// the native cudf format

distcheck handles gz and bz2 compressed file transparently . However if you care about performances, you should decompress your input file first and the parse it with distcheck and it often takes more time to decompress the file on the fly that run the installability test itself. There is also an experimental database backend that is not compiled by default at them moment.

Output

Regarding the output, I’ve already explained the main differences in an old post. As a quick reminder, the old edos-debcheck had two output options. The first is a human readable - unstructured output - that was a handy source of information when running the tool interactively. The second was a xml based format (without a dtd or a schema, I believe) that was used for batch processing.

distcheck has only one output type in the YAML format that aims to be human and machine readable. Hopefully this will cater for both needs. Moreover, just recently I’ve added the output of distcheck a summary of who is breaking what. The output of edos-debcheck was basically a map of packages to the reasons of the breakage. In addition to this information distcheck gives also a maps between reason (a missing dependency or a conflict) to the list of packages that are broken by this problem.This additional info is off by default, but I think it can be nice to know what is the missing dependency that is responsible for the majority of problems in a distribution…

For example, calling distcheck with —summary :

$./distcheck.native --summary deb://tests/sid.packages 
backgroud-packages: 29589
foreground-packages: 29589
broken-packages: 143
missing-packages: 138
conflict-packages: 5
unique-missing-packages: 52
unique-conflict-packages: 5
summary:
 -
  missing:
   missingdep: libevas-svn-05-engines-x (>= 0.9.9.063)
   packages:
    -
     package: enna-dbg
     version: 0.4.0-4
     architecture: amd64
     source: enna (= 0.4.0-4)
    -
     package: enna
     version: 0.4.0-4
     architecture: amd64
     source: enna (= 0.4.0-4)
 -
  missing:
   missingdep: libopenscenegraph56 (>= 2.8.1)
   packages:
    -
     package: libosgal1
     version: 0.6.1-2+b3
     architecture: amd64
     source: osgal (= 0.6.1-2)
    -
     package: libosgal-dev
     version: 0.6.1-2+b3
     architecture: amd64
     source: osgal (= 0.6.1-2)

Below I give a small example of the edos-debcheck output compared to the new yaml based output.

$cat tests/sid.packages | edos-debcheck -failures -explain
Completing conflicts...                                            * 100.0%
Conflicts and dependencies...                                      * 100.0%
Solving                                                            * 100.0%
zope-zms (= 1:2.11.1-03-1): FAILED
  zope-zms (= 1:2.11.1-03-1) depends on missing:
  - zope2.10
  - zope2.9
zope-tinytableplus (= 0.9-19): FAILED
  zope-tinytableplus (= 0.9-19) depends on missing:
  - zope2.11
  - zope2.10
  - zope2.9
...

And an extract from the distcheck output (the order is different. I cut and pasted parts of the output here…)

$./distcheck.native -f -e deb://tests/sid.packages
report:
 -
  package: zope-zms
  version: 1:2.11.1-03-1
  architecture: all
  source: zope-zms (= 1:2.11.1-03-1)
  status: broken
  reasons:
   -
    missing:
     pkg:
      package: zope-zms
      version: 1:2.11.1-03-1
      architecture: all
      missingdep: zope2.9 | zope2.10
 -
  package: zope-tinytableplus
  version: 0.9-19
  architecture: all
  source: zope-tinytableplus (= 0.9-19)
  status: broken
  reasons:
   -
    missing:
     pkg:
      package: zope-tinytableplus
      version: 0.9-19
      architecture: all
      missingdep: zope2.9 | zope2.10 | zope2.11
...

Future

The roadmap to release version 1.0 of distcheck is as follows:

add background and foreground package selection. This feature will

allow the use to specify a larger universe (background packages), but check only a subset of this universe (foreground packages). This should allow users to select packages using grep-dctrl and then pipe them to discheck . At the moment we can select individual packages on the command line or we can use expression like bash (<= 2.7) to check all version of bash in the universe with version greater than 2.7.

code cleanup and a bit of refactoring between distcheck and

buildcheck (that is a frontend for distcheck that allow us to report broken build dependencies)

consider essential packages while performing the installation test.

Here there are few things we have to understand, but the idea would be to detect possible problems related the implicit presence of essential packages in the distribution. At the moment, distcheck performs the installation test in the empty universe, while ideally, the universe should contain all essential packages.

finish the documentation. The effort in underway and we hope to

finalize shortly to release the debian package in experimental.

dose3 distcheck

A while ago I wrote about the new distcheck tool upcoming in dose3. I’ve recently updated the proposal on the debian wiki to reflect recent changes in the yaml data structure. The idea was to remove redundant information, to make it easier to read and at the same time include enough details to make it easy to use from a script. I’ll write down a small example to explain the format. A package can be broken because of a missing package or because of a conflict. For a missing package we’ll have a stanza like this :

  package: libgnuradio-dev
  version: 3.2.2.dfsg-1
  architecture: all
  source: gnuradio (= 3.2.2.dfsg-1)
  status: broken
  reasons:
   -
    missing:
     pkg:
      package: libgruel0
      version: 3.2.2.dfsg-1+b1
      architecture: amd64
      missingdep: libboost-thread1.40.0 (>= 1.40.0-1)
     paths:
      -
       depchain:
        -
         package: libgnuradio-dev
         version: 3.2.2.dfsg-1
         architecture: all
         depends: libgnuradio (= 3.2.2.dfsg-1)
        -
         package: libgnuradio
         version: 3.2.2.dfsg-1
         architecture: all
         depends: libgnuradio-core0
        -
         package: libgnuradio-core0
         version: 3.2.2.dfsg-1+b1
         architecture: amd64
         depends: libgruel0 (= 3.2.2.dfsg-1+b1)

The first part gives details about the package libgnuradio-dev, specifying its status, source and architecture. The second part is the reason of the problem. In this case it is a missing package that is essential to install libgnuradio-dev. missindep is the dependency that cannot be satisfied is the package libgruel0 , in this case: libboost-thread1.40.0 (>= 1.40.0-1).

The paths component gives all possible depchains from the root package libgnuradio-dev to libgruel0 . Notice that we do not include the last node in the dependency chain to avoid a useless repetition. Of course there might be more then on path to reach libgruel0. Distcheck will unroll all of them. Because of the structure of debian dependencies usually there are not so many paths.

The other possible cause of a problem is a conflict. Consider the following :

  package: a
  version: 1
  status: broken
  reasons:
   -
    conflict:
     pkg1:
      package: e
      version: 1
     pkg2:
      package: f
      version: 1
     depchain1:
      -
       depchain:
        -
         package: a
         version: 1
         depends: b
        -
         package: b
         version: 1
         depends: e
     depchain2:
      -
       depchain:
        -
         package: a
         version: 1
         depends: d
        -
         package: d
         version: 1
         depends: f

This is the general case of a deep conflict. I use an artificial example here instead of a concrete one since this case is not very common and I was not able to find one. To put everything in context, this is the example I’ve used (it’s in cudf format, but I think you get the gist of it):

package: a
version: 1
depends: b, d

package: b
version: 1
depends: e

package: d
version: 1
depends: f

package: f
version: 1
conflicts: e

package: e
version: 1
conflicts: f

The first part of the distcheck report is as before with details about the broken package. Since this is a conflict, and all conflicts are binary, we give the two packages involved in the conflict first. Packages f and e are in conflict, but they are not direct dependency of package a . For this reason, we output the two paths that from a lead to f or e. All dependency chains for each conflict are together. Again, since there might be more than one way from a to reach the conflicting packages, we can have more then one depchain.

Another important upcoming change is distcheck (to be implemented soon) it the ability to check if a package is in conflict with an Essential package. In the past edos-debcheck always check the installability of a package in the empty universe. This assumption is actually not true for debian as all essential packages should always be installed. For this reason, now distcheck will check the installability problem not in an empty universe, but in a universe with all essential packages installed.

This check is not going to be fool proof though. Because of the semantic of essential packages, despite is not possible to remove a package toutcourt, an essential package can be replaced by a non essential package via the replace mechanism. For example, poking with this feature I noticed that the package upstart in sid replace sysinit and it is in conflict with it. This is perfectly fine as it gives a mechanism to upgrade and replace essential components of the system. At the same time this does not fit in the edos-debcheck philosophy of checking packages for installation problems in the empty universe (or in a universe with all essential packages installed). At the moment we are still thinking how to address this problem (the solution will be in the long term to add the replace semantic in distcheck), but for the moment we will just provide an option to check packages w.r.t essential packages conscious the this can lead to false positives.

This work is of course done in collaboration with the mancoosi team in paris.

Dose3 is still not ready for prime time. We are preparing debian packages and we plan to upload them in experimental in the near feature.

edos-distcheck - new YAML output format

During the last two days I spent some time to implement part of the [http://wiki.debian.org/EDOS/ProposalDose3 proposed features for distcheck/ edos-distcheck]. Since everybody is at debconf and talk is silver, but code is gold, I hope that a real implementation can get the ball rolling and get us closer to a stable release of the next generation of edos/mancoosi tools.

In particular this post is about the new YAML output format for distcheck. The rational to use YAML is to have a data structure that is at the same time human and machine friendly. There are a lot of scripts in debian that rely on distcheck and we want to provide a grep friendly output that sat the same time doesn’t hurt your eyes. The other proposed solution was to use json, but was ditched in favor of YAML. We also removed the xml output.

In order to provide a machine readable output and to minimize parsing mistakes, I used the schema language proposed here. This is the resulting data structure definition :

type: seq
sequence:
  - type: map
    mapping:
      "package": { type: str, required: true }
      "version": { type: text, required: true }
      "status":  { type: str, enum: [ broken, ok ], required: true }
      "installationset":
         type: seq
         sequence:
           - type: map
             mapping:
               "package": { type: str, required: true }
               "version": { type: text, required: true }
      "reasons":
         type: seq
         sequence:
           - type: map
             mapping:
               "conflict":
                  type: map
                  mapping:
                    "pkg1":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                    "pkg2":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                    "paths":
                      type: seq
                      sequence:
                        - type: map
                          mapping:
                            "path":
                              type: seq
                              sequence:
                                - type: map
                                  mapping:
                                    "package": { type: str, required: true }
                                    "version": { type: text, required: true }
                                    "vpkg": { type: str, required: false}
               "missing":
                  type: map
                  mapping:
                    "pkg":
                      type: map
                      required: true
                      mapping:
                        "package": { type: str, required: true }
                        "version": { type: text, required: true }
                        "vpkg": { type: str, required: false}
                    "paths":
                      type: seq
                      sequence:
                        - type: map
                          mapping:
                            "path":
                              type: seq
                              sequence:
                                - type: map
                                  mapping:
                                    "package": { type: str, required: true }
                                    "version": { type: text, required: true }
                                    "vpkg": { type: str, required: false}

There are few improvements on the old output from edos-distcheck. We are going to discuss these with a real example. The following tow snippets are from the output of distcheck on sid/amd64 (04/08/2010).

Distcheck now outputs a list of broken or installable packages depending on the given options (—failoures , —success, —explain and combinations of thereof ) . Two quick examples :

-
 package: python-gi-dbg
 version: 0.6.0-1
 status: broken
 reasons:
  -
   conflict:
    pkg1:
     package: python-gobject
     version: 2.21.4-1
    pkg2:
     package: python-gi
     version: 0.6.0-1
    paths:
     -
      path:
       -
        package: python-gi-dbg
        version: 0.6.0-1
        vpkg: python-gi (= 0.6.0-1)
       -
        package: python-gi
        version: 0.6.0-1
     -
      path:
       -
        package: python-gi-dbg
        version: 0.6.0-1
        vpkg: python-gi (= 0.6.0-1)
       -
        package: python-gi
        version: 0.6.0-1
        vpkg: python-gobject (>= 2.20)
       -
        package: python-gobject
        version: 2.21.4-1

In the example above, the package python-gi-dbg is broken because there is a conflict between the packages python-gobject and python-gi. The reason why python-gi-dbg is affected by this conflict is explained by following the dependency chain from python-gi-dbg to the two offending packages. Note that for each package element of each path we specify the vpkg, that is the dependency (as it is written in the control file) that lead to the conflict. Since a dependency can be a virtual package or a package with a version constraint, it can be expanded to a disjunction of packages (think a dependency on mta-agent can be expanded as postfix. exim or sendmail…). All possible paths to an offending package are reported.

Likewise if a package is broken because there is an unfulfilled dependency, distcheck will show the path leading to the problem . In the following example we show that the package gnash-tools is broken because there are two dependency that depend on the missing package libboost-date-time1.40.0 (>= 1.40.0-1).

-
 package: gnash-tools
 version: 0.8.7-2+b1
 status: broken
 reasons:
  -
   missing:
    pkg:
     package: gnash-common
     version: 0.8.7-2+b1
     vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
    paths:
     -
      path:
       -
        package: gnash-tools
        version: 0.8.7-2+b1
        vpkg: gnash-common-opengl (= 0.8.7-2+b1) | gnash-common (= 0.8.7-2+b1)
       -
        package: gnash-common
        version: 0.8.7-2+b1
        vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
  -
   missing:
    pkg:
     package: gnash-common-opengl
     version: 0.8.7-2+b1
     vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)
    paths:
     -
      path:
       -
        package: gnash-tools
        version: 0.8.7-2+b1
        vpkg: gnash-common-opengl (= 0.8.7-2+b1) | gnash-common (= 0.8.7-2+b1)
       -
        package: gnash-common-opengl
        version: 0.8.7-2+b1
        vpkg: libboost-date-time1.40.0 (>= 1.40.0-1)

The code is still in a flux and it is not ready for production yet (everything is in the mancoosi svn). I hope this is a good step in the right direction. Comments on the debian wiki are welcome.

if we compare the output of distcheck with the old edos-debcheck we get the following:

$cat /var/lib/apt/lists/ftp.debian.org_debian_dists_unstable_main_binary-amd64_Packages | edos-debcheck -failures -explain
[...]
holotz-castle-milanb (= 0.0.20050210-1): FAILED
  holotz-castle-milanb (= 0.0.20050210-1) depends on one of:
  - holotz-castle (= 1.3.14-2)
  holotz-castle-data (= 1.3.14-2) and holotz-castle-milanb (= 0.0.20050210-1) conflict
  holotz-castle (= 1.3.14-2) depends on one of:
  - holotz-castle-data (= 1.3.14-2)

$./debcheck --explain --failures /var/lib/apt/lists/ftp.debian.org_debian_dists_unstable_main_binary-amd64_Packages
[...]
-
 package: holotz-castle-milanb
 version: 0.0.20050210-1
 status: broken
 reasons:
  -
   conflict:
    pkg1:
     package: holotz-castle-milanb
     version: 0.0.20050210-1
    pkg2:
     package: holotz-castle-data
     version: 1.3.14-2
    paths:
     -
      path:
       -
        package: holotz-castle-milanb
        version: 0.0.20050210-1
        vpkg: holotz-castle
       -
        package: holotz-castle
        version: 1.3.14-2
        vpkg: holotz-castle-data (= 1.3.14-2)
       -
        package: holotz-castle-data
        version: 1.3.14-2