latex tables from csv files

While writing scientific papers often we feel this need to add evidence and data to our claims. This can be attained in different ways : tables, graphs, or nice pictures (or something else if you feel creative). The point is, that to produce this data, I often end-up writing ad-hoc scripts to analyze my data involving a million awk. sed, sort, unique etc etc …

What I want is a more productive work flow to streamline the boring pipeline Producer | Analyzer | latex. First I need a suitable output for to collect data from my experiments. In the past I often collected row data in a non structured format, then used some kind of parser to extract the important information for a particular figure. Printing non structured data is a plain bad idea and a pain, as it has to be parsed again before it can be used. Moreover reusing an old parser is often difficult as the nature of the experiment can be completely different and so the format of the output.

The solution to this problem is to adopt a structured data structure to print your results. This will cut the need to re-write a new parser all the time, and also to try to be more consistent over all my experiments. The format itself is not very important. It can be xml for example or, if you are not so masochist, a something following the json or yaml standards. I’ve choose yaml that is a meta language designed to be at the same time human and machine readable. Yaml it’s fairly easy to produce and very well supported in many programming languages. In particular, yaml is a superset of json, so for simple data structure you can also think of reusing a json printer if you don’t have a yaml printer.

The second step is to parse and analyze the experimental data. I often accomplish this step in python. The choice here is quite simple: mangling text with python is very easy, there are a lot of libraries (both natives and bindings), and a very nice parser for yaml. If this is not enough python-numeric, python-matplotlib and python-stats should convince to adopt it for this task. Surely perl is another choice, but my sense of aesthetic doesn’t allow me to go that way.

The third and final step is convert everything to latex. Yes, it is true that I could generate a latex-compatible output directly with python, but this would make the pipeline a bit less flexible as I might want to use the same data in a a web page, for example, without having to write a second printer for html. The solution is to have a generic csv printer and then perform the final conversion with an off-the-shelf tool.

For latex for example, and actually the entire post is about this, I’ve discovered the module datatool. This module is a pretty neat solution to embed csv tables (and I think it supports other formats as well) directly into your latex document and taking care of the formatting directly in the document.

For example, consider this sample data in csv format :

gcc-4.3 (= 4.3.2-1.1) | gcc-4.3-base (= 4.3.2-1.1) | > 4.3.2-1.1 | 20079 | 20128 | 13757
gcc-4.3 (= 4.3.2-1.1) | libstdc++6 (= 4.3.2-1.1) | > 4.3.2-1.1 | 14951 | 14964 | 10573
gcc-4.3 (= 4.3.2-1.1) | cpp-4.3 (= 4.3.2-1.1) | > 4.3.2-1.1 | 2200 | 2226 | 1566
perl (= 5.10.0-19) | perl-modules (= 5.10.0-19) | > 5.10.0-19 | 1678 | 7898 | 1488
perl (= 5.10.0-19) | perl (= 5.10.0-19) | > 6 | 1678 | 7898 | 1488
perl (= 5.10.0-19) | perl (= 5.10.0-19) | 5.10.0-19 < . < 6 | 1678 | 7898 | 1488
python-defaults (= 2.5.2-3) | python (= 2.5.2-3) | > 3 | 1079 | 2367 | 897
python-defaults (= 2.5.2-3) | python (= 2.5.2-3) | 2.06 < . < 3 | 1075 | 2367 | 894
gtk+2.0 (= 2.12.11-4) | libgtk2.0-0 (= 2.12.11-4) | > 2.12.11-4 | 796 | 2694 | 624
glibc (= 2.7-18) | libc6 (= 2.7-18) | > 2.7-18 | 567 | 20126 | 471

It is a simple ‘|’ separated file, it has 5 columns and no header (I don’t like commas). The datatool latex package is part of the texlive-latex-extra in debian and to use it you just need to add \usepackage{datatool} to you preamble. Now to produce a nice looking latex table, you first need to load the file with the \DTLloaddb command (and you have to specify a proper separator). Without hesitate further, you can now just use the \DTLdisplaydb{table1}* command to produce the table. Awesome !

\DTLsetseparator{|}
\DTLloaddb[noheader,keys={source,package,target,brokenpkg,impactset,brokensource}]{table1}{table1.csv}

\begin{table}[htbp]
  \caption{}
  \centering
  \DTLdisplaydb{table1}
\end{table}

But this is not very nice as there are fields that you don’t want to display. The datatool package is actually pretty flexible and this is how you print a table with only three columns :

\begin{table}[htbp]
  \caption{}
  \centering
  \begin{tabular}{lll}
    {\bf Package} & {Target Version} & {Broken}%
  \DTLforeach{table1}{%
  \package=package,\target=target,\brokensource=brokensource}{%
  \\
  \package & $\target$ & \brokensource}
\end{tabular}
\end{table}

There are a lot of nice short-cuts to print your table. Looking at the documentation it looks like a very powerful too to have. This made my day.