The approach used for collecting the data presented in this paper is, in summary, as follows:
Which source code makes a Debian release?
Fortunately enough, source code for current and past Debian releases is archived, and available for everyone in the Internet. The only problem is to determine the list of source packages for any given release, and where to access them.
Downloading and collecting data
Once we know what files to download, we have to download all of them before being able of gathering data. Since the size of the unpackaged sources for a Debian release we chose to work on a per-package basis, gathering all the relevant data from it (mainly, the number of lines of code) before deleting it and downloading the following one.
Final analysis
Analyze the collected data and get some statistics regarding the total number of SLOC of the release, the SLOC for each package, the SLOC for each of several programming languages considered, etc.
In the following sections these three steps are described in more detail.
The Debian packaging system considers two kind of packages: source and binary. One of more binary packages can be built automatically from each source package. For this paper, only source packages are relevant, and therefore we will no longer refer to binary packages.
When building a source package, a Debian developer starts with the "original" source directory for the piece of software. In Debian parlance, that source is called "upstream". The Debian developer patches upstream sources if needed, and creates a directory debian with all the Debian configuration files (including data needed to build the binary package). Then, the source package is built, usually (but not always) consisting of three files: the upstream sources (a tar.gz file), the patches to get the Debian source directory (a diff.gz file, including both patches to upstream sources and the debian directory), and a description file (with extension dsc). Only in latest releases dsc files are present. Patches files are not present for "native" source packages (those developed for Debian, with no upstream sources).
Source packages of current Debian releases are part of the Debian archive. For every release, they reside in the source directory. There are sites in the Internet including the source packages for every official Debian release to date (usually, mirrors of archive.debian.org). Since Debian 2.0, for every release a Sources.gz file is present in the source directory, with information about the source packages for the release, including the files that compose each package. This is the information we use to determine which source packages, and which files, have to be considered for Debian 2.2.
However, not all packages in Sources.gz should be analyzed when counting lines of code. The main reason not to it is the existence, in some cases, of several versions of the same piece of software. For instance, in Debian 2.2 we can find source packages emacs19 (for emacs-19.34), and emacs20 (for emacs-20.7). Counting both packages will imply counting Emacs twice, which is not the intended procedure. Therefore, a manual inspection of the list of packages is needed for every release, detecting those which are essentially versions of the same software, and choosing one "representative" for each family of versions.
These cases may cause an underestimation of the number of lines of the release, since different versions of the same package may share a lot of code, but not all (consider for instance PHP4 and PHP3, with the former being an almost complete rewrite of the latter). However, we think this effect is negligible, and compensated with some overestimations (see below).
In other cases, we have decided to analyze packages which may have significant quantities of code in common. This is the case, for instance, of emacs and xemacs. Being the latter a code fork of the former, both share a good quantity of lines which, even when not being exactly equal, are evolutions of the same "ancestors". Other similar case is gcc and gnat. The latter, an Ada compiler, is built upon the former (a C compiler), adding many patches and lots of new code. In those cases, we have considered that the code is different enough to consider them as separate packages. This probably leads to some overestimation of the number of lines of code of the release.
The final result of this step is the list of packages (and the files composing them) that we consider for analyzing the size of a Debian release. This list is done by hand (with the help of some really simple scripts) for each release.
Once the packages and files composing Debian 2.2 are determined, they are downloaded from some server of the net of Debian mirrors. Some simple Perl scripts where used to automate this process, which (for each package) consists of the following phases:
Downloading of the files composing the package
Extraction of the source directory corresponding to the upstream package (by untaring the tar.gz file. After extraction, data about this upstream source is gathered.
Patching of the upstream directory with the diff.gz file, to get the Debian source directory. After extraction, data about it is gathered.
Deletion of the debian directory, to avoid counting maintainer scripts (stored in this directory), and gathering of data about this sans-debian Debian source package.
Not all packages have upstream version. Therefore, during this process, some care has to be taken to differentiate this situations.
The fetching of data is done using sloccount scripts, three times for each package (one in each phase, see above), which stores the count of lines of code for each package in a separate directory, ready for later inspection and reporting.
The reason for fetching data three times for every package is to analyze the impact of the Debian developer on the source package. This impact can be in the form of patches to the source (usually to make it more stable and secure, to conform to Debian installation policy, or to add some functionality to it) or in installation scripts (which can be singled out when counting sans-debian source packages).
The final result of this step is the collection of all the data fetched from the downloaded packages, organized by package, and ready to be analyzed. These data consist mainly of lists of files and line counts for them, split by language.
The last step is the generation of reports, using sloccount and some scripts, to study the gathered data. Since in this step all the fetched data is available locally, and in a simple to parse form, the analysis can be done pretty quickly, and can be repeated easily, looking for different kinds of information.
The final result of this step is a set or reports and statistical analysis, using the data fetched in the previous step, and considering them from different points of view. These results are presented in the following section.