Internals/Index

History

The index (_darcs/index) should not be confused with the patch index. It was introduced in Petr Ročkai’s 2009 Google Summer of Code project and made its way into Darcs around Darcs 2.3.1 or so (TODO: check history). José Neder added information about the inode numbers of the files in the 2013 Google Summer of Code project. The on-disk format underwent several minor revisions and is currently (2024-08-06) at 7 (the magic word is “HSI7”). One of the major changes has been to switch to high resolution time stamps.

Motivation

Many operations require that we know things about the working tree, and especially how it differs from the pristine tree (which represents the recorded state of the repository). This can be a costly affair if there are many files in either the working or the pristine tree. The index helps to make these operations faster. It has been designed (this is my personal inference, bf) after a close look at the index in Git and serves similar, though not quite the same purpose. Furthermore, later additions have allowed it to be used for other purposes, such as detecting file and directory moves (renames).

How it works

The index, located at _darcs/index, is a persistent datastructure that caches information about files in the working tree, namely

  • the latest known file modification timestamp and size
  • the (SHA256) hash of its content
  • the latest known file ID (inode number on Posix)

Timestamps are 8 byte values with a (nominal) resolution of one nanosecond; the actual resolution is OS-dependent but always high enough to reliably detect changes. All entries can be zero if they are unknown. An entry exists only if it is currently tracked by Darcs, i.e. it is a member of the pristine tree (the recorded state) plus the changes in pending (such as file additions, renames, etc).

The way we use this cache is by reading it into a Tree and then /overlaying/ that Tree over the actual working tree. See Darcs.Util.Tree.overlay and its documentation for details.

The tricky point to note about the index is that it is read and written using mmap. And when we read it, we are at the same time transparently modifying it. In particular, when reading a file entry, we check the actual status of the file on disk to see if modification time or size have changed, and then update the entry accordingly. In that case (and only in that case) we also read the file content to re-calculate its hash and store that as well.

This means that to make use of the index it must be writable by the user, even if the operation we perform does not change any of the actual repository data. For a long time Darcs could not perform any operations in a read-only repository, even read-only operations like whatsnew, log, or show files, because it ran into unhandled exceptions when trying to mmap the index file for reading and writing. These bugs have been fixed and nowadays Darcs will avoid using the index in such cases, outputting a warning that the index cannot be used (unless you pass –ignore-times which instructs Darcs to ignore the index completely).

Whenever we modify the repository (that is, the recorded state or the pending patch) the index must be /updated/ as well. This is different from (transparently) updating individual entries because it may involve adding new entries or removing existing ones. This cannot be done in-place, so we rename the existing index and write a new one, still using the data from the old index. Nowadays this is done as part of finalizing repository transactions.

UI

You can take a look at what the index contains in a human readable format with the darcs show index command.

The –ignore-times option disables use of the index. Mostly there for testing and perhaps a few other special situations.

Format

There are two entry types, a file entry and a directory entry. Both have a common binary format.

For each file, the index has a copy of the file’s last modification timestamp taken at the instant when the hash has been computed. This means that when file size and timestamp of a file in working copy matches those in the index, we assume that the hash stored in the index for given file is valid. These hashes are then exposed in the resulting ‘Tree’ object, and can be leveraged by eg. ‘diffTrees’ to compare many files quickly.

You may have noticed that we also keep hashes of directories. These are assumed to be valid whenever the complete subtree has been valid. At any point, as soon as a size or timestamp mismatch is found, the working file in question is opened, its hash (and timestamp and size) is recomputed and updated in-place in the index file (everything lives at a fixed offset and is fixed size, so this isn’t an issue). This is also true of directories: when a file in a directory changes hash, this triggers recomputation of all of its parent directory hashes; moreover this is done efficiently – each directory is updated at most once during an update run.

On-disk Format

This is extensively documented in the module header for Darcs.Util.Index.