The InterMezzo High Availability File System HOWTO

by the project members

v1.0.5, August 2001


This document explains the configuration and operation of the InterMezzo file system on Linux.

1. Aknowledgements

Many individuals have contributed to this HOWTO. Among the authors are Peter J. Braam, Rob Simmonds, Gordon Matzigkeit, Christopher Li and Shirish Phatak

2. Disclaimer and License

InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.

InterMezzo is covered by the GPL. The GPL describes the warranties made to you, and can be found in the file COPYING.

Copyright on InterMezzo is held by Peter J. Braam, Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat, Inc, TurboLinux, Inc., Tacitus Systems, Inc. and Mountain View Data, Inc.

InterMezzo is a trademark of Stelias Computing. It may be used freely to refer to the software on the InterMezzo Web Site

3. Introduction

3.1 What is InterMezzo?

InterMezzo is a file system that maintains replicas of folder collections, a.k.a. fileset residing on multiple computers. It keeps these replicas in sync by building a log of modifications and propagating that log to other nodes. The computers that express an interest in the replica are called the replicators of the fileset. InterMezzo has one server for the fileset, which plays an organizing role in exchanging the updates with replicators.

InterMezzo has disconnected operation, i.e. it maintains a log to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible, unless the configuration parameters are set to avoid this.

InterMezzo uses an existing disk file system as the storage location for all data. At present we support ext3, but soon also ReiserFS and XFS might be supported. When an ext3 formatted disk volume is mounted with file system type InterMezzo instead of ext3, the InterMezzo software starts managing all access to the file system. It keeps the logs of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.

InterMezzo can use a basic internal file tranfer mechanism or rely on the rsync protocol (see the Rsync web site).

3.2 Limitations in version 1.0.4

Security

Currently you should run InterMezzo only on trusted networks -- the root users on the replicating systems need to be equally trusted. There is some rudumentary security built into the system yet, which is similar to NFS security (but without root squash). A good way to get a trusted network is to use IPSEC (see FreeSwan http://www.freeswan.org), CIPE (see http://sites.inka.de/sites/bigred/devel/cipe.html), or SSH tunnels. The SSL utility stunnel is somewhat harder to use since it spawns many daemons trying to reconnect. Support for POSIX ACL replication is available for the 2.2 kernel and forthcoming for 2.4. Some security improvements will be made as time progresses.

Recovery

The system currently has journal recovery in combination with Ext3. After system crashes the local disk system with the KML, LML and last_rcvd file which contain distributed state will recover automatically. Recovery with peers will normally also be seamless.

Conflict Handling

The system does not currently have conflict handlers but pessimistic, rigourous conflict detection. More extensive conflict resolution tools are being developed and should be available with the next major release. The design of the system means that conflicts can only occur when reconnecting after a period of disconnected operation and that conflicts can only occur on a client.

Fetch on demand

At the moment InterMezzo replicates an entire filesystem. However, a fetch on demand system will appear in a future version, which will allow partial replication of a filesystem. The first versions of this will fetch file data on demand but replicate metadata (directories and inodes) fully. Partial metadata caching may be implemented in future versions.

4. Installing InterMezzo

4.1 Overview

InterMezzo depends on a kernel that has the InterMezzo file system. There is also a user level file server and cache manager which are currently written in Perl. Finally there are some utilities to make InterMezzo file systems.

4.2 Getting the packages

The packages for version 1.0.4 are available from ftp://ftp.inter-mezzo.org:/pub/intermezzo/1.0.4/rh7.1/RPMS. These packages should install cleanly on a RedHat 7.1 system. You want to intall either the 2.2 kernel package or the 2.4 kernel package.

4.3 Configuring the 2.4 kernel for booting

In order to boot the 2.4 kernel, you need to generate an initial ramdisk with initrd as follows:

mkinitrd /boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4 2.4.7-ac9

In order for Lilo to boot this kernel now add the following kind of lilo entry to your /etc/lilo.conf file:

image=/boot/vmlinuz-2.4.7_ext3_0.9.5_presto_1.0.4
        label=InterMezzo
        read-only
        root=/dev/hda1
        initrd=/boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4

4.4 Building the InterMezzo file system for a custom kernel

In order to get a kernel module for your kernel, you need to have the .config file and the kernel sources for your kernel. Proceed by first preparing your kernel sources, and then building the module:

cd /your/source/linux
make distclean
cp your.config  .config
make oldconfig dep
cd /usr/src/presto24-1.0.04
./configure --enable-linuxdir=/your/source/linux
make install

For Linux 2.2 kernel the same mechanism works.

5. Configuring InterMezzo

5.1 Config files

Your default config directory is /etc/intermezzo. You may use the interactive inconfig command to generate the following configuration files, or manually create them.

The config files in versions 1.0 and later use use the XML format instead of the Perl formats found in older versions.

/etc/intermezzo/sysid

Holds a name of your system, the presto device name and the IP bind address. Suppose your server has the name muskox, with IP address 192.168.0.3, and your clients are clientA and clientB. The sysid file on each host would contain the host name, the presto device and the IP bind address. i.e., on muskox the file would contain:

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />

Note that in early versions of InterMezzo, this file did not contain the name of the presto device; this field is now required.

/etc/intermezzo/serverdb

Holds a database of servers. The server structure is a XML server element, as follows:

<serverdb>
  <server name="muskox" ipaddr="192.168.0.3" port="2222" 
    bindaddr="192.168.0.3" />
</serverdb>

The above contains a single server description for the server muskox with IP address "192.168.0.3". The port and bindaddr are optional; the default port is 2222. Without a bindaddr the server listens to all interfaces for requests, with it, the server only listens on the bindaddr address. If you are running both a client and a server on the same system, you need to specify a different bindaddr for the server and the client(s).

/etc/intermezzo/fsetdb

Holds a database of filesets. The fsetdb structure is a XML fileset element, as follows:

<fsetdb>
<fileset name="yourfsetname" servername="muskox" fetchtype="bulktype" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>

The above contains a single fileset description for a fileset called yourfsetname which is served by muskox. The fileset is replicated on hosts clientA and clientB.

The fetchtype can be the class name of a supported bulk mover. The default is "Rsync", the simpler InterMezzo managed bulk mover is called "Desc".

/etc/fstab

To ease the mounting of InterMezzo filesets add one of the following to the /etc/fstab file. For testing and developing using a loop device as the cache is easiest:

/tmp/cache  /izo0  intermezzo loop,fileset=fsetname,mtpt=/mnt/izo0,
      data=journal,prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0

where /tmp/cache is a file associated with a loop device, /izo0 is a mount point (a directory), fsetname is the name of the fileset and /dev/intermezzo0 is the name of the presto device. The creation of the cache file and the presto device is explained in the examples at the end of this section. The kernel must be configured with loopback device support enabled to do this.

NOTE: The mount option data=journal is important for 2.4 kernels pending a bug fix in ext3.

Using a genuine block device is a little easier, because you do not need to set up a loop device. To use the block device /dev/hda9, the /etc/fstab file should contain:

/dev/hda9  /izo0 intermezzo fileset=fsetname,mtpt=/izo0,
prestodev=/dev/intermezzo0,cache_type=ext3,data=journal,noauto 0 0

NOTE:

Other files

The file /izo0/.intermezzo/fsetname/kml contains kernel modification log (aka the KML) which keeps track of all of the changes made in an InterMezzo filesystem. The file /izo0/.intermezzo/fsetname/last_rcvd is the last_rcvd file which keeps track of the distributed synchronization file. In the current release of InterMezzo, the KML and last_rcvd files need to be created (usually by running mkizofs) before first mounting an InterMezzo filesystem.

5.2 Formatting an InterMezzo file system

For this one uses the mkizofs tool:

mkizofs -r fsetname -j /tmp/cache
mkizofs -r fsetname -j /dev/hdaX

The argument to the -r option gives the root fileset name for which an InterMezzo replication log will be created, the -j option causes and Ext3 journal to be created. Please note that this requires e2fsprogs version 1.22 or later (see http://e2fsprogs.sourceforge.net). There are further options, see mkizofs -h for options, such as specifying the filesystem type.

5.3 Converting ext2/3 file systems to InterMezzo.

If you have already initialized your cache filesystem, then you must manually create the needed InterMezzo metadata files:

mount -t ext2 -o loop /tmp/cache /izo0
mkdir -p /izo0/.intermezzo/fsetname/db
chgrp -R InterMezzo /izo0/.intermezzo
chmod 700 /izo0/.intermezzo
touch /izo0/.intermezzo/fsetname/{kml,lml,last_rcvd}
tune2fs -j /tmp/cache # if file system was ext2
umount /izo0

These example assumes that we are using the loopback device with the /tmp/cache filesystm, and that the fileset will be called fsetname.

Before you can mount these as InterMezzo you should manually replicate them to the replicators, so that the file systems are identical.

5.4 Three common configurations

Let's consider three common system configurations, for each we will give the config files and the correct invocations to start the server/cache manager.

One client and one server (typical use: laptop - desktop,backup and two web server synchronization)

In this case we assume that the host muskox is serving the fileset shared and the host clientA is replicating the fileset. The following files are placed on both muskox and clientA.

/etc/intermezzo/serverdb

<serverdb>
  <server name="muskox" ipaddr="192.168.0.3" />
</serverdb>

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
</fileset>
</fsetdb>

/etc/intermezzo/sysid

On muskox this contains:

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
On clientA this contains:
<sysid name="clientA" psdev="/dev/intermezzo0" bindaddr="192.168.0.20" />

/etc/fstab

The following line is added on both muskox and clientA:

/tmp/fs0 /izo0 intermezzo loop,fileset=shared,prestodev=/dev/intermezzo0, mtpt=/izo0,cache_type=ext3,noauto 0 0

/tmp/fs0

This file and the filesystem is created using the following commands:

dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
mkizofs -F /tmp/fs0

/izo0/.intermezzo/shared/kml

If we didn't run mkizofs above, we create the KML and last_rcvd files by first mounting the filesystem as ext3:

mkdir /izo0
mount -o loop /tmp/fs0 /izo0
mkdir -p /izo0/.intermezzo/shared
touch /izo0/.intermezzo/shared/{kml,last_rcvd}
umount /izo0

/dev/intermezzo0

This is created using the following commands:

mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0

/etc/conf.modules

Your modules configuration file may also be called /etc/modules.conf. Add the lines:

alias char-major-185 intermezzo

Before starting lento, mount the cache:

mkdir /izo0; mount /izo0

Now lento can be started on both muskox and clientA by typing

lento

Two clients and one server (typical use: two remote offices)

/etc/intermezzo/serverdb

The can be the same as for the one client and one server case above.

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>

This is the same as in the first example, but clientB is added to the replicators list.

/etc/intermezzo/sysid

This is the same as in the first example for muskox and clientA, and on clientB contains the following:

<sysid name="clientB" psdev="/dev/intermezzo0" bindaddr="192.168.0.21" />

/etc/fstab

This is the same as used with the one client and one server case above.

Using IPSec and ssh tunnels

Could someone write something here please?

Running over an encrypted tunnel ssh -f -x -L 3333:localhost:2222 -R 3333:localhost:2222

One client and one server on same host (typical use: testingInterMezzo)

Suppose that we are running on the host muskox. To run multiple lentos on one host we need to use ip-aliasing; the ip-aliasing option must be compiled into your kernel (CONFIG_IP_ALIAS). This allows one interface to have more than one IP address associated with it. Suppose the name muskoxA1 and the IP address 192.168.0.100 are available. In:

/etc/hosts

Add the line:

192.168.0.100   muskoxA1        

Then add the ip-alias by typing:

    ifconfig eth0:1 muskoxA1 up

Then create two configuration files containing the following:

/etc/intermezzo/sysid

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />

/etc/intermezzo/sysid.muskoxA1

<sysid name="muskoxA1" psdev="/dev/intermezzo1" bindaddr="192.168.0.100" />

The latter file will act as a sysid file for the lento running on the aliased IP address. Note that because we are running both the client and the server on the same system, we have to specify different devices for each, namely /dev/intermezzo0 and /dev/intermezzo1.

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>muskoxA1</replicator>
</fileset>
</fsetdb>

To run the second lento, a second presto device and loopback cache are required. These are made as follows:

mknod /dev/intermezzo1 c 185 1
dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k

mkizofs -F /tmp/fs1
chmod 700 /dev/intermezzo1

/etc/fstab

Note that two entries are needed here:

/tmp/fs0  /izo0      intermezzo loop,fileset=shared,prestodev=/dev/intermezzo0,
mtpt=/izo0,cache_type=ext3,noauto 0 0
/tmp/fs1  /izo1      intermezzo loop,fileset=shared,prestodev=/dev/intermezzo1,
mtpt=/izo1,cache_type=ext3,noauto 0 0

Now mount the two InterMezzo filesystems:

mount /izo0
mount /izo1

The lento acting as the server can be started as before:

lento

The lento acting as the replicator has to be told which sysid file to read (which tells it which presto device to use). The second lento is started as follows:

lento.pl --idfile=sysid.muskoxA1

5.5 Configuration Checking

Currently the checkconfig tool is not working. The XML version of the config check is not ready yet.

A script is provided to perform simple checks on the configuration files. The script is called config_check and can be found in the .../intermezzo/tools directory.

If Lento is using the standard system id file, /etc/intermezzo/sysid, the script can be run without arguments. If a different system id file is being used the --idfile=my_idfile flag can be used to indicate this.

It is also possible to use a configuration directory other than /etc/intermezzo by using the --configdir=my_confdir flag.

6. Recovery from conflicts

The current version of InterMezzo has a built in recovery mechanism to deal with most situations of system crashes. Through configuration choices, conflicts, i.e. inconsistent updates to client and server caches can be avoided.

However, during disconnected operation, conflicts can be generated if the configuration does not explicitly avoid them through enforcing the file system to be readonly. Where the client and server have inconsistent caches, only manual recovery can recover the system.

The system can be recovered manually as follows:

  1. When a conflict happens, the lento which is reintegrating changes will die. This Lento is receiving updates from its peer in this replicator and typically the peer will have the latest updates. So we are going to synchronize from the lento that survived to the lento that died.
  2. Shutdown the server and client(s), unmount the caches, and remove the presto module from the kernel: umountizo ; rmmod presto
  3. Mount each cache as an ext3 filesystem: mount -o loop /tmp/fs0 /izo0
  4. Use rsync or tar, or another tool, to synchronize the caches on the clients and server. Make sure to remove files from the client that you don't have on the server, the caches need to be identical.
  5. Set the synced flag on the clients - this prevents the system from resyncing on startup. This is done using the command below where SYSID is replaced with the client's sysid, and FSETNAME is replaced with the name of the fileset: touch /var/intermezzo/SYSID/FSETNAME-synced e.g. on client iclientA with fileset shared use: touch /var/intermezzo/iclientA/shared-synced
  6. The persistent databases will be out of sync at this point, so you must clear the KML and last_rcvd records on both the client and the server: cp /dev/null /izo0/.intermezzo/shared/kml ; cp /dev/null /izo0/.intermezzo/shared/last_rcvd
  7. Unmount the caches and mount them again as InterMezzo file systems. Restart Lento on the server and client.

This is cumbersome, but journaled recovery is on its way.

7. Debugging

To help us find bugs we need logging information. The logs come in two places, from the kernel in /var/log/messages, and from lento on stdout and stderr.

The kernel debugging log slows things down enormously and is activated with:

 
echo 4095 > /proc/sys/intermezzo/debug
echo 1 > /proc/sys/intermezzo/trace

The lento log can be captured from the terminal, and is activated using the --debuglevel=N. With N=1 you get many things, with N=100, all of it.

Mailing us the logs as well as a precise description of what you did to produce the bug might be enough to see what's happening.

8. Using the test framework for testing and debugging

Read the README file in the ../intermezzo/tests directory. This can save all information for you conveniently and runs the client(s) and server on a single system.

9. How does InterMezzo work?

InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.

The InterMezzo filesystem keeps sets of files on multiple hosts synchronized. It sits on top of the native filesystems on each host and keeps track of updates to the filesystems in such a way that it can synchronize the changes between multiple hosts. In this document we describe the architectures and protocols that InterMezzo uses to keep files synchronized.

9.1 Coherence and Granularity

InterMezzo guarantees only very loose coherence between the filesystems. Files are only ever handled as complete units, changes are not propagated until the file is closed for writing, and changes on one system are not necessarily reflected on another immediately. In InterMezzo 1.0 whole filesystems are replicated and only one host may have the write lock for that filesystem at any one time.

9.2 intermezzo.o, the kernel module

Presto is the kernel module for InterMezzo. It implements the various operations associated with the InterMezzo file system under VFS and creates pseudo devices for communication with Lento.

9.3 Lento

Lento is a user-space daemon which handles file transfers and other caching issues on behalf of presto. There is one Lento per mounted InterMezzo file system.

9.4 The KML File

There is one KML file per mounted InterMezzo filesystem. The KML file contains records of changes to the filesystem, and taken as a whole the KML file can provide a script for building a replica of the whole filesystem.

The KML file is a series of binary records, each of which represents a single modification to the filesystem. Each record is self-contained in that it does not have references to other records, a property which makes the records easy to move around. The records are of variable length, and the length of the record is stored at the beginning and end of each record to facilitate moving forward or backward through the file. A complete description of the allowed KML record formats doesn't exist yet.

9.5 The Last_rcvd File

There is one Expect file per mounted InterMezzo filesystem. The Expect file contains information about how this host is synchronized with the other hosts by holding pointers into this and other hosts' KML files. This information is stored in the filesystem so that it will be persistent across reboots.

The Expect file has four pieces of information for each remote host.

  1. next_to_expect. A pointer to the next record in the remote host's KML file that we expect it to send to us. If we get a set of records that is does not start at this value then a message has been dropped somewhere and we need to renegotiate with that host. This is NOT a hint.
  2. next_to_send. A pointer to the next record in our KML file that we intend to send to the remote host. This is just a hint because we advance next_to_send as soon as we've sent data to another host, not when we've gotten confirmation that it has been received and processed. When we send KML records to the remote host we send the value of next_to_send (plus the gap, below) to tell the remote host where the records come from in our KML.
  3. confirmed. A pointer to the beginning of the next record in our KML file that has not yet been confirmed as received and processed by the remote host. This is NOT a hint.
  4. gap. An adjustment to add to next_to_send before sending it to the remote host. This lets us move records forward or back in our local KML file while preserving the externally visible file locations. This is NOT a hint.

9.6 Legal Transformations of the KML and Expect Files

In order to maintain consistency, only certain kinds of transformations to the KML and Expect files are allowed, and generally they have to be done together using transactions to make sure the system remains in a coherent state.

  1. Append a Record to the KML File. This is the operation that the normal VFS file operations end up using. The record is appended to the KML file, and no modifications are made to the Expect file.
  2. Incorporate a Remote KML Record. In addition to performing the operation and appending the record to the local KML file, increment the next_to_expect for that host. Modifies the KML and Expect files.
  3. Send KML Records to a Remote Host. A block of KML records are read from the KML file starting at next_to_send, and are transmitted to the remote machine. next_to_send is incremented by the number of bytes read. We effectively get a read lock on this section of the KML file. KML is read but not modified, and the Expect file is modified.
  4. Receive Confirmation of KML Processing. We receive confirmation from a remote host that a set of records starting at a given point and with a certain byte length has been received and processed. These offsets are from a remote host so we have to subtract off the gap for that host, then compare with what we think the confirmed pointer is, then move the confirmed pointer. There are no KML modifications.
  5. Optimize a Section of the KML File. We obtain a write lock on the section to be optimized, then read the section in, perform whatever optimizations we desire, then write it out again. The newly written section must be no larger than the previous, and if it is smaller a NOP block is inserted to fill out the space either before or after the new section. If this section is at the end of the KML file then the KML file can be truncated to remove the NOP block at the end. Then the write lock is relenquished. The KML file is modified and the Expect file is not.
  6. Punch Out a Section of the KML File. The section to be removed must not have outstanding read or write locks, and it can only have NOP records in it. File system magic is then performed to release the appropriate file blocks to produce a sparse file. The Expect file is not changed.
  7. Front-Truncation of the KML File. Instead of producing a sparse file you can remove the beginning of the KML file. Like the punch operation, the section to be removed should have no outstanding read or write locks and should have only NOP operations. The file should then be truncated and the gap values for all of the remote hosts adjusted in one transaction.
  8. Skip a NOP Block in the KML File. The next_to_send and confirm pointers must both be pointing to the beginning of a NOP block. Then next_to_send, confirm, and gap can all be incremented by the size of the NOP block. No changes are made in the KML file.

10. Contact Information

The InterMezzo web site is http://www.inter-mezzo.org.

General questions about InterMezzo can be sent to intermezzo-discuss@lists.sourceforge.net . This along with other InterMezzo related mail lists are archived on the InterMezzo web site, so it may be worth checking here to see if your question has already been answered.

Bug reports should be filed on sourceforge. Please include the version of InterMezzo you are using and a description of your system configuration and the problem observed.

Also, please include all relevant logs: /var/log/messages, and the output of Lento (run with debugging) on server and clients.