Many individuals have contributed to this HOWTO. Among the authors are Peter J. Braam, Rob Simmonds, Gordon Matzigkeit, Christopher Li and Shirish Phatak
InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.
InterMezzo is covered by the GPL. The GPL describes the warranties
made to you, and can be found in the file COPYING
.
Copyright on InterMezzo is held by Peter J. Braam, Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat, Inc, TurboLinux, Inc., Tacitus Systems, Inc. and Mountain View Data, Inc.
InterMezzo is a trademark of Stelias Computing. It may be used freely to refer to the software on the InterMezzo Web Site
InterMezzo is a file system that maintains replicas of folder collections, a.k.a. fileset residing on multiple computers. It keeps these replicas in sync by building a log of modifications and propagating that log to other nodes. The computers that express an interest in the replica are called the replicators of the fileset. InterMezzo has one server for the fileset, which plays an organizing role in exchanging the updates with replicators.
InterMezzo has disconnected operation, i.e. it maintains a log to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible, unless the configuration parameters are set to avoid this.
InterMezzo uses an existing disk file system as the storage location for all data. At present we support ext3, but soon also ReiserFS and XFS might be supported. When an ext3 formatted disk volume is mounted with file system type InterMezzo instead of ext3, the InterMezzo software starts managing all access to the file system. It keeps the logs of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.
InterMezzo can use a basic internal file tranfer mechanism or rely on the rsync protocol (see the Rsync web site).
Currently you should run InterMezzo only on trusted networks -- the root users on the replicating systems need to be equally trusted. There is some rudumentary security built into the system yet, which is similar to NFS security (but without root squash). A good way to get a trusted network is to use IPSEC (see FreeSwan http://www.freeswan.org), CIPE (see http://sites.inka.de/sites/bigred/devel/cipe.html), or SSH tunnels. The SSL utility stunnel is somewhat harder to use since it spawns many daemons trying to reconnect. Support for POSIX ACL replication is available for the 2.2 kernel and forthcoming for 2.4. Some security improvements will be made as time progresses.
The system currently has journal recovery in combination with Ext3. After system crashes the local disk system with the KML, LML and last_rcvd file which contain distributed state will recover automatically. Recovery with peers will normally also be seamless.
The system does not currently have conflict handlers but pessimistic, rigourous conflict detection. More extensive conflict resolution tools are being developed and should be available with the next major release. The design of the system means that conflicts can only occur when reconnecting after a period of disconnected operation and that conflicts can only occur on a client.
At the moment InterMezzo replicates an entire filesystem. However, a fetch on demand system will appear in a future version, which will allow partial replication of a filesystem. The first versions of this will fetch file data on demand but replicate metadata (directories and inodes) fully. Partial metadata caching may be implemented in future versions.
InterMezzo depends on a kernel that has the InterMezzo file system. There is also a user level file server and cache manager which are currently written in Perl. Finally there are some utilities to make InterMezzo file systems.
The packages for version 1.0.4 are available from ftp://ftp.inter-mezzo.org:/pub/intermezzo/1.0.4/rh7.1/RPMS. These packages should install cleanly on a RedHat 7.1 system. You want to intall either the 2.2 kernel package or the 2.4 kernel package.
In order to boot the 2.4 kernel, you need to generate an initial ramdisk with initrd as follows:
mkinitrd /boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4 2.4.7-ac9
In order for Lilo to boot this kernel now add the following kind of
lilo entry to your /etc/lilo.conf
file:
image=/boot/vmlinuz-2.4.7_ext3_0.9.5_presto_1.0.4
label=InterMezzo
read-only
root=/dev/hda1
initrd=/boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4
In order to get a kernel module for your kernel, you need to have
the .config
file and the kernel sources for your kernel.
Proceed by first preparing your kernel sources, and then building the module:
cd /your/source/linux
make distclean
cp your.config .config
make oldconfig dep
cd /usr/src/presto24-1.0.04
./configure --enable-linuxdir=/your/source/linux
make install
For Linux 2.2 kernel the same mechanism works.
Your default config directory is /etc/intermezzo
. You may
use the interactive inconfig
command to generate the following
configuration files, or manually create them.
The config files in versions 1.0 and later use use the XML format instead of the Perl formats found in older versions.
Holds a name of your system, the
presto device name and the IP bind address. Suppose your server has
the name muskox
, with IP address 192.168.0.3
, and your
clients are clientA
and clientB
. The sysid
file
on each host would contain the host name, the presto device and the IP
bind address. i.e., on muskox
the file would contain:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
Note that in early versions of InterMezzo, this file did not contain the name of the presto device; this field is now required.
Holds a database of servers. The server structure is a XML server element, as follows:
<serverdb>
<server name="muskox" ipaddr="192.168.0.3" port="2222"
bindaddr="192.168.0.3" />
</serverdb>
The above contains a single server description for the server
muskox
with IP address "192.168.0.3"
. The port
and
bindaddr
are optional; the default port is 2222. Without a
bindaddr
the server listens to all interfaces for requests, with
it, the server only listens on the bindaddr
address. If you
are running both a client and a server on the same system, you need
to specify a different bindaddr
for the server and the client(s).
Holds a database of filesets. The fsetdb structure is a XML fileset element, as follows:
<fsetdb>
<fileset name="yourfsetname" servername="muskox" fetchtype="bulktype" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>
The above contains a single fileset description for a fileset called
yourfsetname
which is served by muskox
. The fileset is
replicated on hosts clientA
and clientB
.
The fetchtype can be the class name of a supported bulk mover. The default is "Rsync", the simpler InterMezzo managed bulk mover is called "Desc".
To ease the mounting of InterMezzo filesets add one of the following to
the /etc/fstab
file. For testing and developing using a loop
device as the cache is easiest:
/tmp/cache /izo0 intermezzo loop,fileset=fsetname,mtpt=/mnt/izo0,
data=journal,prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0
where /tmp/cache
is a file associated with a loop device,
/izo0
is a mount point (a directory), fsetname
is the
name of the fileset and /dev/intermezzo0
is the name of the
presto device. The creation of the cache file and the presto device
is explained in the examples at the end of this section.
The kernel must be configured with loopback device support enabled to
do this.
NOTE: The mount option data=journal
is important for
2.4 kernels pending a bug fix in ext3.
Using a genuine block device is a little easier, because you do not
need to set up a loop device. To use the block device
/dev/hda9
, the /etc/fstab
file should contain:
/dev/hda9 /izo0 intermezzo fileset=fsetname,mtpt=/izo0,
prestodev=/dev/intermezzo0,cache_type=ext3,data=journal,noauto 0 0
NOTE:
/etc/fstab
entry should be a single line. The same holds for the following
examples.The file /izo0/.intermezzo/fsetname/kml
contains kernel
modification log (aka the KML
) which keeps track of all of the
changes made in an InterMezzo filesystem. The file
/izo0/.intermezzo/fsetname/last_rcvd
is the last_rcvd
file which keeps track of the distributed synchronization file. In
the current release of InterMezzo, the KML and last_rcvd files need to
be created (usually by running mkizofs
) before first mounting an
InterMezzo filesystem.
For this one uses the mkizofs tool:
mkizofs -r fsetname -j /tmp/cache mkizofs -r fsetname -j /dev/hdaX
The argument to the -r
option gives the root fileset name for
which an InterMezzo replication log will be created, the -j
option causes and Ext3 journal to be created. Please note that this
requires e2fsprogs version 1.22 or later
(see
http://e2fsprogs.sourceforge.net). There are further
options, see mkizofs -h
for options, such as specifying the
filesystem type.
If you have already initialized your cache filesystem, then you must manually create the needed InterMezzo metadata files:
mount -t ext2 -o loop /tmp/cache /izo0
mkdir -p /izo0/.intermezzo/fsetname/db
chgrp -R InterMezzo /izo0/.intermezzo
chmod 700 /izo0/.intermezzo
touch /izo0/.intermezzo/fsetname/{kml,lml,last_rcvd}
tune2fs -j /tmp/cache # if file system was ext2
umount /izo0
These example assumes that we are using the loopback device with the
/tmp/cache filesystm, and that the fileset will be called fsetname
.
Before you can mount these as InterMezzo you should manually replicate them to the replicators, so that the file systems are identical.
Let's consider three common system configurations, for each we will give the config files and the correct invocations to start the server/cache manager.
In this case we assume that the host muskox
is serving the fileset
shared
and the host clientA
is replicating the fileset.
The following files are placed on both muskox
and clientA
.
<serverdb>
<server name="muskox" ipaddr="192.168.0.3" />
</serverdb>
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
</fileset>
</fsetdb>
On muskox
this contains:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
On clientA
this contains:
<sysid name="clientA" psdev="/dev/intermezzo0" bindaddr="192.168.0.20" />
The following line is added on both muskox
and clientA
:
/tmp/fs0 /izo0 intermezzo loop,fileset=shared,prestodev=/dev/intermezzo0,
mtpt=/izo0,cache_type=ext3,noauto 0 0
This file and the filesystem is created using the following commands:
dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
mkizofs -F /tmp/fs0
If we didn't run mkizofs above, we create the KML and last_rcvd files by first mounting the filesystem as ext3:
mkdir /izo0
mount -o loop /tmp/fs0 /izo0
mkdir -p /izo0/.intermezzo/shared
touch /izo0/.intermezzo/shared/{kml,last_rcvd}
umount /izo0
This is created using the following commands:
mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0
Your modules configuration file may also be called /etc/modules.conf
.
Add the lines:
alias char-major-185 intermezzo
Before starting lento, mount the cache:
mkdir /izo0; mount /izo0
Now lento can be started on both muskox
and clientA
by typing
lento
The can be the same as for the one client and one server case above.
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>
This is the same as in the first example, but clientB is added to the replicators list.
This is the same as in the first example for muskox
and
clientA
, and on clientB
contains the following:
<sysid name="clientB" psdev="/dev/intermezzo0" bindaddr="192.168.0.21" />
This is the same as used with the one client and one server case above.
Could someone write something here please?
Running over an encrypted tunnel ssh -f -x -L 3333:localhost:2222 -R 3333:localhost:2222
Suppose that we are running on the host muskox
. To run multiple
lentos on one host we need to use ip-aliasing; the ip-aliasing option
must be compiled into your kernel (CONFIG_IP_ALIAS
). This allows
one interface to have more than one IP address associated with it.
Suppose the name muskoxA1
and the IP address 192.168.0.100
are available. In:
Add the line:
192.168.0.100 muskoxA1
Then add the ip-alias by typing:
ifconfig eth0:1 muskoxA1 up
Then create two configuration files containing the following:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
<sysid name="muskoxA1" psdev="/dev/intermezzo1" bindaddr="192.168.0.100" />
The latter file will act as a sysid
file for the lento running on
the aliased IP address. Note that because we are running both the client
and the server on the same system, we have to specify different devices
for each, namely /dev/intermezzo0
and /dev/intermezzo1
.
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>muskoxA1</replicator>
</fileset>
</fsetdb>
To run the second lento, a second presto device and loopback cache are required. These are made as follows:
mknod /dev/intermezzo1 c 185 1 dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k mkizofs -F /tmp/fs1 chmod 700 /dev/intermezzo1
Note that two entries are needed here:
/tmp/fs0 /izo0 intermezzo loop,fileset=shared,prestodev=/dev/intermezzo0, mtpt=/izo0,cache_type=ext3,noauto 0 0 /tmp/fs1 /izo1 intermezzo loop,fileset=shared,prestodev=/dev/intermezzo1, mtpt=/izo1,cache_type=ext3,noauto 0 0
Now mount the two InterMezzo filesystems:
mount /izo0
mount /izo1
The lento acting as the server can be started as before:
lento
The lento acting as the replicator has to be told which sysid
file to read (which tells it which presto device to use).
The second lento is started as follows:
lento.pl --idfile=sysid.muskoxA1
Currently the checkconfig tool is not working. The XML version of the config check is not ready yet.
A script is provided to perform simple checks on the configuration
files. The script is called config_check
and can be found in the
.../intermezzo/tools
directory.
If Lento is using the standard system id file,
/etc/intermezzo/sysid
, the script can be run without
arguments. If a different system id file is being used the
--idfile=my_idfile
flag can be used to indicate this.
It is also possible to use a configuration directory other than
/etc/intermezzo
by using the --configdir=my_confdir
flag.
The current version of InterMezzo has a built in recovery mechanism to deal with most situations of system crashes. Through configuration choices, conflicts, i.e. inconsistent updates to client and server caches can be avoided.
However, during disconnected operation, conflicts can be generated if the configuration does not explicitly avoid them through enforcing the file system to be readonly. Where the client and server have inconsistent caches, only manual recovery can recover the system.
The system can be recovered manually as follows:
umountizo ; rmmod presto
mount -o loop /tmp/fs0 /izo0
touch /var/intermezzo/SYSID/FSETNAME-synced
e.g. on client iclientA
with fileset shared
use:
touch /var/intermezzo/iclientA/shared-synced
cp /dev/null /izo0/.intermezzo/shared/kml ;
cp /dev/null /izo0/.intermezzo/shared/last_rcvd
This is cumbersome, but journaled recovery is on its way.
To help us find bugs we need logging information. The logs come
in two places, from the kernel in /var/log/messages
, and from
lento on stdout and stderr.
The kernel debugging log slows things down enormously and is activated with:
echo 4095 > /proc/sys/intermezzo/debug
echo 1 > /proc/sys/intermezzo/trace
The lento log can be captured from the terminal, and is activated
using the --debuglevel=N
. With N=1 you get many things, with
N=100, all of it.
Mailing us the logs as well as a precise description of what you did to produce the bug might be enough to see what's happening.
Read the README file in the ../intermezzo/tests
directory. This can save all information for you conveniently and
runs the client(s) and server on a single system.
InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.
The InterMezzo filesystem keeps sets of files on multiple hosts synchronized. It sits on top of the native filesystems on each host and keeps track of updates to the filesystems in such a way that it can synchronize the changes between multiple hosts. In this document we describe the architectures and protocols that InterMezzo uses to keep files synchronized.
InterMezzo guarantees only very loose coherence between the filesystems. Files are only ever handled as complete units, changes are not propagated until the file is closed for writing, and changes on one system are not necessarily reflected on another immediately. In InterMezzo 1.0 whole filesystems are replicated and only one host may have the write lock for that filesystem at any one time.
Presto is the kernel module for InterMezzo. It implements the various operations associated with the InterMezzo file system under VFS and creates pseudo devices for communication with Lento.
Lento is a user-space daemon which handles file transfers and other caching issues on behalf of presto. There is one Lento per mounted InterMezzo file system.
There is one KML file per mounted InterMezzo filesystem. The KML file contains records of changes to the filesystem, and taken as a whole the KML file can provide a script for building a replica of the whole filesystem.
The KML file is a series of binary records, each of which represents a single modification to the filesystem. Each record is self-contained in that it does not have references to other records, a property which makes the records easy to move around. The records are of variable length, and the length of the record is stored at the beginning and end of each record to facilitate moving forward or backward through the file. A complete description of the allowed KML record formats doesn't exist yet.
There is one Expect file per mounted InterMezzo filesystem. The Expect file contains information about how this host is synchronized with the other hosts by holding pointers into this and other hosts' KML files. This information is stored in the filesystem so that it will be persistent across reboots.
The Expect file has four pieces of information for each remote host.
In order to maintain consistency, only certain kinds of transformations to the KML and Expect files are allowed, and generally they have to be done together using transactions to make sure the system remains in a coherent state.
The InterMezzo web site is http://www.inter-mezzo.org.
General questions about InterMezzo can be sent to intermezzo-discuss@lists.sourceforge.net
. This along with other
InterMezzo related mail lists are archived on the InterMezzo web site,
so it may be worth checking here to see if your question has already
been answered.
Bug reports should be filed on sourceforge. Please include the version of InterMezzo you are using and a description of your system configuration and the problem observed.
Also, please include all relevant logs: /var/log/messages, and the output of Lento (run with debugging) on server and clients.