Previous Next Contents

2. Setup & Installation Considerations

  1. Q: I must soon install Linux on new system, one requirement is to have RAID1. Now I'm wondering what is the easiest way to do it.
    A: I keep rediscovering that file-system planning is one of the more difficult Unix configuration tasks. To answer your question, I can describe what we did. We planned the following setup:
    • two EIDE disks, 2.1.gig each.
      disk partition mount pt.  size    device
        1      1       /        300M   /dev/hda1
        1      2       swap      64M   /dev/hda2
        1      3       /home    800M   /dev/hda3
        1      4       /var     900M   /dev/hda4
      
        2      1       /root    300M   /dev/hdc1
        2      2       swap      64M   /dev/hdc2
        2      3       /home    800M   /dev/hdc3
        2      4       /var     900M   /dev/hdc4
                          
      
    • each disk is on a separate controller (& ribbon cable). The theory is that a controller failure and/or ribbon failure won't disable both disks. Possibly get performance boost from parallel operations?
    • Install linux on / in /dev/hda1 this will allow booting and subsequent installation of raid patches, etc.
    • /dev/hdc1 will contain a ``cold'' copy of /dev/hda1. This is NOT a raid copy, just a copy-copy. It's there just in case disk1 fails completely; we can use a rescue disk, mark /dev/hdc1 as bootable, and use that to keep going, without having to reinstall the system. The theory here is that in case of severe failure, I can still boot the system without worrying about raid superblock-corruption or other raid failure modes & gotchas that I don't understand.
    • /dev/hda3 and /dev/hdc3 will be mirrors /dev/md0.
    • /dev/hda4 and /dev/hdc4 will be mirrors /dev/md1.
    • we picked /var and /home to be mirrored, and in separate partitions, under the following (convoluted ???) logic:
      • / will contain non-changing data — for all practical purposes, it will be read-only without actually being read-only.
      • /home will contain slowly changing data — an almost-read-only system.
      • /var will contain rapidly changing data, including mail spools, database contents and web server logs.
      The theory is that if for some bizarre reason, the operating system goes wild, corruption is limited to one partition. Thus, if for some unlikely, hypothetical reason, the database starts scribbling everywhere, it might clobber mail and log files, but not /home. I am not entirely satisfied with my logic & reasoning, but it was the best I could do on short notice. I would like to have some scheme that verifies that files in /usr and /home are not changed, e.g. some MD5 signature scheme that is run regularly. The idea is to detect hacker intrusion as well as corruption. Similarly, the database contents are quite valuable, and I don't have a fault-tolerant plan for that that will let me sleep well at night.
    So, to complete the answer to your question:
    • install redhat on disk 1, partition 1. do NOT mount any of the other partitions.
    • install raid per instructions.
    • configure md0 and md1.
    • convince yourself that you know what to do in case of a disk failure! Discover sysadmin mistakes now, and not during an actual crisis. Experiment! (we turned off power during disk activity — this proved to be ugly but informative).
    • do some ugly mount/copy/unmount/rename/reboot scheme to move /var over to the /dev/md1. Done carefully, this is not dangerous.
    • enjoy!
  2. Q: Can I strip/mirror the root partition (/)? Why can't I boot Linux directly from the md disks?
    A: Both Lilo and Loadlin need an non-stripped/mirrored partition to read the kernel image from. If you want to strip/mirror the root partition (/), then create an unstriped/mirrored partition. Typically, this is /boot. Then you either use the initial ramdisk support, or some old patches that were posted a while back, to allow your root device to be striped. There are several approaches that can be used. One approach is documented in detail in the Bootable RAID mini-HOWTO: ftp://ftp.bizsystems.com/pub/raid/bootable-raid. Alternately, use mkinitrd to build the ramdisk image, see below.

    Edward Welbon < welbon@bga.com> writes:

    • ... all that is needed is a script to manage the boot setup. To mount an md filesystem as root, the main thing is to build an initial file system image that has the needed modules and md tools to start md. I have a simple script that does this.
    • For boot media, I have a small cheap SCSI disk (170MB I got it used for $20). This disk runs on a AHA1452, but it could just as well be an inexpensive IDE disk on the native IDE. The disk need not be very fast since it is mainly for boot.
    • This disk has a small file system which contains the kernel and the file system image for initrd. The initial file system image has just enough stuff to allow me to load the raid SCSI device driver module and start the raid partition that will become root. I then do an
      echo 0x900 > /proc/sys/kernel/real-root-dev
                    
      
      (0x900 is for /dev/md0) and exit linuxrc. The boot proceeds normally from there.
    • I have built most support as a module except for the AHA1452 driver that brings in the initrd filesystem. So I have a fairly small kernel. The method is perfectly reliable, I have been doing this since before 2.1.26 and have never had a problem that I could not easily recover from. The file systems even survived several 2.1.4[45] hard crashes with no real problems.
    • At one time I had partitioned the raid disks so that the initial cylinders of the first raid disk held the kernel and the initial cylinders of the second raid disk hold the initial file system image, instead I made the initial cylinders of the raid disks swap since they are the fastest cylinders (why waste them on boot?).
    • The nice thing about having an inexpensive device dedicated to boot is that it is easy to boot from and can also serve as a rescue disk if necessary. If you are interested, you can take a look at the script that builds my initial ram disk image and then runs lilo.
      http://www.realtime.net/~welbon/initrd.md.tar.gz
      It is current enough to show the picture. It isn't especially pretty and it could certainly build a much smaller filesystem image for the initial ram disk. It would be easy to a make it more efficient. But it uses lilo as is. If you make any improvements, please forward a copy to me. 8-)

  3. Q: I have heard that I can run mirroring over striping. Is this true? Can I run mirroring over the loopback device?
    A: Yes, but not the reverse. That is, you can put a stripe over several disks, and then build a mirror on top of this. However, striping cannot be put on top of mirroring. A brief technical explanation is that the linear and stripe personalities use the ll_rw_blk routine for access. The ll_rw_blk routine maps disk devices and sectors, not blocks. Block devices can be layered one on top of the other; but devices that do raw, low-level disk accesses, such as ll_rw_blk, cannot. Currently (November 1997) RAID cannot be run over the loopback devices, although this should be fixed shortly.
  4. Q: I have two small disks and three larger disks. Can I concatenate the two smaller disks with RAID-0, and then create a RAID-5 out of that and the larger disks?
    A: Currently (November 1997), for a RAID-5 array, no. Currently, one can do this only for a RAID-1 on top of the concatenated drives.
  5. Q: What is the difference between RAID-1 and RAID-5 for a two-disk configuration (i.e. the difference between a RAID-1 array built out of two disks, and a RAID-5 array built out of two disks)?
    A: There is no difference in storage capacity. Nor can disks be added to either array to increase capacity (see the question below for details). RAID-1 offers a performance advantage for reads: the RAID-1 driver uses distributed-read technology to simultaneously read two sectors, one from each drive, thus doubling read performance. The RAID-5 driver, although it contains many optimizations, does not currently (September 1997) realize that the parity disk is actually a mirrored copy of the data disk. Thus, it serializes data reads.
  6. Q: How can I guard against a two-disk failure?
    A: Some of the RAID algorithms do guard against multiple disk failures, but these are not currently implemented for Linux. However, the Linux Software RAID can guard against multiple disk failures by layering an array on top of an array. For example, nine disks can be used to create three raid-5 arrays. Then these three arrays can in turn be hooked together into a single RAID-5 array on top. In fact, this kind of a configuration will guard against a three-disk failure. Note that a large amount of disk space is ''wasted'' on the redundancy information.
        For an NxN raid-5 array,
        N=3, 5 out of 9 disks are used for parity (=55%percent;)
        N=4, 7 out of 16 disks
        N=5, 9 out of 25 disks
        ...
        N=9, 17 out of 81 disks (=~20%percent;)
                
    
    In general, an MxN arrary will use M+N-1 disks for parity. The least amount of space is "wasted" when M=N. Another alternative is to create a RAID-1 array with three disks. Note that since all three disks contain identical data, that 2/3's of the space is ''wasted''.
  7. Q: I'd like to understand how it'd be possible to have something like fsck: if the partition hasn't been cleanly unmounted, fsck runs and fixes the filesystem by itself more than 90%percent; of the time. Since the machine is capable of fixing it by itself with ckraid --fix, why not make it automatic?
    A: Brian Candler < B.Candler@pobox.com> responds: Then you just put ckraid into your system initialization scripts, like fsck is. After the root partition is mounted, add the following to /etc/rc.d/rc.sysinit:
        mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
            ckraid --fix /etc/raid.usr.conf
            mdadd /dev/md0 /dev/hda1 /dev/hdc1
        }
        mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {
            ckraid --fix /etc/raid.var.conf
            mdadd /dev/md0 /dev/hda2 /dev/hdc2
        }
                
    
    (Modify the above to suit your system.) Gadi Oxman explains the operation: In an unclean shutdown, Linux might be in one of the following states:
    • The in-memory disk cache was in sync with the RAID set when the unclean shutdown occurred; no data was lost.
    • The in-memory disk cache was newer than the RAID set contents when the crash occurred; this results in a corrupted filesystem and potentially in data loss. This state can be further divided to the following two states:
      • Linux was writing data when the unclean shutdown occurred.
      • Linux was not writing data when the crash occurred.
    Suppose we were using a RAID-1 array. In (2a), it might happen that before the crash, a small number of data blocks were successfully written only to some of the mirrors, so that on the next reboot, the mirrors will no longer contain the same data. If we ignore the mirror differences, the 0.36.3 read-balancing code might choose to read the above data blocks from any of the mirrors, which will result in inconsistent behavior (for example, the output of e2fsck -n /dev/md0 can differ from run to run). Since RAID doesn't protect against unclean shutdowns, usually there isn't any ''obviously correct'' way to fix the mirror differences and the filesystem corruption. For example, by default ckraid --fix will choose the first operational mirror and update the other mirrors with its contents. However, depending on the exact timing at the crash, the data on another mirror might be more recent, and we might want to use it as the source mirror instead, or perhaps use another method for recovery. If you wish to run ckraid --fix automatically, you can check the return code of mdrun for errors. For example:
        mdrun -p1 /dev/md0
        if [ $? -gt 0 ] ; then
                ckraid --fix /etc/raid1.conf
                mdrun -p1 /dev/md0
        fi
                
    


Previous Next Contents