Startup Problems When you Boot the System
There are many things that can go wrong and keep a Unix system from
successfully booting. The most common problems are the following:
- Hardware problems;
- Defective boot media;
- Network problems (diskless booting);
- Damaged filesystems;
- Improperly configured kernels;
- Errors in the system startup scripts.
Hardware problems
Hardware problems are a fact of life and there is not much you can do
to circumvent them. That said, it is always in your best interest to
rationally examine the symptoms and document exactly what is (or isn't)
happening. Most maintenance services carry a limited number of
spare parts, you can greatly improve your service time by
doing some simple diagnosis. Worse yet, many groups have
maintenance contracts with different vendors for different
products. Calling your system vendor when the problem is with a
different vendor's disk drive doesn't help.
The goal of attempting to diagnose the problems should be to eliminate
components known to be working and isolate the problem to one area.
Things to examine when booting:
- Before a problem develops, get a good understanding of the hardware
you have. Before ever powering the system up, remove the cover and look
(don't touch) at the components. Write down information such as how much
memory is on the machine, number of SCSI controllers, periphereals, etc.
If your machine is being set up by a technician, watch the technician set
up the machine and ask questions.
- Do you have power?
Verify the fan is running or that the LED is lit. If not, check the
power circuit by plugging in something known to work (e.g. a radio). If
the circuit has power then the power supply on the machine has probably
gone up.
- Does the system pass the self-test?
Many systems have a power-on self test. That self test will check the
components of the system and usually give some indication of what failed.
This is usually done by specifying test 2 failed, and then you look up
what test 2 was in the hardware manual for your machine.
- Does the root disk get accessed?
Verify the disk is powered up, usually this can be done by listening for
the fan.
Most disk drives have a access light (or at least make a sound when you
access them. Does this happen, if not then you may have a bad disk. Booting
up off an alternate root partition on a different disk may allow you to test
this. At the minimum check the cabling on the disk and verify it is
connected properly.
- Many systems have the ability to probe the SCSI bus from the ROM
monitor, if your system has this ability see if you can see the devices
on your SCSI bus. If you cannot see the any devices on the SCSI bus try
eliminating a device and trying again. Make sure the SCSI terminator is
properly attached, and if possible try a different terminator. If you still
can't see the devices on the SCSI bus you may have a bad CPU board or
SCSI controller.
- Finally, when all else has failed try power cycling
everything. Occasionally
machines will get locked up in an unknown state. Power
cycling the machine and all periphereals may clear this problem. Do not turn
equipment off and back on without waiting at least 15 seconds. Doing so
can cause the problems with the power supply.
Defective boot media
Most modern workstations now boot initially from ROM or PROM; however if
you have a machine that boots from floppy. If you boot from floppy or
tape, insist of having a backup copy of the boot media. Most systems
that support this method of booting have utilities for creating duplicate
copies of the boot media. Insist that you always have backup copies made
when the boot media changes. When you have down time test your backup
copies and verify they work!
With ROM or PROM based systems sit down and read the manual, understand
what the ROM monitor is capable of doing. Many of these monitor have
functions for performing hardware diagnostics.
Damaged filesystems
This has historically been the most common type of problem. However, modern
disk drives have improved their reliability dramatically over the last
few years. Many disk manufacturers now give a 5 year warranty on SCSI
disk drives. Still, the mechanical nature of disk drives makes them more
suseptible to failure.
Under unix, the command fsck is used to do a file consistency
check. The manual page for fsck describes the
options and functions of the command. The fsck comman checks the following:
- Inode block addressing checks: Too many direct or indirect
extents, extents out of order, bad magic number in extents, blocks
that are not in a legal data area of the file system, blocks that
are claimed by more than one inode.
- Size checks: Number of blocks claimed by inode inconsistent with
inode size, directory size not block aligned.
- Directory checks: Illegal number of entries in a directory block,
bad freespace pointer in directory block, entry pointing to
unallocated or outrange inode, overlapping entries, missing or
incorrect dot and dotdot entries.
- Pathname checks: Files or directories not referenced by a pathname
starting from the file system root.
- Link count checks: Link counts that do not agree with the number
of directory references to the inode.
- Freemap checks: Blocks claimed free by the freemap but also
claimed by an inode, blocks unclaimed by any inode but not
appearing in the freemap.
- Super Block checks: Total free block and/or free i-node count
incorrect.
Orphaned files and directories (allocated but unreferenced) are, with the
user's concurrence, reconnected by placing them in the lost+found
directory, if the files are nonempty. The user will be notified if the
file or directory is empty or not. Empty files or directories are
removed, as long as the -n option is not specified. fsck will force the
reconnection of nonempty directories. The name assigned is the i-node
number. The directory lost+found must preexist in the root of the file
system being checked and must have empty slots in which entries can be
made. This directory is always created by mkfs(1m) when a file system is
first created.
The fsck command can usually
recover from minor errors. They may be some file system corruption but
usually the file system will be usuable.
Other than root file systems, if fsck cannot repair your disk or file system
you can generally boot in single-user mode
and recover your file system from archival backups. To do this you would
use whatever command is associated with the backup software you used.
For new system administrators it is a very good idea to practice this
during test time, preferably on a new disk without any critical data.
It is much better to become comfortable with file system restoration when
it is not a crisis situation.
The root file system is obviously the most critical file system, without it
you cannot boot the machine. If the root filesystem is damaged the fsck
command cannot be loaded to correct a file system. If the root file system
is damaged other things must be done. One option, available on many
workstations is to boot from a CD-ROM or some alternate device (tape)
and then use the utilities available to recover your file system. It
is critical that as a system manager you understand what utilities are
available on the alternate device and use one of those utilities to
back up your root file system. For example, on SGI systems we can boot
from CD-ROM and have access to the utility bru. Because of this
we use the bru command to backup up our root and usr file systems. If
a problem occurs we can recover from backup tapes.
If you don't have a means of booting from alternate media it is
probably a very good idea to create an alternate root partition on a
different disk. Then if your primary root disk is corrupted you can
boot from the alternate drive and restore the true root partition from
tape. As with all of this, it is a very good idea for system
administrators to try this ahead of time.
Improperly configured kernels
We will spend more time discussing kernels later in the semester.
However, it is possible to build a kernel that won't boot. For
example, we worked with SGI to build a kernel containing a very
large number of streams buffers. When we went to load the kernel
we had to found that it was too large and conflicted with some
limitations imposed by SGI's SASH utility. The key thing when building
a kernel is to have a consistent mechanism for naming the new and old
kernels. If a new kernel won't boot it is often impossible at that to
go back and see what the previous kernel was named. Given a kernel
that won't boot each OS has some means of specifying an alternate
kernel file to load.
Errors in the system startup scripts
Errors in the startup scripts (either /etc/rc or /etc/init.d) are
usually not fatal. Sometimes the error is minor and can be fixed
after the machine has come up by re-runing a set of commands as root.
If the problem is more sever, you will have to boot the system in
single-user mode and fix the problem by editing the startup scripts.
Remember that usually you only have vi or ed editors
available so make sure you have a basic understanding of those
editors.
Jack Suess/jack@umbc.edu