Bug 72808

Summary: disk corruption experienced on NFS & build server with LVM
Product: [Retired] Red Hat Public Beta Reporter: Alexandre Oliva <aoliva>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: nullCC: sct
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-08-28 08:55:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Snippet from /var/log/messages containing all the relevant messages none

Description Alexandre Oliva 2002-08-28 00:20:25 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020809

Description of problem:
I've had this athlon box with an Asus A7V133 MB for almost a year.  It has run
Red Hat Linux 7.1, 7.2, 7.3 and lately limbo (2) and null.  It has never had any
hardware problems for the first months, until I swapped its CD rom for a CDRW,
and then the IDE0 bus (where the CDRW and one of the hard disks are) would hang.
 Most often, the system recovered by itself.  Sometimes, it didn't, even though
I had RAID1 for the root filesystem and RAID5 for all other filesystems, using 4
disks, one in each IDE controller (2 VIA channels, 2 Promise channels, all built
in the MB).  Since I noticed, after 1 or 2 hangs over a period of a few weeks,
that the CDRW led had started blinking without any reason, I figured it might
have something to do with cabling or some bad contact in the IDE cables.  I
replaced the cable, to no avail.  I learned to live with this problem: whenever
I see the CDRW led blinking without my using it, I reach for the IDE cable and
press the connectors; this always gets the system in a working state again, but
I'm not sure it makes any real difference.

When I installed null for the first time, the day it debuted, I decided I wanted
to play with LVM, and removed all my RAID devices and created one PV per disk
(in addition to some swap space in all disks, and 100MB for /boot on hda and
hde).  I created striped logical volumes with 16MB extents (x4, i.e. 64MB
strides) for / and /l (my local data filesystem).  The machine has worked fine
with kernel 2.4.18-11, and then 2.4.18-12.

Ther other day, at sct's advice, I decided to reconfigure my LVM set up so as to
optimize disk for the expected kinds of workload and to be able to grow the
logical volumes, which I couldn't do to this system unless I added a multiple of
4 disks, due to LVM striping.

So I reinstalled it from scratch, doing what I called `poor man's striping',
i.e., allocating consecutive extents on different disks, using a script like this:

lvcreate -l 1 -n null all /dev/hda1
for f in /dev/hd{c,e,g,a}1; do lvextend -l +1 /dev/all/null $f; done
lvreduce -L 5G /dev/all/null
lvcreate -l 1 -n l all /dev/hda1
for f in /dev/hd{c,e,g,a}1; do lvextend -l +1 /dev/all/l $f; done
mke2fs -j -R stride=4096 -L null/ /dev/all/null
mke2fs -j -R stride=4096 -L /l /dev/all/l

Then run anaconda on that and told it not to format /dev/all/null nor
/dev/all/l, using the former as the root filesystem.  Installation completed
successfully, and I restored the 150GB in /l from another machine overnight. 
This was from Saturday to Sunday.

Sunday and Monday, I used the system normally, i.e., built an entire
toolchain+glibc several times, while experimenting some patches I've been
working on to enable gcc to use distcc for bootstrapping, using the earlier
stages of the bootstrap mounted over NFS.  Formerly, I didn't use NFS very often
(in fact, back in limbo2, I didn't use it at all, since most of my machines were
still running 7.3, that would hang the nfsd on limbo 2).

On Monday morning already, I experienced some odd behavior, namely, there were
some compile errors during bootstrap of GCC using distcc that, if repeated,
would complete successfully.  To convince myself I didn't have a problem with my
desktop, that holds the primary copy of all my data, I ran a local bootstrap,
and that worked fine.  Only when I brought more machines into the build, using
distcc, did the errors show up.

This may be related with the increased load on the local machine, since at the
very least it has to preprocess files it ships for the remote distccds to
compile, and server out every stage compiler over NFS to the other 5 build machines.

This was the last noticeable problem I had on Monday.  This Tuesday, I got back
to work in the morning, and got back to building a cross toolchain+glibc, also
using distcc.  During the afternoon, I ran into some build problems: some object
files in ~/.ccache were corrupted, to the point that the linker no longer
recognized them as valid object files.  I found that odd, and figured it was one
of the other machines in my build farm that was giving me headaches again.  So I
decided to stop using distcc, to clear ~/.ccache since I didn't know how much of
it was corrupted, and start the build over.  But rm -rf ~/.ccache failed with a
segmentation fault.  Oops.  Something was very wrong.

At this point, the CDRW led started blinking.  It hadn't done so for days, if
not weeks.  I pushed the IDE cable on the back of the CDRW and of hda, and it
got back to a normal state, at least normal enough for me to become root and
look at at /var/log/messages.  Just to see I was doomed: there were a lot of
messages indicating filesystem corruption since early morning, ending with a
kernel oops, all of them contained in the attached snipped from /var/log/messages.

After a reset (the disk subsystem appears to have died, since I couldn't get the
machine to reboot, not even with reboot -f -i -h, even though I could log in as
root in the text terminals several times), fsck on /l was forced.  It found 4
inodes with invalid blocks, and refused to proceed, requesting me to run fsck by
hand.  I did so (without -y) and told it to clear the invalid blocks, and then
the 4 affected inodes.  While it scanned for dup blocks, power failed, and the
UPS revealed it's not worth the power it demands itself (not my day, eh? :-/

When power came back, I turned the machine on, and it found and reported the dup
blocks again (this time I started syslog so the messages made it to
/var/log/messages), and required me to run fsck by hand again.  I did so, and
saved the fsck output in a file so that I could check which files had been
affected, to get newer copies of them.

After fsck completed, I looked at the fsck output carefully, and realized some
of theaffected object files were from Monday morning's builds (they hadn't been
touched since then), and most of the others were from Tuesday afternoon's builds.

None of the affected files were sources or Mail files, apparently.  (the cvs
updates I ran in the morning were ok, and I don't have any indication of
oddities in the e-mail I've been reading in the afternoon).  Comparing older
files with backups revealed no corruption of old files, fortunately.  But now
I'm very concerned because it appears that Monday's builds have somehow
corrupted the filesystem (since bootstrap was already failing randomly back
then), and today I had problems when the corrupted spots were reached.

I'm a bit concerned of trying to duplicate this problem, since, well, I depend
on these machines to work, and verifying and checking 150GB of data takes a long
time.  However, I'll keep using this machine the same way, and hope the problem
doesn't happen again.  I thought this report might be useful, though...

The only changes from the earlier, (believed to be stable) configuration are:

- kernel upgraded from 2.4.18-12 to 2.4.18-12.2

- disk layout changed from striped LVs over 4 PVs of 16MB extents to non-striped
LVs over 4 PVs of 4MB extents, with extents scattered across the 4 disks

- use of this box as an NFS server has increased, but only for read access

- the load on this machine has increased, from 2-4 up to peaks of 15 (with 3
concurrent builds at -j6, shipping as many as 15 of the 18 concurrent builds to
remote machines)

I've fsck the other 4 machines running null with kernel 2.4.18-12.2, and one of
them had an unexpected inode cleared by fsck.  It was my laptop (a Dell Inspiron
8000 with a PIII 1.0GHz), that I had converted to a similar hand-striped LVM
over its two disks on Saturday, and it was being used as the master build
machine while bootstraping gcc with distcc while my desktop was reinstalled.  I
found this very suspicious.

Version-Release number of selected component (if applicable):
kernel-2.4.18-12.2

How reproducible:
Didn't try, but will keep trying :-)

Comment 1 Alexandre Oliva 2002-08-28 00:21:07 UTC
Created attachment 73425 [details]
Snippet from /var/log/messages containing all the relevant messages

Comment 2 Alexandre Oliva 2002-08-30 06:16:14 UTC
I just found out something that convinced me that the problem I experienced was
NOT caused by a kernel bug.

The cable that went out of the power supply was plugged to Y cable that fed both
hda and the CDRW.  It turned out that the plug was defective, in that the
contact of the +12V (yellow) cable was loose.

As a result of vibration, sometimes power was cut, causing the reset of the
CDRW, that got its led blinking.  My mere reaching for the IDE cable was enough
to restablish the contact, since I couldn't help touching the power cable that
was in the way.

I wonder how many times the disk went nuts because its power was cut or almost
cut, and I didn't run into a problem out of pure luck.  Porbably what happened
on Monday was that something bad happened at the very wrong time that caused
incorrect information to be stored on disk or something like that.  It was
probably just a coincidence that it happened right after I switched to a totally
different disk layout.

I apologize for the noise and insistence, and thank you for your attention,
kindness and cooperation.