Bug 146974 - /boot data corruption
/boot data corruption
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity low
: ---
: ---
Assigned To: Tom Coughlan
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-03 01:22 EST by lok
Modified: 2007-11-30 17:07 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-10-19 15:07:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description lok 2005-02-03 01:22:25 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
We have recently upgraded from the 2.4.21-15.0.4.ELsmp kernel to the
/boot/vmlinuz-2.4.21-27.0.2.ELsmp kernel on two identical servers. We
are now seeing data corruption problems with /boot on these servers.
eg if we do md5sum on the contents of /boot, we get:

[root@swampoak boot]# md5sum /boot/*
<snip>
585e2f966737c0749a390c4e5fa1de21  /boot/System.map
585e2f966737c0749a390c4e5fa1de21  /boot/System.map-2.4.21-15.0.4.ELsmp
b3e98aa629f0a41677782504e1a9c7b7  /boot/System.map-2.4.21-27.0.2.ELsmp
8245a7ff14eb6a8a6dc7f1aaf80c6020  /boot/System.map-2.4.21-9.0.3.ELsmp
f1af8bab34dec37945d4b332f6f81805  /boot/vmlinux-2.4.21-15.0.4.ELsmp
md5sum: /boot/vmlinux-2.4.21-27.0.2.ELsmp: Input/output error
<snip>

We have tried umounting /boot and running fsck -y (which produces a
lot of errors), and after mounting the filesystem is empty apart from
the lost+found directory. We have also tried running mke2fs on the
partition and restoring the data after mounting the newly formatted
filesystem, but we still get I/O errors like the above.

From /var/log/messages:

Feb  3 16:22:33 swampoak kernel: attempt to access beyond end of device
Feb  3 16:22:33 swampoak kernel: 08:01: rw=0, want=1049345, limit=48163
Feb  3 16:22:33 swampoak kernel: attempt to access beyond end of device
Feb  3 16:22:33 swampoak kernel: 08:01: rw=0, want=579731457, limit=48163
Feb  3 16:22:33 swampoak kernel: attempt to access beyond end of device
Feb  3 16:22:33 swampoak kernel: 08:01: rw=0, want=191049850, limit=48163
Feb  3 16:22:33 swampoak kernel: attempt to access beyond end of device
Feb  3 16:22:33 swampoak kernel: 08:01: rw=0, want=828716830, limit=48163
Feb  3 16:22:33 swampoak kernel: attempt to access beyond end of device
<snip>

df of /boot:
/dev/sda1                46636     21720     22508  50% /boot

We are using the aacraid driver.


Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-27.0.2.EL

How reproducible:
Always

Steps to Reproduce:
1. Try to access files under /boot (eg md5sum)
2. Get I/O errors
3.
    

Actual Results:  I/O errors

Expected Results:  md5sum should have printed a checksum for all files.

Additional info:
Comment 1 Tom Coughlan 2005-02-03 09:51:47 EST
How much memory is in these servers?

Are you able to try the aacraid_10102 driver that is included in
2.4.21-27.0.2.ELsmp?
Comment 2 lok 2005-02-04 03:38:54 EST
The first server has 2GB RAM, the second has 1GB.

We'll be taking the server with 2GB RAM out of production this weekend, and will
be able to try the older version of the driver next week.
Comment 3 lok 2005-02-15 18:29:16 EST
We rebooted the server and it didn't come up.  We tried running an upgrade off
the AS3U4 CD, but it doesn't detect a RedHat install on the disk.  We've also
tried booting from Knoppix and copying the files from /boot on a working AS3U4
server across, and reinstalling Grub, but Grub won't proceed past stage 1.  In
short, we've tried everything short of making a new filesystem on /boot and
using it.  As the server is no longer in production, we're tempted to just
rebuild it from scratch.  Any other suggestions before we do this?
Comment 4 Tom Coughlan 2005-02-15 18:54:50 EST
Not really. I don't recognize this problem at all. Mark?
Comment 5 Mark Salyzyn 2005-02-16 07:39:03 EST
These I/O errors are actually results of the file system making references to 
data beyond the boundaries of the device; so I view this as a data corruption 
issue.

There has 'never' been a 'data corruption' issue associated with the driver. 
Corruption has thus far been attributed to the hardware (Power Supply, MB & 
Memory), source of the data, media or the Firmware (Drive and Card). I 
have 'worried' that when the driver causes a panic for one reason or another, 
that I would end up with file system corruption, but have yet to be bitten by 
this. Could this not be a result of bug id 146630 panic?

Comment 6 Tom Coughlan 2005-05-18 17:20:28 EDT
Sorry for the delay in getting back to this. Can you please provide your current
status?  Has this problem persisted? Do you think it may have been a result of
the bug id 146630 panic, as Mark suggests?

We have not had other data corruption reports that match this scenario exactly.
  There is one data corruption fix in U5. See BZ 147969. It is not clear whether
that problem applies in this case. If you are still seeing the problem, please
try a test with U5. 
Comment 7 lok 2005-05-20 02:15:52 EDT
I don't think it's the result of bug id 146630 - we had three servers with
NetRAID 4M cards which we upgraded to kernel 2.4.21-27.0.2.ELsmp, one hung as
per bugID 146630, the other two booted with the newer kernel but then had data
corruption issues.  None of the three are in production use anymore.  We tried a
clean rebuild on one of the two which suffered data corruption from our
kickstart configuration then an update -fu to get the latest kernel, and had the
same data corruption issues.  It could well be the firmware on the card to have
different symptoms on the three otherwise identical boxes.

I have to say, since the boxes aren't in production anymore (and we have no
other servers with NetRAID 4M cards) I wouldn't be too worried if this bug ID
and also 146630 were both closed off.
Comment 8 RHEL Product and Program Management 2007-10-19 15:07:46 EDT
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.