From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: We have recently upgraded from the 2.4.21-15.0.4.ELsmp kernel to the /boot/vmlinuz-2.4.21-27.0.2.ELsmp kernel on two identical servers. We are now seeing data corruption problems with /boot on these servers. eg if we do md5sum on the contents of /boot, we get: [root@swampoak boot]# md5sum /boot/* <snip> 585e2f966737c0749a390c4e5fa1de21 /boot/System.map 585e2f966737c0749a390c4e5fa1de21 /boot/System.map-2.4.21-15.0.4.ELsmp b3e98aa629f0a41677782504e1a9c7b7 /boot/System.map-2.4.21-27.0.2.ELsmp 8245a7ff14eb6a8a6dc7f1aaf80c6020 /boot/System.map-2.4.21-9.0.3.ELsmp f1af8bab34dec37945d4b332f6f81805 /boot/vmlinux-2.4.21-15.0.4.ELsmp md5sum: /boot/vmlinux-2.4.21-27.0.2.ELsmp: Input/output error <snip> We have tried umounting /boot and running fsck -y (which produces a lot of errors), and after mounting the filesystem is empty apart from the lost+found directory. We have also tried running mke2fs on the partition and restoring the data after mounting the newly formatted filesystem, but we still get I/O errors like the above. From /var/log/messages: Feb 3 16:22:33 swampoak kernel: attempt to access beyond end of device Feb 3 16:22:33 swampoak kernel: 08:01: rw=0, want=1049345, limit=48163 Feb 3 16:22:33 swampoak kernel: attempt to access beyond end of device Feb 3 16:22:33 swampoak kernel: 08:01: rw=0, want=579731457, limit=48163 Feb 3 16:22:33 swampoak kernel: attempt to access beyond end of device Feb 3 16:22:33 swampoak kernel: 08:01: rw=0, want=191049850, limit=48163 Feb 3 16:22:33 swampoak kernel: attempt to access beyond end of device Feb 3 16:22:33 swampoak kernel: 08:01: rw=0, want=828716830, limit=48163 Feb 3 16:22:33 swampoak kernel: attempt to access beyond end of device <snip> df of /boot: /dev/sda1 46636 21720 22508 50% /boot We are using the aacraid driver. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-27.0.2.EL How reproducible: Always Steps to Reproduce: 1. Try to access files under /boot (eg md5sum) 2. Get I/O errors 3. Actual Results: I/O errors Expected Results: md5sum should have printed a checksum for all files. Additional info:
How much memory is in these servers? Are you able to try the aacraid_10102 driver that is included in 2.4.21-27.0.2.ELsmp?
The first server has 2GB RAM, the second has 1GB. We'll be taking the server with 2GB RAM out of production this weekend, and will be able to try the older version of the driver next week.
We rebooted the server and it didn't come up. We tried running an upgrade off the AS3U4 CD, but it doesn't detect a RedHat install on the disk. We've also tried booting from Knoppix and copying the files from /boot on a working AS3U4 server across, and reinstalling Grub, but Grub won't proceed past stage 1. In short, we've tried everything short of making a new filesystem on /boot and using it. As the server is no longer in production, we're tempted to just rebuild it from scratch. Any other suggestions before we do this?
Not really. I don't recognize this problem at all. Mark?
These I/O errors are actually results of the file system making references to data beyond the boundaries of the device; so I view this as a data corruption issue. There has 'never' been a 'data corruption' issue associated with the driver. Corruption has thus far been attributed to the hardware (Power Supply, MB & Memory), source of the data, media or the Firmware (Drive and Card). I have 'worried' that when the driver causes a panic for one reason or another, that I would end up with file system corruption, but have yet to be bitten by this. Could this not be a result of bug id 146630 panic?
Sorry for the delay in getting back to this. Can you please provide your current status? Has this problem persisted? Do you think it may have been a result of the bug id 146630 panic, as Mark suggests? We have not had other data corruption reports that match this scenario exactly. There is one data corruption fix in U5. See BZ 147969. It is not clear whether that problem applies in this case. If you are still seeing the problem, please try a test with U5.
I don't think it's the result of bug id 146630 - we had three servers with NetRAID 4M cards which we upgraded to kernel 2.4.21-27.0.2.ELsmp, one hung as per bugID 146630, the other two booted with the newer kernel but then had data corruption issues. None of the three are in production use anymore. We tried a clean rebuild on one of the two which suffered data corruption from our kickstart configuration then an update -fu to get the latest kernel, and had the same data corruption issues. It could well be the firmware on the card to have different symptoms on the three otherwise identical boxes. I have to say, since the boxes aren't in production anymore (and we have no other servers with NetRAID 4M cards) I wouldn't be too worried if this bug ID and also 146630 were both closed off.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.