Description of problem:
After applying all Update 5 packages, including the kernel, to EL4; on reboot,
LVM breaks as well as raw ext3 on bare metal partitions. (Data on first SATA
drive is accessible and stable, however, second SATA drive is not. Note
however, we have had one, out of 6 machines corrupt it's primary sda device so
it seems this bug can also cause issues on both devices)
At first glance, appears to be LVM, however, eliminated LVM as the cause by
reformatting the /dev/sdb device with standard ext3 partition. mkfs.ext3 works
fine. however, the following error occurs when attempting to mount the device:
[root@grid63-11x ~]# mount /dev/sdb1 /local_disk
mount: /dev/sdb1 already mounted or /local_disk busy
Definitely an issue on stock updated kernel 2.6.9-55.ELsmp_x86_64.
Version-Release number of selected component (if applicable):
EL4 Update 5
Every time on a subset of 4 machines with identical hardware configuration by:
Steps to Reproduce:
2. apply update 5 patchs with kernel
4. mounting fails on boot
Everything works FINE until booted into new kernel. (Note key detail: rebooting
on old kernel multiple times works fine BEFORE booting into -55ELsmp kernel)
When new (-55ELsmp) kernel boots, if LVM is used, the VolGroup01 is unable to
initialize logical volumes on boot resulting in an fsck failure and needing root
password. If direct ext3, mount fails but boot and continues on to subsequent
After this point, the LogVolXX devices are reported as zero size, and for direct
ext3 on the device, says the device is busy.
If rebooting to older kernel also does not fix the issue. It seems that the new
kernel seems to damage the partition on the device in some way not allowing it
to boot. destroying the disk, re-mkfsing, and/or setting up LVM fixes the
problem, as long as -55ELsmp kernel is not booted.
Boot with sdb1 device mounted.
Is there anything in /var/log/messages that would steer us in the right direction?
What was the last known kernel that was ok? If its not too much trouble i'd we
might want to binary search the kernels, to find the one that introduced this
Not a smoking gun that would point to exactly what the culpret would be. I am
going to take one of my grid nodes offline and do some simple trial and error to
find the exact [set of] package[s] causing the issue. I will be doing this over
the next two days and will post results ASAP.
Created attachment 158438 [details]
bootlog with working dmraid and LVM
this bootlog is the kernel messages on boot with working dmraid
Created attachment 158439 [details]
bootlog with broken dmraid and LVM
bootlog with broken dmraid installed on machine. note the CRC errors
I've nailed down the problem with the new dmraid-1.0.0.rc14-5_RHEL4_U5 package.
downgrading to dmraid-1.0.0.rc11-3_RHEL4_U4 resolves the issue. I've attached
bootlog messages off serial console. sdiff points out the major differences,
specifficaly note the CRC errors on the devices.
This seems to me to be some kind of hardware/dmraid interaction issue. We are
running a 74GB drive and a 300GB drive.. The Adaptec HostRAID controller
(AIC-8130: (Marvell 88SX6041)) does NOT have array devices configured, therefore
we are doing direct JBOD mode. The specific motherboard is a SuperMicro
H8DAR-T. Specs here:
Please let me know if you need any more info.
Here's the specific CRC errors.
ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sda
ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sdb
sorry, last update. fix title.
Closing, because we can't see a solution from the DDF1 spec.