Description of problem: After applying all Update 5 packages, including the kernel, to EL4; on reboot, LVM breaks as well as raw ext3 on bare metal partitions. (Data on first SATA drive is accessible and stable, however, second SATA drive is not. Note however, we have had one, out of 6 machines corrupt it's primary sda device so it seems this bug can also cause issues on both devices) At first glance, appears to be LVM, however, eliminated LVM as the cause by reformatting the /dev/sdb device with standard ext3 partition. mkfs.ext3 works fine. however, the following error occurs when attempting to mount the device: [root@grid63-11x ~]# mount /dev/sdb1 /local_disk mount: /dev/sdb1 already mounted or /local_disk busy Definitely an issue on stock updated kernel 2.6.9-55.ELsmp_x86_64. Version-Release number of selected component (if applicable): EL4 Update 5 How reproducible: Every time on a subset of 4 machines with identical hardware configuration by: Steps to Reproduce: 1. Re-kickstart 2. apply update 5 patchs with kernel 3. reboot 4. mounting fails on boot Actual results: Everything works FINE until booted into new kernel. (Note key detail: rebooting on old kernel multiple times works fine BEFORE booting into -55ELsmp kernel) When new (-55ELsmp) kernel boots, if LVM is used, the VolGroup01 is unable to initialize logical volumes on boot resulting in an fsck failure and needing root password. If direct ext3, mount fails but boot and continues on to subsequent runlevels. After this point, the LogVolXX devices are reported as zero size, and for direct ext3 on the device, says the device is busy. If rebooting to older kernel also does not fix the issue. It seems that the new kernel seems to damage the partition on the device in some way not allowing it to boot. destroying the disk, re-mkfsing, and/or setting up LVM fixes the problem, as long as -55ELsmp kernel is not booted. Expected results: Boot with sdb1 device mounted. Additional info:
yikes! Is there anything in /var/log/messages that would steer us in the right direction? What was the last known kernel that was ok? If its not too much trouble i'd we might want to binary search the kernels, to find the one that introduced this problem... thanks.
Not a smoking gun that would point to exactly what the culpret would be. I am going to take one of my grid nodes offline and do some simple trial and error to find the exact [set of] package[s] causing the issue. I will be doing this over the next two days and will post results ASAP.
Created attachment 158438 [details] bootlog with working dmraid and LVM this bootlog is the kernel messages on boot with working dmraid
Created attachment 158439 [details] bootlog with broken dmraid and LVM bootlog with broken dmraid installed on machine. note the CRC errors
I've nailed down the problem with the new dmraid-1.0.0.rc14-5_RHEL4_U5 package. downgrading to dmraid-1.0.0.rc11-3_RHEL4_U4 resolves the issue. I've attached bootlog messages off serial console. sdiff points out the major differences, specifficaly note the CRC errors on the devices. This seems to me to be some kind of hardware/dmraid interaction issue. We are running a 74GB drive and a 300GB drive.. The Adaptec HostRAID controller (AIC-8130: (Marvell 88SX6041)) does NOT have array devices configured, therefore we are doing direct JBOD mode. The specific motherboard is a SuperMicro H8DAR-T. Specs here: http://www.supermicro.com/Aplus/motherboard/Opteron/8132/H8DAR-T.cfm Please let me know if you need any more info. Thanks!
[append] Here's the specific CRC errors. ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sda ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sdb
sorry, last update. fix title.
Closing, because we can't see a solution from the DDF1 spec.