244116 – dmraid-1.0.0.rc14-5_RHEL4_U5 has CRC errors on HostRAID JBOD Devices

Bug 244116 - dmraid-1.0.0.rc14-5_RHEL4_U5 has CRC errors on HostRAID JBOD Devices

Summary: dmraid-1.0.0.rc14-5_RHEL4_U5 has CRC errors on HostRAID JBOD Devices

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	dmraid
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-13 21:41 UTC by Karl Grindley
Modified:	2008-10-16 12:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-10-16 12:39:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootlog with working dmraid and LVM (21.74 KB, text/plain) 2007-07-03 13:34 UTC, Karl Grindley	no flags	Details
bootlog with broken dmraid and LVM (14.42 KB, text/plain) 2007-07-03 13:35 UTC, Karl Grindley	no flags	Details
View All

Description Karl Grindley 2007-06-13 21:41:05 UTC

Description of problem:

After applying all Update 5 packages, including the kernel, to EL4; on reboot,
LVM breaks as well as raw ext3 on bare metal partitions.  (Data on first SATA
drive is accessible and stable, however, second SATA drive is not.  Note
however, we have had one, out of 6 machines corrupt it's primary sda device so
it seems this bug can also cause issues on both devices)

At first glance, appears to be LVM, however, eliminated LVM as the cause by
reformatting the /dev/sdb device with standard ext3 partition.  mkfs.ext3 works
fine.  however, the following error occurs when attempting to mount the device:
[root@grid63-11x ~]# mount /dev/sdb1 /local_disk
mount: /dev/sdb1 already mounted or /local_disk busy

Definitely an issue on stock updated kernel 2.6.9-55.ELsmp_x86_64.

Version-Release number of selected component (if applicable):

EL4 Update 5

How reproducible:

Every time on a subset of 4 machines with identical hardware configuration by:

Steps to Reproduce:
1. Re-kickstart
2. apply update 5 patchs with kernel
3. reboot
4. mounting fails on boot
  
Actual results:

Everything works FINE until booted into new kernel.  (Note key detail: rebooting
on old kernel multiple times works fine BEFORE booting into -55ELsmp kernel)

When new (-55ELsmp) kernel boots, if LVM is used, the VolGroup01 is unable to
initialize logical volumes on boot resulting in an fsck failure and needing root
password.  If direct ext3, mount fails but boot and continues on to subsequent
runlevels.

After this point, the LogVolXX devices are reported as zero size, and for direct
ext3 on the device, says the device is busy.

If rebooting to older kernel also does not fix the issue.  It seems that the new
kernel seems to damage the partition on the device in some way not allowing it
to boot.  destroying the disk, re-mkfsing, and/or setting up LVM fixes the
problem, as long as -55ELsmp kernel is not booted.

Expected results:

Boot with sdb1 device mounted.

Additional info:

Comment 1 Jason Baron 2007-06-29 16:01:50 UTC

yikes!

Is there anything in /var/log/messages that would steer us in the right direction?

What was the last known kernel that was ok? If its not too much trouble i'd we
might want to binary search the kernels, to find the one that introduced this
problem...

thanks.

Comment 2 Karl Grindley 2007-07-02 11:28:27 UTC

Not a smoking gun that would point to exactly what the culpret would be.  I am
going to take one of my grid nodes offline and do some simple trial and error to
find the exact [set of] package[s] causing the issue.  I will be doing this over
the next two days and will post results ASAP.

Comment 3 Karl Grindley 2007-07-03 13:34:18 UTC

Created attachment 158438 [details]
bootlog with working dmraid and LVM

this bootlog is the kernel messages on boot with working dmraid

Comment 4 Karl Grindley 2007-07-03 13:35:27 UTC

Created attachment 158439 [details]
bootlog with broken dmraid and LVM

bootlog with broken dmraid installed on machine.  note the CRC errors

Comment 5 Karl Grindley 2007-07-03 13:41:53 UTC

I've nailed down the problem with the new dmraid-1.0.0.rc14-5_RHEL4_U5 package.
 downgrading to dmraid-1.0.0.rc11-3_RHEL4_U4 resolves the issue.  I've attached
bootlog messages off serial console.  sdiff points out the major differences,
specifficaly note the CRC errors on the devices.

This seems to me to be some kind of hardware/dmraid interaction issue.  We are
running a 74GB drive and a 300GB drive..  The Adaptec HostRAID controller
(AIC-8130: (Marvell 88SX6041)) does NOT have array devices configured, therefore
we are doing direct JBOD mode.  The specific motherboard is a SuperMicro
H8DAR-T.  Specs here:
http://www.supermicro.com/Aplus/motherboard/Opteron/8132/H8DAR-T.cfm

Please let me know if you need any more info.

Thanks!

Comment 6 Karl Grindley 2007-07-03 13:43:15 UTC

[append]

Here's the specific CRC errors.

ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sda
ddf1: physical drives with CRC 9513BC8E, expected 8C26D753 on /dev/sdb

Comment 7 Karl Grindley 2007-07-03 13:44:08 UTC

sorry, last update.  fix title.

Comment 8 Heinz Mauelshagen 2008-10-16 12:39:30 UTC

Closing, because we can't see a solution from the DDF1 spec.

Note You need to log in before you can comment on or make changes to this bug.