Bug 514124

Summary: F11 mdadm caused array element to be kicked out of existing arrays on system shutdown
Product: [Fedora] Fedora Reporter: ed leaver <eleaver>
Component: mdadmAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: dledford
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-15 19:42:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/boot.log and /var/log/messages from F11 session that apparently broke the raid.
none
/var/log/messages from subsequent CentOS session that recovered the broken raid. none

Description ed leaver 2009-07-28 05:30:38 UTC
Description of problem:
I dual-boot AMD-64 system with CentOS-5 as primary production environment and Fedora as test environment. Primary disk are two Seagate SATA 320GB drives in softraid (md) raid1:
md0  BootMD   /boot
md1  FedoraMD /
md2  CentOSMD /
md3  swap
md4  HomeMD Centos /home (will share with Fedora when Fedora becomes stable)

Also have a secondary 80GB Samsung disk used for OS testing upon which I also have installed F11. It is a simple SATA disk, no raid. The problem(s) appear on either F11 installation.

1. Upon encountering /dev/md3 at boot, F11 issues an "unrecognized partition table" warning and kicks out one of the swap drives, usually /dev/sda5.

2. Upon system shutdown F11 frequently-but-not-always kicks a drive out of *every* raid1 partition -- even those that are not mounted or appear /etc/fstab. It may kick either of the two drives out of the array. The raid1 partitions may be recovered by re-adding the removed drive partitions e.g.

$> /sbin/mdadm -a /dev/md0 /dev/sda1

from CentOS. But it takes a few hours to restore all partitions. I think the above re-add also works from F11.

Version-Release number of selected component (if applicable):

CentOS and F11 are both up-to-date as of this posting (7/27/09)

How reproducible:

Most of the time on system shutdown from either F11 installation. But sometimes a clean shutdown will leave the raid arrays intact.

Steps to Reproduce:
1. Assemble / add the arrays under CentOS-5.3
2. mdadm -D /dev/md[0-4] shows the arrays clean
3. reboot into F11
4. reboot into CentOS (or hardware-reset and boot into CentOS)
5. mdadm -D /dev/md[0-4] shows the arrays degraded.
6. mdadm -a /dev/md[0-4] /dev/sd[ab]n (n=1,2,3,5,6)
 
Actual results:
as described

Expected results:
F11 should not degrade my raids!. It should not do anything to partitions that are not mounted!!

Additional info:

These symptoms are similar to those reported in (now closed) bug 496186, but anaconda/kickstart are not involved. My system runs an Opteron 175 on an Abit AT8 MB, ULi 1575 SB and PowerColor ATi x700 graphics. 4GB ECC. Today I have been issuing xrandr commands in today's two F11 sessions, in thus-far futile attempt to get desktop to span across dual monitors. The md bug has manifested upon exit from each installation: the non-raid installation was hardware-reset when X became unusable and couldn't be reset; the md installation shut down cleanly. 

fsck reports no errors on the raid disks, nor does smartd. 

The problem also manifested under F10, from its release until an update in February 2009. 

There is lots in F11 I'd like to help test in addition to RadeonHD. But this md/raid problem is a real show-stopper. Thanks!

Comment 1 ed leaver 2009-07-28 05:51:03 UTC
The mdadm problem occurs under both fc11 kernels 2.6.29.4-167 and 2.6.29.6-213. But it occurred under fc10 as well. I installed a similar raid setup on my mothers PC running Fedora 9; that never had any problems.

Comment 2 ed leaver 2009-07-28 17:38:53 UTC
Created attachment 355449 [details]
/var/log/boot.log and /var/log/messages from F11 session that apparently broke the raid.

I forgot to include /Fedora/var/log/boot.log and /Fedora/var/log/messages; the latter does report (many) disk errors on shutdown. Comparing with the subsequent /CentOS/var/log/messages from the immediately following CentOS session, we see the disk partitions were actually kicked out of the raids by CentOS, although I suspect Fedora would have kicked them out on reboot as well. Is it possible there is a problem with the disks that CentOS fsck does not report? What other diagnostics can I run?

Comment 3 ed leaver 2009-07-28 17:41:42 UTC
Created attachment 355450 [details]
/var/log/messages from subsequent CentOS session that recovered the broken raid.

Made this a separate attachment for clarity.

Comment 4 ed leaver 2009-09-04 08:38:43 UTC
The snd_hda_intel driver appears to write outside its allocated memory.

I ran Palimpsest after one of the sata disks had been kicked out of the raid. Palimpsest couldn't tell me much. Not even that the disk was SMART enabled.

Looked like a SATA driver problem. Oops.

It moved (natch) after I updated to 2.6.29.6-217.2.16.fc11.x86_64 kernel. Now the system locks up on boot (Starting udev) if nomodeset is given, but finishes boot to runlevel 3 if it isn't. 

But then locks up upon startx.

These troubles go away if snd_hda_intel is blacklisted in /etc/modprobe.d/blacklist.conf. 

The ULi 1575 SB has a ULi M5288 SATA controller and Realtek 883D sound chip, for which snd_hda_intel is the correct driver.

But snd_hda_intel appears to write outside its allocated memory.

There is a similar -- perhaps duplicate -- bug against Rawhide: #521004.

Thanks.

Comment 5 Doug Ledford 2009-09-15 19:42:47 UTC
Since this is not an mdadm or md raid issue as originally thought, I'm closing this bug out.