645293 – kernel does not recognize the partitions on an mdraid array (Intel BIOS RAID)

Bug 645293 - kernel does not recognize the partitions on an mdraid array (Intel BIOS RAID)

Summary: kernel does not recognize the partitions on an mdraid array (Intel BIOS RAID)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	RejectedBlocker https://fedoraproject...
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-10-21 09:07 UTC by Hans de Goede
Modified:	2011-04-24 03:35 UTC (History)
CC List:	14 users (show)
Fixed In Version:	kernel-2.6.35.12-88.fc14
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-04-24 03:35:48 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg (55.09 KB, text/plain) 2010-10-21 09:07 UTC, Hans de Goede	no flags	Details
View All

Description Hans de Goede 2010-10-21 09:07:16 UTC

Description of problem:
-Take a system with an Intel BIOS RAID array (2 disk striped set in my tests)
-This systems runs fine with RHEL-6
-Boot F14 livecd, with an overlay which includes a fix for bug 645283, so
 that the raid sets gets activated automatically on boot
-Notice there are no /dev/md126p? files, check dmesg, find:

[   19.621662] md126: detected capacity change from 0 to 160047300608
[   19.621673]  md126:
[   19.621686] Buffer I/O error on device md126, logical block 0
[   19.621691] Buffer I/O error on device md126, logical block 0
[   19.621695] Buffer I/O error on device md126, logical block 0
[   19.621700] Buffer I/O error on device md126, logical block 0
[   19.621704] Buffer I/O error on device md126, logical block 0
[   19.621708] Buffer I/O error on device md126, logical block 0
[   19.621711] Buffer I/O error on device md126, logical block 0
[   19.621714] Dev md126: unable to read RDB block 0
[   19.621717] Buffer I/O error on device md126, logical block 0
[   19.621721]  unable to read partition table
[   19.661531] md: raid0 personality registered for level 0
[   19.661656] md/raid0:md126: looking at sda
[   19.661660] md/raid0:md126:   comparing sda(156296448) with sda(156296448)
[   19.661663] md/raid0:md126:   END
[   19.661664] md/raid0:md126:   ==> UNIQUE
[   19.661666] md/raid0:md126: 1 zones
[   19.661668] md/raid0:md126: looking at sdb
[   19.661670] md/raid0:md126:   comparing sdb(156296448) with sda(156296448)
[   19.661672] md/raid0:md126:   EQUAL
[   19.661674] md/raid0:md126: FINAL 1 zones
[   19.661679] md/raid0:md126: done.
[   19.661682] md/raid0:md126: md_size is 312592384 sectors.
[   19.661683] ******* md126 configuration *********
[   19.661685] zone0=[sda/sdb/]
[   19.661688]         zone offset=0kb device offset=0kb size=156296448kb
[   19.661690] **********************************
[   19.661691] 

Notice how:
1) the partitiontable  reading fails
2) the partition table reading seems to happen before mdraid has
   examined the member disks, which seems the wrong order to me

This could be specific to my system, but I'm not sure proposing as F14Blocker for now.

Comment 1 Hans de Goede 2010-10-21 09:07:43 UTC

Created attachment 454753 [details]
dmesg

Comment 2 Adam Williamson 2010-10-21 19:12:30 UTC

I have this issue too, but the system also has an installed copy of F14, which works fine. I don't think this is a blocker issue, given my experience.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 3 Adam Williamson 2010-10-21 19:16:21 UTC

note that red_alert (sandro mathys) has we think a similar issue, and he tested that it is possible to install to such an array both from live and regular installer, so this is solely to do with the live environment not setting the array up properly; I don't think that's really a huge problem.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 4 Sandro Mathys 2010-10-21 19:31:30 UTC

That's right, I see the exact same behaviour with the livecd (not using the mentioned fix, though). But once anaconda examines devices the partitions are activated as well and installation went smooth. I can boot both the pre-installed win7 and the newly installed F14 just fine so the array was assembled correctly.

IMHO really no blocker, if at all it's NTH

Comment 5 James Laska 2010-10-21 19:41:26 UTC

Given my understanding of the issue, I vote for fixing this in rawhide/F15 and noting this on Common_F14_Bugs.  While it's not ideal that the live image doesn't properly assemble the raidset on boot, it sounds like we have several results that show running liveinst does assemble and install to the bios raid set without failure.

Comment 6 Hans de Goede 2010-10-22 07:28:34 UTC

(In reply to comment #5)
> Given my understanding of the issue, I vote for fixing this in rawhide/F15 and
> noting this on Common_F14_Bugs.  While it's not ideal that the live image
> doesn't properly assemble the raidset on boot, it sounds like we have several
> results that show running liveinst does assemble and install to the bios raid
> set without failure.

Assemble yes, but the partitions on it are not accessible which is not nice, but having it assembled avoids any possible direct use of the member disks, which is what worried me most.

So I agree with CommonBug-ing this.

Comment 7 James Laska 2010-10-22 16:17:33 UTC

Thanks for the feedback Hans.  This issue was discussed at the 2010-10-22 F-14-Final blocker review meeting where the group determined that based on previous comments, this issue does not meet the Blocker, or nice-to-have criteria.

I'm adding CommonBugs keyword so we can document this issue to alert users.

Comment 8 Nick Wiltshire 2010-11-06 00:36:46 UTC

Hi guys,

I have experienced the same problem as discussed. Although my experience is slightly different. I have 2 Intel BIOS RAID array's on a x58a-ud3r m/board. I have 3 disks in a raid5 for the OS (Win7 and Fedora14), and 2 disks in a raid1 for shared data (formatted NTFS). I can confirm that booting the live cd results in detecting the raid arrays but not showing any partitions. I can also confirm that Anaconda correctly detects both RAID arrays and has allowed me to install on the raid 5 with the choice to mount the raid 1. So at this point I have successfully installed on the raid 5 and windows ntfs partion also on the raid 5 shows up correctly.

My problem is that the raid1 array is detected /dev/md125 but same as the live cd it shows no partition, as such I can not mount it. I should also note that it is only detected 50% of the time. Other times it does not show up at all however the disks themselves show in disk utility as healthy. So i believe this to be a substantial problem as anyone using more than one raid array without installing on it will not be able to mount their drives.

Additionally, I upgraded from fedora13. Upon reboot install.img was not found as it could no longer mount the arrays. So at this point I pointed to an install image on the net and once in Anaconda I was able to continue the install successfully. Both raid arrays mounted properly in fedora13, unbuntu 10.10 and windows7.

Any feedback would be much appreciated as without my data drives using fedora14 seems rather pointless. I am happy to provide any logs or other information, however anything too technical I may need assistance with. Hope to hear from someone soon and to see this problem resolved.

Thanks,

Nick.

Comment 9 Ziemowit Pierzycki 2010-11-15 00:16:25 UTC

Hi,

I have a different problem but it sounds to be related.  I have a system with Intel ICH10R based RAID01.  It was running F12 just fine for more than a year.

Today I decided to upgrade to F14 by installing on top of already existing RAID and LVM setup.  After a new installation, the system restarted and everything was fine.  I turned off SELinux, rebooted, and the BIOS RAID marked few drives as offline.  Then the system was stuck at the screen just prior to booting the kernel as if the RAID was broken.

When I booted the F14 DVD into the rescue mode the RAID was found but in degraded state.  I was able to backup all my files to another drive so at least I didn't loose anything.

Seems the installer works fine, but as soon as the newly install system comes up it corrupts the RAID.  So as soon as the newly installed system reboots something happens to the RAID.  I was able to repeat the problem three times.  I did not see a single error message.

Comment 10 Kevin Paetzold 2010-11-17 21:29:34 UTC

I also am having problems getting my software raid recognized reliably with F14 with most of the same symptoms described by the other people posting.  This is the same raid array that I use successfully with F13 (and previously used with F12 and F11 and ....).  I finally went as far as to:

- delete the raid
- zero the super blocks via mdadm
- write zeroes over the partitions via dd
- recreated the raid using F14 (mdadm)
- formatted as ext4.  Successfully used the raid.
- rebooted F14 and then the raid was not recognized.
- rebooted immediately back to F13 and successfully used the raid recreated above with F14.

At one point I seemed to be able to make this issue go away by making my own initrd without dmraid.  In subsequent attempts (after a reinstall of F14) I was not able to get things to work by removing dmraid (so I am not 100% sure on this).

Comment 11 Ziemowit Pierzycki 2010-11-17 22:17:56 UTC

I re-created the RAID, installed the system with F14, and right away upgraded everything.  Five reboots later the system is still running... strange.  Maybe the upgrade fixed something?

Comment 12 Adam Williamson 2010-11-17 23:23:17 UTC

ziemowit and kevin do not have the bug reported here, by my reading. this is specific to the case of the live CD incorrectly constructing the array, but Anaconda and the installed system constructing it correctly. please file new bugs for your cases (if you wish to). thanks.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 13 Nick Wiltshire 2010-11-18 01:55:55 UTC

Adam, I assume that your excluding them from this bug topic means you agree with my post regarding the partitions on my RAID 1 array not being detected correctly.  This being the case, are there any further developments for this bug.  Is it actively being investigated or is it not likely to be resolved soon?

Any help would be great.

Comment 14 Adam Williamson 2010-11-18 04:37:11 UTC

I don't really know. I'm just triaging. Hans is interested but I don't think he's 'officially' in charge of the RAID stuff any more.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 15 Hans de Goede 2010-11-18 07:32:01 UTC

(In reply to comment #14)
> I don't really know. I'm just triaging. Hans is interested but I don't think
> he's 'officially' in charge of the RAID stuff any more.
> 

Besides not 'officially' being in charge of RAID stuff any more, this is a kernel problem and I never did do RAID stuff at the kernel level.

The correct person to talk to to look into this bug is Doug Ledford.

Regards,

Hans

Comment 16 Nick Wiltshire 2010-11-18 09:57:35 UTC

Ok thanks Hans for the feedback.  I have forwarded the details to Doug, hopefully to the correct address.  If you're in contact with him it would be great if you can check he is aware of the issue.

Thanks again!

Nick.

Comment 17 Doug Ledford 2010-11-18 16:12:06 UTC

I'll be back around my test setup Monday and will take a look at this then.

Comment 18 Kevin Paetzold 2010-11-19 12:54:36 UTC

I entered 654864 (as suggested) to describe the issue I am encountering.
https://bugzilla.redhat.com/show_bug.cgi?id=654864

Comment 19 Adam Williamson 2010-11-22 18:26:07 UTC

assigning then.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 20 Nick Wiltshire 2010-11-29 22:30:54 UTC

Hi Doug,

Have you had a chance to look at this as yet?  Any progress?  Let me know if you need to see any logs or need any more info from my system.

Thanks for your help,

Nick.

Comment 21 Eduardo 2010-12-24 05:43:46 UTC

Hello,

I want to report that my system is also experiencing the same problem as Nick's. I just upgraded from F13 to F14 using preupgrade. I have a Intel BIOS raid 1 that gets assembled correctly (/proc/mdstat shows container and raid correctly), but accessing the partition table is failing. These are the versions I have:
kernel-2.6.35.10-72.fc14.x86_64
dracut-006-5.fc14.noarch
mdadm-3.1.3-0.git20100804.2.fc14.x86_64

Preupgrade left my old F13 kernels behind so I tried the latest one I had and the raid worked fine (kernel-2.6.34.7-63.fc13.x86_64). Please let me know if you know of any potential issues of running this kernel on a F14 system.

I also tested using dmraid on kernel-2.6.35.10-72.fc14.x86_64 and the raid partitions worked fine. I followed these steps to enable it manually:
1. Stop the lvm group: vgchange -a n VG_DATA
2. Stop mdadm raid: mdadm -S --scan
3. Activate dmraid: dmraid -a y
4. Activate lvm group: vgchange -a y

This bug seems to be prioritized based on the idea that it is only happening on live media but it is happening also on installed versions. Should we file a different bug for this instead?

Also I think that the Common F14 Bugs page should be updated. Currently it states that the bug only happens when booting on F14 live media

Please let me know if you need any other information or logs. Also if there is another kernel or other package version that you would like me to test.

Thank you

Comment 22 Eduardo 2011-02-12 05:21:27 UTC

Just to keep track of this issue, I tried kernel-2.6.35.11-83.fc14.x86_64 and the problem still persists

Comment 23 Eduardo 2011-02-12 14:34:47 UTC

With this new kernel partx is able to read the partitions in the raid and create the devices for them. I had tried partx on a previous kernel and it had not work. Still it has errors during boot time, but I can bring the raid partitions manually. Thank you.

Comment 24 Eduardo 2011-02-13 20:36:52 UTC

I think this might be relevant. I noticed that on the initialization scripts kpartx is used instead of partx. kpartx fails to create the devices when I invoked it as follow:
#kpartx -a -p p -v /dev/md126

The output of the command is:
add map md126p1 (253:10): 0 524288000 linear /dev/md126 2048
add map md126p2 (253:11): 0 2405982017 linear /dev/md126 524290048

It shows the partitions I expect but the devices are not created.

For now I made a small change to rc.sysinit to get my raid working with the latest kernel. The initial buffer error still happens, but at least it recovers and creates the devices for the partitions before enabling lvm:

--- /etc/rc.d/rc.sysinit.bak    2010-12-24 10:51:17.408970720 -0500
+++ /etc/rc.d/rc.sysinit        2011-02-13 15:11:26.837371007 -0500
@@ -196,6 +196,15 @@
 # Start any MD RAID arrays that haven't been started yet
 [ -r /proc/mdstat -a -r /dev/md/md-device-map ] && /sbin/mdadm -IRs
 
+# Create device maps from md* partition tables that have not been initialized
+if [ -r /proc/mdstat ]; then
+       for mdname in $(grep ' active' /proc/mdstat | awk '{ print $1 }'); do
+               if [ -e /dev/${mdname} -a ! -e /dev/${mdname}p1 ]; then
+                       /sbin/partx -a "/dev/${mdname}" 
+               fi
+       done
+fi
+
 if [ -x /sbin/lvm ]; then
        action $"Setting up Logical Volume Management:" /sbin/lvm vgchange -a y --sysinit
 fi

Comment 25 Hans de Goede 2011-02-14 08:55:55 UTC

(In reply to comment #24)
> I think this might be relevant. I noticed that on the initialization scripts
> kpartx is used instead of partx. kpartx fails to create the devices when I
> invoked it as follow:
> #kpartx -a -p p -v /dev/md126
> 
> The output of the command is:
> add map md126p1 (253:10): 0 524288000 linear /dev/md126 2048
> add map md126p2 (253:11): 0 2405982017 linear /dev/md126 524290048
> 
> It shows the partitions I expect but the devices are not created.
> 
> For now I made a small change to rc.sysinit to get my raid working with the
> latest kernel.

This is not the right solution. kpartx is only needed / used for partitions on top of dmraid devices, for mdraid the in kernel partitioning support should
be used. You are using the in kernel support with the partx call, but there should be no need to call it, the kernel should find the partitions automatically, like it does on regular disks.

It would be interesting to try this with newer kernels, like:
http://koji.fedoraproject.org/koji/buildinfo?buildID=213137

Comment 26 Nick Wiltshire 2011-02-14 09:35:28 UTC

Hi All,

I couldn't help but notice this has had some attention of late.  My original issue was my raid5 (OS array) was detected correctly and auto mounted correctly, however my raid 1 data array was not detected and hence did not auto mount.  I over came this a little while ago with the following steps, I have had 2 kernel updates since and both raid devices continue to work correctly since making these changes.

Step 1.
sudo mv /etc/mdadm.conf /etc/mdadm.conf.orig

Step 2.
sudo mdadm --detail --scan > /etc/mdadm.conf

Step 3.
sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).old.img

Step 4.
sudo dracut --mdadmconf /boot/initramfs-$(uname -r).img $(uname -r)

Step 5.
Restart computer and hope it works.

During testing of this I actually skipped step3 and made a different temp.img in step 4 and then created a boot entry in menu.lst that pointed to it.  That way my working initramfs.img and boot entry remained unchanged.  Once I was convinced the process worked I just completed steps 1 through 5 and my raid has been working properly since.

I agree with Hans that the kernel should have auto detected this, but as a work around it seems fairly simple and has worked solidly since i made the changes.
I hope this helps.

Nick Wiltshire.

Comment 27 Eduardo 2011-02-16 14:08:04 UTC

Thank you for the explanation about kpart and the workarounds. I just tried the new kernel provided by Hans, but it is still unable to read the partitions until running partx:
Feb 16 08:53:19 asgard1 kernel: [   16.445208] md126: detected capacity change from 0 to 1500299395
072
Feb 16 08:53:19 asgard1 kernel: [   16.445225] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445229] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445234] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445239] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445244] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445248] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445252] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445254] Dev md126: unable to read RDB block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445256] Buffer I/O error on device md126, logical block 0
Feb 16 08:53:19 asgard1 kernel: [   16.445259]  md126: unable to read partition table
Feb 16 08:53:19 asgard1 kernel: [   16.460668] md: raid1 personality registered for level 1
Feb 16 08:53:19 asgard1 kernel: [   16.622785] ALSA sound/pci/ctxfi/cttimer.c:424: ctxfi: Use xfi-n
ative timer
Feb 16 08:53:19 asgard1 kernel: [   16.632987] bio: create slab <bio-1> at 1
Feb 16 08:53:19 asgard1 kernel: [   16.633286] md/raid1:md126: active with 2 out of 2 mirrors

Before I tried having dracut initialize the raid I want to confirm that I have the correct kernel parameters:
kernel /vmlinuz-2.6.37-2.fc15.x86_64 ro root=/dev/mapper/vg_asgard1-LogVol_root rd_LVM_LV=vg_asgard1/LogVol_root rd_LVM_LV=vg_asgard1/LogVol_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rhgb vga=0x361

Do I need to remove rd_NO_MD?

Thank you,
Eduardo

Comment 28 Eduardo 2011-04-23 16:47:52 UTC

Hello,

After upgrading to kernel-2.6.35.12-88.fc14.x86_64 about a week ago I have not seen these errors. Everything seems to be working fine now. Thank you!

Note You need to log in before you can comment on or make changes to this bug.