Description of problem: Running an x86_64 F17.TC6 system updated with updates-testing and https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc17 fails to properly initialize a raid10 array at boot with incremental assembly. This creates a situation whereby the system starts with either a degraded array, or depending on which disks are activated "in time", may also fail to start the array with "not enough members." I have tried the suggestion included here https://bugzilla.redhat.com/show_bug.cgi?id=808774#c3 to add rd.md.uuid for all of my arrays to the kernel boot line and rebuilt initrd to no avail. The raid1 arrays in this same machine have no issue. The default /lib/systemd/fedora-storage-init contains the lines: # Start any MD RAID arrays that haven't been started yet [ -r /proc/mdstat ] && [ -r /dev/md/md-device-map ] && /sbin/mdadm -IRs If I change from -IRs to -As, the raid10 array comes up as expected and works--I'm not sure this is the proper fix, however. Also, if I re-add the missing device(s) on a degraded raid10 after boot, they are added without issue, but then rebuilding begins. [root@ds ~]# cat /etc/mdadm.conf # mdadm.conf written out by anaconda MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md0 level=raid1 num-devices=2 UUID=476d0070:6556c3a3:8c4616af:d34ab831 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=413c1502:22f94fdd:cd4da7e5:91675691 ARRAY /dev/md2 level=raid10 num-devices=6 UUID=b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a ARRAY /dev/md3 level=raid10 num-devices=6 UUID=79b75c03:e7c45579:8ab1ee93:94c002cd ARRAY /dev/md4 level=raid1 num-devices=2 UUID=ab595bee:e8dfa157:dd6963fd:f9c0ba5f Please let me know if there is any additional information required.
Anthony, Could you post mdadm --detail of the arrays that fail, and also the output of /proc/mdstat from the time where it isn't assembled correctly? Thanks, Jes
This is from an earlier failure: Personalities : [raid1] [raid10] md3 : active raid10 sdi2[5] sdd2[0] sde2[1] sdf2[3] sdg2[2] 5755676160 blocks super 1.2 512K chunks 2 near-copies [6/5] [UUUU_U] bitmap: 3/43 pages [12KB], 65536KB chunk md2 : active raid10 sdh1[4] sdi1[5] sdd1[0] sde1[1] sdf1[3] sdg1[2] 104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU] bitmap: 1/1 pages [4KB], 65536KB chunk md4 : active raid1 sdc1[0] sdj1[1] 586058616 blocks super 1.2 [2/2] [UU] bitmap: 2/5 pages [8KB], 65536KB chunk md0 : active raid1 sda2[0] sdb2[1] 511988 blocks super 1.0 [2/2] [UU] md1 : active raid1 sda3[0] sdb3[1] 104856504 blocks super 1.2 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> ------ And another: Personalities : [raid1] [raid10] md3 : active raid10 sdi2[5] sdd2[0] sdh2[4] sde2[3] sdg2[1] 5755676160 blocks super 1.2 512K chunks 2 near-copies [6/5] [UU_UUU] bitmap: 3/43 pages [12KB], 65536KB chunk md2 : active raid10 sdf1[2] sdh1[4] sdd1[0] sdi1[5] sde1[3] sdg1[1] 104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU] bitmap: 1/1 pages [4KB], 65536KB chunk md4 : active raid1 sdj1[1] sdc1[0] 586058616 blocks super 1.2 [2/2] [UU] bitmap: 2/5 pages [8KB], 65536KB chunk md0 : active raid1 sda2[0] sdb2[1] 511988 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb3[1] sda3[0] 104856504 blocks super 1.2 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> ------ This is the detail when I use mdadm -As in fedora-init-storage (will you need to see one of these when it's degraded?) [root@ds ~]# mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Fri May 18 17:57:48 2012 Raid Level : raid10 Array Size : 104855040 (100.00 GiB 107.37 GB) Used Dev Size : 34951680 (33.33 GiB 35.79 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 21 14:13:43 2012 State : active Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : ds.messinet.com:2 (local to host ds.messinet.com) UUID : b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a Events : 104 Number Major Minor RaidDevice State 0 8 49 0 active sync /dev/sdd1 1 8 65 1 active sync /dev/sde1 2 8 97 2 active sync /dev/sdg1 3 8 81 3 active sync /dev/sdf1 4 8 113 4 active sync /dev/sdh1 5 8 129 5 active sync /dev/sdi1 [root@ds ~]# mdadm --detail /dev/md3 /dev/md3: Version : 1.2 Creation Time : Fri May 18 17:57:18 2012 Raid Level : raid10 Array Size : 5755676160 (5489.04 GiB 5893.81 GB) Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 21 14:14:08 2012 State : active Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : ds.messinet.com:3 (local to host ds.messinet.com) UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd Events : 15584 Number Major Minor RaidDevice State 0 8 50 0 active sync /dev/sdd2 1 8 66 1 active sync /dev/sde2 2 8 98 2 active sync /dev/sdg2 3 8 82 3 active sync /dev/sdf2 4 8 114 4 active sync /dev/sdh2 5 8 130 5 active sync /dev/sdi2
Now when reverting to mdadm -IRs in fedora-storage-init: [root@ds ~]# mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Fri May 18 17:57:48 2012 Raid Level : raid10 Array Size : 104855040 (100.00 GiB 107.37 GB) Used Dev Size : 34951680 (33.33 GiB 35.79 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 21 14:21:18 2012 State : active Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : ds.messinet.com:2 (local to host ds.messinet.com) UUID : b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a Events : 104 Number Major Minor RaidDevice State 0 8 49 0 active sync /dev/sdd1 1 8 97 1 active sync /dev/sdg1 2 8 81 2 active sync /dev/sdf1 3 8 65 3 active sync /dev/sde1 4 8 113 4 active sync /dev/sdh1 5 8 129 5 active sync /dev/sdi1 [root@ds ~]# mdadm --detail /dev/md3 /dev/md3: Version : 1.2 Creation Time : Fri May 18 17:57:18 2012 Raid Level : raid10 Array Size : 5755676160 (5489.04 GiB 5893.81 GB) Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB) Raid Devices : 6 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 21 14:22:37 2012 State : active, degraded Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : ds.messinet.com:3 (local to host ds.messinet.com) UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd Events : 15803 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 98 1 active sync /dev/sdg2 2 8 82 2 active sync /dev/sdf2 3 0 0 3 removed 4 8 114 4 active sync /dev/sdh2 5 8 130 5 active sync /dev/sdi2
Then after a readd: [root@ds ~]# mdadm --manage /dev/md3 --re-add /dev/sde2 /dev/sdd2 mdadm: re-added /dev/sde2 mdadm: re-added /dev/sdd2 [root@ds ~]# mdadm --detail /dev/md3 /dev/md3: Version : 1.2 Creation Time : Fri May 18 17:57:18 2012 Raid Level : raid10 Array Size : 5755676160 (5489.04 GiB 5893.81 GB) Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 21 14:26:45 2012 State : active, degraded, recovering Active Devices : 4 Working Devices : 6 Failed Devices : 0 Spare Devices : 2 Layout : near=2 Chunk Size : 512K Rebuild Status : 0% complete Name : ds.messinet.com:3 (local to host ds.messinet.com) UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd Events : 16498 Number Major Minor RaidDevice State 0 8 50 0 spare rebuilding /dev/sdd2 1 8 98 1 active sync /dev/sdg2 2 8 82 2 active sync /dev/sdf2 3 8 66 3 spare rebuilding /dev/sde2 4 8 114 4 active sync /dev/sdh2 5 8 130 5 active sync /dev/sdi2
Anthony, I am curious if your arrays fail to start because systemd isn't quite ready when the script is run. In the case where an array fails, what happens if you do the following mdadm -S <device> mdadm -IRs Will it then get started correctly? Neil just pushed a fix into upstream to handle the case of the incremental code running before systemd is ready, and I was curious if this is what we are seeing here. Note the upstream udev md rules file differ quite a lot from the Fedora one so it isn't a straight patch apply, but if the above works for you, I will look into porting it over for testing. Cheers, Jes
(In reply to comment #5) > Anthony, > > I am curious if your arrays fail to start because systemd isn't quite > ready when the script is run. In the case where an array fails, what > happens if you do the following > > mdadm -S <device> > mdadm -IRs > > Will it then get started correctly? I do believe you are on the right track in that systemd isn't ready, however... After booting with a failed raid10 on /dev/md2 (and with the default: mdadm -IRs in /lib/systemd/fedora-storage-init), I did the following: mdadm -S /dev.md2 mdadm -IRs Then nothing happens -- mdadm -IRs has no effect. Nothing in the logs. And /dev/md2 is not started. If I then issue an mdadm -As, everything works as expected and the 6 drive /dev/md2 array is initialized with no degredation. > Neil just pushed a fix into upstream to handle the case of the incremental > code running before systemd is ready, and I was curious if this is what we > are seeing here. I still think this is related to systemd. In my early testing, I added a sleep command after the mdadm -IRs command in /lib/systemd/fedora-storage-init and that _subjectively seemed_ to help, but also proved unreliable in the end The only thing that seems to ensure that raid10 starts properly at this point in F17 is mdadm -As in /lib/systemd/fedora-storage-init (instead of mdadm -IRs) > Note the upstream udev md rules file differ quite a lot from the Fedora one > so it isn't a straight patch apply, but if the above works for you, I will > look into porting it over for testing. If you are able to build some F17 x86_64 RPMs with your changes, I'd be willing to test them today. I am about to put this server into "real service" in a day or so, but until then, I'd be able to do some more testing before I wipe everything and do another fresh F17 install. > Cheers, > Jes
I did see this: http://neil.brown.name/git?p=mdadm;a=commitdiff;h=339b77467a7ad5e5354b5ab0542f8148b18c898c Perhaps there's more to it, but applying this change did not resolve the issue at boot with raid10. Also, a question -- https://bugzilla.redhat.com/show_bug.cgi?id=808774#c3 suggests adding rd.md.uuid for all of my arrays to the kernel boot line. Is this still necessary. It seems to have no effect on raid10 in my case, but is it needed for raid1, for example? Sorry for the question, but I'm trying to narrow the requirements down a bit.
Anthony, Just to be sure, you tried applying this patch to /lib/udev/rules.d/64-md-raid.rules but it had no effect? I am not sure about the rd.md.uuid rule in this case, however, did you try to update your initramfs after applying the patch? dracut puts 64-md-raid.rules into the initramfs. If you didn't, try running 'dracut -f' Cheers, Jes
(In reply to comment #8) > Anthony, > > Just to be sure, you tried applying this patch to > /lib/udev/rules.d/64-md-raid.rules > but it had no effect? I had originally done this, yes. It had no effect. > I am not sure about the rd.md.uuid rule in this case, however, did you try > to update your initramfs after applying the patch? dracut puts > 64-md-raid.rules into the initramfs. I did not originally do this before comment #7, however... > If you didn't, try running 'dracut -f' I have now done this with the patch from comment #7 applied correctly and it does not fix the problem. I am dropped into emergency mode where I am easily able to do mdadm -S on the failed devices, then issue an mdadm -As, which properly starts the raid10 devices with all available drives. Perhaps a short term hack would be a conditional in /lib/systemd/fedora-storage-init that does mdadm -IRs as long as no raid10 are configured, but mdadm -As if there are raid10 configured? I'm not sure if this is possible, though. > Cheers, > Jes
I am not keen on hacking fedora-storage-init like that, it would simply mask the problem. It is puzzling that you are the only one reporting the problem so far, so I wonder what is special in your setup. I will have to see if I can create a setup that will reproduce this. Is it a high cpu count system? Are any of the raids used for the boot process, /, /boot, /usr, etc? Cheers, Jes
~]# cat /etc/mdadm.conf # mdadm.conf written out by anaconda MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md0 level=raid1 num-devices=2 UUID=476d0070:6556c3a3:8c4616af:d34ab831 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=413c1502:22f94fdd:cd4da7e5:91675691 ARRAY /dev/md2 level=raid10 num-devices=6 UUID=b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a ARRAY /dev/md3 level=raid10 num-devices=6 UUID=79b75c03:e7c45579:8ab1ee93:94c002cd ARRAY /dev/md4 level=raid1 num-devices=2 UUID=ab595bee:e8dfa157:dd6963fd:f9c0ba5f ~]# cat /etc/fstab # # /etc/fstab # Created by anaconda on Fri May 18 18:01:58 2012 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # UUID=d1ba2322-3320-4c3a-afb4-fc22f08cff79 / ext4 defaults 1 1 UUID=9b15a1b8-6fde-4c52-bc19-59a4fa4eee76 /boot ext4 defaults 1 2 UUID=1ce20db0-e5be-49db-ae6b-8ec5bcd76b0a /home ext4 defaults 1 2 UUID=fb7cc378-d30e-47a1-8de8-77a1ef794ff3 /srv ext4 defaults 1 2 UUID=ad94366c-fec4-435c-b5cd-0ada0b638c11 /var/lib ext4 defaults 1 2 UUID=ac421f62-58b0-41b3-976f-fff3fbd791e8 swap swap defaults 0 0 UUID=1809567f-9f01-4909-9625-11875230ecd8 swap swap defaults 0 0 ~]# cat /proc/partitions major minor #blocks name 8 0 976762584 sda 8 1 1024 sda1 8 2 512000 sda2 8 3 104857600 sda3 8 4 5136384 sda4 8 16 976762584 sdb 8 17 1024 sdb1 8 18 512000 sdb2 8 19 104857600 sdb3 8 20 5136384 sdb4 11 0 1048575 sr0 9 1 104856504 md1 9 0 511988 md0 8 32 586061784 sdc 8 33 586059776 sdc1 8 144 586061784 sdj 8 145 586059776 sdj1 9 4 586058616 md4 8 80 1953514584 sdf 8 81 34953216 sdf1 8 82 1918560256 sdf2 8 48 1953514584 sdd 8 49 34953216 sdd1 8 50 1918560256 sdd2 8 112 1953514584 sdh 8 113 34953216 sdh1 8 114 1918560256 sdh2 8 64 1953514584 sde 8 65 34953216 sde1 8 66 1918560256 sde2 8 96 1953514584 sdg 8 97 34953216 sdg1 8 98 1918560256 sdg2 8 128 1953514584 sdi 8 129 34953216 sdi1 8 130 1918560256 sdi2 9 2 104855040 md2 9 3 5755676160 md3 ~]# cat /proc/mdstat Personalities : [raid1] [raid10] md3 : active raid10 sdd2[0] sdi2[5] sdh2[4] sde2[3] sdf2[2] sdg2[1] 5755676160 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU] bitmap: 2/43 pages [8KB], 65536KB chunk md2 : active raid10 sde1[3] sdi1[5] sdg1[1] sdd1[0] sdh1[4] sdf1[2] 104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU] bitmap: 0/1 pages [0KB], 65536KB chunk md4 : active raid1 sdj1[1] sdc1[0] 586058616 blocks super 1.2 [2/2] [UU] bitmap: 0/5 pages [0KB], 65536KB chunk md0 : active raid1 sdb2[1] sda2[0] 511988 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb3[1] sda3[0] 104856504 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none>
Created attachment 587672 [details] cat /proc/cpuinfo
Created attachment 587677 [details] dmesg output
Based on our irc discussion I am noting the following points for the record here: 1) Problem seems to happen only with drives sitting on the mpt controller 2) The mpt controller fires much later in the boot process than the AHCI controller, and systemd is already up and running My guess here is that we hit issues with the probe order of the drives in the raid10, and try to launch it before the required drives are added. I am not quite sure how to get around this yet, but it is worth trying to force the mpt2sas driver to be loaded before systemd is set free. rd.driver.pre=mpt2sas ought to do it. Jes
Created attachment 587729 [details] dmesg output with / and /boot on drives in SAS box
Jes, This is functionally the same as the problem HPA was having with his raid6. The fact that it's a raid10 here is largely irrelevant. It's the two different types of controllers, two different controller drivers, and the system starting to bring things up before all the drives have completed their device scan that's the problem. I had initially thought that scsi_wait_scan was broken. Now I'm thinking the issue is just slightly different. I think maybe scsi_wait_scan is doing what it's supposed to do, but we are attempting to bring up the arrays as soon as scsi_wait_scan completes and that's the problem. I think scsi_wait_scan is waiting until the low level scsi scan is complete, but that's no guarantee that the udev add/change events have all been processed. Instead, scsi_wait_scan scans the drives, and that creates the driver entries in the kernel, and that causes udev add/change events to get *queued*, but there is no mechanism in place there to guarantee that A) udev has received the events and put them into it's own internal queue or B) that udev has completed processing the events. And since bringing up the raid devices relies on the udev event queue having been processed, not the kernel scan queue, we have a race condition. One that's normally not hit unless you have two different drivers that complete their events at vastly different times. I think what may be needed here is for dracut to do a scsi_wait_scan followed by a udev_settle (and maybe even a slight pause between the scsi_wait_scan and the udev settle to make sure that udev has registered all of the events and got them in queue, or two subsequent udev settle calls, whatever it takes to make sure that every event created by scsi_wait_scan is done before we proceed.
Harold, can you comment on my assessment in comment #16? Is dracut already doing the right thing? We have several cases now that are pointing to some sort of race between scsi_wait_scan ending and device assembly, and my best educated guess at the moment is that we are waiting for scsi_wait_scan to complete, but not for all the queued udev events that scsi_wait_scan created.
Doug, That sounds very likely, it was my conclusion as well from discussing this with Anthony that something goes wrong in the order things are processed. It also explains why I haven't seen any of it in my own testing since I so far haven't had any none AHCI controllers locally (which will change next week when the one I ordered should arrive). At that point I should be able to test/experiment here too. Adding HPA to this bug. Cheers, Jes
At this point, after a reinstall of this F17 system, I can confirm that between the two interfaces: AHCI and MPT2SAS, If: 1. I install with md0: raid1 /boot, and md1: raid1 / on AHCI, and md2: raid10 /home and md3: raid10 /srv on MPT2SAS, I DO get the failures noted in the initial bug report. 2. I install with md0: raid1 /boot, md1: raid1 /, md2: raid10 /home and md3: raid10 /srv all on MPT2SAS, I DO NOT get the failures. The leads me (the non-expert tester) to lean further towards Doug's comments in comment #16 as being the cause of the problem. I have not yet been able to test whether or not enabling the udev-settle.service has any effect as I will need to re install with option 1 above to try that.
(In reply to comment #17) > Harold, can you comment on my assessment in comment #16? Is dracut already > doing the right thing? We have several cases now that are pointing to some > sort of race between scsi_wait_scan ending and device assembly, and my best > educated guess at the moment is that we are waiting for scsi_wait_scan to > complete, but not for all the queued udev events that scsi_wait_scan created. 1. we do not do scsi_wait_scan anymore 2. "mdadm" is run from udev rules, so it's done immediately when the device is registered and the uevent is processed.
If you don't do scsi_wait_scan any more, then how do you know when the scan is complete and it is safe to proceed with mounting the root fs?
(In reply to comment #21) > If you don't do scsi_wait_scan any more, then how do you know when the scan > is complete and it is safe to proceed with mounting the root fs? We are safe, if the root device is present and mountable. For md devices, dracut waits until all udev events have been processed and the system is idle and the boot process lasts at least some seconds, which can be specified by "rd.retry=<seconds>", before dracut tries to start the raid device in degraded mode.
(In reply to comment #22) > (In reply to comment #21) > > If you don't do scsi_wait_scan any more, then how do you know when the scan > > is complete and it is safe to proceed with mounting the root fs? > > We are safe, if the root device is present and mountable. > > For md devices, dracut waits until all udev events have been processed and > the system is idle and the boot process lasts at least some seconds, which > can be specified by "rd.retry=<seconds>", before dracut tries to start the > raid device in degraded mode. How do you know all udev events have been processed? The problem we have seen here is when someone has two SATA/SAS controllers in the box, and the second controller is much slower doing the probing. Per the log posted previously, the mpt2sas controller could take over 10 seconds to complete it's probing. In the example attached here, the mpt2sas controller completed it's probing 16 seconds after the initial AHCI controller completed the first round. See https://bugzilla.redhat.com/attachment.cgi?id=587729 Asking people with multiple controllers to set a timeout on the commandline might be a workaround, but we need to find a way for this to happen automatically. Cheers, Jes
(In reply to comment #23) > (In reply to comment #22) > > (In reply to comment #21) > > > If you don't do scsi_wait_scan any more, then how do you know when the scan > > > is complete and it is safe to proceed with mounting the root fs? > > > > We are safe, if the root device is present and mountable. > > > > For md devices, dracut waits until all udev events have been processed and > > the system is idle and the boot process lasts at least some seconds, which > > can be specified by "rd.retry=<seconds>", before dracut tries to start the > > raid device in degraded mode. > > How do you know all udev events have been processed? The problem we have seen > here is when someone has two SATA/SAS controllers in the box, and the second > controller is much slower doing the probing. Per the log posted previously, > the mpt2sas controller could take over 10 seconds to complete it's probing. > In the example attached here, the mpt2sas controller completed it's probing > 16 seconds after the initial AHCI controller completed the first round. > See https://bugzilla.redhat.com/attachment.cgi?id=587729 > > Asking people with multiple controllers to set a timeout on the commandline > might be a workaround, but we need to find a way for this to happen > automatically. > > Cheers, > Jes Then maybe the driver should not return from modprobe until it initialized. This will keep "udevadm settle" blocked. In dracut we have to draw a line somewhere. No uevents pending and some time has passed should mean that dracut might execute some fallback strategies on raids. Another idea might be some /sys file with the information that disk probing is still in progress, which dracut can check.
Note I don't know of the mpt2sas driver returns from modprobe before it is finished or not. I just read the probe logs in the dmesg output. However there may be drivers doing async probing to avoid locking up the bus too long, in particular if they can have a large number of LUNs attached. A /sys file might be an approach. Jes
(In reply to comment #25) > Note I don't know of the mpt2sas driver returns from modprobe before it is > finished or not. I just read the probe logs in the dmesg output. The mpt2sas uses async scan and returns from modprobe immediately, the scan then can take a longer time afterwards.
Tomas, Thank you for the clarification - is there any method for userland to detect that probing is still ongoing? Thanks, Jes
(In reply to comment #27) > Tomas, > > Thank you for the clarification - is there any method for userland to detect > that probing is still ongoing? I'm not sure, the 'modprobe scsi_wait_scan' used to work, but there were some changes to it, I don't know if it still works with the latest kernels. A possible workaround is to switch to sync scan http://lxr.linux.no/#linux+v3.5/drivers/scsi/scsi_scan.c#L102 > > Thanks, > Jes
Tomas, It sounds like we need to investigate a smarter API for querying whether we have completed probing. I don't think switching to sync scan is the way to go. Thanks, Jes
(In reply to comment #29) > Tomas, > > It sounds like we need to investigate a smarter API for querying whether we > have completed probing. I don't think switching to sync scan is the way to > go. Sure, I mentioned this as a workaround. What is wrong with scsi_wait_scan? > > Thanks, > Jes
For informational purposes (Tainted Kernel): Same problem here on Fedora 16: * kernel-3.6.5-2.fc16.x86_64 * mdadm-3.2.5-4.fc16.x86_64 This is the first reboot after upgrading to these packages, not sure if there is a relation between these packages and the issue. After reboot I ran into this mess: [root@kvm01 RAID10]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid10] [raid1] md4 : active raid10 sdd6[2] sda6[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_] md8 : active (auto-read-only) raid10 sde10[3] sdb10[1] sda10[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U] md5 : active raid10 sde7[3] sda7[0] sdb7[1] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U] md9 : active (auto-read-only) raid10 sda11[0] sdb11[1] sde11[3] 1289924608 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U] md2 : inactive sde4[3] sda4[0] 204797952 blocks super 1.2 md0 : active raid1 sda2[0] 511988 blocks super 1.0 [4/1] [U___] md6 : active raid10 sdd8[2] sde8[3] sdb8[1] sda8[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md3 : active raid10 sdd5[2] sdb5[1] sde5[3] sda5[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md7 : active raid10 sdb9[1] sdd9[2] sde9[3] sda9[0] 205307904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md1 : active raid6 sda3[0] sdd3[2] sdb3[1] sde3[3] 204796928 blocks super 1.1 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] bitmap: 1/1 pages [4KB], 65536KB chunk Actually none of the supposedly missing devices was missing, but apparently during boot they were. I was lucky that the system was able to boot (Root FS was OK). After the boot I was able to make all these RAID devices active again (although RESYNC is still running). Doing tests on this system is possible, although it requires careful planning because of the users of the system.
Created attachment 640740 [details] dmesg output of F16 system This is the dmesg output of my F16 system showing a failing incremental RAID assembly during boot.
Rolf, Just to be sure I got the data right, your sda4/sde4/sda2 are supposed to be part of the other arrays? Could you post the output of a 'correct' /proc/mdstat as well? Your problem is somewhat different from the other reports of this problem, since you seem to have all the disks attached to a single AHCI controller in the system. The other reports were mostly caused by having raid arrays spanning two controllers (AHCI and MPT SAS) which got probed at different times. What version of dracut do you have installed? Thanks, Jes
Hi Jes, Here's my mdstat, with all arrays operational again: [rolf@kvm01 ~]$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid10] [raid1] md2 : active raid10 sda4[0] sde4[3] sdd4[2] sdb4[1] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md4 : active raid10 sde6[5] sdb6[4] sdd6[2] sda6[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md8 : active raid10 sdd10[2] sde10[3] sdb10[1] sda10[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md5 : active raid10 sdd7[4] sde7[3] sda7[0] sdb7[1] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md9 : active raid10 sdd11[2] sda11[0] sdb11[1] sde11[3] 1289924608 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md0 : active raid1 sde2[6] sdd2[5] sdb2[4] sda2[0] 511988 blocks super 1.0 [4/4] [UUUU] md6 : active raid10 sdd8[2] sde8[3] sdb8[1] sda8[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md3 : active raid10 sdd5[2] sdb5[1] sde5[3] sda5[0] 204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md7 : active raid10 sdb9[1] sdd9[2] sde9[3] sda9[0] 205307904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU] md1 : active raid6 sda3[0] sdd3[2] sdb3[1] sde3[3] 204796928 blocks super 1.1 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> [rolf@kvm01 ~]$ dracut is dracut-018-60.git20120927.fc16.noarch One observation I made was that the top most arrays in /proc/mdstat were affected, and not the bottom half of the arrays. I think that the array show up in /proc/mdstat in reverse order at which they're built during boot. This suggests that at a certain point apparently not all partitions required for the arrays were recognized at the time the respective arrays were built or so. If you need more information, let me know. Rolf
Reinstalled the system with less RAID partitions per hard disk. For me the issue is gone now, though I think the true cause hasn't been found.
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
(In reply to Rolf Fokkens from comment #35) > Reinstalled the system with less RAID partitions per hard disk. For me the > issue is gone now, though I think the true cause hasn't been found. I have since yum upgraded through Fedora 18 and am now using Fedora 19. I do not see this issue any longer, though like Rolf, I'm not sure it was ever actually figured out. Since I reported it. I'll close it. Feel free to reopen it if there are related issues.