Bug 823597

Summary: mdadm raid10 incremental assembly fails at boot, creating a degraded array
Product: [Fedora] Fedora Reporter: Anthony Messina <amessina>
Component: mdadmAssignee: Jes Sorensen <Jes.Sorensen>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 18CC: agk, bruno, dledford, gmazyland, harald, hhoyer, hpa, Jes.Sorensen, kzak, mschmidt, rolf, thenzl
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-04 06:38:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cat /proc/cpuinfo
none
dmesg output
none
dmesg output with / and /boot on drives in SAS box
none
dmesg output of F16 system none

Description Anthony Messina 2012-05-21 15:36:19 UTC
Description of problem:
Running an x86_64 F17.TC6 system updated with updates-testing and https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc17 fails to properly initialize a raid10 array at boot with incremental assembly. This creates a situation whereby the system starts with either a degraded array, or depending on which disks are activated "in time", may also fail to start the array with "not enough members."  

I have tried the suggestion included here https://bugzilla.redhat.com/show_bug.cgi?id=808774#c3 to add rd.md.uuid for all of my arrays to the kernel boot line and rebuilt initrd to no avail.

The raid1 arrays in this same machine have no issue.

The default /lib/systemd/fedora-storage-init contains the lines:
# Start any MD RAID arrays that haven't been started yet
[ -r /proc/mdstat ] && [ -r /dev/md/md-device-map ] && /sbin/mdadm -IRs

If I change from -IRs to -As, the raid10 array comes up as expected and works--I'm not sure this is the proper fix, however.

Also, if I re-add the missing device(s) on a degraded raid10 after boot, they are added without issue, but then rebuilding begins.

[root@ds ~]# cat /etc/mdadm.conf 
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=476d0070:6556c3a3:8c4616af:d34ab831
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=413c1502:22f94fdd:cd4da7e5:91675691
ARRAY /dev/md2 level=raid10 num-devices=6 UUID=b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a
ARRAY /dev/md3 level=raid10 num-devices=6 UUID=79b75c03:e7c45579:8ab1ee93:94c002cd
ARRAY /dev/md4 level=raid1 num-devices=2 UUID=ab595bee:e8dfa157:dd6963fd:f9c0ba5f

Please let me know if there is any additional information required.

Comment 1 Jes Sorensen 2012-05-21 19:06:00 UTC
Anthony,

Could you post mdadm --detail of the arrays that fail, and also the output
of /proc/mdstat from the time where it isn't assembled correctly?

Thanks,
Jes

Comment 2 Anthony Messina 2012-05-21 19:15:40 UTC
This is from an earlier failure:
Personalities : [raid1] [raid10] 
md3 : active raid10 sdi2[5] sdd2[0] sde2[1] sdf2[3] sdg2[2]
      5755676160 blocks super 1.2 512K chunks 2 near-copies [6/5] [UUUU_U]
      bitmap: 3/43 pages [12KB], 65536KB chunk

md2 : active raid10 sdh1[4] sdi1[5] sdd1[0] sde1[1] sdf1[3] sdg1[2]
      104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md4 : active raid1 sdc1[0] sdj1[1]
      586058616 blocks super 1.2 [2/2] [UU]
      bitmap: 2/5 pages [8KB], 65536KB chunk

md0 : active raid1 sda2[0] sdb2[1]
      511988 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sda3[0] sdb3[1]
      104856504 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

------
And another:
Personalities : [raid1] [raid10] 
md3 : active raid10 sdi2[5] sdd2[0] sdh2[4] sde2[3] sdg2[1]
      5755676160 blocks super 1.2 512K chunks 2 near-copies [6/5] [UU_UUU]
      bitmap: 3/43 pages [12KB], 65536KB chunk

md2 : active raid10 sdf1[2] sdh1[4] sdd1[0] sdi1[5] sde1[3] sdg1[1]
      104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md4 : active raid1 sdj1[1] sdc1[0]
      586058616 blocks super 1.2 [2/2] [UU]
      bitmap: 2/5 pages [8KB], 65536KB chunk

md0 : active raid1 sda2[0] sdb2[1]
      511988 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sdb3[1] sda3[0]
      104856504 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

------
This is the detail when I use mdadm -As in fedora-init-storage (will you need to see one of these when it's degraded?)

[root@ds ~]# mdadm --detail /dev/md2
/dev/md2:                                                                                                                                                                                                                                    
        Version : 1.2                                                                                                                                                                                                                        
  Creation Time : Fri May 18 17:57:48 2012                                                                                                                                                                                                   
     Raid Level : raid10                                                                                                                                                                                                                     
     Array Size : 104855040 (100.00 GiB 107.37 GB)                                                                                                                                                                                           
  Used Dev Size : 34951680 (33.33 GiB 35.79 GB)                                                                                                                                                                                              
   Raid Devices : 6                                                                                                                                                                                                                          
  Total Devices : 6                                                                                                                                                                                                                          
    Persistence : Superblock is persistent                                                                                                                                                                                                   
                                                                                                                                                                                                                                             
  Intent Bitmap : Internal                                                                                                                                                                                                                   

    Update Time : Mon May 21 14:13:43 2012
          State : active 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : ds.messinet.com:2  (local to host ds.messinet.com)
           UUID : b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a
         Events : 104

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       97        2      active sync   /dev/sdg1
       3       8       81        3      active sync   /dev/sdf1
       4       8      113        4      active sync   /dev/sdh1
       5       8      129        5      active sync   /dev/sdi1


[root@ds ~]# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Fri May 18 17:57:18 2012
     Raid Level : raid10
     Array Size : 5755676160 (5489.04 GiB 5893.81 GB)
  Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 21 14:14:08 2012                                                                                                                                                                                                   
          State : active                                                                                                                                                                                                                     
 Active Devices : 6                                                                                                                                                                                                                          
Working Devices : 6                                                                                                                                                                                                                          
 Failed Devices : 0                                                                                                                                                                                                                          
  Spare Devices : 0                                                                                                                                                                                                                          
                                                                                                                                                                                                                                             
         Layout : near=2                                                                                                                                                                                                                     
     Chunk Size : 512K                                                                                                                                                                                                                       
                                                                                                                                                                                                                                             
           Name : ds.messinet.com:3  (local to host ds.messinet.com)                                                                                                                                                                         
           UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd                                                                                                                                                                                        
         Events : 15584                                                                                                                                                                                                                      
                                                                                                                                                                                                                                             
    Number   Major   Minor   RaidDevice State                                                                                                                                                                                                
       0       8       50        0      active sync   /dev/sdd2                                                                                                                                                                              
       1       8       66        1      active sync   /dev/sde2                                                                                                                                                                              
       2       8       98        2      active sync   /dev/sdg2                                                                                                                                                                              
       3       8       82        3      active sync   /dev/sdf2                                                                                                                                                                              
       4       8      114        4      active sync   /dev/sdh2                                                                                                                                                                              
       5       8      130        5      active sync   /dev/sdi2

Comment 3 Anthony Messina 2012-05-21 19:22:49 UTC
Now when reverting to mdadm -IRs in fedora-storage-init:

[root@ds ~]# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri May 18 17:57:48 2012
     Raid Level : raid10
     Array Size : 104855040 (100.00 GiB 107.37 GB)
  Used Dev Size : 34951680 (33.33 GiB 35.79 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 21 14:21:18 2012
          State : active 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : ds.messinet.com:2  (local to host ds.messinet.com)
           UUID : b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a
         Events : 104

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       97        1      active sync   /dev/sdg1
       2       8       81        2      active sync   /dev/sdf1
       3       8       65        3      active sync   /dev/sde1
       4       8      113        4      active sync   /dev/sdh1
       5       8      129        5      active sync   /dev/sdi1


[root@ds ~]# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Fri May 18 17:57:18 2012
     Raid Level : raid10
     Array Size : 5755676160 (5489.04 GiB 5893.81 GB)
  Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB)
   Raid Devices : 6
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 21 14:22:37 2012
          State : active, degraded 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : ds.messinet.com:3  (local to host ds.messinet.com)
           UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd
         Events : 15803

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       98        1      active sync   /dev/sdg2
       2       8       82        2      active sync   /dev/sdf2
       3       0        0        3      removed
       4       8      114        4      active sync   /dev/sdh2
       5       8      130        5      active sync   /dev/sdi2

Comment 4 Anthony Messina 2012-05-21 19:27:14 UTC
Then after a readd:

[root@ds ~]# mdadm --manage /dev/md3 --re-add /dev/sde2 /dev/sdd2
mdadm: re-added /dev/sde2
mdadm: re-added /dev/sdd2
[root@ds ~]# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Fri May 18 17:57:18 2012
     Raid Level : raid10
     Array Size : 5755676160 (5489.04 GiB 5893.81 GB)
  Used Dev Size : 1918558720 (1829.68 GiB 1964.60 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 21 14:26:45 2012
          State : active, degraded, recovering 
 Active Devices : 4
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 2

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : ds.messinet.com:3  (local to host ds.messinet.com)
           UUID : 79b75c03:e7c45579:8ab1ee93:94c002cd
         Events : 16498

    Number   Major   Minor   RaidDevice State
       0       8       50        0      spare rebuilding   /dev/sdd2
       1       8       98        1      active sync   /dev/sdg2
       2       8       82        2      active sync   /dev/sdf2
       3       8       66        3      spare rebuilding   /dev/sde2
       4       8      114        4      active sync   /dev/sdh2
       5       8      130        5      active sync   /dev/sdi2

Comment 5 Jes Sorensen 2012-05-29 06:59:44 UTC
Anthony,

I am curious if your arrays fail to start because systemd isn't quite
ready when the script is run. In the case where an array fails, what
happens if you do the following

mdadm -S <device>
mdadm -IRs

Will it then get started correctly?

Neil just pushed a fix into upstream to handle the case of the incremental
code running before systemd is ready, and I was curious if this is what we
are seeing here.

Note the upstream udev md rules file differ quite a lot from the Fedora one
so it isn't a straight patch apply, but if the above works for you, I will
look into porting it over for testing.

Cheers,
Jes

Comment 6 Anthony Messina 2012-05-30 08:20:22 UTC
(In reply to comment #5)
> Anthony,
> 
> I am curious if your arrays fail to start because systemd isn't quite
> ready when the script is run. In the case where an array fails, what
> happens if you do the following
> 
> mdadm -S <device>
> mdadm -IRs
> 
> Will it then get started correctly?

I do believe you are on the right track in that systemd isn't ready, however...

After booting with a failed raid10 on /dev/md2 (and with the default: mdadm -IRs in /lib/systemd/fedora-storage-init), I did the following:

mdadm -S /dev.md2
mdadm -IRs

Then nothing happens -- mdadm -IRs has no effect.  Nothing in the logs.  And /dev/md2 is not started.

If I then issue an mdadm -As, everything works as expected and the 6 drive /dev/md2 array is initialized with no degredation.

> Neil just pushed a fix into upstream to handle the case of the incremental
> code running before systemd is ready, and I was curious if this is what we
> are seeing here.

I still think this is related to systemd.  In my early testing, I added a sleep command after the mdadm -IRs command in /lib/systemd/fedora-storage-init and that _subjectively seemed_ to help, but also proved unreliable in the end

The only thing that seems to ensure that raid10 starts properly at this point in F17 is mdadm -As in /lib/systemd/fedora-storage-init (instead of mdadm -IRs)

> Note the upstream udev md rules file differ quite a lot from the Fedora one
> so it isn't a straight patch apply, but if the above works for you, I will
> look into porting it over for testing.

If you are able to build some F17 x86_64 RPMs with your changes, I'd be willing to test them today.  I am about to put this server into "real service" in a day or so, but until then, I'd be able to do some more testing before I wipe everything and do another fresh F17 install.

> Cheers,
> Jes

Comment 7 Anthony Messina 2012-05-30 08:33:17 UTC
I did see this:
http://neil.brown.name/git?p=mdadm;a=commitdiff;h=339b77467a7ad5e5354b5ab0542f8148b18c898c

Perhaps there's more to it, but applying this change did not resolve the issue at boot with raid10.

Also, a question -- https://bugzilla.redhat.com/show_bug.cgi?id=808774#c3 suggests adding rd.md.uuid for all of my arrays to the kernel boot line.  Is this still necessary.  It seems to have no effect on raid10 in my case, but is it needed for raid1, for example?

Sorry for the question, but I'm trying to narrow the requirements down a bit.

Comment 8 Jes Sorensen 2012-05-30 09:35:11 UTC
Anthony,

Just to be sure, you tried applying this patch to
/lib/udev/rules.d/64-md-raid.rules
but it had no effect?

I am not sure about the rd.md.uuid rule in this case, however, did you try
to update your initramfs after applying the patch? dracut puts
64-md-raid.rules into the initramfs.

If you didn't, try running 'dracut -f'

Cheers,
Jes

Comment 9 Anthony Messina 2012-05-30 10:11:14 UTC
(In reply to comment #8)
> Anthony,
> 
> Just to be sure, you tried applying this patch to
> /lib/udev/rules.d/64-md-raid.rules
> but it had no effect?

I had originally done this, yes.  It had no effect.
 
> I am not sure about the rd.md.uuid rule in this case, however, did you try
> to update your initramfs after applying the patch? dracut puts
> 64-md-raid.rules into the initramfs.

I did not originally do this before comment #7, however...

> If you didn't, try running 'dracut -f'

I have now done this with the patch from comment #7 applied correctly and it does not fix the problem.  I am dropped into emergency mode where I am easily able to do mdadm -S on the failed devices, then issue an mdadm -As, which properly starts the raid10 devices with all available drives.

Perhaps a short term hack would be a conditional in /lib/systemd/fedora-storage-init that does mdadm -IRs as long as no raid10 are configured, but mdadm -As if there are raid10 configured?  I'm not sure if this is possible, though.

> Cheers,
> Jes

Comment 10 Jes Sorensen 2012-05-30 10:17:42 UTC
I am not keen on hacking fedora-storage-init like that, it would simply mask
the problem. It is puzzling that you are the only one reporting the problem
so far, so I wonder what is special in your setup.

I will have to see if I can create a setup that will reproduce this.

Is it a high cpu count system?

Are any of the raids used for the boot process, /, /boot, /usr, etc?

Cheers,
Jes

Comment 11 Anthony Messina 2012-05-30 10:28:15 UTC
~]# cat /etc/mdadm.conf 
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=476d0070:6556c3a3:8c4616af:d34ab831
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=413c1502:22f94fdd:cd4da7e5:91675691
ARRAY /dev/md2 level=raid10 num-devices=6 UUID=b6ef9d68:e3dd261f:3ba5961c:0cfd1a2a
ARRAY /dev/md3 level=raid10 num-devices=6 UUID=79b75c03:e7c45579:8ab1ee93:94c002cd
ARRAY /dev/md4 level=raid1 num-devices=2 UUID=ab595bee:e8dfa157:dd6963fd:f9c0ba5f

~]# cat /etc/fstab

#
# /etc/fstab
# Created by anaconda on Fri May 18 18:01:58 2012
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=d1ba2322-3320-4c3a-afb4-fc22f08cff79 /                       ext4    defaults        1 1
UUID=9b15a1b8-6fde-4c52-bc19-59a4fa4eee76 /boot                   ext4    defaults        1 2
UUID=1ce20db0-e5be-49db-ae6b-8ec5bcd76b0a /home                   ext4    defaults        1 2
UUID=fb7cc378-d30e-47a1-8de8-77a1ef794ff3 /srv                    ext4    defaults        1 2
UUID=ad94366c-fec4-435c-b5cd-0ada0b638c11 /var/lib                ext4    defaults        1 2
UUID=ac421f62-58b0-41b3-976f-fff3fbd791e8 swap                    swap    defaults        0 0
UUID=1809567f-9f01-4909-9625-11875230ecd8 swap                    swap    defaults        0 0


~]# cat /proc/partitions 
major minor  #blocks  name

   8        0  976762584 sda
   8        1       1024 sda1
   8        2     512000 sda2
   8        3  104857600 sda3
   8        4    5136384 sda4
   8       16  976762584 sdb
   8       17       1024 sdb1
   8       18     512000 sdb2
   8       19  104857600 sdb3
   8       20    5136384 sdb4
  11        0    1048575 sr0
   9        1  104856504 md1
   9        0     511988 md0
   8       32  586061784 sdc
   8       33  586059776 sdc1
   8      144  586061784 sdj
   8      145  586059776 sdj1
   9        4  586058616 md4
   8       80 1953514584 sdf
   8       81   34953216 sdf1
   8       82 1918560256 sdf2
   8       48 1953514584 sdd
   8       49   34953216 sdd1
   8       50 1918560256 sdd2
   8      112 1953514584 sdh
   8      113   34953216 sdh1
   8      114 1918560256 sdh2
   8       64 1953514584 sde
   8       65   34953216 sde1
   8       66 1918560256 sde2
   8       96 1953514584 sdg
   8       97   34953216 sdg1
   8       98 1918560256 sdg2
   8      128 1953514584 sdi
   8      129   34953216 sdi1
   8      130 1918560256 sdi2
   9        2  104855040 md2
   9        3 5755676160 md3

~]# cat /proc/mdstat
Personalities : [raid1] [raid10] 
md3 : active raid10 sdd2[0] sdi2[5] sdh2[4] sde2[3] sdf2[2] sdg2[1]
      5755676160 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 2/43 pages [8KB], 65536KB chunk

md2 : active raid10 sde1[3] sdi1[5] sdg1[1] sdd1[0] sdh1[4] sdf1[2]
      104855040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md4 : active raid1 sdj1[1] sdc1[0]
      586058616 blocks super 1.2 [2/2] [UU]
      bitmap: 0/5 pages [0KB], 65536KB chunk

md0 : active raid1 sdb2[1] sda2[0]
      511988 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sdb3[1] sda3[0]
      104856504 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

Comment 12 Anthony Messina 2012-05-30 10:31:33 UTC
Created attachment 587672 [details]
cat /proc/cpuinfo

Comment 13 Anthony Messina 2012-05-30 10:46:51 UTC
Created attachment 587677 [details]
dmesg output

Comment 14 Jes Sorensen 2012-05-30 11:13:47 UTC
Based on our irc discussion I am noting the following points for the
record here:

1) Problem seems to happen only with drives sitting on the mpt controller
2) The mpt controller fires much later in the boot process than the AHCI
   controller, and systemd is already up and running

My guess here is that we hit issues with the probe order of the drives in
the raid10, and try to launch it before the required drives are added.
I am not quite sure how to get around this yet, but it is worth trying to
force the mpt2sas driver to be loaded before systemd is set free.

rd.driver.pre=mpt2sas ought to do it.

Jes

Comment 15 Anthony Messina 2012-05-30 13:05:22 UTC
Created attachment 587729 [details]
dmesg output with / and /boot on drives in SAS box

Comment 16 Doug Ledford 2012-05-30 15:31:41 UTC
Jes,

This is functionally the same as the problem HPA was having with his raid6.  The fact that it's a raid10 here is largely irrelevant.  It's the two different types of controllers, two different controller drivers, and the system starting to bring things up before all the drives have completed their device scan that's the problem.

I had initially thought that scsi_wait_scan was broken.  Now I'm thinking the issue is just slightly different.  I think maybe scsi_wait_scan is doing what it's supposed to do, but we are attempting to bring up the arrays as soon as scsi_wait_scan completes and that's the problem.  I think scsi_wait_scan is waiting until the low level scsi scan is complete, but that's no guarantee that the udev add/change events have all been processed.  Instead, scsi_wait_scan scans the drives, and that creates the driver entries in the kernel, and that causes udev add/change events to get *queued*, but there is no mechanism in place there to guarantee that A) udev has received the events and put them into it's own internal queue or B) that udev has completed processing the events.  And since bringing up the raid devices relies on the udev event queue having been processed, not the kernel scan queue, we have a race condition.  One that's normally not hit unless you have two different drivers that complete their events at vastly different times.

I think what may be needed here is for dracut to do a scsi_wait_scan followed by a udev_settle (and maybe even a slight pause between the scsi_wait_scan and the udev settle to make sure that udev has registered all of the events and got them in queue, or two subsequent udev settle calls, whatever it takes to make sure that every event created by scsi_wait_scan is done before we proceed.

Comment 17 Doug Ledford 2012-05-30 15:38:55 UTC
Harold, can you comment on my assessment in comment #16?  Is dracut already doing the right thing?  We have several cases now that are pointing to some sort of race between scsi_wait_scan ending and device assembly, and my best educated guess at the moment is that we are waiting for scsi_wait_scan to complete, but not for all the queued udev events that scsi_wait_scan created.

Comment 18 Jes Sorensen 2012-05-30 15:46:55 UTC
Doug,

That sounds very likely, it was my conclusion as well from discussing this
with Anthony that something goes wrong in the order things are processed.
It also explains why I haven't seen any of it in my own testing since I so
far haven't had any none AHCI controllers locally (which will change next
week when the one I ordered should arrive). At that point I should be able
to test/experiment here too.

Adding HPA to this bug.

Cheers,
Jes

Comment 19 Anthony Messina 2012-05-30 18:46:02 UTC
At this point, after a reinstall of this F17 system, I can confirm that between the two interfaces: AHCI and MPT2SAS, If:

1. I install with md0: raid1 /boot, and md1: raid1 / on AHCI, and md2: raid10 /home and md3: raid10 /srv on MPT2SAS, I DO get the failures noted in the initial bug report.

2. I install with md0: raid1 /boot, md1: raid1 /, md2: raid10 /home and md3: raid10 /srv all on MPT2SAS, I DO NOT get the failures.

The leads me (the non-expert tester) to lean further towards Doug's comments in comment #16 as being the cause of the problem.

I have not yet been able to test whether or not enabling the udev-settle.service has any effect as I will need to re install with option 1 above to try that.

Comment 20 Harald Hoyer 2012-07-30 10:21:28 UTC
(In reply to comment #17)
> Harold, can you comment on my assessment in comment #16?  Is dracut already
> doing the right thing?  We have several cases now that are pointing to some
> sort of race between scsi_wait_scan ending and device assembly, and my best
> educated guess at the moment is that we are waiting for scsi_wait_scan to
> complete, but not for all the queued udev events that scsi_wait_scan created.

1. we do not do scsi_wait_scan anymore
2. "mdadm" is run from udev rules, so it's done immediately when the device is registered and the uevent is processed.

Comment 21 Doug Ledford 2012-07-30 15:24:41 UTC
If you don't do scsi_wait_scan any more, then how do you know when the scan is complete and it is safe to proceed with mounting the root fs?

Comment 22 Harald Hoyer 2012-07-30 19:46:32 UTC
(In reply to comment #21)
> If you don't do scsi_wait_scan any more, then how do you know when the scan
> is complete and it is safe to proceed with mounting the root fs?

We are safe, if the root device is present and mountable.

For md devices, dracut waits until all udev events have been processed and the system is idle and the boot process lasts at least some seconds, which can be specified by "rd.retry=<seconds>", before dracut tries to start the raid device in degraded mode.

Comment 23 Jes Sorensen 2012-08-02 07:07:43 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > If you don't do scsi_wait_scan any more, then how do you know when the scan
> > is complete and it is safe to proceed with mounting the root fs?
> 
> We are safe, if the root device is present and mountable.
> 
> For md devices, dracut waits until all udev events have been processed and
> the system is idle and the boot process lasts at least some seconds, which
> can be specified by "rd.retry=<seconds>", before dracut tries to start the
> raid device in degraded mode.

How do you know all udev events have been processed? The problem we have seen
here is when someone has two SATA/SAS controllers in the box, and the second
controller is much slower doing the probing. Per the log posted previously,
the mpt2sas controller could take over 10 seconds to complete it's probing.
In the example attached here, the mpt2sas controller completed it's probing
16 seconds after the initial AHCI controller completed the first round.
See https://bugzilla.redhat.com/attachment.cgi?id=587729

Asking people with multiple controllers to set a timeout on the commandline
might be a workaround, but we need to find a way for this to happen
automatically.

Cheers,
Jes

Comment 24 Harald Hoyer 2012-08-02 10:43:37 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > > If you don't do scsi_wait_scan any more, then how do you know when the scan
> > > is complete and it is safe to proceed with mounting the root fs?
> > 
> > We are safe, if the root device is present and mountable.
> > 
> > For md devices, dracut waits until all udev events have been processed and
> > the system is idle and the boot process lasts at least some seconds, which
> > can be specified by "rd.retry=<seconds>", before dracut tries to start the
> > raid device in degraded mode.
> 
> How do you know all udev events have been processed? The problem we have seen
> here is when someone has two SATA/SAS controllers in the box, and the second
> controller is much slower doing the probing. Per the log posted previously,
> the mpt2sas controller could take over 10 seconds to complete it's probing.
> In the example attached here, the mpt2sas controller completed it's probing
> 16 seconds after the initial AHCI controller completed the first round.
> See https://bugzilla.redhat.com/attachment.cgi?id=587729
> 
> Asking people with multiple controllers to set a timeout on the commandline
> might be a workaround, but we need to find a way for this to happen
> automatically.
> 
> Cheers,
> Jes

Then maybe the driver should not return from modprobe until it initialized.
This will keep "udevadm settle" blocked.

In dracut we have to draw a line somewhere. No uevents pending and some time has passed should mean that dracut might execute some fallback strategies on raids.

Another idea might be some /sys file with the information that disk probing is still in progress, which dracut can check.

Comment 25 Jes Sorensen 2012-08-02 10:52:56 UTC
Note I don't know of the mpt2sas driver returns from modprobe before it is
finished or not. I just read the probe logs in the dmesg output.

However there may be drivers doing async probing to avoid locking up the bus
too long, in particular if they can have a large number of LUNs attached.

A /sys file might be an approach.

Jes

Comment 26 Tomas Henzl 2012-08-02 14:47:22 UTC
(In reply to comment #25)
> Note I don't know of the mpt2sas driver returns from modprobe before it is
> finished or not. I just read the probe logs in the dmesg output.

The mpt2sas uses async scan and returns from modprobe immediately, the scan then can take a longer time afterwards.

Comment 27 Jes Sorensen 2012-08-02 15:10:55 UTC
Tomas,

Thank you for the clarification - is there any method for userland to detect
that probing is still ongoing?

Thanks,
Jes

Comment 28 Tomas Henzl 2012-08-02 15:32:18 UTC
(In reply to comment #27)
> Tomas,
> 
> Thank you for the clarification - is there any method for userland to detect
> that probing is still ongoing?
I'm not sure, the 'modprobe scsi_wait_scan' used to work, but there were some changes to it, I don't know if it still works with the latest kernels.
A possible workaround is to switch to sync scan http://lxr.linux.no/#linux+v3.5/drivers/scsi/scsi_scan.c#L102
> 
> Thanks,
> Jes

Comment 29 Jes Sorensen 2012-08-03 06:55:10 UTC
Tomas,

It sounds like we need to investigate a smarter API for querying whether we
have completed probing. I don't think switching to sync scan is the way to
go.

Thanks,
Jes

Comment 30 Tomas Henzl 2012-08-03 10:29:53 UTC
(In reply to comment #29)
> Tomas,
> 
> It sounds like we need to investigate a smarter API for querying whether we
> have completed probing. I don't think switching to sync scan is the way to
> go.
Sure, I mentioned this as a workaround. What is wrong with scsi_wait_scan?
> 
> Thanks,
> Jes

Comment 31 Rolf Fokkens 2012-11-08 11:30:16 UTC
For informational purposes (Tainted Kernel): Same problem here on Fedora 16:

* kernel-3.6.5-2.fc16.x86_64
* mdadm-3.2.5-4.fc16.x86_64

This is the first reboot after upgrading to these packages, not sure if there is a relation between these packages and the issue.

After reboot I ran into this mess:
[root@kvm01 RAID10]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] [raid10] [raid1] 
md4 : active raid10 sdd6[2] sda6[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/2] [U_U_]
      
md8 : active (auto-read-only) raid10 sde10[3] sdb10[1] sda10[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U]
      
md5 : active raid10 sde7[3] sda7[0] sdb7[1]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U]
      
md9 : active (auto-read-only) raid10 sda11[0] sdb11[1] sde11[3]
      1289924608 blocks super 1.2 512K chunks 2 far-copies [4/3] [UU_U]
      
md2 : inactive sde4[3] sda4[0]
      204797952 blocks super 1.2
       
md0 : active raid1 sda2[0]
      511988 blocks super 1.0 [4/1] [U___]
      
md6 : active raid10 sdd8[2] sde8[3] sdb8[1] sda8[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md3 : active raid10 sdd5[2] sdb5[1] sde5[3] sda5[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md7 : active raid10 sdb9[1] sdd9[2] sde9[3] sda9[0]
      205307904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md1 : active raid6 sda3[0] sdd3[2] sdb3[1] sde3[3]
      204796928 blocks super 1.1 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

Actually none of the supposedly missing devices was missing, but apparently during boot they were.

I was lucky that the system was able to boot (Root FS was OK). After the boot I was able to make all these RAID devices active again (although RESYNC is still running). 

Doing tests on this system is possible, although it requires careful planning because of the users of the system.

Comment 32 Rolf Fokkens 2012-11-08 11:32:34 UTC
Created attachment 640740 [details]
dmesg output of F16 system

This is the dmesg output of my F16 system showing a failing incremental RAID assembly during boot.

Comment 33 Jes Sorensen 2012-11-09 10:34:32 UTC
Rolf,

Just to be sure I got the data right, your sda4/sde4/sda2 are supposed to be
part of the other arrays? Could you post the output of a 'correct'
/proc/mdstat as well?

Your problem is somewhat different from the other reports of this problem, since
you seem to have all the disks attached to a single AHCI controller in the
system. The other reports were mostly caused by having raid arrays spanning two
controllers (AHCI and MPT SAS) which got probed at different times.

What version of dracut do you have installed?

Thanks,
Jes

Comment 34 Rolf Fokkens 2012-11-09 23:35:59 UTC
Hi Jes,

Here's my mdstat, with all arrays operational again:

[rolf@kvm01 ~]$ cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] [raid10] [raid1] 
md2 : active raid10 sda4[0] sde4[3] sdd4[2] sdb4[1]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md4 : active raid10 sde6[5] sdb6[4] sdd6[2] sda6[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md8 : active raid10 sdd10[2] sde10[3] sdb10[1] sda10[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md5 : active raid10 sdd7[4] sde7[3] sda7[0] sdb7[1]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md9 : active raid10 sdd11[2] sda11[0] sdb11[1] sde11[3]
      1289924608 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md0 : active raid1 sde2[6] sdd2[5] sdb2[4] sda2[0]
      511988 blocks super 1.0 [4/4] [UUUU]
      
md6 : active raid10 sdd8[2] sde8[3] sdb8[1] sda8[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md3 : active raid10 sdd5[2] sdb5[1] sde5[3] sda5[0]
      204795904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md7 : active raid10 sdb9[1] sdd9[2] sde9[3] sda9[0]
      205307904 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
      
md1 : active raid6 sda3[0] sdd3[2] sdb3[1] sde3[3]
      204796928 blocks super 1.1 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
[rolf@kvm01 ~]$

dracut is dracut-018-60.git20120927.fc16.noarch

One observation I made was that the top most arrays in /proc/mdstat were affected, and not the bottom half of the arrays. I think that the array show up in /proc/mdstat in reverse order at which they're built during boot. This suggests that at a certain point apparently not all partitions required for the arrays were recognized at the time the respective arrays were built or so.

If you need more information, let me know.

Rolf

Comment 35 Rolf Fokkens 2013-06-14 18:42:22 UTC
Reinstalled the system with less RAID partitions per hard disk. For me the issue is gone now, though I think the true cause hasn't been found.

Comment 36 Fedora End Of Life 2013-07-03 22:55:02 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 37 Anthony Messina 2013-09-04 06:38:15 UTC
(In reply to Rolf Fokkens from comment #35)
> Reinstalled the system with less RAID partitions per hard disk. For me the
> issue is gone now, though I think the true cause hasn't been found.

I have since yum upgraded through Fedora 18 and am now using Fedora 19.  I do not see this issue any longer, though like Rolf, I'm not sure it was ever actually figured out.

Since I reported it.  I'll close it.  Feel free to reopen it if there are related issues.