Bug 790310

Summary: Devices added to degraded md RAID10 array with o2 layout do not become active
Product: [Fedora] Fedora Reporter: Alexander Murashkin <alexandermurashkin>
Component: kernelAssignee: Jes Sorensen <Jes.Sorensen>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 802691 (view as bug list) Environment:
Last Closed: 2012-03-19 14:29:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 802691    

Description Alexander Murashkin 2012-02-14 08:08:06 UTC
Description of problem:

If md RAID10 array with o2 layout is degraded there is no way to recover it. Devices added to such array become spare devices and are not used by kernel for the recovery. 

Right after the devices are added kernel prints syslog message "md/raid10:mdNN: insufficient working devices for recovery"

Version-Release number of selected component (if applicable):

kernel-3.2.5-3.fc16.x86_64

How reproducible:

Steps to Reproduce:
1. Create 4 identical partitions, for example, /dev/sd[cdef]1
2. mdadm --create /dev/md25 --raid-devices=4 --chunk=512 --level=raid10 --layout=o2 --assume-clean  /dev/sdc1 missing missing /dev/sdf1
3. mdadm /dev/md25 --add /dev/sdd1
4. mdadm /dev/md25 --add /dev/sde1
5. mdadm --detail /dev/md25
....      
0       8       33        0      active sync   /dev/sdc1
4       8       49        -      spare         /dev/sdd1
5       8       65        -      spare         /dev/sde1
3       8       81        3      active sync   /dev/sdf1
  
Actual results:

Added devices do not become active. Recovery is started but fails immediately with syslog message "insufficient working devices for recovery"

Expected results:

Added devices become active after successful recovery.

       0       8       33        0      active sync   /dev/sdc1
       4       8       49        1      active sync   /dev/sdd1
       5       8       65        2      spare rebuilding   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

Additional info:

I checked that this problem does not happen for the default RAID10 layout (near=2).

The md25 array is a test array. I also have degraded production array with o2 layout that cannot be recovered. 

-------- using layout offset=2 ----------------------

Syslog messages are at the bottom.

# mdadm --create /dev/md25 --raid-devices=4 --chunk=512 --level=raid10 --layout=o2 --assume-clean  /dev/sdc1 missing missing /dev/sdf1
# mdadm /dev/md25 --add /dev/sdd1
# mdadm /dev/md25 --add /dev/sde1
# mdadm --detail /dev/md25
/dev/md25:
        Version : 1.2
  Creation Time : Tue Feb 14 01:38:52 2012
     Raid Level : raid10
     Array Size : 1054720 (1030.17 MiB 1080.03 MB)
  Used Dev Size : 527360 (515.09 MiB 540.02 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Feb 14 01:39:55 2012
          State : clean, degraded 
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2

         Layout : offset=2
     Chunk Size : 512K

           Name : glaive.castle.aimk.com:25  (local to host glaive.castle.aimk.com)
           UUID : 72e4ed21:5ba59fbc:a4402111:62aa08db
         Events : 21

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       0        0        1      removed
       2       0        0        2      removed
       3       8       81        3      active sync   /dev/sdf1

       4       8       49        -      spare   /dev/sdd1
       5       8       65        -      spare   /dev/sde1

-------- using default layout near=2 ----------------------

# mdadm --create /dev/md25 --raid-devices=4 --chunk=512 --level=raid10 --assume-clean  /dev/sdc1 missing missing /dev/sdf1
# mdadm /dev/md25 --add /dev/sdd1
# mdadm /dev/md25 --add /dev/sde1
# mdadm --detail /dev/md25
dev/md25:
        Version : 1.2
  Creation Time : Tue Feb 14 01:37:17 2012
     Raid Level : raid10
     Array Size : 1055744 (1031.17 MiB 1081.08 MB)
  Used Dev Size : 527872 (515.59 MiB 540.54 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Feb 14 01:37:37 2012
          State : clean, degraded, recovering 
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 35% complete

           Name : glaive.castle.aimk.com:25  (local to host glaive.castle.aimk.com)
           UUID : ccaca5de:69982fad:d64d233b:436c5618
         Events : 11

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       4       8       49        1      spare rebuilding   /dev/sdd1
       2       0        0        2      removed
       3       8       81        3      active sync   /dev/sdf1

       5       8       65        -      spare   /dev/sde1

[root@glaive md]# mdadm --detail /dev/md25
/dev/md25:
        Version : 1.2
  Creation Time : Tue Feb 14 01:37:17 2012
     Raid Level : raid10
     Array Size : 1055744 (1031.17 MiB 1081.08 MB)
  Used Dev Size : 527872 (515.59 MiB 540.54 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Feb 14 01:37:53 2012
          State : clean, degraded, recovering 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 73% complete

           Name : glaive.castle.aimk.com:25  (local to host glaive.castle.aimk.com)
           UUID : ccaca5de:69982fad:d64d233b:436c5618
         Events : 40

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       4       8       49        1      active sync   /dev/sdd1
       5       8       65        2      spare rebuilding   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

[root@glaive md]# mdadm --detail /dev/md25
/dev/md25:
        Version : 1.2
  Creation Time : Tue Feb 14 01:37:17 2012
     Raid Level : raid10
     Array Size : 1055744 (1031.17 MiB 1081.08 MB)
  Used Dev Size : 527872 (515.59 MiB 540.54 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Feb 14 01:37:53 2012
          State : clean, degraded, recovering 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 512K

           Name : glaive.castle.aimk.com:25  (local to host glaive.castle.aimk.com)
           UUID : ccaca5de:69982fad:d64d233b:436c5618
         Events : 40

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       4       8       49        1      active sync   /dev/sdd1
       5       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

--- syslog for offset=2 ------------------------------------------------


Feb 14 01:39:50 glaive kernel: [ 5378.754962] md: bind<sde1>
Feb 14 01:39:50 glaive kernel: [ 5378.782013] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.782015]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.782017]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.782018]  disk 1, wo:1, o:1, dev:sde1
Feb 14 01:39:50 glaive kernel: [ 5378.782020]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.782026] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.782027]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.782029]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.782030]  disk 1, wo:1, o:1, dev:sde1
Feb 14 01:39:50 glaive kernel: [ 5378.782032]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.782033]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.786470] md: recovery of RAID array md25
Feb 14 01:39:50 glaive kernel: [ 5378.786472] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Feb 14 01:39:50 glaive kernel: [ 5378.786474] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 14 01:39:50 glaive kernel: [ 5378.786477] md: using 128k window, over a total of 527360k.
Feb 14 01:39:50 glaive kernel: [ 5378.786560] md/raid10:md25: insufficient working devices for recovery.
Feb 14 01:39:50 glaive kernel: [ 5378.786573] md: md25: recovery done.
Feb 14 01:39:50 glaive kernel: [ 5378.869515] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.869518]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.869521]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.869523]  disk 1, wo:1, o:1, dev:sde1
Feb 14 01:39:50 glaive kernel: [ 5378.869526]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.869528]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.869570] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.869573]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.869575]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.869578]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.869580]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.869585] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.869586]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.869588]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.869590]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.869593]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.869594] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.869596]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.869598]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.869605]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.869606]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.869608] RAID10 conf printout:
Feb 14 01:39:50 glaive kernel: [ 5378.869609]  --- wd:2 rd:4
Feb 14 01:39:50 glaive kernel: [ 5378.869610]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:50 glaive kernel: [ 5378.869611]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:50 glaive kernel: [ 5378.869613]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:50 glaive kernel: [ 5378.869639] md: recovery of RAID array md25
Feb 14 01:39:50 glaive kernel: [ 5378.869641] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Feb 14 01:39:50 glaive kernel: [ 5378.869642] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 14 01:39:50 glaive kernel: [ 5378.869645] md: using 128k window, over a total of 527360k.
Feb 14 01:39:50 glaive kernel: [ 5378.869796] md/raid10:md25: insufficient working devices for recovery.
Feb 14 01:39:50 glaive kernel: [ 5378.869809] md: md25: recovery done.
Feb 14 01:39:51 glaive kernel: [ 5378.907833] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.907836]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.907839]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.907841]  disk 2, wo:1, o:1, dev:sdd1
Feb 14 01:39:51 glaive kernel: [ 5378.907843]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:51 glaive kernel: [ 5378.911009] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.911012]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.911014]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.911016]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:51 glaive kernel: [ 5378.911022] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.911023]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.911025]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.911027]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:51 glaive kernel: [ 5378.911029] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.911030]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.911032]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.911035]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:51 glaive md: RebuildFinished /dev/md25 [ clean, degraded ]
Feb 14 01:39:51 glaive kernel: [ 5378.960666] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.960668]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.960670]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.960672]  disk 3, wo:0, o:1, dev:sdf1
Feb 14 01:39:51 glaive kernel: [ 5378.960673] RAID10 conf printout:
Feb 14 01:39:51 glaive kernel: [ 5378.960674]  --- wd:2 rd:4
Feb 14 01:39:51 glaive kernel: [ 5378.960676]  disk 0, wo:0, o:1, dev:sdc1
Feb 14 01:39:51 glaive kernel: [ 5378.960677]  disk 3, wo:0, o:1, dev:sdf1

Comment 3 Jes Sorensen 2012-03-19 14:29:24 UTC
Spent a bunch of time looking through this one. However consulting Neil who
is the upstream mdadm maintainer, I received the following explanation:

"o2 places data thus:

  A  B  C  D
  D  A  B  C

where columns are devices.

You've created an array with no place to store B.
mdadm or really shouldn't let you do that.  That is the bug."

In other words, this is not a bug, the real issue is that mdadm shouldn't
allow you to create such a setup in the first place.

Cheers,
Jes

Comment 4 Alexander Murashkin 2012-03-20 01:41:10 UTC
> You've created an array with no place to store B.
> mdadm or really shouldn't let you do that.  That is the bug.

I can see it for the example below

mdadm --create /dev/md25 --raid-devices=4 --chunk=512 --level=raid10
--layout=o2 --assume-clean  /dev/sdc1 missing missing /dev/sdf1 

But in real life I created RAID10 array using 4 disks. After 2 disks were disconnected for whatever reason I was not able to add them back.

So as I understand RAID10 with o2 layout does not survive loss of any two adjacent disks (1+2, 2+3, 3+4, 4+1). When in the case of n2 layout only 2 combinations are not survivable (1+2, 3+4).

So o2 is less reliable than n2. In my opinion, it shall be mentioned in the documentation.