Bug 510486

Summary: mdadm should prefere multipath devices before raw devices
Product: [Fedora] Fedora Reporter: Tomasz Torcz <tomek>
Component: mdadmAssignee: Doug Ledford <dledford>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: dledford, notting
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-25 12:55:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tomasz Torcz 2009-07-09 13:11:14 UTC
Description of problem:
I have 2 SCSI HBAs in server, both are connected to the same disk array. Array exports 4 LUNs. Each LUN is accessible from both HBAs and pairs are configured in multipath configuration. Resulting multipath devices are then striped in mdadm. After a reboot, or when I run "mdadm --assemble --scan" mdadm finds superblocks and starts arrays. Problem: /dev/sdX are checked before /dev/mapper/mpath*. Array is created from single devices, not their multipath concatenation. Manual assembly works OK.

System configuration is:
# cat /proc/partitions 
major minor  #blocks  name

   8        0   20010312 sda
   8        1     204800 sda1
   8        2   19803795 sda2
   8       16    1048576 sdb
   8       32 1708417024 sdc
   8       64 1708417024 sde
   8       80 1708418048 sdf
   8       48 1708418048 sdd
   8       96 1708417024 sdg
   8      112 1708418048 sdh
   8      144 1708418048 sdj
   8      128 1708417024 sdi

# multipath -l
mpathe (3600d02300070e177015396558a446f01) dm-3 VW,VRU1610
size=1.6T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-2 status=active
  |- 5:0:0:3 sdf 8:80  active undef running
  `- 6:0:0:3 sdj 8:144 active undef running
mpathd (3600d02300070e1770153966a4476c601) dm-4 VW,VRU1610
size=1.6T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-2 status=active
  |- 5:0:0:1 sdd 8:48  active undef running
  `- 6:0:0:1 sdh 8:112 active undef running
mpathc (3600d02300070e1770153966a4476c600) dm-1 VW,VRU1610
size=1.6T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-2 status=active
  |- 5:0:0:0 sdc 8:32  active undef running
  `- 6:0:0:0 sdg 8:96  active undef running
mpathb (3600144f04a3f3f3b00080027490cc100) dm-0 SUN,SOLARIS
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-1 status=active
  `- 4:0:0:0 sdb 8:16  active undef running
mpathf (3600d02300070e177015396558a446f00) dm-2 VW,VRU1610
size=1.6T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-2 status=active
  |- 5:0:0:2 sde 8:64  active undef running
  `- 6:0:0:2 sdi 8:128 active undef running

As it can be seen, "sdf" and "sdj" is the exact same LUN as seen by both HBAs. Likewise sdd/sdh, sdc/sdg, sde/sdi.


After autoconfiguration (WRONG):
# mdadm -Q --detail /dev/md127 | tail -5
    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd
       2       8       80        2      active sync   /dev/sdf
       3       8       64        3      active sync   /dev/sde
md created on raw devices.

Manual configuration:
# mdadm --assemble auto /dev/mapper/mpath{c,d,e,f}
mdadm: auto has been started with 4 drives.
# mdadm -Q --detail /dev/md127 | tail -5
    Number   Major   Minor   RaidDevice State
       0     253        1        0      active sync   /dev/dm-1
       1     253        4        1      active sync   /dev/dm-4
       2     253        3        2      active sync   /dev/dm-3
       3     253        2        3      active sync   /dev/dm-2

md created over multipath devices.

/dev/sda1: UUID="57f8b94e-dc62-42bf-96fd-b17c40bc0b53" SEC_TYPE="ext2" TYPE="ext3" 
/dev/sda2: LABEL="lnx_test" UUID="0af52868-6378-4511-ac3f-8076caca907d" TYPE="btrfs" UUID_SUB="c1cbea8e-f38d-4f2e-abef-3da6a9239346" 
/dev/sdb: UUID="ef5d9562-d047-4351-afbd-02f6ad8faecf" TYPE="swap" 
/dev/sdc: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdd: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdf: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sde: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdg: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdh: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdi: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/mapper/mpathc: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/sdj: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/mapper/mpathd: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/mapper/mpathe: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/mapper/mpathf: UUID="f309cd10-08e0-d4ab-3b59-0416b272506e" TYPE="linux_raid_member" 
/dev/mapper/mpathb: UUID="ef5d9562-d047-4351-afbd-02f6ad8faecf" TYPE="swap" 
/dev/md127: LABEL="visioraid" UUID="12502f37-1061-4884-9c8b-a9b4a53167b8" TYPE="ext4" 


Versions:
mdadm-3.0-1.fc12.x86_64
initscripts-8.95-1.x86_64
device-mapper-multipath-0.4.9-1.fc12.x86_64


How reproducible:
Always.

Steps to Reproduce:
1. Create LUN visible as two devices, use multipath to associate them.
2. Create md array on /dev/mapper/mpathX device.
3. Stop an md array
4. Do mdadm --assemble --scan
  
Actual results:
mdadm checks raw LUN and build array from it. Multipath configuration is ignored.


Expected results:
Md should first scan multipath devices and try to build array from them. Raw sdX devices should be checked later.

Additional info:
I also believe that early initialization of mdadm prevents multipath from init, as half of the LUNs are claimed by md.

Comment 1 Doug Ledford 2010-02-19 19:44:51 UTC
This is not likely something that can be solved in mdadm (at least not entirely).  There are two problems:

1) mdadm device assembly is done before multipath device assembly in rc.sysinit.  This means that the md assembly will always grab one of the multipath devices before devmapper can get it.

2) mdadm is assembling devices incrementally these days, and that means it isn't looking at all available devices and then deciding what to assemble, it's only looking at one specific device, the one udev tells it to look at, and then deciding whether or not to assemble.  As such, it's not possible to make mdadm do "when given the choice between sda, sde, or dm-1, choose dm-1 first".

So, mainly due to #2 above, I think the only way you are going to be able to use dm-multipath and md raid over dm-multipath is to edit your /etc/mdadm.conf file and change the DEVICE option from partitions to /dev/dm*, at which point mdadm should refuse to create a raid array from the /dev/sd?? devices even when udev calls mdadm's incremental assembly on them, and should instead only perform the assembly when the multipath devices come on line.

I'm adding Bill Nottingham to the Cc list on this so he can confirm my suspicion, but if I'm right and the workaround fixes your issue, then that's likely the best that can be done in mdadm at least, it would need anaconda changes so that anaconda knew to modify mdadm.conf to ignore multipath constituent devices on install to make this happen automatically.

Comment 2 Doug Ledford 2010-02-19 19:45:19 UTC
Please let us know if the workaround does in fact work.

Comment 3 Tomasz Torcz 2010-02-25 12:55:30 UTC
I'm soryy, I don't have that hardware anymore, so I cannot test. Your workaround seems plausible.