Bug 509396

Summary: kpartx hangs when virtual disks are created/destroyed on a Dell MD3000i
Product: Red Hat Enterprise Linux 5 Reporter: Adam Huffman <bloch>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: agk, bmarzins, bmr, christophe.varoqui, dwysocha, egoggin, eric, flakrat, heinzm, iannis, junichi.nomura, kueda, lmb, prockai, tranlan
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-06 17:19:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Udevmonitor dump
none
Message log none

Description Adam Huffman 2009-07-02 15:29:27 UTC
Description of problem:

I've been running some tests using a Dell MD3000i connected to a server via iSCSI, using kernels provided by Don Zickus which include the MD3000i in the RDAC driver.

A couple of times now I have deleted a virtual disk on the array, flushed the multipath map, created a new disk and then attempted to reload the device map.
At this point, a kpartx process becomes stuck:

6626 ?        S<     0:00 /sbin/dmsetup ls --target multipath --exec /sbin/kpartx -a -p p -j 253 -m 0
6627 ?        D<     0:00 /sbin/kpartx -a -p p /dev/mapper/mpath3

and there are lots of messages like these:

Jul  2 16:16:36 marsantes multipathd: dm-0: add map (uevent) 
Jul  2 16:16:36 marsantes multipathd: mpath3: event checker started 
Jul  2 16:16:37 marsantes kernel: sd 7:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:37 marsantes kernel: sd 5:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:38 marsantes kernel: sd 5:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:38 marsantes kernel: sd 4:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:38 marsantes kernel: sd 4:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:38 marsantes kernel: sd 6:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:39 marsantes kernel: sd 6:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdb, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:16.
Jul  2 16:16:39 marsantes multipathd: 8:16: mark as failed
Jul  2 16:16:39 marsantes multipathd: mpath3: remaining active paths: 3
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdc, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:32.
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdd, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:48.
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdf, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:80.
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:40 marsantes multipathd: 8:32: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 2
Jul  2 16:16:40 marsantes multipathd: 8:48: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 1
Jul  2 16:16:40 marsantes multipathd: 8:80: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 0
Jul  2 16:16:43 marsantes multipathd: 8:16: reinstated
Jul  2 16:16:43 marsantes multipathd: mpath3: remaining active paths: 1
Jul  2 16:16:43 marsantes kernel: sd 7:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:43 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:43 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:43 marsantes kernel: sd 5:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:43 marsantes kernel: sd 5:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:43 marsantes kernel: sd 4:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:44 marsantes multipathd: 8:32: reinstated
Jul  2 16:16:44 marsantes multipathd: mpath3: remaining active paths: 2
Jul  2 16:16:44 marsantes multipathd: 8:48: reinstated
Jul  2 16:16:44 marsantes multipathd: mpath3: remaining active paths: 3
Jul  2 16:16:44 marsantes multipathd: 8:80: reinstated
Jul  2 16:16:44 marsantes multipathd: mpath3: remaining active paths: 4 
Jul  2 16:16:44 marsantes multipathd: dm-0: add map (uevent) 
Jul  2 16:16:44 marsantes multipathd: dm-0: devmap already registered 
Jul  2 16:16:44 marsantes multipathd: dm-0: add map (uevent) 
Jul  2 16:16:44 marsantes multipathd: dm-0: devmap already registered 
Jul  2 16:16:44 marsantes multipathd: dm-0: add map (uevent) 
Jul  2 16:16:44 marsantes multipathd: dm-0: devmap already registered 
Jul  2 16:16:44 marsantes kernel: sd 4:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:44 marsantes kernel: sd 6:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:44 marsantes kernel: sd 6:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:44 marsantes kernel: sd 5:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:45 marsantes kernel: sd 5:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:45 marsantes kernel: sd 4:0:0:1: queueing MODE_SELECT command.
Jul  2 16:16:46 marsantes kernel: sd 4:0:0:1: retrying MODE_SELECT command.
Jul  2 16:16:46 marsantes kernel: end_request: I/O error, dev sdb, sector 0
Jul  2 16:16:46 marsantes kernel: device-mapper: multipath: Failing path 8:16.
Jul  2 16:16:46 marsantes kernel: end_request: I/O error, dev sdb, sector 8
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:16.
Jul  2 16:16:39 marsantes multipathd: 8:16: mark as failed 
Jul  2 16:16:39 marsantes multipathd: mpath3: remaining active paths: 3 
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdc, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:32.
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdd, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:48.
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:39 marsantes kernel: end_request: I/O error, dev sdf, sector 0
Jul  2 16:16:39 marsantes kernel: device-mapper: multipath: Failing path 8:80.
Jul  2 16:16:39 marsantes multipathd: dm-0: add map (uevent)
Jul  2 16:16:39 marsantes multipathd: dm-0: devmap already registered
Jul  2 16:16:40 marsantes multipathd: 8:32: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 2
Jul  2 16:16:40 marsantes multipathd: 8:48: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 1
Jul  2 16:16:40 marsantes multipathd: 8:80: mark as failed
Jul  2 16:16:40 marsantes multipathd: mpath3: remaining active paths: 0
Jul  2 16:16:43 marsantes multipathd: 8:16: reinstated
Jul  2 16:16:43 marsantes multipathd: mpath3: remaining active paths: 1

running continuously and the only remedy is a reboot.

I've tried commenting out the line in the udev rules as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=497041 but that hasn't had any effect.

Version-Release number of selected component (if applicable):
device-mapper-1.02.28-2.el5
device-mapper-multipath-0.4.7-23.el5_3.4
kpartx-0.4.7-23.el5_3.4


How reproducible:


Steps to Reproduce:
1. delete a disk on the MD3000i
2. create a disk on the MD3000i
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Eric 2009-08-11 14:31:24 UTC
Did you have iscsid rescan the array? (from console: iscsiadm -m node -R).

You have to make iscsid update its device nodes before you reload multipath.

Finally, I must say multipath seems to go beserk on a device beeing removed while it has commands in the queue. Perhaps someone should look after that.

Comment 2 Ben Marzinski 2009-08-19 20:46:07 UTC
Could you also run udevmonitor, to see if the kernel is really throwing out all those uevents?

Comment 3 Eric 2009-08-25 09:19:07 UTC
Created attachment 358541 [details]
Udevmonitor dump

Udevmonitor dump while removing and re-adding a disk over iscsi.

Comment 4 Eric 2009-08-25 09:19:50 UTC
Created attachment 358542 [details]
Message log

Partial message log while removing and re-adding a disk over iscsi

Comment 5 Adam Huffman 2009-08-25 09:36:35 UTC
It seems much happier now, running kernel 2.6.18-164.el5.  I deleted a virtual disk yesterday then added a new one.  When I rescanned the array then rebuild the multipath device map, the new disk appeared and there was no kpartx hang.

Can't play around with this particular device any more as it's going into production.  However, another one will be installed fairly soon and I can run more testing on that.

Comment 6 Eric 2009-08-25 09:44:56 UTC
Above the requsted udevmonitor dump. Please note that the addition and removal of the disk is done at the iscsi target side. 

I also noticed that iscsid did not remove the disk device nodes (in use by multipath?). Then again, I should check if iscsi still properly does that when not using multipath. 

Finnally, toying around with the iscsi disks really screwed up multipath again:

36001c23000dd034d000009ca4795fe16 dm-6 DELL,MD3000i
[size=1.0G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][enabled]
 \_ 2:0:0:10 sdj 8:144 [active][ready]
\_ round-robin 0 [prio=100][enabled]
 \_ 5:0:0:10 sdk 8:160 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:10 sdr 65:16 [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 4:0:0:10 sds 65:32 [active][ghost]
36001c23000dd030e000007534795b6af dm-4 DELL,MD3000i
[size=50G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:7  sdf 8:80  [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 5:0:0:7  sdg 8:96  [active][ghost]
\_ round-robin 0 [prio=100][active]
 \_ 3:0:0:7  sdn 8:208 [active][ready]
\_ round-robin 0 [prio=100][enabled]
 \_ 4:0:0:7  sdo 8:224 [active][ready]
1_ dm-19 DELL,MD3000i
[size=1.0G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:1  sdb 8:16  [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 5:0:0:1  sdc 8:32  [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:1  sdd 8:48  [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 4:0:0:1  sde 8:64  [active][ghost]
\_ round-robin 0 [prio=100][active]
 \_ 2:0:0:10 sdj 8:144 [active][ready]
\_ round-robin 0 [prio=100][enabled]
 \_ 5:0:0:10 sdk 8:160 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:10 sdr 65:16 [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 4:0:0:10 sds 65:32 [active][ghost]
36001c23000dd034d0000098a4795ecb8 dm-5 DELL,MD3000i
[size=50G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]
 \_ 2:0:0:8  sdh 8:112 [active][ready]
\_ round-robin 0 [prio=100][enabled]
 \_ 5:0:0:8  sdi 8:128 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:8  sdp 8:240 [active][ghost]
\_ round-robin 0 [prio=0][enabled]
 \_ 4:0:0:8  sdq 65:0  [active][ghost]

Note the designation "1_ dm-19 DELL,MD3000i". The paths of 2 disks that have been removed are grouped there. After re-adding one of those disks, "36001c23000dd034d000009ca4795fe16 dm-6 DELL,MD3000i" appears using the same paths as before. So now they're listed twice by multipath. 

PS.

Maybe worth another bug-report or a manual change for the md3000i (from redhat): i dont like the fact the kernel tries to read the iscsi disk partition tables as the specific path might no be accessible. It causes a lot of read errors and I assume slows down boot dramatically.

Comment 9 Ben Marzinski 2010-12-13 23:11:03 UTC
Having a device change wwids can really mess with multipath.  It's quite possible that some of the recent iscsi changes have fixed this.  Are you able to reproduce this on a recent version.

Comment 10 RHEL Program Management 2014-01-29 10:39:53 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 11 Adam Huffman 2014-06-10 11:35:36 UTC
I've changed jobs since I reported this and no longer have access to this hardware.