1379902 – multipath device is busy following reboot of gateway node during I/O load

Bug 1379902 - multipath device is busy following reboot of gateway node during I/O load

Summary: multipath device is busy following reboot of gateway node during I/O load

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	2.1
Assignee:	Mike Christie
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1379890
TreeView+	depends on / blocked

Reported:	2016-09-28 03:38 UTC by Paul Cuzner
Modified:	2017-07-30 15:33 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-06 17:17:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
rbd map worked and multipath -ll shows the devices (901 bytes, text/plain) 2016-09-28 03:38 UTC, Paul Cuzner	no flags	Details
syslog from the node - ceph-1 (214.07 KB, text/plain) 2016-09-28 20:42 UTC, Paul Cuzner	no flags	Details
View All

Description Paul Cuzner 2016-09-28 03:38:57 UTC

Created attachment 1205384 [details]
rbd map worked and multipath -ll shows the devices

Description of problem:
In a multi-gw configuration, with device-mapper-multipath providing the path layer to LIO - when a node is killed during active I/O some devices report as busy after the node restarts. Since the device is in a busy state, it is unable to be accessed by LIO and therefore the client is never able recover it's path(s).

Version-Release number of selected component (if applicable):
RHEL 7.3 beta - 3.10.0-506.el7.x86_64
device-mapper-multipath-0.4.9-99.el7.x86_64

Also I'm using the rbd-target-gw script in the following hierarchy
rbd-target-gw -> rbdmap -> target

How reproducible:
3 tests - all with same outcome

Steps to Reproduce:
1. Create a 2 gateway configuration, with a LUN exported to a Windows 2012 client
2. connect the Windows client to both gateways with the iscsi initiator 
3. run iometer on the windows box, 100% random read workload will be fine
4. determine the gw which is active for this LUN, then poweroff this gateway forcing the alternate node to be accessed
5. restart the gateway
6. Look at the device should be added to LIO and available to the Windows Client 

Actual results:

systemctl status target shows the issue when attempting to add the LUN to the LIO
i.e.
Sep 28 15:55:49 ceph-1.test.lab systemd[1]: Starting Restore LIO kernel target configuration...
Sep 28 15:55:49 ceph-1.test.lab target[2777]: Could not create StorageObject ansible3: Cannot configure StorageObject because device /dev/mapper/0-7a924515f007c is...e, skipped
Sep 28 15:55:49 ceph-1.test.lab target[2777]: Could not find matching StorageObject for LUN 2, skipped
Sep 28 15:55:49 ceph-1.test.lab target[2777]: Could not find matching StorageObject for LUN 2, skipped
Sep 28 15:55:49 ceph-1.test.lab target[2777]: Could not find matching TPG LUN 2 for MappedLUN 0, skipped


Expected results:
The restarted node should be able to access all prior LUNs, and LIO should be restored to it's prior state.

Additional info:

Comment 2 Paul Cuzner 2016-09-28 03:59:55 UTC

I removed the device with multipath -f 
unmapped then rbd map rbd/ansible3 -o noshare
multipath shows the device
[root@ceph-1 system]# multipath -ll 
0-7a924515f007c dm-4 Ceph,RBD
size=30G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- #:#:#:# rbd2 251:32 active ready running
0-7ab55515f007c dm-6 Ceph,RBD
size=50G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- #:#:#:# rbd3 251:48 active ready running
0-7aafe79e2a9e3 dm-3 Ceph,RBD
size=15G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- #:#:#:# rbd1 251:16 active ready running
0-937d7515f007c dm-2 Ceph,RBD
size=30G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- #:#:#:# rbd0 251:0  active ready running



However a targetctl restore still fails with the same message

checking osd state
[root@ceph-1 system]# ceph osd blacklist ls
listed 0 entries

Comment 3 Paul Cuzner 2016-09-28 20:42:11 UTC

Created attachment 1205676 [details]
syslog from the node - ceph-1

Comment 4 Paul Cuzner 2016-09-28 20:43:41 UTC

Added syslog from the node encountering these issues.

in this case the device that LIO wants to add is rbd2 which is rbd/ansible3

Comment 5 Paul Cuzner 2016-09-30 04:14:41 UTC

More info, but no solution

It appears that lvm is accepting the dm devices created for the rbd's. I added this to lvm.conf
global_filter = [ "r|^/dev/mapper/[0-255]-.*|" ]

Now lvmdiskscan does NOT show the /dev/mapper/0-<bla> devices

I also noticed that in the boot log

Sep 30 15:48:24 ceph-1.test.lab systemd-udevd[2623]: inotify_add_watch(7, /dev/rbd2p1, 10) failed: No such file or directory

So I updated the udev rules directly (just to test!) excluding rbd devices
/lib/udev/rules.d/60-persistent-storage.rules

and now the inotify watch is resolved...
Sep 30 15:50:53 ceph-1.test.lab kernel:  rbd2: p1
Sep 30 15:50:53 ceph-1.test.lab kernel: rbd: rbd2: capacity 32212254720 features 0x5
Sep 30 15:50:53 ceph-1.test.lab multipathd[513]: rbd2: add path (uevent)
Sep 30 15:50:53 ceph-1.test.lab multipathd[513]: rbd2: HDIO_GETGEO failed with 25
Sep 30 15:50:53 ceph-1.test.lab rbdmap[2502]: Mapped 'rbd/ansible3' to '/dev/rbd2'
Sep 30 15:50:53 ceph-1.test.lab multipathd[513]: 0-7a924515f007c: load table [0 62914560 multipath 0 0 1 1 service-time 0 1 1 251:32 1]
Sep 30 15:50:53 ceph-1.test.lab multipathd[513]: 0-7a924515f007c: event checker started
Sep 30 15:50:53 ceph-1.test.lab multipathd[513]: rbd2 [251:32]: path added to devmap 0-7a924515f007c

However, the problem remains. A device exported and initialised by a client is getting locked on the gateway preventing iy from being used by LIO.

Note You need to log in before you can comment on or make changes to this bug.