Bug 1368191 - [RFE] os-brick should use the -R retry switch when flushing multipath devices
Summary: [RFE] os-brick should use the -R retry switch when flushing multipath devices
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-os-brick
Version: 7.0 (Kilo)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Upstream M2
: 12.0 (Pike)
Assignee: Gorka Eguileor
QA Contact: Prasanth Anbalagan
URL:
Whiteboard:
Depends On:
Blocks: 1368211 1442136
TreeView+ depends on / blocked
 
Reported: 2016-08-18 15:33 UTC by Rodrigo A B Freire
Modified: 2019-12-16 06:24 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1368211 (view as bug list)
Environment:
Last Closed: 2017-04-25 15:10:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1592520 0 None None None 2016-08-18 15:33:31 UTC
Launchpad 1663936 0 None None None 2017-04-25 15:06:29 UTC
OpenStack gerrit 433103 0 None MERGED Retry multipath flush when map is in use 2020-06-24 09:45:45 UTC
Red Hat Knowledge Base (Solution) 2387621 0 None None None 2016-08-18 15:36:45 UTC
Red Hat Knowledge Base (Solution) 2490251 0 None None None 2016-08-18 15:39:06 UTC

Description Rodrigo A B Freire 2016-08-18 15:33:31 UTC
Description of problem:
* Sometimes multipath has problems when detaching a LUN. This is a quite common scenario and is expressed in environments 3 or more multipath devices.
* OpenStack code actually does catches traps multipath flush errors and results in device-less orphaned multipath devices and / or D-state hung processes.

Version-Release number of selected component (if applicable):
* RHOSP7

How reproducible:
* Easily

Steps to Reproduce:
1. Configure a compute node to export multipath devices to cinder
2. Configure OpenStack to make use of the multipath devices as cinder back-end
3. Create a instance with 3 or more multipath devices
4. Remove the instance

* The multipath failure errors can be also reproduced in a RHEL7 system with 3 or more multipath LUNs using the following script:

while true; do for MPATH in <WWID 1> <WWID 2> <WWID 3> ; do DEVICES=`multipath -l $MPATH | grep runnin | awk '{print  substr ($_,6,8)}' `; echo "Flushing: multipath -f $MPATH"; if ! multipath -f $MPATH; then echo "Failed! Trying again."; sleep 1 ; echo "multipath -f $MPATH" ; multipath -f $MPATH ; echo "The 2nd multipath -f returned $?" ; exit 1 ; fi ; for DEVICE in $DEVICES; do echo "Deleting: echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete"; echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete ; done ; done ; LC=`multipath -ll|wc -l` ; multipath -ll ; if [ "$LC" != "0" ]; then exit 1; fi  ; sleep 10; rescan-scsi-bus.sh -i ; sleep 2; multipath -r ; done

Actual results:
* The LUN removal failure is only seen in debug mode, with the following error signature:

2016-08-16 11:55:34.186 5652 DEBUG nova.storage.linuxscsi [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Found multipath device = /dev/mapper/111111111111111111111111100000016 find_multipath_device /usr/lib/python2.7/site-packages/nova/storage/linuxscsi.py:136
2016-08-16 11:55:34.187 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removing multipath device 111111111111111111111111100000016 with paths [{'device': '/dev/sdaa', 'host': '2', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdz', 'host': '2', 'id': '0', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdac', 'host': '5', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdab', 'host': '5', 'id': '0', 'channel': '0', 'lun': '2'}] disconnect_volume /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1491
2016-08-16 11:55:34.187 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016 execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:223
2016-08-16 11:55:34.482 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016" returned: 1 in 0.295s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:254
2016-08-16 11:55:34.482 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] multipath ['-f', '111111111111111111111111100000016']: stdout=Aug 16 11:55:34 | 111111111111111111111111100000016: map in use
Aug 16 11:55:34 | failed to remove multipath map 111111111111111111111111100000016
 stderr= _run_multipath /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1465
2016-08-16 11:55:34.483 5652 INFO nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removed multipath device 111111111111111111111111100000016

* It is not detected as a error condition, because the OpenStack code takes exit 1 from multipath as a expected condition, as per nova/virt/libvirt/volume.py:

  self._run_multipath(['-f', mdev], check_exit_code=[0, 1])


Expected results:
* OpenStack should detect the multipath failure and try again. If that fails again, it should fail the operation and not delete the underlying devices, which could cause D-state hung processes, which are only removed by rebooting the compute node.


Additional info:
* The presence or absence of LVM volumes are not relevant for this problem.
* A mere retry is sufficient for a exit error 1 (map in use).
* queue_if_no_path is of no influence in this error. It will happen with or without queue_if_no_path.
* This problem is also related to Launchpad Bug https://bugs.launchpad.net/os-brick/+bug/1592520

Comment 2 Rodrigo A B Freire 2016-08-18 15:43:12 UTC
A very important note here:

As per Red Hat Documentation [1], the canonical way to remove a multipath device is:

1. Close all files
2. Unmount the device
3. Remove the lvm part (not relevant in our issue here)
4. multipath -l to enumerate the underlying devices that are part of a multipath
4.1 multipath -f <WWID>
5. flush devices (blockdev --flushbufs /dev/sd)
6. remove any existing references to the /dev/sd devices
7. echo delete for each /dev/sd device.

This bug's error happens in step 4.1.

--
[1] - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/removing_devices.html

Comment 5 Rodrigo A B Freire 2016-08-18 17:07:19 UTC
RHEL-side bug: https://bugzilla.redhat.com/show_bug.cgi?id=1368211

Comment 15 Gorka Eguileor 2017-04-25 15:10:35 UTC
A retry mechanism for the "map in use" case has already been added upstream and we are also working on refactoring OS-Brick's iSCSI code to make it more robust.


Note You need to log in before you can comment on or make changes to this bug.