Bug 1368191

Summary: [RFE] os-brick should use the -R retry switch when flushing multipath devices
Product: Red Hat OpenStack Reporter: Rodrigo A B Freire <rfreire>
Component: python-os-brickAssignee: Gorka Eguileor <geguileo>
Status: CLOSED UPSTREAM QA Contact: Prasanth Anbalagan <panbalag>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: acanan, apevec, berrange, dasmith, eglynn, geguileo, jschluet, kchamart, lhh, lruzicka, lyarwood, sbauza, sclewis, sferdjao, sgordon, srevivo, vromanso
Target Milestone: Upstream M2Keywords: FutureFeature, Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1368211 (view as bug list) Environment:
Last Closed: 2017-04-25 15:10:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1368211, 1442136    

Description Rodrigo A B Freire 2016-08-18 15:33:31 UTC
Description of problem:
* Sometimes multipath has problems when detaching a LUN. This is a quite common scenario and is expressed in environments 3 or more multipath devices.
* OpenStack code actually does catches traps multipath flush errors and results in device-less orphaned multipath devices and / or D-state hung processes.

Version-Release number of selected component (if applicable):
* RHOSP7

How reproducible:
* Easily

Steps to Reproduce:
1. Configure a compute node to export multipath devices to cinder
2. Configure OpenStack to make use of the multipath devices as cinder back-end
3. Create a instance with 3 or more multipath devices
4. Remove the instance

* The multipath failure errors can be also reproduced in a RHEL7 system with 3 or more multipath LUNs using the following script:

while true; do for MPATH in <WWID 1> <WWID 2> <WWID 3> ; do DEVICES=`multipath -l $MPATH | grep runnin | awk '{print  substr ($_,6,8)}' `; echo "Flushing: multipath -f $MPATH"; if ! multipath -f $MPATH; then echo "Failed! Trying again."; sleep 1 ; echo "multipath -f $MPATH" ; multipath -f $MPATH ; echo "The 2nd multipath -f returned $?" ; exit 1 ; fi ; for DEVICE in $DEVICES; do echo "Deleting: echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete"; echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete ; done ; done ; LC=`multipath -ll|wc -l` ; multipath -ll ; if [ "$LC" != "0" ]; then exit 1; fi  ; sleep 10; rescan-scsi-bus.sh -i ; sleep 2; multipath -r ; done

Actual results:
* The LUN removal failure is only seen in debug mode, with the following error signature:

2016-08-16 11:55:34.186 5652 DEBUG nova.storage.linuxscsi [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Found multipath device = /dev/mapper/111111111111111111111111100000016 find_multipath_device /usr/lib/python2.7/site-packages/nova/storage/linuxscsi.py:136
2016-08-16 11:55:34.187 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removing multipath device 111111111111111111111111100000016 with paths [{'device': '/dev/sdaa', 'host': '2', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdz', 'host': '2', 'id': '0', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdac', 'host': '5', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdab', 'host': '5', 'id': '0', 'channel': '0', 'lun': '2'}] disconnect_volume /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1491
2016-08-16 11:55:34.187 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016 execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:223
2016-08-16 11:55:34.482 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016" returned: 1 in 0.295s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:254
2016-08-16 11:55:34.482 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] multipath ['-f', '111111111111111111111111100000016']: stdout=Aug 16 11:55:34 | 111111111111111111111111100000016: map in use
Aug 16 11:55:34 | failed to remove multipath map 111111111111111111111111100000016
 stderr= _run_multipath /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1465
2016-08-16 11:55:34.483 5652 INFO nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removed multipath device 111111111111111111111111100000016

* It is not detected as a error condition, because the OpenStack code takes exit 1 from multipath as a expected condition, as per nova/virt/libvirt/volume.py:

  self._run_multipath(['-f', mdev], check_exit_code=[0, 1])


Expected results:
* OpenStack should detect the multipath failure and try again. If that fails again, it should fail the operation and not delete the underlying devices, which could cause D-state hung processes, which are only removed by rebooting the compute node.


Additional info:
* The presence or absence of LVM volumes are not relevant for this problem.
* A mere retry is sufficient for a exit error 1 (map in use).
* queue_if_no_path is of no influence in this error. It will happen with or without queue_if_no_path.
* This problem is also related to Launchpad Bug https://bugs.launchpad.net/os-brick/+bug/1592520

Comment 2 Rodrigo A B Freire 2016-08-18 15:43:12 UTC
A very important note here:

As per Red Hat Documentation [1], the canonical way to remove a multipath device is:

1. Close all files
2. Unmount the device
3. Remove the lvm part (not relevant in our issue here)
4. multipath -l to enumerate the underlying devices that are part of a multipath
4.1 multipath -f <WWID>
5. flush devices (blockdev --flushbufs /dev/sd)
6. remove any existing references to the /dev/sd devices
7. echo delete for each /dev/sd device.

This bug's error happens in step 4.1.

--
[1] - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/removing_devices.html

Comment 5 Rodrigo A B Freire 2016-08-18 17:07:19 UTC
RHEL-side bug: https://bugzilla.redhat.com/show_bug.cgi?id=1368211

Comment 15 Gorka Eguileor 2017-04-25 15:10:35 UTC
A retry mechanism for the "map in use" case has already been added upstream and we are also working on refactoring OS-Brick's iSCSI code to make it more robust.