Description of problem: * Sometimes multipath has problems when detaching a LUN. This is a quite common scenario and is expressed in environments 3 or more multipath devices. * OpenStack code actually does catches traps multipath flush errors and results in device-less orphaned multipath devices and / or D-state hung processes. Version-Release number of selected component (if applicable): * RHOSP7 How reproducible: * Easily Steps to Reproduce: 1. Configure a compute node to export multipath devices to cinder 2. Configure OpenStack to make use of the multipath devices as cinder back-end 3. Create a instance with 3 or more multipath devices 4. Remove the instance * The multipath failure errors can be also reproduced in a RHEL7 system with 3 or more multipath LUNs using the following script: while true; do for MPATH in <WWID 1> <WWID 2> <WWID 3> ; do DEVICES=`multipath -l $MPATH | grep runnin | awk '{print substr ($_,6,8)}' `; echo "Flushing: multipath -f $MPATH"; if ! multipath -f $MPATH; then echo "Failed! Trying again."; sleep 1 ; echo "multipath -f $MPATH" ; multipath -f $MPATH ; echo "The 2nd multipath -f returned $?" ; exit 1 ; fi ; for DEVICE in $DEVICES; do echo "Deleting: echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete"; echo 1 > /sys/bus/scsi/drivers/sd/$DEVICE/delete ; done ; done ; LC=`multipath -ll|wc -l` ; multipath -ll ; if [ "$LC" != "0" ]; then exit 1; fi ; sleep 10; rescan-scsi-bus.sh -i ; sleep 2; multipath -r ; done Actual results: * The LUN removal failure is only seen in debug mode, with the following error signature: 2016-08-16 11:55:34.186 5652 DEBUG nova.storage.linuxscsi [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Found multipath device = /dev/mapper/111111111111111111111111100000016 find_multipath_device /usr/lib/python2.7/site-packages/nova/storage/linuxscsi.py:136 2016-08-16 11:55:34.187 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removing multipath device 111111111111111111111111100000016 with paths [{'device': '/dev/sdaa', 'host': '2', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdz', 'host': '2', 'id': '0', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdac', 'host': '5', 'id': '1', 'channel': '0', 'lun': '2'}, {'device': '/dev/sdab', 'host': '5', 'id': '0', 'channel': '0', 'lun': '2'}] disconnect_volume /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1491 2016-08-16 11:55:34.187 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016 execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:223 2016-08-16 11:55:34.482 5652 DEBUG oslo_concurrency.processutils [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf multipath -f 111111111111111111111111100000016" returned: 1 in 0.295s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:254 2016-08-16 11:55:34.482 5652 DEBUG nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] multipath ['-f', '111111111111111111111111100000016']: stdout=Aug 16 11:55:34 | 111111111111111111111111100000016: map in use Aug 16 11:55:34 | failed to remove multipath map 111111111111111111111111100000016 stderr= _run_multipath /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume.py:1465 2016-08-16 11:55:34.483 5652 INFO nova.virt.libvirt.volume [req-e07ae6e4-61f6-419a-aac1-69d51b8012c9 00e2c2eded244109a3dd28794a1a235e d32b33d7780c4522993a7f32f938d647 - - -] Removed multipath device 111111111111111111111111100000016 * It is not detected as a error condition, because the OpenStack code takes exit 1 from multipath as a expected condition, as per nova/virt/libvirt/volume.py: self._run_multipath(['-f', mdev], check_exit_code=[0, 1]) Expected results: * OpenStack should detect the multipath failure and try again. If that fails again, it should fail the operation and not delete the underlying devices, which could cause D-state hung processes, which are only removed by rebooting the compute node. Additional info: * The presence or absence of LVM volumes are not relevant for this problem. * A mere retry is sufficient for a exit error 1 (map in use). * queue_if_no_path is of no influence in this error. It will happen with or without queue_if_no_path. * This problem is also related to Launchpad Bug https://bugs.launchpad.net/os-brick/+bug/1592520
A very important note here: As per Red Hat Documentation [1], the canonical way to remove a multipath device is: 1. Close all files 2. Unmount the device 3. Remove the lvm part (not relevant in our issue here) 4. multipath -l to enumerate the underlying devices that are part of a multipath 4.1 multipath -f <WWID> 5. flush devices (blockdev --flushbufs /dev/sd) 6. remove any existing references to the /dev/sd devices 7. echo delete for each /dev/sd device. This bug's error happens in step 4.1. -- [1] - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/removing_devices.html
RHEL-side bug: https://bugzilla.redhat.com/show_bug.cgi?id=1368211
A retry mechanism for the "map in use" case has already been added upstream and we are also working on refactoring OS-Brick's iSCSI code to make it more robust.