Description of problem: Load testing of the following commands eventually finds a situation where the controller node shows a faulting mpio device that also shows as a backed mpio device for a launched instance on the compute node. lsof | grep dm-12 #for the faulting mpio path showed blkid and kpartx low levels attempts were made to kill these off first with kill -15 (workded for blkid) but kill -9 needed on kpartx were still not able to delete the faulty path This is using rhos5 on rhel6 using hp 3par backend. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Proper (online) cleanup procedure for failed path (multipath -F is intrusive!!!): ====== MPATHDEV="/dev/dm-58" multipath -ll $MPATHDEV for i in $( multipath -ll $MPATHDEV | awk '/ failed / { print $3 }' do echo "Removing: $i"; echo 1 > /sys/block/${i}/device/delete done multipath -ll $MPATHDEV multipath -f $MPATHDEV ===== This seems like a different issue but worth noting here. multipath errors being parsed as device names. https://bugzilla.redhat.com/show_bug.cgi?id=1235786
mushroom configured to do this eventually runs into this mpio problem ================================================================== step 1: create a volume via cinder cinder create --image-id 32144d4b-cfaf-4038-8fb8-feb037d8723d --volume-type LocalLVM --display-name test1001 40 step 2: nova --debug boot vm-test1001 --flavor 3 --block-device source=volume,id=da49ff15-f914-4bf3-8a8c-b3cc3f27d734,dest=volume,size=40,shutdown=remove,bootindex=0 =================================================================
Customer seems to indicate this problem shows up quicker from Case 1 approach. (1) 1.1 nova create volume 1.2 nova boot from volume (has delete_on_terminate) nova delete instance cinder delete volume ..... faulty multipath hangs around (it is gone after a few minutes) (2) 2.1 cinder create volume 2.2 nova boot from volume (does not have delete_on_terminate) nova delete instance cinder delete volume ,,,,, no fauly multipath hangs around nova volume-create 40 --volume-type=premium --image-id=2302496c-7384-495d-96bc-03a26a28dd41 --display-name=just-premium-vol-606 to cinder create --image-id 2302496c-7384-495d-96bc-03a26a28dd41 --volume-type premium --display-name test707 40
other notes, seems like a race/load type condition, it requires mushroom driven load of comment #5
It looks like Cinder configuration issue. Please, set use_multipath_for_image_xfer = true in /etc/cinder.conf. Today it's commented out therefore Cinder is using a default value, which is false. In that case, Cinder is removing a single path instead of the multipath device. Later today we have a call with the customer to confirm that theory.
After adding the following to /etc/cinder/cinder.conf [HP3PARFC] use_multipath_for_image_xfer=true As opposed to just globally adding use_multipath_for_image_xfer=true we no longer saw faulty paths. The customer is going to run the test at scale and let me know the results. Should I open a separate bug to address the need to have to do this per device? An upstream bug already seems to exist: https://bugs.launchpad.net/cinder/+bug/1326571
The issue is much improved after the xfer setting in comment 9, but after testing it has recurred. The customer did a non-concurrent test like the following: for x in range(0, 128): vol = cinder_create() vm = nova_boot(vol) nova_delete(vm) # cinder_delete is implicit because of nova boot's shutdown=remove option After 128 loops we saw one faulty path remaining. I will follow up with more details.
Cinder makes a decision to delete a single path or the multipath device base on the multipath_id which is presumably created during attach_volume process. After reviewing logs from the last debugging session I see that in many cases we trying to delete a single path because the multipath_id is missing. We still have to understand whether it's missing because the multipath device doesn't exist then it makes sense, but if it's missing in Cinder metadata only then it can explain why after detach Cinder leaves a faulty devices behind. We need to run another debug session.
The reproducer test was modified as follows: - create a volume from a glance image (nova is not involved) - four threads of creating two volumes as described above - after 8 volumes are created, delete them - using a different 3PAR array - using the same OpenStack install (3 controllers in HA + N compute) The results were one faulty path as per `multipath -ll`.
It seems that the issue is gone after latest multipath reconfgiuration.
Not all patches are merged, moving this bug back to POST.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2686.html