Bug 1280401 - Controller node does not fully detach multipath device and the device can not be removed by manual means
Summary: Controller node does not fully detach multipath device and the device can not...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: async
: 6.0 (Juno)
Assignee: Gorka Eguileor
QA Contact: nlevinki
URL:
Whiteboard:
Depends On: 1255523 1280409
Blocks: 1278590
TreeView+ depends on / blocked
 
Reported: 2015-11-11 16:39 UTC by Gorka Eguileor
Modified: 2023-02-22 23:02 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When systems are under heavy load multipath detection may take longer than expected. Also detaching was not done on device validation failure. Consequence: Cinder doesn't detect in time the existence of the multipath and works in single path mode even when underlying OS ends up detecting the multipath. So when we disconnect the device we do it as if it were a single path device and not a multipath, so we end up with leftovers in the system. On multipath when we have properly detected the paths but they are all in a failed state when we try to read from the device on validation we will not disconnect the device. Fix: When Cinder is configured for multipath we'll do several retries to detect multipaths on connect and we'll also check on disconnect if what initially was a single path is now a multipath. This patch tries to detach the volume if we fail when validating the device after we have attached the volume. Result: We no longer have leftovers or leave devices attached on validation failure.
Clone Of: 1255523
Environment:
Last Closed: 2018-01-25 16:38:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Gorka Eguileor 2015-11-11 16:57:50 UTC
+++ This bug was initially created as a clone of Bug #1255523 +++

Description of problem:

Load testing of the following commands eventually finds a situation where the controller node shows a faulting mpio device that also shows as a backed mpio device for a launched instance on the compute node.

lsof | grep dm-12 #for the faulting mpio path

showed blkid and kpartx

low levels attempts were made to kill these off first with kill -15 (workded  for blkid) but kill -9 needed on kpartx

were still not able to delete the faulty path

This is using rhos5 on rhel6 using hp 3par backend.




Version-Release number of selected component (if applicable):




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Proper (online) cleanup procedure for failed path (multipath -F is intrusive!!!):

======

MPATHDEV="/dev/dm-58"
multipath -ll $MPATHDEV

for i in $( multipath -ll $MPATHDEV | awk '/ failed / { print $3 }' 
do 
  echo "Removing: $i"; echo 1 > /sys/block/${i}/device/delete
done

multipath -ll $MPATHDEV
multipath -f $MPATHDEV

=====

This seems like a different issue but worth noting here.

multipath errors being parsed as device names.
https://bugzilla.redhat.com/show_bug.cgi?id=1235786

Comment 2 Mike McCune 2016-03-28 22:38:41 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions


Note You need to log in before you can comment on or make changes to this bug.