Bug 1370598

Summary: multipathd segfault during volume attach
Product: Red Hat OpenStack Reporter: Jack Waterworth <jwaterwo>
Component: openstack-novaAssignee: Lee Yarwood <lyarwood>
Status: CLOSED ERRATA QA Contact: Prasanth Anbalagan <panbalag>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: aludwar, awaugama, berrange, bmarzins, dasmith, eglynn, eharney, fdinitto, geguileo, gszasz, jraju, jschluet, jwaterwo, kchamart, lyarwood, panbalag, pgrist, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-2015.1.4-18.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 22:56:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1367850    
Bug Blocks:    

Description Jack Waterworth 2016-08-26 17:54:10 UTC
Description of problem:
nova fails during volume attach. upon further inspection it appears that multipathd has seg faulted and nova fails when attempting to view multipathing output.

Version-Release number of selected component (if applicable):
openstack-nova-compute-2015.1.4-1.el7ost.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Attach volume to instance

Actual results:
volume attach fails because multipathd is not running

Expected results:
multipathd should not be in a stopped state due to segfault

Additional info:

I have a bugzilla opened with the device-mapper team for this issue.

https://bugzilla.redhat.com/show_bug.cgi?id=1367850

The team has suggested putting in further checks in device-mapper-multipath to avoid the segfault, but states that there is a SAN side shuffling that is causing multipathing to get into this bad state.

I suspect this is either caused by nova not correctly cleaning up paths on attach/detach, or by cinder when devices are created and deleted.

Comment 5 Paul Grist 2016-08-30 15:20:14 UTC
Looking to confirm, but I think the next step here is to get the updates to the 2 customers discussed in: 

https://bugzilla.redhat.com/show_bug.cgi?id=1367850#c13

Comment 6 Ben Marzinski 2016-08-30 17:25:10 UTC
I'm also going to try to trigger some LUN reassignments on my machines to see if I can recreate this, but with and without the latest multipath code.

Comment 8 Jack Waterworth 2016-09-06 19:29:56 UTC
The customer has updated the rpms from the other bz. they havent had the issue occur again, but they are seeing some messages from mpath:

# multipath -ll 36005076802810b39780000000000012f
Sep 02 15:12:40 | 65:80: path wwid appears to have changed. Using old wwid.

Comment 9 Ben Marzinski 2016-09-06 20:00:19 UTC
This is what multipathd prints when it catches the issue an keeps itself from crashing. However, I wrote that fix to deal with a bug where the LUN itself wasn't changing, just its WWID (because of user error). In the current case, the LUN is changing.  Probably the best thing for multipathd to do is to disable and then remove any path when we detect that it's wwid has changed (and possibly re-add the path again, so multipath can continue to use it with the new information). That way multipath will do the best that it can to save users from themselves (Like I said, we still do not support remapping LUNs while they are in use, and currently, there is no way to ).

The ideal solution would be to not remap in-use LUNs, since nothing supports this.

Comment 24 errata-xmlrpc 2017-02-15 22:56:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0282.html