Description of problem:
nova fails during volume attach. upon further inspection it appears that multipathd has seg faulted and nova fails when attempting to view multipathing output.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Attach volume to instance
volume attach fails because multipathd is not running
multipathd should not be in a stopped state due to segfault
I have a bugzilla opened with the device-mapper team for this issue.
The team has suggested putting in further checks in device-mapper-multipath to avoid the segfault, but states that there is a SAN side shuffling that is causing multipathing to get into this bad state.
I suspect this is either caused by nova not correctly cleaning up paths on attach/detach, or by cinder when devices are created and deleted.
Looking to confirm, but I think the next step here is to get the updates to the 2 customers discussed in:
I'm also going to try to trigger some LUN reassignments on my machines to see if I can recreate this, but with and without the latest multipath code.
The customer has updated the rpms from the other bz. they havent had the issue occur again, but they are seeing some messages from mpath:
# multipath -ll 36005076802810b39780000000000012f
Sep 02 15:12:40 | 65:80: path wwid appears to have changed. Using old wwid.
This is what multipathd prints when it catches the issue an keeps itself from crashing. However, I wrote that fix to deal with a bug where the LUN itself wasn't changing, just its WWID (because of user error). In the current case, the LUN is changing. Probably the best thing for multipathd to do is to disable and then remove any path when we detect that it's wwid has changed (and possibly re-add the path again, so multipath can continue to use it with the new information). That way multipath will do the best that it can to save users from themselves (Like I said, we still do not support remapping LUNs while they are in use, and currently, there is no way to ).
The ideal solution would be to not remap in-use LUNs, since nothing supports this.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.