Bug 1386469 - Failed multipath lose path during migration
Summary: Failed multipath lose path during migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 8.0 (Liberty)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: zstream
: 8.0 (Liberty)
Assignee: Gorka Eguileor
QA Contact: Avi Avraham
URL:
Whiteboard:
Depends On: 1418856 1506277
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-19 04:14 UTC by VIKRANT
Modified: 2020-09-10 09:52 UTC (History)
13 users (show)

Fixed In Version: openstack-cinder-7.0.3-8.el7ost
Doc Type: Bug Fix
Doc Text:
This update improves iSCSI connections using the latest `os-brick` functionality to force the detachment of volumes, when appropriate. For optimal results, use with iscsi-initiator-utils >= 6.2.0.874-2.
Clone Of:
Environment:
Last Closed: 2017-11-29 15:58:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nova node 0 log file (186.87 KB, text/plain)
2017-10-23 10:12 UTC, Avi Avraham
no flags Details
nova node 1 log file (181.76 KB, text/plain)
2017-10-23 10:13 UTC, Avi Avraham
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 459453 0 'None' MERGED Do proper cleanup if connect volume fails 2020-07-28 06:32:16 UTC
OpenStack gerrit 459454 0 'None' MERGED Add support for OS-Brick force disconnect 2020-07-28 06:32:13 UTC
Red Hat Knowledge Base (Solution) 2715071 0 None None None 2016-10-19 04:42:53 UTC
Red Hat Product Errata RHBA-2017:3283 0 normal SHIPPED_LIVE openstack-cinder bug fix advisory 2017-11-29 20:58:18 UTC

Description VIKRANT 2016-10-19 04:14:35 UTC
Description of problem:

Failed multipath lose path during migration. 

Version-Release number of selected component (if applicable):
RHEL OSP 8

How reproducible:
Everytime for customer. 

Steps to Reproduce:
1. Attach a iscsi multipath cinder volume to an instance. 
2. Simulate failure of two paths from 4 paths, live-migrate the instance. 
3. On destination compute node failed paths disappeared for migrated instance and for the instance which is already present on destination compute node. 
4. Re-enable the failed paths. Nothing changed in output. 
5. After the re-scan only for the migrated instance paths re-appeared for the existing instance on destination node only two paths remained. 


Actual results:
Failed multipaths are not showing after the migration.

Expected results:
Failed multipaths should show on destination compute node after the migration. 

Additional info:
More info is coming in an internal comment.

Comment 24 Avi Avraham 2017-10-23 10:12:23 UTC
Created attachment 1342055 [details]
nova node 0 log file

Comment 25 Avi Avraham 2017-10-23 10:13:08 UTC
Created attachment 1342057 [details]
nova node 1 log file

Comment 26 Avi Avraham 2017-10-23 10:18:23 UTC
While trying to verify the following bug I'm getting a lot of errors from nova.
The log files of both nodes are attached here. 
The command running to do live migration : 
server migrate --block-migration --live compute-1.localdomain inst2
running the same scenario without volume attached passed

Comment 30 Avi Avraham 2017-10-29 14:30:24 UTC
While disabling multipath from nova.conf
The migration successfully passed

Comment 31 Matthew Booth 2017-10-31 14:08:08 UTC
I've spent a bunch of time looking into this, and the only Nova errors I've seen so far have been in response to Cinder errors. I'm satisfied there's no specific Nova issue to address here. We could probably do better in response to Cinder errors, but that would require a redesign which wouldn't be applicable to OSP8.

Specifically, the most common failure to migrate I see manifests like this in the libvirt logs:

2017-10-31 13:56:24.555+0000: 1726: error : virNetClientProgramDispatchError:177 :
 internal error: qemu unexpectedly closed the monitor: 2017-10-31T13:56:24.146532Z
 qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/2 (labe
l charserial1)
2017-10-31T13:56:24.342623Z qemu-kvm: load of migration failed: Input/output error
2017-10-31 13:56:24.555+0000: 1726: debug : qemuDomainObjExitRemote:4151 : Exited
remote (vm=0x7feea4017790 name=instance-00000011)

This appears to relate to a real failure of the multipath block device:

# multipath -ll
3514f0c5a51600fd3 dm-0 XtremIO ,XtremApp
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=0 status=enabled
  |- 10:0:0:1 sda 8:0   failed faulty running
  |- 11:0:0:1 sdb 8:16  failed faulty running
  |- 12:0:0:1 sdc 8:32  failed faulty running
  `- 13:0:0:1 sdd 8:48  failed faulty running
3514f0c5a51600fd5 dm-2 XtremIO ,XtremApp
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 10:0:0:2 sdf 8:80  active ready running
  |- 11:0:0:2 sde 8:64  active ready running
  |- 12:0:0:2 sdg 8:96  active ready running
  `- 13:0:0:2 sdh 8:112 active ready running

[root@compute-0 ~]# cat /dev/dm-0 > /dev/null
cat: /dev/dm-0: Input/output error

My assumption here is that qemu is failing with the same I/O error I get from cat.

I've also seen 500 responses from nova-api, but in every case these have been in cinder client, with a corresponding error in cinder's volume.log.

Unfortunately I don't have any useful insight into the cause of the multipath or cinder failures, so I can't say whether or not they're expected. However, Nova failures are expected if the volume is inaccessible by the host.

Comment 39 Avi Avraham 2017-11-26 13:39:20 UTC
Verified 
Package version :
openstack-cinder-7.0.3-10.el7ost.noarch
server migration successfully executed with multipath configuration.
One of two multipath interfaces traffic was block with iptables on all compute servers while instance migration.

Comment 42 errata-xmlrpc 2017-11-29 15:58:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3283


Note You need to log in before you can comment on or make changes to this bug.