Bug 1270125 - LibvirtFibreChannelVolumeDriver might miss multipath ids during detection if system is under load
Summary: LibvirtFibreChannelVolumeDriver might miss multipath ids during detection if ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 5.0 (RHEL 6)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: async
: 5.0 (RHEL 6)
Assignee: Lee Yarwood
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1273472 1273473
TreeView+ depends on / blocked
 
Reported: 2015-10-09 03:06 UTC by John Fulton
Modified: 2023-02-22 23:02 UTC (History)
16 users (show)

Fixed In Version: openstack-nova-2014.1.5-5.el6ost, openstack-nova-2014.1.5-6.el7ost
Doc Type: Bug Fix
Doc Text:
Nova can often overload multipathd with multiple calls when attaching fibre channel volumes to instances, and later calls can miss recently attached LUNs as multipathd struggles to create multipath devices in time. As a consequence, LUNs are not found by Nova and volume attachment fails. Upstream the fix is to avoid querying multipathd at all within the new os-brick library. As this is not possible downstream, the updated package retries queries where a LUN is not found initially. As a result, nova now retries multipathd queries allowing time for multipath devices to be created and for volume attachment to complete successfully.
Clone Of:
: 1273472 1273473 (view as bug list)
Environment:
Last Closed: 2015-11-18 12:46:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1255523 0 urgent CLOSED Controller node does not fully detach multipath device and the device can not be removed by manual means 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2015:2075 0 normal SHIPPED_LIVE openstack-nova bug fix advisory 2015-11-18 17:45:41 UTC

Description John Fulton 2015-10-09 03:06:05 UTC
- Description of problem:

There is a Nova multipath issue related to the cleanup of the multipath devices
on a Fibre Channel environment. 

The root cause seems to be in the way LibvirtFibreChannelVolumeDriver detects multipath ids (connect_volume). It's just running "multipath -l <dev>" once and  parsing response to get a multipath id [1].

The problem is that if environment is slow then it takes time to create a multipath device so Nova is getting back an empty response and assumes that the volume is connected via single path. Later, in disconnect_volume it reports that something wrong happens with the multipath tools [2] and delete a single path, so the multipath device remains stale.

We examined connection info of 300 volumes in Nova's block_device_mapping table and found that some of them are missing a multipath id. 

[1] https://github.com/openstack/nova/blob/icehouse-eol/nova/virt/libvirt/volume.py#L1006
[2] https://github.com/openstack/nova/blob/icehouse-eol/nova/virt/libvirt/volume.py#L1043

- Version-Release number of selected component (if applicable):

openstack-nova-common-2014.1.4-4.el6ost.noarch
openstack-nova-compute-2014.1.4-4.el6ost.noarch

- How reproducible:

Intermittent

- Steps to Reproduce:  (In pseduo code)

Spawn 4 concurrent threads of the following to create/delete 128 VMs. 

for x in range(0, 32):
    vol = cinder_create()
    vm = nova_boot(vol)
    nova_delete(vm)
    cinder_delete(vol)

- Actual results:

After all instances are deleted there are faulty paths left on the controller. 

# multipath -ll
30000000000000000 dm-11 3PARdata,VV
size=38G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 8:0:2:1 sdj 8:144 failed faulty running
  |- 7:0:2:1 sdh 8:112 failed faulty running
  `- 8:0:2:4 sdp 8:240 failed faulty running

#

- Expected results:

`multipath -ll` will return return paths in a failed faulty running state. 

- Additional info:

A similar problem occurs on the controller node during the same stress test as documented in externally linked Red Hat BZ 1255523.

Comment 2 Sergey Gotliv 2015-10-12 13:46:17 UTC
It looks very similar to BZ#1115375. I think that this attachment [1] from the Cinder's BZ#1093416 is what you need here at least for 5.0. Unfortunately more complete solution that was recently introduced in the os-brick library [2] is not backportable to 5.0 and has to be properly tested. 

[1] https://bugzilla.redhat.com/attachment.cgi?id=892577
[2] https://review.openstack.org/#/c/213389/

Comment 20 errata-xmlrpc 2015-11-18 12:46:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2075.html


Note You need to log in before you can comment on or make changes to this bug.