Bug 1270125 - LibvirtFibreChannelVolumeDriver might miss multipath ids during detection if system is under load
LibvirtFibreChannelVolumeDriver might miss multipath ids during detection if ...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
5.0 (RHEL 6)
x86_64 Linux
urgent Severity urgent
: async
: 5.0 (RHEL 6)
Assigned To: Lee Yarwood
nlevinki
: ZStream
Depends On:
Blocks: 1273472 1273473
  Show dependency treegraph
 
Reported: 2015-10-08 23:06 EDT by John Fulton
Modified: 2016-04-26 10:14 EDT (History)
18 users (show)

See Also:
Fixed In Version: openstack-nova-2014.1.5-5.el6ost, openstack-nova-2014.1.5-6.el7ost
Doc Type: Bug Fix
Doc Text:
Nova can often overload multipathd with multiple calls when attaching fibre channel volumes to instances, and later calls can miss recently attached LUNs as multipathd struggles to create multipath devices in time. As a consequence, LUNs are not found by Nova and volume attachment fails. Upstream the fix is to avoid querying multipathd at all within the new os-brick library. As this is not possible downstream, the updated package retries queries where a LUN is not found initially. As a result, nova now retries multipathd queries allowing time for multipath devices to be created and for volume attachment to complete successfully.
Story Points: ---
Clone Of:
: 1273472 1273473 (view as bug list)
Environment:
Last Closed: 2015-11-18 07:46:19 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1255523 None None None Never

  None (edit)
Description John Fulton 2015-10-08 23:06:05 EDT
- Description of problem:

There is a Nova multipath issue related to the cleanup of the multipath devices
on a Fibre Channel environment. 

The root cause seems to be in the way LibvirtFibreChannelVolumeDriver detects multipath ids (connect_volume). It's just running "multipath -l <dev>" once and  parsing response to get a multipath id [1].

The problem is that if environment is slow then it takes time to create a multipath device so Nova is getting back an empty response and assumes that the volume is connected via single path. Later, in disconnect_volume it reports that something wrong happens with the multipath tools [2] and delete a single path, so the multipath device remains stale.

We examined connection info of 300 volumes in Nova's block_device_mapping table and found that some of them are missing a multipath id. 

[1] https://github.com/openstack/nova/blob/icehouse-eol/nova/virt/libvirt/volume.py#L1006
[2] https://github.com/openstack/nova/blob/icehouse-eol/nova/virt/libvirt/volume.py#L1043

- Version-Release number of selected component (if applicable):

openstack-nova-common-2014.1.4-4.el6ost.noarch
openstack-nova-compute-2014.1.4-4.el6ost.noarch

- How reproducible:

Intermittent

- Steps to Reproduce:  (In pseduo code)

Spawn 4 concurrent threads of the following to create/delete 128 VMs. 

for x in range(0, 32):
    vol = cinder_create()
    vm = nova_boot(vol)
    nova_delete(vm)
    cinder_delete(vol)

- Actual results:

After all instances are deleted there are faulty paths left on the controller. 

# multipath -ll
30000000000000000 dm-11 3PARdata,VV
size=38G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 8:0:2:1 sdj 8:144 failed faulty running
  |- 7:0:2:1 sdh 8:112 failed faulty running
  `- 8:0:2:4 sdp 8:240 failed faulty running

#

- Expected results:

`multipath -ll` will return return paths in a failed faulty running state. 

- Additional info:

A similar problem occurs on the controller node during the same stress test as documented in externally linked Red Hat BZ 1255523.
Comment 2 Sergey Gotliv 2015-10-12 09:46:17 EDT
It looks very similar to BZ#1115375. I think that this attachment [1] from the Cinder's BZ#1093416 is what you need here at least for 5.0. Unfortunately more complete solution that was recently introduced in the os-brick library [2] is not backportable to 5.0 and has to be properly tested. 

[1] https://bugzilla.redhat.com/attachment.cgi?id=892577
[2] https://review.openstack.org/#/c/213389/
Comment 20 errata-xmlrpc 2015-11-18 07:46:19 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2075.html

Note You need to log in before you can comment on or make changes to this bug.