Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1347447

Summary: High Availability to Protect Instances in Red Hat Enterprise Linux OpenStack Platform 7 with cinder boot volumes on cinder.volume.drivers.ibm.storwize_svc.StorwizeSVCDriver and no-shared-storage option - instance in error state
Product: Red Hat OpenStack Reporter: Andreas Karis <akaris>
Component: openstack-cinderAssignee: Eric Harney <eharney>
Status: CLOSED CURRENTRELEASE QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: aludwar, berrange, dasmith, egafford, eglynn, eharney, kchamart, lixqin, sbauza, scohen, sferdjao, sgordon, srevivo, vromanso
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1371963 1371969 (view as bug list) Environment:
Last Closed: 2016-11-16 14:52:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1371963, 1371969    
Attachments:
Description Flags
full description of this bug with full analysis none

Description Andreas Karis 2016-06-16 22:43:28 UTC
Created attachment 1168857 [details]
full description of this bug with full analysis

When following the instance high availability guide...
> https://access.redhat.com/articles/1544823
... in Red Hat Enterprise Linux OpenStack Platform 7  with cinder boot volumes on cinder.volume.drivers.ibm.storwize_svc.StorwizeSVCDriver and no-shared-storage option - instances go into error state, but otherwise correctly transition to the new host and are pingable / working 

Customer and Red Hat engineer followed this guide closely, a) in customer's environment with physical hardware and IBM storage for cinder, b) in virtual lab environment with NFS storage for cinder
In both cases, no shared storage was used for nova ephemeral. However, environment b) was tested and worked before with nova shared storage on NFS and was then converted to shared cinder storage, only

Scenario a) was only tested with cinder block storage, and without shared ephemeral storage; this scenario throws an error message quite similar to the one from scenario b), but it causes the instance to go into error state
~~~
enabled_backends=iscsi_storage

[iscsi_storage]
volume_driver=cinder.volume.drivers.ibm.storwize_svc.StorwizeSVCDriver
san_ip=x.x.x.x
san_login=<san-login>
san_password=*********
ssh_max_pool_conn=5
ssh_conn_timeout=30
ssh_min_pool_conn=1
san_thin_provision=True
volume_backend_name=iscsi_storage
san_ssh_port=22
san_clustername=Cinder
san_is_local=False
storwize_svc_volpool_name=NFV-pool
storwize_svc_iscsi_chap_enabled=False
~~~

This scenario migrates from overcloud-controller-0 to overcloud-controller-1

Scenario b) works both with shared nova storage and on cinder block storage, without shared ephemeral storage; however, in the case of cinder block storage, this throws an error message in the logs which seems to be silently ignored
~~~
[root@overcloud-controller-0 ~]# grep enabled_back /etc/cinder/cinder.conf 
enabled_backends=nfs
[root@overcloud-controller-0 ~]# cat /etc/cinder/cinder.conf | tail
(...)
[nfs]
nfs_shares_config = /etc/cinder/nfs_share
volume_driver = cinder.volume.drivers.nfs.NfsDriver
volume_backend_name = nfs
nfs_mount_options = rw,noatime,nodiratime,async
~~~

This scenario migrates from overcloud-controller-1 to overcloud-controller-0

1) This looks like an issue with the detach operation, which should either never happen or which tries to detach the volume on the wrong hypervisor, e.g., in the case of IBM storage:
~~~
2016-06-14 06:17:30.313 1366 ERROR cinder.api.middleware.fault [req-c3372c54-1f62-43bb-b317-342bf6d9e3c0 ee48cfd97e8647979479f28f2b88d8e1 e9b29cc2cfc44d9ab2105cba6821a0dc - - -] Caught error: Remote error: Remote error: VolumeBackendAPIException Bad or unexpected response from the storage volume backend API: Unable to terminate volume connection: Bad or unexpected response from the storage volume backend API: CLI Exception output:
 command: ['svctask', 'rmvdiskhostmap', '-host', u'"overcloud-compute-1.localdomain-85294885"', u'volume-96e8f4a7-4e4c-489b-b034-36a91e13a241']
 stdout:
 stderr: CMMVC5842E The action failed because an object that was specified in the command does not exist.
~~~
It is trying to detach from overcloud-compute-1, but it should detach from the old hypervisor, overcloud-compute-0
2)  This issue shows both with NFS backed storage and IBM backed storage, but it only becomes a real ERROR with IBM storage. it seems to be silently ignored with NFS storage


Please see attached .txt for a full analysis of the issue.

Comment 1 Andreas Karis 2016-06-22 21:07:59 UTC
Hi,

Any news regarding this BZ? Is this a known issue?

Thanks,

Andreas

Comment 2 Andrew Ludwar 2016-06-23 13:57:31 UTC
I'm running into this issue with a customer as well. Customer is doing a suite of nova evacuate tests with the compute HA configured.

Comment 10 xiaoqin 2016-07-20 02:20:15 UTC
This issue has been fixed in master branch which will be contained in Newton release. Refer to the fix in https://review.openstack.org/#/c/299673/

Comment 11 Andreas Karis 2016-07-20 15:26:02 UTC
Can we get a backport for this?

Comment 13 xiaoqin 2016-07-21 02:39:26 UTC
We can backport the fix to community branch theoretically following the community process. But the community requires the backport must match the master commit. The files change a lot from Kilo to Mitaka, even the file name. So it is not recommend to do the backport.  You can update the function unmap_vol_from_host in your file cinder/volume/drivers/ibm/storwize_svc/helpers.py according to the fix in https://review.openstack.org/#/c/299673/, just like:

    def unmap_vol_from_host(self, volume_name, host_name):
        """Unmap the volume and delete the host if it has no more mappings."""

        LOG.debug('enter: unmap_vol_from_host: volume %(volume_name)s from '
                  'host %(host_name)s'
                  % {'volume_name': volume_name, 'host_name': host_name})

        # Check if the mapping exists
        resp = self.ssh.lsvdiskhostmap(volume_name)
        if not len(resp):
            LOG.warning(_LW('unmap_vol_from_host: No mapping of volume '
                            '%(vol_name)s to any host found.') %
                        {'vol_name': volume_name})
            return
        found = False
        if host_name is None:
            if len(resp) > 1:
                LOG.warning(_LW('unmap_vol_from_host: Multiple mappings of '
                                'volume %(vol_name)s found, no host '
                                'specified.') % {'vol_name': volume_name})
                return
            else:
                host_name = resp[0]['host_name']
                found = True
        else:
            for h in resp.select('host_name'):
                if h == host_name:
                    found = True
            if not found:
                LOG.warning(_LW('unmap_vol_from_host: No mapping of volume '
                                '%(vol_name)s to host %(host)s found.') %
                            {'vol_name': volume_name, 'host': host_name})

        # We now know that the mapping exists
        if found:
            self.ssh.rmvdiskhostmap(host_name, volume_name)

        # If this host has no more mappings, delete it
        resp = self.ssh.lshostvdiskmap(host_name)
        if not len(resp):
            self.delete_host(host_name)

        LOG.debug('leave: unmap_vol_from_host: volume %(volume_name)s from '
                  'host %(host_name)s'
                  % {'volume_name': volume_name, 'host_name': host_name})
Or you can send us your files under volume/drivers/ibm/storwize_svc/ and we can provide you a private patch.

Comment 14 Andreas Karis 2016-07-21 16:43:59 UTC
Hello,

Thanks for the info! Just to be clear: the patched version of this function looks like this ...

~~~
   def unmap_vol_from_host(self, volume_name, host_name):
        """Unmap the volume and delete the host if it has no more mappings."""

        LOG.debug('Enter: unmap_vol_from_host: volume %(volume_name)s from '
                  'host %(host_name)s.',
                  {'volume_name': volume_name, 'host_name': host_name})

        # Check if the mapping exists
        resp = self.ssh.lsvdiskhostmap(volume_name)
        if not len(resp):
            LOG.warning(_LW('unmap_vol_from_host: No mapping of volume '
                            '%(vol_name)s to any host found.'),
                        {'vol_name': volume_name})
            return host_name
        if host_name is None:
            if len(resp) > 1:
                LOG.warning(_LW('unmap_vol_from_host: Multiple mappings of '
                                'volume %(vol_name)s found, no host '
                                'specified.'), {'vol_name': volume_name})
                return
            else:
                host_name = resp[0]['host_name']
        else:
            found = False
            for h in resp.select('host_name'):
                if h == host_name:
                    found = True
            if not found:
                LOG.warning(_LW('unmap_vol_from_host: No mapping of volume '
                                '%(vol_name)s to host %(host)s found.'),
                            {'vol_name': volume_name, 'host': host_name})
                return host_name
        # We now know that the mapping exists
        self.ssh.rmvdiskhostmap(host_name, volume_name)

        LOG.debug('Leave: unmap_vol_from_host: volume %(volume_name)s from '
                  'host %(host_name)s.',
                  {'volume_name': volume_name, 'host_name': host_name})
        return host_name
~~~

However, as this is not compatible with the code we run, you make a slightly different change (it's the same principle, but you apply it differently):

~~~
[root@overcloud-controller-0 site-packages]# diff cinder/volume/drivers/ibm/storwize_svc/helpers.py.back cinder/volume/drivers/ibm/storwize_svc/helpers.py
337a338
> 
351a353
>         found = False
359a362
>                 found = True
361d363
<             found = False
371c373,374
<         self.ssh.rmvdiskhostmap(host_name, volume_name)
---
>         if found:
>             self.ssh.rmvdiskhostmap(host_name, volume_name)
~~~

Do I correctly understand?

Thanks!

Comment 15 xiaoqin 2016-07-22 01:56:26 UTC
Yes, you're correct.

Comment 16 Andreas Karis 2016-07-27 00:59:32 UTC
ok I we just tested this out with the customer, the fix works

Comment 17 Andreas Karis 2016-07-28 00:41:07 UTC
Thanks, xiaoqin

@red hat engineering: we need a backport for this fix - comment 14 - for OSP 7.

Comment 18 Sean Cohen 2016-11-16 14:52:49 UTC
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2030.html