RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1387590 - fence_compute - Fixes for fix_plug/domain_name and nova force_down functionality.
Summary: fence_compute - Fixes for fix_plug/domain_name and nova force_down functional...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: fence-agents
Version: 7.3
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Andrew Beekhof
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 1430393 (view as bug list)
Depends On:
Blocks: 1393789 1440487
TreeView+ depends on / blocked
 
Reported: 2016-10-21 10:40 UTC by Marian Krcmarik
Modified: 2020-09-10 09:52 UTC (History)
11 users (show)

Fixed In Version: fence-agents-4.0.11-53.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1393789 1440487 (view as bug list)
Environment:
Last Closed: 2017-08-01 16:10:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix (7.40 KB, patch)
2016-11-10 01:57 UTC, Andrew Beekhof
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1874 0 normal SHIPPED_LIVE fence-agents bug fix and enhancement update 2017-08-01 17:53:05 UTC

Description Marian Krcmarik 2016-10-21 10:40:13 UTC
Description of problem:
I have noticed some problems while playing with Instance HA on RHOSP10 with RHEL7.3, some of them prevent the agent to function corretly, some of them are good to have to improve funcionality.

The problems which may prevent agent to function well:
1. There is one condition in fix_plug_name(options) which seems to be always true:
elif options["--plug"].find(options["--domain"]):
find() method returns Index If found otherwise -1, so the correction could be
elif options["--domain"] in  options["--plug"]: or elif options["--plug"].find(options["--domain"]) > -1:

2. There is a calling of fix_plug_name(options) in main() method, this fix_plug_name(options) method calls fix_domain() method which tries to get hypervisor list by calling nova client. The thing is that nova client instance of object is created after fix_plug_name is called, so fix_plug_name() should be called after creating the nova client or nova client should be created earlier.

3. There is a condition in set_power_status(_, options) method which indicates when the status of node should be set to "on":
        if options["--action"] == "on":
               if get_power_status(_, options) == "on":
I believe the second condition should have opposite logic:
               if get_power_status(_, options) != "on":
Because we want to set compute to on status when it's not actually on.

The problems to improve force_down functionality:
1. One of the main problems is that once nova-compute service is marked as force_down It won't switch by itself after a reboot to "up" status, The service must be unset from force_down status again otherwise even though nova-compute service is running and functioning well The status of service will be down and nova scheduler wont use the compute for booting VMs. It would be nice to place this step probably in nova-compute-wait?

2. Nova APi microversioning is kinda strange to me but If we want to have successful result from calling:
nova.services.force_down(options["--plug"], "nova-compute", force_down=False),
We need to specify nova version at nova client creation with microversion 2.11 and higher, otherwise we get VersionNotFoundForAPIMethod exception. I did not find any way how nova client could fallback in the case we use higher version of nova api at client creation than server supports, in this case we would get an exception NotAcceptable and range if supported nova api versions.

The only way I am able to come up with is to query for supported Microversions of server.
Something like:

def get_max_api_version():

    max_version = None

    nova = nova_client.Client('2',
            options["--username"],
            options["--password"],
            options["--tenant-name"],
            options["--auth-url"],
            insecure=options["--insecure"],
            region_name=options["--region-name"],
            endpoint_type=options["--endpoint-type"])

    versions = nova.versions.list()
    for version in versions:
        if version.status == "CURRENT":
            max_version = version.version

    if max_version:
        return max_version
    else:
        return "2"

And then use the output of such method as Version for nova client instance used for quering openstack in fence_compute. The method would return 2.3 for RHOS7 (so force_down not supported), 2.12 for RHOS8 (force_down supported), 2.27 for RHOS9 ...

Version-Release number of selected component (if applicable):
fence-agents-compute-4.0.11-47.el7.x86_64

How reproducible:
Always


Additional info:
If needed we can break the valid problems into separated bugs and use this one as tracker, up to assignee

Comment 3 Andrew Beekhof 2016-11-09 05:56:39 UTC
For every version (2 -> 2.27) I get:

{u'version': {u'status': u'CURRENT', u'updated': u'2013-07-23T11:33:21Z', u'links': [{u'href': u'https://192.168.24.2:13774/v2.1/', u'rel': u'self'}, {u'href': u'http://docs.openstack.org/', u'type': u'text/html', u'rel': u'describedby'}], u'min_version': u'2.1', u'version': u'2.38', u'media-types': [{u'base': u'application/json', u'type': u'application/vnd.openstack.compute+json;version=2.1'}], u'id': u'v2.1'}}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/novaclient/v2/versions.py", line 104, in list
    return self._list(version_url, "versions")
  File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 255, in _list
    data = body[response_key]
KeyError: 'versions'

when calling nova.versions.list()

Comment 4 Marian Krcmarik 2016-11-09 16:33:54 UTC
(In reply to Andrew Beekhof from comment #3)
> For every version (2 -> 2.27) I get:
> 
> {u'version': {u'status': u'CURRENT', u'updated': u'2013-07-23T11:33:21Z',
> u'links': [{u'href': u'https://192.168.24.2:13774/v2.1/', u'rel': u'self'},
> {u'href': u'http://docs.openstack.org/', u'type': u'text/html', u'rel':
> u'describedby'}], u'min_version': u'2.1', u'version': u'2.38',
> u'media-types': [{u'base': u'application/json', u'type':
> u'application/vnd.openstack.compute+json;version=2.1'}], u'id': u'v2.1'}}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/lib/python2.7/site-packages/novaclient/v2/versions.py", line
> 104, in list
>     return self._list(version_url, "versions")
>   File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 255, in
> _list
>     data = body[response_key]
> KeyError: 'versions'
> 
> when calling nova.versions.list()

I am getting the same reply (traceback) from RHOSP10, It works without problems with older releases (even with some older RHOSP10 puddle), Not sure It's a bug and possibly where exactly.
The nova server now returns dictionary with one element which key is called version, It used to return dictionary with one element called "versions" which value used to be a list of version dictionaries.So I guess a bug in nova or change of behaviour?

Comment 5 Marian Krcmarik 2016-11-09 17:04:50 UTC
(In reply to Marian Krcmarik from comment #4)
> (In reply to Andrew Beekhof from comment #3)
> > For every version (2 -> 2.27) I get:
> > 
> > {u'version': {u'status': u'CURRENT', u'updated': u'2013-07-23T11:33:21Z',
> > u'links': [{u'href': u'https://192.168.24.2:13774/v2.1/', u'rel': u'self'},
> > {u'href': u'http://docs.openstack.org/', u'type': u'text/html', u'rel':
> > u'describedby'}], u'min_version': u'2.1', u'version': u'2.38',
> > u'media-types': [{u'base': u'application/json', u'type':
> > u'application/vnd.openstack.compute+json;version=2.1'}], u'id': u'v2.1'}}
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/usr/lib/python2.7/site-packages/novaclient/v2/versions.py", line
> > 104, in list
> >     return self._list(version_url, "versions")
> >   File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 255, in
> > _list
> >     data = body[response_key]
> > KeyError: 'versions'
> > 
> > when calling nova.versions.list()
> 
> I am getting the same reply (traceback) from RHOSP10, It works without
> problems with older releases (even with some older RHOSP10 puddle), Not sure
> It's a bug and possibly where exactly.
> The nova server now returns dictionary with one element which key is called
> version, It used to return dictionary with one element called "versions"
> which value used to be a list of version dictionaries.So I guess a bug in
> nova or change of behaviour?

Maybe just let's create a new nova client instance with specified version 2.11 (The first API version where force_down was introduced) which would be only used for  calling nova.services.force_down() and If the call of nova.services.force_down raises novaclient.exceptions.NotAcceptable then fence agent would assume force_down is not supported on that version and skip it.

Comment 6 Andrew Beekhof 2016-11-09 23:28:07 UTC
(In reply to Marian Krcmarik from comment #4)
> (In reply to Andrew Beekhof from comment #3)
> > For every version (2 -> 2.27) I get:
> > 
> > {u'version': {u'status': u'CURRENT', u'updated': u'2013-07-23T11:33:21Z',
> > u'links': [{u'href': u'https://192.168.24.2:13774/v2.1/', u'rel': u'self'},
> > {u'href': u'http://docs.openstack.org/', u'type': u'text/html', u'rel':
> > u'describedby'}], u'min_version': u'2.1', u'version': u'2.38',
> > u'media-types': [{u'base': u'application/json', u'type':
> > u'application/vnd.openstack.compute+json;version=2.1'}], u'id': u'v2.1'}}
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/usr/lib/python2.7/site-packages/novaclient/v2/versions.py", line
> > 104, in list
> >     return self._list(version_url, "versions")
> >   File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 255, in
> > _list
> >     data = body[response_key]
> > KeyError: 'versions'
> > 
> > when calling nova.versions.list()
> 
> I am getting the same reply (traceback) from RHOSP10, It works without
> problems with older releases (even with some older RHOSP10 puddle), Not sure
> It's a bug and possibly where exactly.
> The nova server now returns dictionary with one element which key is called
> version, It used to return dictionary with one element called "versions"
> which value used to be a list of version dictionaries.So I guess a bug in
> nova or change of behaviour?

After much investigation, the reason is that this API call requires a session.
One can verify this by changing use_session to 'False' in /usr/lib/python2.7/site-packages/novaclient/shell.py:

        # Do not use Keystone session for cases with no session support. The
        # presence of auth_plugin means os_auth_system is present and is not
        # keystone.
        use_session = True

And re-running: nova version-list

Comment 7 Andrew Beekhof 2016-11-10 00:19:45 UTC
It is possible to use the versions.list() call if the client is created as:

from novaclient import client
from novaclient import api_versions

from keystoneauth1 import loading
from novaclient.shell import OpenStackComputeShell

shell = OpenStackComputeShell()
parser = shell.get_base_parser([])
(args, args_list) = parser.parse_known_args([])

keystone_session = ( loading.load_session_from_argparse_arguments(args))
keystone_auth = ( loading.load_auth_from_argparse_arguments(args))

nova = client.Client(api_versions.APIVersion("2.0"), 'admin', None, 'admin', 'https://192.168.24.2:13000/v2.0', session=keystone_session, auth=keystone_auth)


But that seems like it would be more fragile, not less

Comment 8 Andrew Beekhof 2016-11-10 01:57:53 UTC
Created attachment 1219127 [details]
fix

this patch appears to do the trick

Comment 9 Andrew Beekhof 2016-11-10 02:02:04 UTC
Scratch build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12069910

Comment 11 Marian Krcmarik 2016-11-11 23:19:46 UTC
It seems based on the testing of the build with included patch that all the problems were solved except for one when nova compute service remains to be marked as down even though compute node is up and running again after fencing. I created a separate bug for that as agreed with Andrew - https://bugzilla.redhat.com/show_bug.cgi?id=1394418

Comment 14 Chen 2017-02-16 02:47:11 UTC
Hi Andrew,

The patch in comment #8 seems to have some problem. Please forgive me if I'm wrong.

The version I'm using is fence-agents-compute-4.0.11-47.el7_3.2.x86_64.

create_nova_connection() function will talk to overcloud nova when fence-nova starts. But fence-nova can not start with the following error. This will be output after "pcs cluster stop --all" and then "pcs cluster start --all"

Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [ Nova connection failed. ConnectionError: ('Connection aborted.', error(111, 'Connection refused')) ]
Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [ Nova connection failed. ConnectionError: ('Connection aborted.', error(111, 'Connection refused')) ]
Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [ Couldn't obtain a supported connection to nova, tried: ['2.11', '2'] ]
Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [  ]
Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [ Please use '-h' for usage ]
Feb 16 02:14:17 [12351] overcloud-controller-0.localdomain stonith-ng:  warning: log_action:    fence_compute[13428] stderr: [  ]

So I think when fence-nova starts, the openstack cluster is not ready to provide nova service or the floating IP is not ready etc. As a result, fence_compute can not talk to nova and fence-nova can not start.

I confirmed that after the cluster starts up, the following script has no error output.

from novaclient import client
versions = [ "2.11", "2" ]
for version in versions:
    nova = client.Client(version,"admin","jFFG4PzWPmqUaTCVc9FEJTWkJ","admin","http://10.0.0.4:5000/v2.0")
    try:
        nova.hypervisors.list()
        print "ok"
    except Exception as e:
        print "Nova connection failed. %s: %s" % (e.__class__.__name__, e)

I tried pcs stonith cleanup fence-nova but it seems that the whole cluster is cleaned up not only fence-nova. So as a result this can not help to solve the issue.

Best Regards,
Chen

Comment 15 Manabu Ori 2017-02-16 06:17:13 UTC
Hi,

By Andrew and Chen's support, I commented out fail_usage() line in /sbin/fence_compute, and then succeeded to start fence-nova, with OSP8 and fence-agents-compute-4.0.11-47.el7_3.2.x86_64.

# diff -u /sbin/fence_compute.orig /sbin/fence_compute
--- /sbin/fence_compute.orig    2017-02-16 14:37:50.256058816 +0900
+++ /sbin/fence_compute 2017-02-16 14:39:24.897601432 +0900
@@ -332,7 +332,7 @@
                except Exception as e:
                        logging.warning("Nova connection failed. %s: %s" % (e.__class__.__name__, e))

-       fail_usage("Couldn't obtain a supported connection to nova, tried: %s" % repr(versions))
+       #fail_usage("Couldn't obtain a supported connection to nova, tried: %s" % repr(versions))

 def define_new_opts():
        all_opt["endpoint-type"] = {

Comment 16 Udi Shkalim 2017-02-20 12:20:21 UTC
Verified based on Comment #11
fence-agents-4.0.11-51.el7

Comment 17 Andrew Beekhof 2017-02-20 22:18:44 UTC
(In reply to Udi Shkalim from comment #16)
> Verified based on Comment #11
> fence-agents-4.0.11-51.el7

Udi, we may need to create an additional test as we didn't notice that the agent breaks when nova isn't up.

Moving back to modified :-(

Comment 18 Andrew Beekhof 2017-02-20 22:31:07 UTC
This is the patch we want...
We still want to know, but it shouldn't be fatal on its own.
All uses of nova include a check for it being set first.

diff --git a/fence/agents/compute/fence_compute.py b/fence/agents/compute/fence_compute.py
index 0a238b6..bc4cb5b 100644
--- a/fence/agents/compute/fence_compute.py
+++ b/fence/agents/compute/fence_compute.py
@@ -329,7 +329,7 @@ def create_nova_connection(options):
                except Exception as e:
                        logging.warning("Nova connection failed. %s: %s" % (e.__class__.__name__, e))
                        
-       fail_usage("Couldn't obtain a supported connection to nova, tried: %s" % repr(versions))
+       logging.warning("Couldn't obtain a supported connection to nova, tried: %s\n" % repr(versions))
 
 def define_new_opts():
        all_opt["endpoint-type"] = {

Comment 19 Oyvind Albrigtsen 2017-02-21 13:53:27 UTC
(In reply to Andrew Beekhof from comment #18)
> This is the patch we want...
New build with the new patch.

Comment 21 Andrew Beekhof 2017-03-15 03:33:13 UTC
*** Bug 1430393 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2017-08-01 16:10:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1874


Note You need to log in before you can comment on or make changes to this bug.