1760213 – [OSP16] fence_compute: IHA is not working any longer due to a change in behaviour of nova's service disabling

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1760213 - [OSP16] fence_compute: IHA is not working any longer due to a change in behaviour of nova's service disabling

Summary: [OSP16] fence_compute: IHA is not working any longer due to a change in behav...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Oyvind Albrigtsen
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1762432
TreeView+	depends on / blocked

Reported:	2019-10-10 07:27 UTC by Michele Baldessari
Modified:	2020-04-28 15:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:	fence-agents-4.2.1-32.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1762432 (view as bug list)
Environment:
Last Closed:	2020-04-28 15:30:15 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1580	0	None	None	None	2020-04-28 15:30:36 UTC

Description Michele Baldessari 2019-10-10 07:27:45 UTC

Description of problem:
the fence_compute agent which gets invoked right after a node is IPMI-fenced, times out
after 60s in OSP16. Because it calls nova service-disable <nova-compute> and that now seems to
require the compute node to be up

The fence_compute agent roughly does this after a node is fenced:
A. nova service-disable <nova-compute-X>
B. nova service-force-down <nova-compute-X>

This all worked more or less fine osp15 included but we noted that with osp16
(1.) nova service-disable *requires* the compute to be up for this to work.
When it is not up we get the following traceback:
10-09 12:54:17.123 19 DEBUG nova.api.openstack.wsgi [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Action: 'update', calling method: <function Controller.__getattribute__.<locals>.version_sel
ect at 0x7f98b0eb1488>, body: {"host": "compute-1.localdomain", "binary": "nova-compute"} _process_stack /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:520
2019-10-09 12:54:17.164 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" acquired by "nova.context.set_target_cel
l.<locals>.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2019-10-09 12:54:17.165 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" released by "nova.context.set_target_cel
l.<locals>.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-10-09 12:54:17.214 19 DEBUG nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Calling the compute service on host compute-1.localdomain to sync the COMPUTE_STATUS_DISABLED 
trait. _update_compute_provider_status /usr/lib/python3.6/site-packages/nova/compute/api.py:5173

...timeout...

2019-10-09 12:55:17.222 19 ERROR nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource pr
oviders managed by host compute-1.localdomain. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e097
44fcb1fc4ddc75dac1d1
2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last):
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     return self._queues[msg_id].get(block=True, timeout=timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     return waiter.wait()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     return get_hub().switch()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 297, in switch
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     return self.greenlet.switch()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api queue.Empty
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.222 19 ERROR nova.compute.api During handling of the above exception, another exception occurred:
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last):
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 5174, in _update_compute_provider_status
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     self.rpcapi.set_host_enabled(context, service.host, enabled)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/nova/compute/rpcapi.py", line 1017, in set_host_enabled
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     transport_options=self.transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     transport_options=transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     transport_options=transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     call_monitor_timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     message = self.waiters.get(msg_id, timeout=timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api     'to message ID %s' % msg_id)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e09744fcb1fc4ddc75dac1d1
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.227 19 INFO nova.api.openstack.requestlog [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] 172.17.1.112 "PUT /v2.1/os-services/disable" status: 200 len: 94 microversion: 2.1
1 time: 60.112760

(When compute-0 is up, then nova service-disable works just fine)

The change that implemented this in nova is:
https://review.opendev.org/#/c/654596/ 

This BZ is to figure out/discuss if we're okay to simply skip disabling the nova-compute service for the node in the fence_compute agent (https://github.com/ClusterLabs/fence-agents/blob/master/agents/compute/fence_compute.py#L120) or if we should see to amend nova to somehow not time out after 60seconds.

Comment 5 pkomarov 2019-11-05 12:26:44 UTC

Verified , 

the IHA process is working as excpected:

(undercloud) [stack@undercloud-0 ~]$ ansible computeInstanceHA -b -mshell -a'rpm -q fence-agents-compute'


overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
fence-agents-compute-4.2.1-30.el8_1.1.noarch

overcloud-novacomputeiha-1 | CHANGED | rc=0 >>
fence-agents-compute-4.2.1-30.el8_1.1.noarch

(overcloud) [stack@undercloud-0 ~]$ openstack server show osvm2|grep hypervisor_hostnam
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-novacomputeiha-0.redhat.local    


(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
 Id    Name                           State
----------------------------------------------------
 1     instance-0000000b              running

(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'echo b >/proc/sysrq-trigger'
[WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-0 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: Shared connection to 192.168.24.30 closed.",
    "unreachable": true
}


#from the logs 
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (stage6) 	warning: Scheduling Node overcloud-novacomputeiha-0 for STONITH
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (native_stop_constraints) 	info: compute-unfence-trigger:0_stop_0 is implicit after overcloud-novacomputeiha-0 is fenced
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (LogNodeActions) 	notice:  * Fence (reboot) overcloud-novacomputeiha-0 'remote connection is unrecoverable'
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (LogAction) 	notice:  * Recover    overcloud-novacomputeiha-0           (      

Nov 05 12:16:00 controller-2 pacemaker-fenced    [25356] (call_remote_stonith) 	info: Requesting that 'controller-0' perform op 'overcloud-novacomputeiha-0 off' with 'stonith-fence_ipmilan-5254000cae5a' for pacemaker-controld.25360 (72s)


Nov 05 12:16:05 controller-2 pacemaker-attrd     [25358] (attrd_peer_update) 	info: Setting evacuate[overcloud-novacomputeiha-0.redhat.local]: (null) -> yes from controller-0

(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
 Id    Name                           State
----------------------------------------------------

(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-1 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud

overcloud-novacomputeiha-1 | CHANGED | rc=0 >>
 Id    Name                           State
----------------------------------------------------
 1     instance-0000000b              running

(overcloud) [stack@undercloud-0 ~]$ openstack server show osvm2|grep hypervisor_hostnam
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-novacomputeiha-1.redhat.local

Comment 7 errata-xmlrpc 2020-04-28 15:30:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1580

Note You need to log in before you can comment on or make changes to this bug.