Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
DescriptionMichele Baldessari
2019-10-10 07:27:45 UTC
Description of problem:
the fence_compute agent which gets invoked right after a node is IPMI-fenced, times out
after 60s in OSP16. Because it calls nova service-disable <nova-compute> and that now seems to
require the compute node to be up
The fence_compute agent roughly does this after a node is fenced:
A. nova service-disable <nova-compute-X>
B. nova service-force-down <nova-compute-X>
This all worked more or less fine osp15 included but we noted that with osp16
(1.) nova service-disable *requires* the compute to be up for this to work.
When it is not up we get the following traceback:
10-09 12:54:17.123 19 DEBUG nova.api.openstack.wsgi [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Action: 'update', calling method: <function Controller.__getattribute__.<locals>.version_sel
ect at 0x7f98b0eb1488>, body: {"host": "compute-1.localdomain", "binary": "nova-compute"} _process_stack /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:520
2019-10-09 12:54:17.164 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" acquired by "nova.context.set_target_cel
l.<locals>.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2019-10-09 12:54:17.165 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" released by "nova.context.set_target_cel
l.<locals>.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-10-09 12:54:17.214 19 DEBUG nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Calling the compute service on host compute-1.localdomain to sync the COMPUTE_STATUS_DISABLED
trait. _update_compute_provider_status /usr/lib/python3.6/site-packages/nova/compute/api.py:5173
...timeout...
2019-10-09 12:55:17.222 19 ERROR nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource pr
oviders managed by host compute-1.localdomain. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e097
44fcb1fc4ddc75dac1d1
2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last):
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api return self._queues[msg_id].get(block=True, timeout=timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api return waiter.wait()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2019-10-09 12:55:17.222 19 ERROR nova.compute.api return get_hub().switch()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 297, in switch
2019-10-09 12:55:17.222 19 ERROR nova.compute.api return self.greenlet.switch()
2019-10-09 12:55:17.222 19 ERROR nova.compute.api queue.Empty
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.222 19 ERROR nova.compute.api During handling of the above exception, another exception occurred:
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last):
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 5174, in _update_compute_provider_status
2019-10-09 12:55:17.222 19 ERROR nova.compute.api self.rpcapi.set_host_enabled(context, service.host, enabled)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/nova/compute/rpcapi.py", line 1017, in set_host_enabled
2019-10-09 12:55:17.222 19 ERROR nova.compute.api return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=self.transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=transport_options)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2019-10-09 12:55:17.222 19 ERROR nova.compute.api call_monitor_timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
2019-10-09 12:55:17.222 19 ERROR nova.compute.api message = self.waiters.get(msg_id, timeout=timeout)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
2019-10-09 12:55:17.222 19 ERROR nova.compute.api 'to message ID %s' % msg_id)
2019-10-09 12:55:17.222 19 ERROR nova.compute.api oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e09744fcb1fc4ddc75dac1d1
2019-10-09 12:55:17.222 19 ERROR nova.compute.api
2019-10-09 12:55:17.227 19 INFO nova.api.openstack.requestlog [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] 172.17.1.112 "PUT /v2.1/os-services/disable" status: 200 len: 94 microversion: 2.1
1 time: 60.112760
(When compute-0 is up, then nova service-disable works just fine)
The change that implemented this in nova is:
https://review.opendev.org/#/c/654596/
This BZ is to figure out/discuss if we're okay to simply skip disabling the nova-compute service for the node in the fence_compute agent (https://github.com/ClusterLabs/fence-agents/blob/master/agents/compute/fence_compute.py#L120) or if we should see to amend nova to somehow not time out after 60seconds.
Verified ,
the IHA process is working as excpected:
(undercloud) [stack@undercloud-0 ~]$ ansible computeInstanceHA -b -mshell -a'rpm -q fence-agents-compute'
overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
fence-agents-compute-4.2.1-30.el8_1.1.noarch
overcloud-novacomputeiha-1 | CHANGED | rc=0 >>
fence-agents-compute-4.2.1-30.el8_1.1.noarch
(overcloud) [stack@undercloud-0 ~]$ openstack server show osvm2|grep hypervisor_hostnam
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-novacomputeiha-0.redhat.local
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud
overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
Id Name State
----------------------------------------------------
1 instance-0000000b running
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'echo b >/proc/sysrq-trigger'
[WARNING]: Found both group and host with same name: undercloud
overcloud-novacomputeiha-0 | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: Shared connection to 192.168.24.30 closed.",
"unreachable": true
}
#from the logs
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (stage6) warning: Scheduling Node overcloud-novacomputeiha-0 for STONITH
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (native_stop_constraints) info: compute-unfence-trigger:0_stop_0 is implicit after overcloud-novacomputeiha-0 is fenced
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (LogNodeActions) notice: * Fence (reboot) overcloud-novacomputeiha-0 'remote connection is unrecoverable'
Nov 05 12:15:58 controller-2 pacemaker-schedulerd[25359] (LogAction) notice: * Recover overcloud-novacomputeiha-0 (
Nov 05 12:16:00 controller-2 pacemaker-fenced [25356] (call_remote_stonith) info: Requesting that 'controller-0' perform op 'overcloud-novacomputeiha-0 off' with 'stonith-fence_ipmilan-5254000cae5a' for pacemaker-controld.25360 (72s)
Nov 05 12:16:05 controller-2 pacemaker-attrd [25358] (attrd_peer_update) info: Setting evacuate[overcloud-novacomputeiha-0.redhat.local]: (null) -> yes from controller-0
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-0 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud
overcloud-novacomputeiha-0 | CHANGED | rc=0 >>
Id Name State
----------------------------------------------------
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud-novacomputeiha-1 -b -mshell -a'virsh list'
[WARNING]: Found both group and host with same name: undercloud
overcloud-novacomputeiha-1 | CHANGED | rc=0 >>
Id Name State
----------------------------------------------------
1 instance-0000000b running
(overcloud) [stack@undercloud-0 ~]$ openstack server show osvm2|grep hypervisor_hostnam
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-novacomputeiha-1.redhat.local
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:1580
Description of problem: the fence_compute agent which gets invoked right after a node is IPMI-fenced, times out after 60s in OSP16. Because it calls nova service-disable <nova-compute> and that now seems to require the compute node to be up The fence_compute agent roughly does this after a node is fenced: A. nova service-disable <nova-compute-X> B. nova service-force-down <nova-compute-X> This all worked more or less fine osp15 included but we noted that with osp16 (1.) nova service-disable *requires* the compute to be up for this to work. When it is not up we get the following traceback: 10-09 12:54:17.123 19 DEBUG nova.api.openstack.wsgi [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Action: 'update', calling method: <function Controller.__getattribute__.<locals>.version_sel ect at 0x7f98b0eb1488>, body: {"host": "compute-1.localdomain", "binary": "nova-compute"} _process_stack /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:520 2019-10-09 12:54:17.164 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" acquired by "nova.context.set_target_cel l.<locals>.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 2019-10-09 12:54:17.165 19 DEBUG oslo_concurrency.lockutils [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Lock "24116fad-1e1e-498b-a507-f03e12778ea9" released by "nova.context.set_target_cel l.<locals>.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2019-10-09 12:54:17.214 19 DEBUG nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] Calling the compute service on host compute-1.localdomain to sync the COMPUTE_STATUS_DISABLED trait. _update_compute_provider_status /usr/lib/python3.6/site-packages/nova/compute/api.py:5173 ...timeout... 2019-10-09 12:55:17.222 19 ERROR nova.compute.api [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource pr oviders managed by host compute-1.localdomain. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e097 44fcb1fc4ddc75dac1d1 2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last): 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get 2019-10-09 12:55:17.222 19 ERROR nova.compute.api return self._queues[msg_id].get(block=True, timeout=timeout) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get 2019-10-09 12:55:17.222 19 ERROR nova.compute.api return waiter.wait() 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait 2019-10-09 12:55:17.222 19 ERROR nova.compute.api return get_hub().switch() 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 297, in switch 2019-10-09 12:55:17.222 19 ERROR nova.compute.api return self.greenlet.switch() 2019-10-09 12:55:17.222 19 ERROR nova.compute.api queue.Empty 2019-10-09 12:55:17.222 19 ERROR nova.compute.api 2019-10-09 12:55:17.222 19 ERROR nova.compute.api During handling of the above exception, another exception occurred: 2019-10-09 12:55:17.222 19 ERROR nova.compute.api 2019-10-09 12:55:17.222 19 ERROR nova.compute.api Traceback (most recent call last): 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 5174, in _update_compute_provider_status 2019-10-09 12:55:17.222 19 ERROR nova.compute.api self.rpcapi.set_host_enabled(context, service.host, enabled) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/nova/compute/rpcapi.py", line 1017, in set_host_enabled 2019-10-09 12:55:17.222 19 ERROR nova.compute.api return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call 2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=self.transport_options) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send 2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=transport_options) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send 2019-10-09 12:55:17.222 19 ERROR nova.compute.api transport_options=transport_options) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send 2019-10-09 12:55:17.222 19 ERROR nova.compute.api call_monitor_timeout) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait 2019-10-09 12:55:17.222 19 ERROR nova.compute.api message = self.waiters.get(msg_id, timeout=timeout) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get 2019-10-09 12:55:17.222 19 ERROR nova.compute.api 'to message ID %s' % msg_id) 2019-10-09 12:55:17.222 19 ERROR nova.compute.api oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 44659cb2e09744fcb1fc4ddc75dac1d1 2019-10-09 12:55:17.222 19 ERROR nova.compute.api 2019-10-09 12:55:17.227 19 INFO nova.api.openstack.requestlog [req-614d4e81-a935-437d-a343-473dbacec331 cae75ef33c1a4b31b4406c99fdca6695 bb4f0969a4f34680b48c0e74622c41c5 - default default] 172.17.1.112 "PUT /v2.1/os-services/disable" status: 200 len: 94 microversion: 2.1 1 time: 60.112760 (When compute-0 is up, then nova service-disable works just fine) The change that implemented this in nova is: https://review.opendev.org/#/c/654596/ This BZ is to figure out/discuss if we're okay to simply skip disabling the nova-compute service for the node in the fence_compute agent (https://github.com/ClusterLabs/fence-agents/blob/master/agents/compute/fence_compute.py#L120) or if we should see to amend nova to somehow not time out after 60seconds.