Bug 1972774 - [OSP16.2] Timeout reached while cleaning the ironic nodes due to network issue: ovn-db-bundle hasn't been started properly in pacemaker
Summary: [OSP16.2] Timeout reached while cleaning the ironic nodes due to network issu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Lucas Alvares Gomes
QA Contact: Roman Safronov
URL:
Whiteboard:
: 1983958 (view as bug list)
Depends On:
Blocks: 1880120 1970051
TreeView+ depends on / blocked
 
Reported: 2021-06-16 15:25 UTC by Jason Paroly
Modified: 2022-03-24 02:20 UTC (History)
23 users (show)

Fixed In Version: python-networking-ovn-7.4.2-2.20210601204820.el8ost.7
Doc Type: Bug Fix
Doc Text:
This update fixes an issue that caused Networking service (neutron) agents, such as Networking service DHCP, to fail when they tried to create resources in OVN because ML2/OVN prevented RPC workers from connecting the OVN southbound database.
Clone Of:
Environment:
Last Closed: 2021-09-15 07:16:23 UTC
Target Upstream Version:
Embargoed:
jparoly: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1935983 0 None None None 2021-07-13 15:55:20 UTC
OpenStack gerrit 800679 0 None NEW [OVN] Fix ML2/OVN + Neutron DHCP agent use case 2021-07-13 16:11:23 UTC
Red Hat Issue Tracker OSP-5528 0 None None None 2022-03-24 02:20:07 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:17:00 UTC

Comment 2 Julia Kreger 2021-06-17 14:33:39 UTC
It appears that the Neutron DHCP agent is failing to operate... well... at all. It is not launching dnsmasq and it is reporting a fairly narly exception:

2021-06-17 14:01:11.487 124511 INFO neutron.agent.dhcp.agent [req-1e24eaf3-2d99-45de-9625-0330b17d9563 - - - - -] Synchronizing state
2021-06-17 14:01:11.765 124511 INFO neutron.agent.dhcp.agent [req-0db2c9f9-7c40-4645-93a2-2ed4a58b309e - - - - -] All active networks have been fetched through RPC.
2021-06-17 14:01:11.765 124511 INFO neutron.agent.dhcp.agent [-] Starting network c6114048-d279-4c8e-abb6-c77930bb785e dhcp configuration
2021-06-17 14:01:11.766 124511 DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: c6114048-d279-4c8e-abb6-c77930bb785e action: enable call_driver /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:194
2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/c6114048-d279-4c8e-abb6-c77930bb785e/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:262
2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.dhcp [-] DHCP port dhcpaebe17f8-776d-5ab6-9a5f-e9bdeeaca66f-c6114048-d279-4c8e-abb6-c77930bb785e on network c6114048-d279-4c8e-abb6-c77930bb785e does not yet exist. Checking for a reserved port. _setup_reserved_dhcp_port /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1492
2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.dhcp [-] DHCP port dhcpaebe17f8-776d-5ab6-9a5f-e9bdeeaca66f-c6114048-d279-4c8e-abb6-c77930bb785e on network c6114048-d279-4c8e-abb6-c77930bb785e does not yet exist. Creating new one. _setup_new_dhcp_port /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1506
2021-06-17 14:01:12.335 124511 DEBUG neutron.agent.linux.dhcp [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Cleaning stale devices for network c6114048-d279-4c8e-abb6-c77930bb785e _cleanup_stale_devices /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1607
2021-06-17 14:01:12.337 161507 DEBUG oslo.privsep.daemon [-] privsep: reply[140577667895368]: (4, []) _call_back /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:511
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Unable to enable dhcp for c6114048-d279-4c8e-abb6-c77930bb785e.: oslo_messaging.rpc.client.RemoteError: Remote error: MechanismDriverError 
['Traceback (most recent call last):\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n    res = self.dispatcher.dispatch(message)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch\n    result = func(ctxt, **new_args)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n    return func(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n    setattr(e, \'_RETRY_EXCEEDED\', True)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n    ectxt.value = e.inner_exc\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n    LOG.debug("Retry wrapper got retriable exception: %s", e)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n    return f(*dup_args, **dup_kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/quota/resource_registry.py", line 94, in wrapper\n    ret_val = f(_self, context, *args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 274, in create_dhcp_port\n    return self._port_action(plugin, context, port, \'create_port\')\n', '  File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 99, in _port_action\n    return p_utils.create_port(plugin, context, port)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/plugins/utils.py", line 336, in create_port\n    context, {port_apidef.RESOURCE_NAME: port_data})\n', '  File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 687, in inner\n    return f(self, context, *args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 233, in wrapped\n    return method(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n    setattr(e, \'_RETRY_EXCEEDED\', True)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n    ectxt.value = e.inner_exc\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n    LOG.debug("Retry wrapper got retriable exception: %s", e)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n    return f(*dup_args, **dup_kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1435, in create_port\n    result, mech_context = self._create_port_db(context, port)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1425, in _create_port_db\n    self.mechanism_manager.create_port_precommit(mech_context)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n    raise_db_retriable=True)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n    errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n'].
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 209, in call_driver
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 235, in enable
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     common_utils.wait_until_true(self._enable, timeout=300)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 703, in wait_until_true
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     while not predicate():
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 246, in _enable
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     interface_name = self.device_manager.setup(self.network)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1640, in setup
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     self._cleanup_stale_devices(network, dhcp_port=None)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     self.force_reraise()
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     six.reraise(self.type_, self.value, self.tb)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     raise value
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1635, in setup
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     port = self.setup_dhcp_port(network)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1574, in setup_dhcp_port
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     dhcp_port = setup_method(network, device_id, dhcp_subnets)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1522, in _setup_new_dhcp_port
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     return self.plugin.create_dhcp_port({'port': port_dict})
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 837, in create_dhcp_port
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     port=port, host=self.host)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron_lib/rpc.py", line 157, in call
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     return self._original_context.call(ctxt, method, **kwargs)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     transport_options=self.transport_options)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 664, in _send
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent     raise result
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent oslo_messaging.rpc.client.RemoteError: Remote error: MechanismDriverError
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent ['Traceback (most recent call last):\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n    res = self.dispatcher.dispatch(message)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch\n    result = func(ctxt, **new_args)\n', '  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n    return func(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n    setattr(e, \'_RETRY_EXCEEDED\', True)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n    ectxt.value = e.inner_exc\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n    LOG.debug("Retry wrapper got retriable exception: %s", e)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n    return f(*dup_args, **dup_kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/quota/resource_registry.py", line 94, in wrapper\n    ret_val = f(_self, context, *args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 274, in create_dhcp_port\n    return self._port_action(plugin, context, port, \'create_port\')\n', '  File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 99, in _port_action\n    return p_utils.create_port(plugin, context, port)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/plugins/utils.py", line 336, in create_port\n    context, {port_apidef.RESOURCE_NAME: port_data})\n', '  File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 687, in inner\n    return f(self, context, *args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 233, in wrapped\n    return method(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n    setattr(e, \'_RETRY_EXCEEDED\', True)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n    ectxt.value = e.inner_exc\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n    return f(*args, **kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n    LOG.debug("Retry wrapper got retriable exception: %s", e)\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n    return f(*dup_args, **dup_kwargs)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1435, in create_port\n    result, mech_context = self._create_port_db(context, port)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1425, in _create_port_db\n    self.mechanism_manager.create_port_precommit(mech_context)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n    raise_db_retriable=True)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n    errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n'].
2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent 
2021-06-17 14:01:12.339 124511 DEBUG oslo_concurrency.lockutils [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Lock "resize_greenpool" acquired by "neutron.agent.dhcp.agent.DhcpAgent._resize_process_pool" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327

I can see the DHCP packets coming in on br-baremetal to controller-0, but with-in the container the agent is just flailing reporting this error over and over.

Comment 3 Julia Kreger 2021-06-17 14:38:27 UTC
We're going to need eyes on this from the networking team since they have the context into changes which have taken place in the overcloud which may be breaking the bare metal as a service use case.

Comment 4 Lucas Alvares Gomes 2021-06-17 15:22:05 UTC
Following up with comment #2, Julia rightly pointed out that the DHCP agent is failing when it tries to create the DHCP port, here's a snippet of the relevant traceback from her comment:

<snippet>
File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n    raise_db_retriable=True)\n', '  File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n    errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n']
</snippet>

The failure is comming the "create_port_precommit" method from the ML2/OVN driver.

Now, in the Neutron server log I can see the following error all over the logs [0][1]:

2021-06-16 10:22:02.028 17 DEBUG neutron_lib.callbacks.manager [req-df2f0e7f-bafa-4245-8e78-c171a1937274 - - - - -] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_create_port_callback-11019623'] for port, precommit_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers [req-df2f0e7f-bafa-4245-8e78-c171a1937274 - - - - -] Mechanism driver 'ovn' failed in create_port_precommit: AttributeError: 'NoneType' object has no attribute 'chassis_exists'
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers   File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 477, in _call_on_drivers
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers     getattr(driver.obj, method_name)(context)
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers   File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 551, in create_port_precommit
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers     if self._is_port_provisioning_required(port, context.host):
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers   File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 589, in _is_port_provisioning_required
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers     if not self._sb_ovn.chassis_exists(host):
2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers AttributeError: 'NoneType' object has no attribute 'chassis_exists'

This self._sb_ovn attribute is the attribute that holds the connection with the OVN Southbound database, the fact that it's None means that it has not been connected to the OVN SB DB.

Looking at the OVN DBs logs it seems like they are not running ? For example the timestamp for the OVN NB DB in all 3 controllers are before 10:00:00, there's no logs after 9:58:...

Last log messages from Controller-0 [2]:
2021-06-16T09:57:59.547Z|00019|reconnect|WARN|unix#6: connection dropped (Connection reset by peer)
2021-06-16T09:57:59.558Z|00020|jsonrpc|WARN|unix#7: receive error: Connection reset by peer
2021-06-16T09:57:59.558Z|00021|reconnect|WARN|unix#7: connection dropped (Connection reset by peer)

Last log messages from Controller-1 [3]:
2021-06-16T09:57:59.408Z|00014|reconnect|INFO|tcp:172.17.1.120:6642: connecting...
2021-06-16T09:57:59.903Z|00015|reconnect|INFO|tcp:172.17.1.120:6642: connected
2021-06-16T09:57:59.910Z|00016|replication|INFO|Monitor request received. Resetting the database

Last log messages from Controller-2 [4]:
2021-06-16T09:57:59.909Z|00011|replication|INFO|Monitor request received. Resetting the database
2021-06-16T09:58:02.770Z|00012|memory|INFO|7312 kB peak resident set size after 10.0 seconds
2021-06-16T09:58:02.770Z|00013|memory|INFO|cells:48 monitors:0

So I think the problem is that the OVN DBs haven't started properly. I looked at the pacemaker logs [5] and I see many of the following error (in all 3 controllers) that looks related:

Jun 16 09:26:21 controller-2 pacemaker-based     [28878] (cib_perform_op) 	info: +  /cib/status/node_state[@id='ovn-dbs-bundle-0']/lrm[@id='ovn-dbs-bundle-0']/lrm_resources/lrm_resource[@id='ovndb_servers']/lrm_rsc_op[@id='ovndb_servers_monitor_10000']:  @transition-magic=0:8;213:72:8:f638d9b3-c348-4ca8-bcd6-8b31c55f1141, @call-id=39, @rc-code=8, @op-status=0, @exec-time=396
Jun 16 09:26:21 controller-2 pacemaker-based     [28878] (cib_process_request) 	info: Completed cib_modify operation for section status: OK (rc=0, origin=controller-0/crmd/193, version=0.78.7)
Jun 16 09:26:21 controller-2 pacemaker-controld  [28883] (services_os_action_execute) 	warning: Cannot execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory | stat rc=2
Jun 16 09:26:21 controller-2 pacemaker-controld  [28883] (lrmd_api_get_metadata_params) 	error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Jun 16 09:26:21 controller-2 pacemaker-controld  [28883] (build_operation_update) 	warning: Failed to get metadata for ovndb_servers (ocf:ovn:ovndb-servers)
Jun 16 09:26:21 controller-2 pacemaker-controld  [28883] (process_lrm_event) 	notice: Result of monitor operation for ovndb_servers on ovn-dbs-bundle-2: ok | rc=0 call=23 key=ovndb_servers_monitor_30000 confirmed=false cib-update=185

I believe the root cause is this, the ovn-db-bundle hasn't been started properly in pacemaker but I do not know why that happened as I am not familiar with the error above.


[0] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/neutron/server.log.gz
[1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/neutron/server.log.1.gz
[2] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/openvswitch/ovsdb-server-nb.log.gz
[3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-1/var/log/containers/openvswitch/ovsdb-server-nb.log.gz
[4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-2/var/log/containers/openvswitch/ovsdb-server-nb.log.gz
[5] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-2/var/log/pacemaker/pacemaker.log.gz

Comment 5 Michele Baldessari 2021-06-18 07:44:17 UTC
(In reply to Lucas Alvares Gomes from comment #4)
> So I think the problem is that the OVN DBs haven't started properly. I
> looked at the pacemaker logs [5] and I see many of the following error (in
> all 3 controllers) that looks related:
> 
> Jun 16 09:26:21 controller-2 pacemaker-based     [28878] (cib_perform_op) 
> info: + 
> /cib/status/node_state[@id='ovn-dbs-bundle-0']/lrm[@id='ovn-dbs-bundle-0']/
> lrm_resources/lrm_resource[@id='ovndb_servers']/
> lrm_rsc_op[@id='ovndb_servers_monitor_10000']: 
> @transition-magic=0:8;213:72:8:f638d9b3-c348-4ca8-bcd6-8b31c55f1141,
> @call-id=39, @rc-code=8, @op-status=0, @exec-time=396
> Jun 16 09:26:21 controller-2 pacemaker-based     [28878]
> (cib_process_request) 	info: Completed cib_modify operation for section
> status: OK (rc=0, origin=controller-0/crmd/193, version=0.78.7)
> Jun 16 09:26:21 controller-2 pacemaker-controld  [28883]
> (services_os_action_execute) 	warning: Cannot execute
> '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory |
> stat rc=2
> Jun 16 09:26:21 controller-2 pacemaker-controld  [28883]
> (lrmd_api_get_metadata_params) 	error: Failed to retrieve meta-data for
> ocf:ovn:ovndb-servers
> Jun 16 09:26:21 controller-2 pacemaker-controld  [28883]
> (build_operation_update) 	warning: Failed to get metadata for ovndb_servers
> (ocf:ovn:ovndb-servers)
> Jun 16 09:26:21 controller-2 pacemaker-controld  [28883] (process_lrm_event)
> notice: Result of monitor operation for ovndb_servers on ovn-dbs-bundle-2:
> ok | rc=0 call=23 key=ovndb_servers_monitor_30000 confirmed=false
> cib-update=185

Just for the record. The above are totally benign and are due to the fact that pacemaker checks
for the OCF resource agent on the host but ovn-dbs only exists inside the ovn-dbs container.
We've asked to get rid of this error via https://bugzilla.redhat.com/show_bug.cgi?id=1850506

> I believe the root cause is this, the ovn-db-bundle hasn't been started
> properly in pacemaker but I do not know why that happened as I am not
> familiar with the error above.

Something is off in this environment. Namely something kills all VMs on the hypervisor at 09:41
Jun 16 09:41:29 controller-0 healthcheck_swift_container_server[176352]: 200 172.17.4.99:6001 0.002797 seconds
Jun 16 09:41:29 controller-0 systemd[1]: tripleo_swift_container_server_healthcheck.service: Succeeded.
Jun 16 09:41:29 controller-0 systemd[1]: Started swift_container_server healthcheck.
Jun 16 08:54:37 controller-0 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021


Jun 16 09:41:33 controller-1 conmon[45724]: cluster 2021-06-16 09:41:30.299899 mgr.controller-0 (mgr.14231) 752 : cluster [DBG] pgmap v797: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail
Jun 16 09:41:33 controller-1 conmon[45724]: cluster 2021-06-16 09:41:32.300400 mgr.controller-0 (mgr.14231) 753 : cluster [DBG] pgmap v798: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail
Jun 16 08:54:08 controller-1 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
Jun 16 08:54:08 controller-1 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=da9c707f-f199-43fd-b9e4-c4508098ac9f ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet

Jun 16 09:41:33 controller-2 conmon[45675]: cluster 2021-06-16 09:41:30.299899 mgr.controller-0 (mgr.14231) 752 : cluster [DBG] pgmap v797: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail
Jun 16 09:41:33 controller-2 conmon[45675]: cluster 2021-06-16 09:41:32.300400 mgr.controller-0 (mgr.14231) 753 : cluster [DBG] pgmap v798: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail
Jun 16 08:52:30 controller-2 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
Jun 16 08:52:30 controller-2 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=da9c707f-f199-43fd-b9e4-c4508098ac9f ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet


And on the hypervisor journal we see the following:
Jun 16 09:41:52 titan67.lab.eng.tlv2.redhat.com kvm[1994438]: 1 guest now active
Jun 16 09:41:52 titan67.lab.eng.tlv2.redhat.com systemd-machined[26128]: Machine qemu-29-controller-1 terminated.
-- Subject: A virtual machine or container has been terminated
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The virtual machine qemu-29-controller-1 with its leader PID 1920097 has been
-- shut down.


As to why that is, I am not entirely sure tbh. A little bit before the controller-1 vm being shutdown
by the hypervisor I see the following:
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994344]: Received disconnect from 172.16.0.19 port 60896:11: disconnected by user
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994344]: Disconnected from user root 172.16.0.19 port 60896
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994343]: Received disconnect from 172.16.0.19 port 60894:11: disconnected by user
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994343]: Disconnected from user root 172.16.0.19 port 60894
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994335]: pam_unix(sshd:session): session closed for user root
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994332]: pam_unix(sshd:session): session closed for user root
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Session 2968461 logged out. Waiting for processes to exit.
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Session 2968460 logged out. Waiting for processes to exit.
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Removed session 2968461.
-- Subject: Session 2968461 has been terminated
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat
--
-- A session with the ID 2968461 has been terminated.
Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Removed session 2968460.
-- Subject: Session 2968460 has been terminated
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat
--
-- A session with the ID 2968460 has been terminated.


I am not sure if it is related. What's sure is that we killed the vms hosting all three controllers during this test.

Comment 6 Michele Baldessari 2021-06-18 07:46:54 UTC
Also note that the cluster is in a good state at the time of log collection (10:59:31):
Full List of Resources:
  * ip-192.168.24.44	(ocf::heartbeat:IPaddr2):	 Started controller-0
  * ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	 Started controller-1
  * ip-172.17.1.62	(ocf::heartbeat:IPaddr2):	 Started controller-2
  * ip-172.17.1.145	(ocf::heartbeat:IPaddr2):	 Started controller-0
  * ip-172.17.3.101	(ocf::heartbeat:IPaddr2):	 Started controller-1
  * ip-172.17.4.57	(ocf::heartbeat:IPaddr2):	 Started controller-2
  * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0	(ocf::heartbeat:podman):	 Started controller-0
    * haproxy-bundle-podman-1	(ocf::heartbeat:podman):	 Started controller-1
    * haproxy-bundle-podman-2	(ocf::heartbeat:podman):	 Started controller-2
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
    * galera-bundle-0	(ocf::heartbeat:galera):	 Master controller-0
    * galera-bundle-1	(ocf::heartbeat:galera):	 Master controller-1
    * galera-bundle-2	(ocf::heartbeat:galera):	 Master controller-2
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	 Started controller-0
    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	 Started controller-1
    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	 Started controller-2
  * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
    * redis-bundle-0	(ocf::heartbeat:redis):	 Master controller-0
    * redis-bundle-1	(ocf::heartbeat:redis):	 Slave controller-1
    * redis-bundle-2	(ocf::heartbeat:redis):	 Slave controller-2
  * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	 Master controller-0
    * ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	 Slave controller-1
    * ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	 Slave controller-2
  * ip-172.17.1.120	(ocf::heartbeat:IPaddr2):	 Started controller-0
  * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	 Started controller-0

So the cluster did manage to recover, but the test running obviously barfed as the three controllers were reset.

Comment 16 spower 2021-06-28 16:51:28 UTC
This bug was approved as a Blocker + however, the bug is not triaged correctly, no tiraged keyword, no dev/QE/PM acks, no release flag. Please complete the triage or it will not get included in the release/.

Comment 35 Rodolfo Alonso 2021-07-23 12:17:18 UTC
*** Bug 1983958 has been marked as a duplicate of this bug. ***

Comment 41 Roman Safronov 2021-07-27 10:58:06 UTC
Verified that the issue does not occur on RHOS-16.2-RHEL-8-20210722.n.0

Comment 43 errata-xmlrpc 2021-09-15 07:16:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.