It appears that the Neutron DHCP agent is failing to operate... well... at all. It is not launching dnsmasq and it is reporting a fairly narly exception: 2021-06-17 14:01:11.487 124511 INFO neutron.agent.dhcp.agent [req-1e24eaf3-2d99-45de-9625-0330b17d9563 - - - - -] Synchronizing state 2021-06-17 14:01:11.765 124511 INFO neutron.agent.dhcp.agent [req-0db2c9f9-7c40-4645-93a2-2ed4a58b309e - - - - -] All active networks have been fetched through RPC. 2021-06-17 14:01:11.765 124511 INFO neutron.agent.dhcp.agent [-] Starting network c6114048-d279-4c8e-abb6-c77930bb785e dhcp configuration 2021-06-17 14:01:11.766 124511 DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: c6114048-d279-4c8e-abb6-c77930bb785e action: enable call_driver /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:194 2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/c6114048-d279-4c8e-abb6-c77930bb785e/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:262 2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.dhcp [-] DHCP port dhcpaebe17f8-776d-5ab6-9a5f-e9bdeeaca66f-c6114048-d279-4c8e-abb6-c77930bb785e on network c6114048-d279-4c8e-abb6-c77930bb785e does not yet exist. Checking for a reserved port. _setup_reserved_dhcp_port /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1492 2021-06-17 14:01:11.767 124511 DEBUG neutron.agent.linux.dhcp [-] DHCP port dhcpaebe17f8-776d-5ab6-9a5f-e9bdeeaca66f-c6114048-d279-4c8e-abb6-c77930bb785e on network c6114048-d279-4c8e-abb6-c77930bb785e does not yet exist. Creating new one. _setup_new_dhcp_port /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1506 2021-06-17 14:01:12.335 124511 DEBUG neutron.agent.linux.dhcp [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Cleaning stale devices for network c6114048-d279-4c8e-abb6-c77930bb785e _cleanup_stale_devices /usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py:1607 2021-06-17 14:01:12.337 161507 DEBUG oslo.privsep.daemon [-] privsep: reply[140577667895368]: (4, []) _call_back /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:511 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Unable to enable dhcp for c6114048-d279-4c8e-abb6-c77930bb785e.: oslo_messaging.rpc.client.RemoteError: Remote error: MechanismDriverError ['Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n return func(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n setattr(e, \'_RETRY_EXCEEDED\', True)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n ectxt.value = e.inner_exc\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n LOG.debug("Retry wrapper got retriable exception: %s", e)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n return f(*dup_args, **dup_kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/quota/resource_registry.py", line 94, in wrapper\n ret_val = f(_self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 274, in create_dhcp_port\n return self._port_action(plugin, context, port, \'create_port\')\n', ' File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 99, in _port_action\n return p_utils.create_port(plugin, context, port)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/plugins/utils.py", line 336, in create_port\n context, {port_apidef.RESOURCE_NAME: port_data})\n', ' File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 687, in inner\n return f(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 233, in wrapped\n return method(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n setattr(e, \'_RETRY_EXCEEDED\', True)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n ectxt.value = e.inner_exc\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n LOG.debug("Retry wrapper got retriable exception: %s", e)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n return f(*dup_args, **dup_kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1435, in create_port\n result, mech_context = self._create_port_db(context, port)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1425, in _create_port_db\n self.mechanism_manager.create_port_precommit(mech_context)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n raise_db_retriable=True)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n']. 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 209, in call_driver 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 235, in enable 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent common_utils.wait_until_true(self._enable, timeout=300) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 703, in wait_until_true 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent while not predicate(): 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 246, in _enable 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent interface_name = self.device_manager.setup(self.network) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1640, in setup 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent self._cleanup_stale_devices(network, dhcp_port=None) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent self.force_reraise() 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent six.reraise(self.type_, self.value, self.tb) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent raise value 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1635, in setup 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent port = self.setup_dhcp_port(network) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1574, in setup_dhcp_port 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent dhcp_port = setup_method(network, device_id, dhcp_subnets) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 1522, in _setup_new_dhcp_port 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent return self.plugin.create_dhcp_port({'port': port_dict}) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 837, in create_dhcp_port 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent port=port, host=self.host) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron_lib/rpc.py", line 157, in call 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent return self._original_context.call(ctxt, method, **kwargs) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent transport_options=self.transport_options) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent transport_options=transport_options) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent transport_options=transport_options) 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 664, in _send 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent raise result 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent oslo_messaging.rpc.client.RemoteError: Remote error: MechanismDriverError 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent ['Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n return func(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n setattr(e, \'_RETRY_EXCEEDED\', True)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n ectxt.value = e.inner_exc\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n LOG.debug("Retry wrapper got retriable exception: %s", e)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n return f(*dup_args, **dup_kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/quota/resource_registry.py", line 94, in wrapper\n ret_val = f(_self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 274, in create_dhcp_port\n return self._port_action(plugin, context, port, \'create_port\')\n', ' File "/usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py", line 99, in _port_action\n return p_utils.create_port(plugin, context, port)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/plugins/utils.py", line 336, in create_port\n context, {port_apidef.RESOURCE_NAME: port_data})\n', ' File "/usr/lib/python3.6/site-packages/neutron/common/utils.py", line 687, in inner\n return f(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 233, in wrapped\n return method(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 139, in wrapped\n setattr(e, \'_RETRY_EXCEEDED\', True)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 135, in wrapped\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper\n ectxt.value = e.inner_exc\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped\n LOG.debug("Retry wrapper got retriable exception: %s", e)\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 179, in wrapped\n return f(*dup_args, **dup_kwargs)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1435, in create_port\n result, mech_context = self._create_port_db(context, port)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/plugin.py", line 1425, in _create_port_db\n self.mechanism_manager.create_port_precommit(mech_context)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n raise_db_retriable=True)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n']. 2021-06-17 14:01:12.338 124511 ERROR neutron.agent.dhcp.agent 2021-06-17 14:01:12.339 124511 DEBUG oslo_concurrency.lockutils [req-60392d32-dca5-439b-af33-64cf2e8ad91e - - - - -] Lock "resize_greenpool" acquired by "neutron.agent.dhcp.agent.DhcpAgent._resize_process_pool" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 I can see the DHCP packets coming in on br-baremetal to controller-0, but with-in the container the agent is just flailing reporting this error over and over.
We're going to need eyes on this from the networking team since they have the context into changes which have taken place in the overcloud which may be breaking the bare metal as a service use case.
Following up with comment #2, Julia rightly pointed out that the DHCP agent is failing when it tries to create the DHCP port, here's a snippet of the relevant traceback from her comment: <snippet> File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 701, in create_port_precommit\n raise_db_retriable=True)\n', ' File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 495, in _call_on_drivers\n errors=errors\n', 'neutron.plugins.ml2.common.exceptions.MechanismDriverError\n'] </snippet> The failure is comming the "create_port_precommit" method from the ML2/OVN driver. Now, in the Neutron server log I can see the following error all over the logs [0][1]: 2021-06-16 10:22:02.028 17 DEBUG neutron_lib.callbacks.manager [req-df2f0e7f-bafa-4245-8e78-c171a1937274 - - - - -] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_create_port_callback-11019623'] for port, precommit_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers [req-df2f0e7f-bafa-4245-8e78-c171a1937274 - - - - -] Mechanism driver 'ovn' failed in create_port_precommit: AttributeError: 'NoneType' object has no attribute 'chassis_exists' 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers Traceback (most recent call last): 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 477, in _call_on_drivers 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers getattr(driver.obj, method_name)(context) 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 551, in create_port_precommit 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers if self._is_port_provisioning_required(port, context.host): 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 589, in _is_port_provisioning_required 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers if not self._sb_ovn.chassis_exists(host): 2021-06-16 10:22:02.078 17 ERROR neutron.plugins.ml2.managers AttributeError: 'NoneType' object has no attribute 'chassis_exists' This self._sb_ovn attribute is the attribute that holds the connection with the OVN Southbound database, the fact that it's None means that it has not been connected to the OVN SB DB. Looking at the OVN DBs logs it seems like they are not running ? For example the timestamp for the OVN NB DB in all 3 controllers are before 10:00:00, there's no logs after 9:58:... Last log messages from Controller-0 [2]: 2021-06-16T09:57:59.547Z|00019|reconnect|WARN|unix#6: connection dropped (Connection reset by peer) 2021-06-16T09:57:59.558Z|00020|jsonrpc|WARN|unix#7: receive error: Connection reset by peer 2021-06-16T09:57:59.558Z|00021|reconnect|WARN|unix#7: connection dropped (Connection reset by peer) Last log messages from Controller-1 [3]: 2021-06-16T09:57:59.408Z|00014|reconnect|INFO|tcp:172.17.1.120:6642: connecting... 2021-06-16T09:57:59.903Z|00015|reconnect|INFO|tcp:172.17.1.120:6642: connected 2021-06-16T09:57:59.910Z|00016|replication|INFO|Monitor request received. Resetting the database Last log messages from Controller-2 [4]: 2021-06-16T09:57:59.909Z|00011|replication|INFO|Monitor request received. Resetting the database 2021-06-16T09:58:02.770Z|00012|memory|INFO|7312 kB peak resident set size after 10.0 seconds 2021-06-16T09:58:02.770Z|00013|memory|INFO|cells:48 monitors:0 So I think the problem is that the OVN DBs haven't started properly. I looked at the pacemaker logs [5] and I see many of the following error (in all 3 controllers) that looks related: Jun 16 09:26:21 controller-2 pacemaker-based [28878] (cib_perform_op) info: + /cib/status/node_state[@id='ovn-dbs-bundle-0']/lrm[@id='ovn-dbs-bundle-0']/lrm_resources/lrm_resource[@id='ovndb_servers']/lrm_rsc_op[@id='ovndb_servers_monitor_10000']: @transition-magic=0:8;213:72:8:f638d9b3-c348-4ca8-bcd6-8b31c55f1141, @call-id=39, @rc-code=8, @op-status=0, @exec-time=396 Jun 16 09:26:21 controller-2 pacemaker-based [28878] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=controller-0/crmd/193, version=0.78.7) Jun 16 09:26:21 controller-2 pacemaker-controld [28883] (services_os_action_execute) warning: Cannot execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory | stat rc=2 Jun 16 09:26:21 controller-2 pacemaker-controld [28883] (lrmd_api_get_metadata_params) error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers Jun 16 09:26:21 controller-2 pacemaker-controld [28883] (build_operation_update) warning: Failed to get metadata for ovndb_servers (ocf:ovn:ovndb-servers) Jun 16 09:26:21 controller-2 pacemaker-controld [28883] (process_lrm_event) notice: Result of monitor operation for ovndb_servers on ovn-dbs-bundle-2: ok | rc=0 call=23 key=ovndb_servers_monitor_30000 confirmed=false cib-update=185 I believe the root cause is this, the ovn-db-bundle hasn't been started properly in pacemaker but I do not know why that happened as I am not familiar with the error above. [0] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/neutron/server.log.gz [1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/neutron/server.log.1.gz [2] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-0/var/log/containers/openvswitch/ovsdb-server-nb.log.gz [3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-1/var/log/containers/openvswitch/ovsdb-server-nb.log.gz [4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-2/var/log/containers/openvswitch/ovsdb-server-nb.log.gz [5] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-hardware_provisioning-rqci-16.2_director-virthost-3cont_1comp_3ceph_2ironic-ipv4-geneve-IR-reboot-RHELOSP-60959/23/controller-2/var/log/pacemaker/pacemaker.log.gz
(In reply to Lucas Alvares Gomes from comment #4) > So I think the problem is that the OVN DBs haven't started properly. I > looked at the pacemaker logs [5] and I see many of the following error (in > all 3 controllers) that looks related: > > Jun 16 09:26:21 controller-2 pacemaker-based [28878] (cib_perform_op) > info: + > /cib/status/node_state[@id='ovn-dbs-bundle-0']/lrm[@id='ovn-dbs-bundle-0']/ > lrm_resources/lrm_resource[@id='ovndb_servers']/ > lrm_rsc_op[@id='ovndb_servers_monitor_10000']: > @transition-magic=0:8;213:72:8:f638d9b3-c348-4ca8-bcd6-8b31c55f1141, > @call-id=39, @rc-code=8, @op-status=0, @exec-time=396 > Jun 16 09:26:21 controller-2 pacemaker-based [28878] > (cib_process_request) info: Completed cib_modify operation for section > status: OK (rc=0, origin=controller-0/crmd/193, version=0.78.7) > Jun 16 09:26:21 controller-2 pacemaker-controld [28883] > (services_os_action_execute) warning: Cannot execute > '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory | > stat rc=2 > Jun 16 09:26:21 controller-2 pacemaker-controld [28883] > (lrmd_api_get_metadata_params) error: Failed to retrieve meta-data for > ocf:ovn:ovndb-servers > Jun 16 09:26:21 controller-2 pacemaker-controld [28883] > (build_operation_update) warning: Failed to get metadata for ovndb_servers > (ocf:ovn:ovndb-servers) > Jun 16 09:26:21 controller-2 pacemaker-controld [28883] (process_lrm_event) > notice: Result of monitor operation for ovndb_servers on ovn-dbs-bundle-2: > ok | rc=0 call=23 key=ovndb_servers_monitor_30000 confirmed=false > cib-update=185 Just for the record. The above are totally benign and are due to the fact that pacemaker checks for the OCF resource agent on the host but ovn-dbs only exists inside the ovn-dbs container. We've asked to get rid of this error via https://bugzilla.redhat.com/show_bug.cgi?id=1850506 > I believe the root cause is this, the ovn-db-bundle hasn't been started > properly in pacemaker but I do not know why that happened as I am not > familiar with the error above. Something is off in this environment. Namely something kills all VMs on the hypervisor at 09:41 Jun 16 09:41:29 controller-0 healthcheck_swift_container_server[176352]: 200 172.17.4.99:6001 0.002797 seconds Jun 16 09:41:29 controller-0 systemd[1]: tripleo_swift_container_server_healthcheck.service: Succeeded. Jun 16 09:41:29 controller-0 systemd[1]: Started swift_container_server healthcheck. Jun 16 08:54:37 controller-0 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021 Jun 16 09:41:33 controller-1 conmon[45724]: cluster 2021-06-16 09:41:30.299899 mgr.controller-0 (mgr.14231) 752 : cluster [DBG] pgmap v797: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail Jun 16 09:41:33 controller-1 conmon[45724]: cluster 2021-06-16 09:41:32.300400 mgr.controller-0 (mgr.14231) 753 : cluster [DBG] pgmap v798: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail Jun 16 08:54:08 controller-1 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021 Jun 16 08:54:08 controller-1 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=da9c707f-f199-43fd-b9e4-c4508098ac9f ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet Jun 16 09:41:33 controller-2 conmon[45675]: cluster 2021-06-16 09:41:30.299899 mgr.controller-0 (mgr.14231) 752 : cluster [DBG] pgmap v797: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail Jun 16 09:41:33 controller-2 conmon[45675]: cluster 2021-06-16 09:41:32.300400 mgr.controller-0 (mgr.14231) 753 : cluster [DBG] pgmap v798: 96 pgs: 96 active+clean; 0 B data, 87 MiB used, 165 GiB / 180 GiB avail Jun 16 08:52:30 controller-2 kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021 Jun 16 08:52:30 controller-2 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=da9c707f-f199-43fd-b9e4-c4508098ac9f ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet And on the hypervisor journal we see the following: Jun 16 09:41:52 titan67.lab.eng.tlv2.redhat.com kvm[1994438]: 1 guest now active Jun 16 09:41:52 titan67.lab.eng.tlv2.redhat.com systemd-machined[26128]: Machine qemu-29-controller-1 terminated. -- Subject: A virtual machine or container has been terminated -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- The virtual machine qemu-29-controller-1 with its leader PID 1920097 has been -- shut down. As to why that is, I am not entirely sure tbh. A little bit before the controller-1 vm being shutdown by the hypervisor I see the following: Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994344]: Received disconnect from 172.16.0.19 port 60896:11: disconnected by user Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994344]: Disconnected from user root 172.16.0.19 port 60896 Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994343]: Received disconnect from 172.16.0.19 port 60894:11: disconnected by user Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994343]: Disconnected from user root 172.16.0.19 port 60894 Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994335]: pam_unix(sshd:session): session closed for user root Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com sshd[1994332]: pam_unix(sshd:session): session closed for user root Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Session 2968461 logged out. Waiting for processes to exit. Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Session 2968460 logged out. Waiting for processes to exit. Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Removed session 2968461. -- Subject: Session 2968461 has been terminated -- Defined-By: systemd -- Support: https://access.redhat.com/support -- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat -- -- A session with the ID 2968461 has been terminated. Jun 16 09:41:51 titan67.lab.eng.tlv2.redhat.com systemd-logind[1466]: Removed session 2968460. -- Subject: Session 2968460 has been terminated -- Defined-By: systemd -- Support: https://access.redhat.com/support -- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat -- -- A session with the ID 2968460 has been terminated. I am not sure if it is related. What's sure is that we killed the vms hosting all three controllers during this test.
Also note that the cluster is in a good state at the time of log collection (10:59:31): Full List of Resources: * ip-192.168.24.44 (ocf::heartbeat:IPaddr2): Started controller-0 * ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.17.1.62 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-172.17.1.145 (ocf::heartbeat:IPaddr2): Started controller-0 * ip-172.17.3.101 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.17.4.57 (ocf::heartbeat:IPaddr2): Started controller-2 * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 * galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 * galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 * ip-172.17.1.120 (ocf::heartbeat:IPaddr2): Started controller-0 * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0 So the cluster did manage to recover, but the test running obviously barfed as the three controllers were reset.
This bug was approved as a Blocker + however, the bug is not triaged correctly, no tiraged keyword, no dev/QE/PM acks, no release flag. Please complete the triage or it will not get included in the release/.
*** Bug 1983958 has been marked as a duplicate of this bug. ***
Verified that the issue does not occur on RHOS-16.2-RHEL-8-20210722.n.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483