Description of problem: 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent [req-c4d71c06-ef3e-4a9f-929c-20fca987b6a2 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent return self._queues[msg_id].get(block=True, timeout=timeout) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent return waiter.wait() 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent return get_hub().switch() 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent return self.greenlet.switch() 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent queue.Empty 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred: 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 976, in _report_state 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 101, in report_state 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent transport_options=self.transport_options) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent transport_options=transport_options) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent transport_options=transport_options) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent call_monitor_timeout) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492 https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/errors.txt.txt Started to fail on 12/16/2020: https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2 We do not see the same issue in train: https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-stable3&job_name=periodic-tripleo-ci-centos-8-scenario012-standalone-train This results in the private networks failing to be created. The error was: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://192.168.24.1:9696/v2.0/networks.json: HTTPConnectionPool(host='192.168.24.1', port=9696): Max retries exceeded with url: /v2.0/networks.json?tenant_id=5420954cc98447fcbac4681bc4127118 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f54f5da5198>: Failed to establish a new connection: [Errno 111] Connection refused',))
There is error in neutron-server: 2020-12-22 01:45:37.282 8 CRITICAL neutron.plugins.ml2.managers [-] The 'ansible = networking_ansible.ml2.mech_driver:AnsibleMechanismDriver' entrypoint could not be loaded for the following reason: 'No module named 'network_runner.resources''.: ModuleNotFoundError: No module named 'network_runner.resources' And due to that neutron server fails to start. This "ansible" mechanism driver is enabled in that specific scenario. I don't know if we should remove it or if it is really needed, we are probably missing networking-ansible to be installed in the neutron container.
Thx to ysandeep I got env with reproducer and I found out that we are missing patch https://opendev.org/x/networking-ansible/commit/28e22565070bfde3e2b476bd3cfd323dbe2d34c4 in the networking-ansible module installed in the container. It is included in networking-ansible 5.0.0 so we should update it to that version or maybe backport this patch. The best guy to ask about that is probably @dradez who is networking-ansible maintainer still.
[08:23:54] <slaweq> weshay|ruck: for train-release I see in rdo https://github.com/redhat-openstack/rdoinfo/blob/master/buildsys-tags/cloud8-openstack-train-release.yml#L1444 [08:23:54] * udesale_ has quit (Ping timeout: 256 seconds) [08:24:02] <slaweq> so 0.1.7 is used there [08:24:20] <slaweq> and that works fine without this patch in networking-ansible [08:25:27] <weshay|ruck> python3-ansible-runner-1.4.0-1.el8ost.noarch [08:25:35] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/rpm-list.txt [08:25:54] <slaweq> weshay|ruck: You should check python3-network-runner package AFAIK [08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch [08:26:28] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/all_available_packages.txt [08:26:37] * derekh has quit (Ping timeout: 264 seconds) [08:26:50] <slaweq> weshay|ruck: yes, that is version which needs that patch basically
This looks like a case of a newer version of network-runner being install than should be provided with 16.2? The code changes that required net-runner 0.2 wasn't made untill Victoria. We haven't tested the train code base with net-runner 0.2 at all afaik, only 0.1.x. If the installed net-ansible code is asking for the net-runner.models we should verify that the net-ansible version installed is the supported version for train. If net-ansible can't find the resources module in net-runner we should veify that net-runner 0.1.7 is installed. This looks suspect to me: [08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch
We have just hit this issue after upgrading OSP 16.1.1 to 16.1.3 .. is there any known workaround ? We are actively using net-ansible functionality.
I just want to follow up to my previous comment and leave artifact in case anyone else hits this issue. I was able to work it around by manually rolling python-network-runner down to 0.1.7 [root@hextupleo-controller-2 ~]# podman exec -tiu root neutron_api /bin/bash ()[root@hextupleo-controller-2 /]# rpm -qa | grep network-runner ansible-role-network-runner-0.1.7-3.el8ost.noarch python3-network-runner-0.1.7-3.el8ost.noarch now my net-ansible functionality is functional again.
Update testing passed: [root@rhel8-test yum.repos.d]# dnf update python3-network-runner overrides-lhh 51 kB/s | 2.7 kB 00:00 Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: python3-network-runner017 noarch 0.1.7-4.el8ost overrides 40 k replacing python3-network-runner.noarch 0.2.2-2.el8ost Installing dependencies: ansible-role-network-runner017 noarch 0.1.7-4.el8ost overrides 24 k replacing ansible-role-network-runner.noarch 0.2.2-2.el8ost ... and switching to the new -4 build: ansible-role-network-runner noarch 0.2.2-4.el8ost overrides 26 k replacing ansible-role-network-runner017.noarch 0.1.7-4.el8ost python3-network-runner noarch 0.2.2-4.el8ost overrides 35 k replacing python3-network-runner017.noarch 0.1.7-4.el8ost
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817