Bug 1909795 - network-runner 0.2.2 breaks rhos 16.1/16.2 neutron: dhcp agent failing due to amq timouts, unable to sync network state
Summary: network-runner 0.2.2 breaks rhos 16.1/16.2 neutron: dhcp agent failing due to...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-network-runner
Version: 16.2 (Train)
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: z4
: 16.1 (Train on RHEL 8.2)
Assignee: Lon Hohberger
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On: 1866606
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-21 16:51 UTC by wes hayutin
Modified: 2021-03-17 15:37 UTC (History)
16 users (show)

Fixed In Version: python-network-runner017-0.1.7-4.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, a rebase in python-network-runner from 0.1.7 to 0.2.2 in OSP 16.1.3 caused ML2 Networking using Ansible to no longer function. With this update, python-networking-ansible is reverted to 0.1.7, and Ansible networking returns to a functioning state. For more information, see https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/bare_metal_provisioning/ml2-networking-ansible.
Clone Of:
Environment:
Last Closed: 2021-03-17 15:36:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 729085 0 None MERGED Use runner model instead of resource 2021-02-19 10:19:13 UTC
OpenStack gerrit 768222 0 None NEW Use runner model instead of resource 2021-02-19 10:19:13 UTC
Red Hat Product Errata RHBA-2021:0817 0 None None None 2021-03-17 15:37:12 UTC

Description wes hayutin 2020-12-21 16:51:16 UTC
Description of problem:

2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent [req-c4d71c06-ef3e-4a9f-929c-20fca987b6a2 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return self._queues[msg_id].get(block=True, timeout=timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return waiter.wait()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return get_hub().switch()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return self.greenlet.switch()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent queue.Empty
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 976, in _report_state
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 101, in report_state
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=self.transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     call_monitor_timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492


https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/errors.txt.txt

Started to fail on 12/16/2020:
https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2

We do not see the same issue in train:
https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-stable3&job_name=periodic-tripleo-ci-centos-8-scenario012-standalone-train


This results in the private networks failing to be created.

The error was: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://192.168.24.1:9696/v2.0/networks.json: HTTPConnectionPool(host='192.168.24.1', port=9696): Max retries exceeded with url: /v2.0/networks.json?tenant_id=5420954cc98447fcbac4681bc4127118 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f54f5da5198>: Failed to establish a new connection: [Errno 111] Connection refused',))

Comment 1 Slawek Kaplonski 2020-12-22 08:05:22 UTC
There is error in neutron-server:

2020-12-22 01:45:37.282 8 CRITICAL neutron.plugins.ml2.managers [-] The 'ansible = networking_ansible.ml2.mech_driver:AnsibleMechanismDriver' entrypoint could not be loaded for the following reason: 'No module named 'network_runner.resources''.: ModuleNotFoundError: No module named 'network_runner.resources'

And due to that neutron server fails to start.

This "ansible" mechanism driver is enabled in that specific scenario. I don't know if we should remove it or if it is really needed, we are probably missing networking-ansible to be installed in the neutron container.

Comment 2 Slawek Kaplonski 2020-12-22 15:09:45 UTC
Thx to ysandeep I got env with reproducer and I found out that we are missing patch https://opendev.org/x/networking-ansible/commit/28e22565070bfde3e2b476bd3cfd323dbe2d34c4 in the networking-ansible module installed in the container.
It is included in networking-ansible 5.0.0 so we should update it to that version or maybe backport this patch.
The best guy to ask about that is probably @dradez who is networking-ansible maintainer still.

Comment 3 wes hayutin 2020-12-22 15:28:28 UTC
[08:23:54] <slaweq> weshay|ruck: for train-release I see in rdo https://github.com/redhat-openstack/rdoinfo/blob/master/buildsys-tags/cloud8-openstack-train-release.yml#L1444
[08:23:54] * udesale_ has quit (Ping timeout: 256 seconds)
[08:24:02] <slaweq> so 0.1.7 is used there
[08:24:20] <slaweq> and that works fine without this patch in networking-ansible
[08:25:27] <weshay|ruck> python3-ansible-runner-1.4.0-1.el8ost.noarch
[08:25:35] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/rpm-list.txt
[08:25:54] <slaweq> weshay|ruck: You should check python3-network-runner package AFAIK
[08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch
[08:26:28] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/all_available_packages.txt
[08:26:37] * derekh has quit (Ping timeout: 264 seconds)
[08:26:50] <slaweq> weshay|ruck: yes, that is version which needs that patch basically

Comment 4 Dan Radez 2020-12-23 01:54:36 UTC
This looks like a case of a newer version of network-runner being install than should be provided with 16.2?

The code changes that required net-runner 0.2 wasn't made untill Victoria.
We haven't tested the train code base with net-runner 0.2 at all afaik, only 0.1.x.

If the installed net-ansible code is asking for the net-runner.models we should verify that the net-ansible version installed is the supported version for train.
If net-ansible can't find the resources module in net-runner we should veify that net-runner 0.1.7 is installed.


This looks suspect to me:
[08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch

Comment 13 Chris Janiszewski 2020-12-28 19:52:44 UTC
We have just hit this issue after upgrading OSP 16.1.1 to 16.1.3 .. is there any known workaround ? We are actively using net-ansible functionality.

Comment 17 Chris Janiszewski 2021-01-18 20:28:37 UTC
I just want to follow up to my previous comment and leave artifact in case anyone else hits this issue.
I was able to work it around by manually rolling python-network-runner down to 0.1.7

[root@hextupleo-controller-2 ~]# podman exec -tiu root neutron_api /bin/bash
()[root@hextupleo-controller-2 /]# rpm -qa | grep network-runner
ansible-role-network-runner-0.1.7-3.el8ost.noarch
python3-network-runner-0.1.7-3.el8ost.noarch

now my net-ansible functionality is functional again.

Comment 19 Lon Hohberger 2021-01-20 15:49:20 UTC
Update testing passed:

[root@rhel8-test yum.repos.d]# dnf update python3-network-runner
overrides-lhh                                    51 kB/s | 2.7 kB     00:00    
Dependencies resolved.
================================================================================
 Package                          Arch     Version            Repository   Size
================================================================================
Installing:
 python3-network-runner017        noarch   0.1.7-4.el8ost     overrides    40 k
     replacing  python3-network-runner.noarch 0.2.2-2.el8ost
Installing dependencies:
 ansible-role-network-runner017   noarch   0.1.7-4.el8ost     overrides    24 k
     replacing  ansible-role-network-runner.noarch 0.2.2-2.el8ost

... and switching to the new -4 build:

 ansible-role-network-runner
                   noarch 0.2.2-4.el8ost    overrides                      26 k
     replacing  ansible-role-network-runner017.noarch 0.1.7-4.el8ost
 python3-network-runner
                   noarch 0.2.2-4.el8ost    overrides                      35 k
     replacing  python3-network-runner017.noarch 0.1.7-4.el8ost

Comment 35 errata-xmlrpc 2021-03-17 15:36:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817


Note You need to log in before you can comment on or make changes to this bug.