Bug 1909795

Summary: network-runner 0.2.2 breaks rhos 16.1/16.2 neutron: dhcp agent failing due to amq timouts, unable to sync network state
Product: Red Hat OpenStack Reporter: wes hayutin <whayutin>
Component: python-network-runnerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Candido Campos <ccamposr>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.2 (Train)CC: amuller, chrisw, cjanisze, dradez, igallagh, jjoyce, lhh, mburns, mgarciac, ralonsoh, scohen, shrjoshi, skaplons, slinaber, spower, tvignaud
Target Milestone: z4Keywords: Regression, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-network-runner017-0.1.7-4.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, a rebase in python-network-runner from 0.1.7 to 0.2.2 in OSP 16.1.3 caused ML2 Networking using Ansible to no longer function. With this update, python-networking-ansible is reverted to 0.1.7, and Ansible networking returns to a functioning state. For more information, see https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/bare_metal_provisioning/ml2-networking-ansible.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-17 15:36:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1866606    
Bug Blocks:    

Description wes hayutin 2020-12-21 16:51:16 UTC
Description of problem:

2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent [req-c4d71c06-ef3e-4a9f-929c-20fca987b6a2 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return self._queues[msg_id].get(block=True, timeout=timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return waiter.wait()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return get_hub().switch()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return self.greenlet.switch()
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent queue.Empty
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent 
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 976, in _report_state
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 101, in report_state
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=self.transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     call_monitor_timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2020-12-21 01:41:08.491 ERROR /var/log/containers/neutron/dhcp-agent.log: 135708 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID d2091dd162de48e692189910f57d3492


https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/errors.txt.txt

Started to fail on 12/16/2020:
https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2

We do not see the same issue in train:
https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-stable3&job_name=periodic-tripleo-ci-centos-8-scenario012-standalone-train


This results in the private networks failing to be created.

The error was: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://192.168.24.1:9696/v2.0/networks.json: HTTPConnectionPool(host='192.168.24.1', port=9696): Max retries exceeded with url: /v2.0/networks.json?tenant_id=5420954cc98447fcbac4681bc4127118 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f54f5da5198>: Failed to establish a new connection: [Errno 111] Connection refused',))

Comment 1 Slawek Kaplonski 2020-12-22 08:05:22 UTC
There is error in neutron-server:

2020-12-22 01:45:37.282 8 CRITICAL neutron.plugins.ml2.managers [-] The 'ansible = networking_ansible.ml2.mech_driver:AnsibleMechanismDriver' entrypoint could not be loaded for the following reason: 'No module named 'network_runner.resources''.: ModuleNotFoundError: No module named 'network_runner.resources'

And due to that neutron server fails to start.

This "ansible" mechanism driver is enabled in that specific scenario. I don't know if we should remove it or if it is really needed, we are probably missing networking-ansible to be installed in the neutron container.

Comment 2 Slawek Kaplonski 2020-12-22 15:09:45 UTC
Thx to ysandeep I got env with reproducer and I found out that we are missing patch https://opendev.org/x/networking-ansible/commit/28e22565070bfde3e2b476bd3cfd323dbe2d34c4 in the networking-ansible module installed in the container.
It is included in networking-ansible 5.0.0 so we should update it to that version or maybe backport this patch.
The best guy to ask about that is probably @dradez who is networking-ansible maintainer still.

Comment 3 wes hayutin 2020-12-22 15:28:28 UTC
[08:23:54] <slaweq> weshay|ruck: for train-release I see in rdo https://github.com/redhat-openstack/rdoinfo/blob/master/buildsys-tags/cloud8-openstack-train-release.yml#L1444
[08:23:54] * udesale_ has quit (Ping timeout: 256 seconds)
[08:24:02] <slaweq> so 0.1.7 is used there
[08:24:20] <slaweq> and that works fine without this patch in networking-ansible
[08:25:27] <weshay|ruck> python3-ansible-runner-1.4.0-1.el8ost.noarch
[08:25:35] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/rpm-list.txt
[08:25:54] <slaweq> weshay|ruck: You should check python3-network-runner package AFAIK
[08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch
[08:26:28] <weshay|ruck> https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-16.2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario012-standalone-rhos-16.2/9eeccd7/logs/undercloud/var/log/extra/all_available_packages.txt
[08:26:37] * derekh has quit (Ping timeout: 264 seconds)
[08:26:50] <slaweq> weshay|ruck: yes, that is version which needs that patch basically

Comment 4 Dan Radez 2020-12-23 01:54:36 UTC
This looks like a case of a newer version of network-runner being install than should be provided with 16.2?

The code changes that required net-runner 0.2 wasn't made untill Victoria.
We haven't tested the train code base with net-runner 0.2 at all afaik, only 0.1.x.

If the installed net-ansible code is asking for the net-runner.models we should verify that the net-ansible version installed is the supported version for train.
If net-ansible can't find the resources module in net-runner we should veify that net-runner 0.1.7 is installed.


This looks suspect to me:
[08:26:17] <weshay|ruck> python3-network-runner-0:0.2.2-2.el8ost.noarch

Comment 13 Chris Janiszewski 2020-12-28 19:52:44 UTC
We have just hit this issue after upgrading OSP 16.1.1 to 16.1.3 .. is there any known workaround ? We are actively using net-ansible functionality.

Comment 17 Chris Janiszewski 2021-01-18 20:28:37 UTC
I just want to follow up to my previous comment and leave artifact in case anyone else hits this issue.
I was able to work it around by manually rolling python-network-runner down to 0.1.7

[root@hextupleo-controller-2 ~]# podman exec -tiu root neutron_api /bin/bash
()[root@hextupleo-controller-2 /]# rpm -qa | grep network-runner
ansible-role-network-runner-0.1.7-3.el8ost.noarch
python3-network-runner-0.1.7-3.el8ost.noarch

now my net-ansible functionality is functional again.

Comment 19 Lon Hohberger 2021-01-20 15:49:20 UTC
Update testing passed:

[root@rhel8-test yum.repos.d]# dnf update python3-network-runner
overrides-lhh                                    51 kB/s | 2.7 kB     00:00    
Dependencies resolved.
================================================================================
 Package                          Arch     Version            Repository   Size
================================================================================
Installing:
 python3-network-runner017        noarch   0.1.7-4.el8ost     overrides    40 k
     replacing  python3-network-runner.noarch 0.2.2-2.el8ost
Installing dependencies:
 ansible-role-network-runner017   noarch   0.1.7-4.el8ost     overrides    24 k
     replacing  ansible-role-network-runner.noarch 0.2.2-2.el8ost

... and switching to the new -4 build:

 ansible-role-network-runner
                   noarch 0.2.2-4.el8ost    overrides                      26 k
     replacing  ansible-role-network-runner017.noarch 0.1.7-4.el8ost
 python3-network-runner
                   noarch 0.2.2-4.el8ost    overrides                      35 k
     replacing  python3-network-runner017.noarch 0.1.7-4.el8ost

Comment 35 errata-xmlrpc 2021-03-17 15:36:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817