Bug 1885342
| Summary: | [update][osp13] 67% ping loss on composable deployment (with networker) for osp13<=z7 | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 13.0 (Queens) | CC: | bcafarel, mburns, ralonsoh, skaplons |
| Target Milestone: | --- | Keywords: | TestOnly, Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-12-03 15:22:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Sofer Athlan-Guyot
2020-10-05 16:28:48 UTC
Hi, so we get a complete fip loss as soon as the third controller is updated. Before the update of the networker. The fip doesn't return to normal operation, ie it stays blocked. This doesn't happen when deployement has no networkers and this doesn't happen on osp13 version above or equal to z8. I couldn't see anything relevant in the openvswitch logs. So I went on to compare openvswitch that were working, against those which were not: | State | Name | Starting point | openvswitch version | Redhat | |---------+------+----------------+-----------------------------------------------------+---------| | fail | ga | 2018-06-21.2 | openvswitch-2.9.0-19.el7fdp.1.x86_64.rpm | 7.6 | | fail | z3 | 2018-11-07.3 | openvswitch-2.9.0-56.el7fdp.x86_64.rpm | 7.6 | | fail | z5 | 2019-03-18.1 | openvswitch-2.9.0-97.el7fdp.x86_64.rpm | 7.6 | | fail | z7 | 2019-06-28.1 | openvswitch-2.9.0-103.el7fdp.x86_64.rpm | 7.6 | | working | z8 | 2019-08-27.2 | openvswitch-2.9.0-110.el7fdp.x86_64.rpm | 7.7 | | working | z11 | 2020-03-04.1 | openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64.rpm | 7.7 | | | | | openvswitch2.11-2.11.0-35.el7fdp.x86_64.rpm | | updated to rhosp-openvswitch 2.11-0.7.el7ost,,openvswitch2.11.x86_64 0:2.11.3-64.el7fdp As I said, we we don't have ping loss on non-composable deployment. Composable deployment include 2 networkers. One thing I notice in a live reproducer is that networker have their protocol set: [root@networker-1 ~]# ovs-vsctl list bridge| grep -E '(name|protoc)' name : br-tun protocols : ["OpenFlow10", "OpenFlow13"] name : br-isolated protocols : ["OpenFlow10", "OpenFlow13"] name : br-int protocols : ["OpenFlow10", "OpenFlow13"] name : br-ex protocols : ["OpenFlow10", "OpenFlow13"] while the controllers don't: [heat-admin@controller-1 ~]$ sudo ovs-vsctl list bridge| grep -E '(name|protoc)' name : br-ex protocols : [] name : br-isolated protocols : [] As this may be some kind of error related to https://bugzilla.redhat.com/show_bug.cgi?id=1843811 or https://bugzilla.redhat.com/show_bug.cgi?id=1863024 ? So,
actual error happens on the networker role (thanks slaweq)
2020-10-15 12:46:24.473 43605 DEBUG oslo.privsep.daemon [-] privsep: reply[140493428170416]: (4, ['qdhcp-ba395bdf-a349-49f5-882d-bf624d3c602d', 'qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb']) loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:456
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument
: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 161, in call
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 1177, in process
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info self._process_internal_ports()
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 625, in _process_internal_ports
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info existing_devices = self._get_existing_devices()
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 484, in _get_existing_devices
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info ip_devs = ip_wrapper.get_devices()
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 139, in get_devices
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info return []
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info self.force_reraise()
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info six.reraise(self.type_, self.value, self.tb)
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 132, in get_devices
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info log_fail_as_error=self.log_fail_as_error).split()
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info returncode=returncode)
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info
2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info
2020-10-15 12:46:24.476 42960 ERROR neutron.agent.l3.agent [-] Failed to process compatible router: 4d2a994b-e7dc-44fe-8fcd-5a65409d49bb: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a6540
The problem doesn't happen on non-composable deployment, so there must be something in the version difference between the controller and the networker must be causing the issue.
Should we try starting the update by the networker role ?
Hi, so further investigation showed that updating the networker first led to the same situation, so we don't need the rabbitmq disconnection. Just plain update of the networker leads to the same error. The only way out of this is to reboot the networker. So, first a question that would require more analysis here: - why this doesn't show up when we don't use networker role: when using simple HA deployment with l3 agent in ctlr-0,1,2 we don't have the cut, how can this be ? Then we need a way out of this, currently the most likely way out is a knowledge based article as the necessary procedure won't be able to fit in the update workflow. A currently working procedure seems to be (need more tests): - after update prepare: - evacuate the routers present on networker-X for X=0; - update networker-X : run openstack overcloud update run --limit networker-X --playbook all - reboot networker-X - move to next networker X+1 Or, as described in c#55: - mount --rbind /run/netns /run/netns - mount --make-shared /run/netns This need to be confirmed and tested. Eventually the root cause seems to be the old version of iproute and pyroute in rhel-7.6, jumping all the way to rhel-7.9. So early testing seems to show that a good workaround is:
tripleo-ansible-inventory --plan "qe-Cloud-0" --ansible_ssh_user heat-admin --static-yaml-inventory inventory.yaml
cat workaround.yaml
---
- hosts: Networker
gather_facts: false
become: true
tasks:
- name: Workaround for flawed iproute and pyroute
shell: |
mount --rbind /run/netns /run/netns
mount --make-shared /run/netns
ansible-playbook -i inventory.yaml workaround.yaml
I need more extensive testing but this seems to show good results.
|