Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1885342

Summary: [update][osp13] 67% ping loss on composable deployment (with networker) for osp13<=z7
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED CURRENTRELEASE QA Contact: David Rosenfeld <drosenfe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: bcafarel, mburns, ralonsoh, skaplons
Target Milestone: ---Keywords: TestOnly, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-03 15:22:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2020-10-05 16:28:48 UTC
Description of problem:  Hi,

composable update from GA to 2020-09-16.1, we have a error with a ping loss of 67% during overcloud update run.

Comment 2 Sofer Athlan-Guyot 2020-10-07 08:21:50 UTC
Hi,

so we get a complete fip loss as soon as the third controller is
updated.  Before the update of the networker.  The fip doesn't return
to normal operation, ie it stays blocked.  This doesn't happen when
deployement has no networkers and this doesn't happen on osp13 version
above or equal to z8.

I couldn't see anything relevant in the openvswitch logs.

So I went on to compare openvswitch that were working, against those
which were not:

| State   | Name | Starting point | openvswitch version                                 | Redhat  |
|---------+------+----------------+-----------------------------------------------------+---------|
| fail    | ga   |   2018-06-21.2 | openvswitch-2.9.0-19.el7fdp.1.x86_64.rpm            |     7.6 |
| fail    | z3   |   2018-11-07.3 | openvswitch-2.9.0-56.el7fdp.x86_64.rpm              |     7.6 |
| fail    | z5   |   2019-03-18.1 | openvswitch-2.9.0-97.el7fdp.x86_64.rpm              |     7.6 |
| fail    | z7   |   2019-06-28.1 | openvswitch-2.9.0-103.el7fdp.x86_64.rpm             |     7.6 |
| working | z8   |   2019-08-27.2 | openvswitch-2.9.0-110.el7fdp.x86_64.rpm             |     7.7 |
| working | z11  |   2020-03-04.1 | openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64.rpm |     7.7 |
|         |      |                | openvswitch2.11-2.11.0-35.el7fdp.x86_64.rpm         |         |

updated to rhosp-openvswitch 2.11-0.7.el7ost,,openvswitch2.11.x86_64 0:2.11.3-64.el7fdp

As I said, we we don't have ping loss on non-composable deployment.

Composable deployment include 2 networkers.

One thing I notice in a live reproducer is that networker have their
protocol set:

[root@networker-1 ~]# ovs-vsctl list bridge| grep -E '(name|protoc)'
name                : br-tun
protocols           : ["OpenFlow10", "OpenFlow13"]
name                : br-isolated
protocols           : ["OpenFlow10", "OpenFlow13"]
name                : br-int
protocols           : ["OpenFlow10", "OpenFlow13"]
name                : br-ex
protocols           : ["OpenFlow10", "OpenFlow13"]

while the controllers don't:

[heat-admin@controller-1 ~]$ sudo ovs-vsctl list bridge| grep -E '(name|protoc)'
name                : br-ex
protocols           : []
name                : br-isolated
protocols           : []

As this may be some kind of error related to
https://bugzilla.redhat.com/show_bug.cgi?id=1843811 or
https://bugzilla.redhat.com/show_bug.cgi?id=1863024 ?

Comment 6 Sofer Athlan-Guyot 2020-10-15 13:49:50 UTC
So,

actual error happens on the networker role (thanks slaweq)

    2020-10-15 12:46:24.473 43605 DEBUG oslo.privsep.daemon [-] privsep: reply[140493428170416]: (4, ['qdhcp-ba395bdf-a349-49f5-882d-bf624d3c602d', 'qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb']) loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:456              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument                                                            
    : ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument                                                                                                        
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info Traceback (most recent call last):                                                                                                                                                                            
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 161, in call                                                                                                                          
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     return func(*args, **kwargs)                                                                                                                                                                              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 1177, in process                                                                                                              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     self._process_internal_ports()                                                                                                                                                                            
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 625, in _process_internal_ports                                                                                              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     existing_devices = self._get_existing_devices()                                                                                                                                                          
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 484, in _get_existing_devices                                                                                                
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     ip_devs = ip_wrapper.get_devices()                                                                                                                                                                        
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 139, in get_devices                                                                                                            
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     return []                                                                                                                                                                                                
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__                                                                                                                      
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     self.force_reraise()                                                                                                                                                                                      
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise                                                                                                                  
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     six.reraise(self.type_, self.value, self.tb)                                                                                                                                                              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 132, in get_devices                                                                                                            
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     log_fail_as_error=self.log_fail_as_error).split()                                                                                                                                                        
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 147, in execute                                                                                                                  
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info     returncode=returncode)                                                                                                                                                                                    
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a65409d49bb" failed: Invalid argument                                        
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info                                                                                                                                                                                                              
    2020-10-15 12:46:24.475 42960 ERROR neutron.agent.l3.router_info                                                                                                                                                                                                              
    2020-10-15 12:46:24.476 42960 ERROR neutron.agent.l3.agent [-] Failed to process compatible router: 4d2a994b-e7dc-44fe-8fcd-5a65409d49bb: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: setting the network namespace "qrouter-4d2a994b-e7dc-44fe-8fcd-5a6540


The problem doesn't happen on non-composable deployment, so there must be something in the version difference between the controller and the networker must be causing the issue.

Should we try starting the update by the networker role ?

Comment 8 Sofer Athlan-Guyot 2020-10-21 08:03:46 UTC
Hi,

so further investigation showed that updating the networker first led to the same situation, so we don't need the rabbitmq disconnection.  Just plain update of the networker leads to the same error.

The only way out of this is to reboot the networker.

So, first a question that would require more analysis here:

 - why this doesn't show up when we don't use networker role:  when using simple HA deployment with l3 agent in ctlr-0,1,2 we don't have the cut, how can this be ?

Then we need a way out of this, currently the most likely way out is a knowledge based article as the necessary procedure won't be able to fit in the update workflow.

A currently working procedure seems to be (need more tests):
 - after update prepare:
   - evacuate the routers present on  networker-X for X=0;
   - update networker-X : run openstack overcloud update run --limit networker-X --playbook all 
   - reboot networker-X
   - move to next networker X+1

Or, as described in c#55:

  - mount --rbind /run/netns /run/netns
  - mount --make-shared /run/netns

This need to be confirmed and tested.

Eventually the root cause seems to be the old version of iproute and pyroute in rhel-7.6, jumping all the way to rhel-7.9.

Comment 10 Sofer Athlan-Guyot 2020-10-21 09:18:07 UTC
So early testing seems to show that a good workaround is:


tripleo-ansible-inventory     --plan "qe-Cloud-0"     --ansible_ssh_user heat-admin     --static-yaml-inventory     inventory.yaml

cat workaround.yaml
---
- hosts: Networker
  gather_facts: false
  become: true
  tasks:
    - name: Workaround for flawed iproute and pyroute
      shell: |
        mount --rbind /run/netns /run/netns
        mount --make-shared /run/netns

ansible-playbook -i inventory.yaml workaround.yaml

I need more extensive testing but this seems to show good results.