Description of problem: During the update from OSP16 to OSP16.1 there is a 2 minute traffic loss because of the openvswitch package is updated on compute nodes: Installed: openvswitch2.13-2.13.0-39.el8fdp.x86_64 Installed: rhosp-openvswitch-2.13-8.el8ost.noarch Removed: openvswitch2.11-2.11.0-35.el8fdp.x86_64 Removed: rhosp-openvswitch-2.11-0.5.el8ost.noarch Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Can you please attach sosreports from the affected nodes + update logs from the Undercloud?
Hi, one more note about the possible solution. The upgrade code that prevents openvswitch reboot is rather involved[1], that why using some migration mechanism would be better IMHO. [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L359-L495
Hi, so I've made a mistake in the tz used for the different logs especially for the ping test. Those are in unix time and my conversion was wrong. Using no TZ when doing the conversion fixed it: cat undercloud-0/home/stack/ping_results_202007192119.log | TZ= perl -pe 's/([\d]{10}\.[\d]{3})/localtime $1/eg;' | head -n2 PING 10.0.0.246 (10.0.0.246) 56(84) bytes of data. [Sun Jul 19 21:19:52 2020130] 64 bytes from 10.0.0.246: icmp_seq=1 ttl=63 time=8.13 ms which has the same tz than the Controller.log head -1 undercloud-0/home/stack//overcloud_update_run_Controller.log 2020-07-19 21:19:53 | Running minor update all playbooks for Controller role Now the cut happen at: [Sun Jul 19 23:32:13 2020343] 64 bytes from 10.0.0.246: icmp_seq=7930 ttl=63 time=1.96 ms [Sun Jul 19 23:32:27 2020673] From 10.0.0.246 icmp_seq=7941 Destination Host Unreachable [Sun Jul 19 23:32:27 2020743] From 10.0.0.246 icmp_seq=7942 Destination Host Unreachable ... [Sun Jul 19 23:34:14 2020864] From 10.0.0.246 icmp_seq=8046 Destination Host Unreachable [Sun Jul 19 23:34:14 2020872] From 10.0.0.246 icmp_seq=8047 Destination Host Unreachable [Sun Jul 19 23:34:15 2020672] 64 bytes from 10.0.0.246: icmp_seq=8048 ttl=63 time=1059 ms [Sun Jul 19 23:34:15 2020808] 64 bytes from 10.0.0.246: icmp_seq=8049 ttl=63 time=19.0 ms And this is during Compute update in undercloud-0/home/stack//overcloud_update_run_Compute.log "Removed: network-scripts-10.00.4-1.el8.x86_64", "Removed: unbound-libs-1.7.3-8.el8.x86_64", "Removed: network-scripts-openvswitch2.11-2.11.0-35.el8fdp.x86_64", "Removed: nftables-1:0.9.0-14.el8.x86_64"]} 2020-07-19 23:34:38 | 2020-07-19 23:34:38 | TASK [Ensure openvswitch is running after update] ****************************** 2020-07-19 23:34:38 | Sunday 19 July 2020 23:34:10 +0000 (0:05:49.926) 0:06:26.658 *********** 2020-07-19 23:34:38 | changed: [compute-1] => {"changed": true, "enabled": true, "name": "openvswitch", "state": "started", "status": {"ActiveEnterTimestamp": "Sun 2020-07-19 16:10:05 UTC", "ActiveEnterTimestampMonotonic": "10155462", "Ac tiveExitTimestamp": "Sun 2020-07-19 23:32:15 UTC", "ActiveExitTimestampMonotonic": "26539527478", "ActiveState": "inactive", "After": "network-pre.targe In compute-0/var/log/openvswitch/ovs-vswitchd.log we can see that all interfaces are deleted: 2020-07-19T23:32:14.188Z|00410|bridge|INFO|bridge br-tun: deleted interface patch-int on port 1 2020-07-19T23:32:14.188Z|00411|bridge|INFO|bridge br-tun: deleted interface br-tun on port 65534 2020-07-19T23:32:14.188Z|00412|bridge|INFO|bridge br-tun: deleted interface vxlan-ac110239 on port 2 2020-07-19T23:32:14.188Z|00413|bridge|INFO|bridge br-tun: deleted interface vxlan-ac110228 on port 4 2020-07-19T23:32:14.188Z|00414|bridge|INFO|bridge br-tun: deleted interface vxlan-ac110258 on port 7 2020-07-19T23:32:14.188Z|00415|bridge|INFO|bridge br-tun: deleted interface vxlan-ac110235 on port 5 2020-07-19T23:32:14.188Z|00416|bridge|INFO|bridge br-tun: deleted interface vxlan-ac110220 on port 6 2020-07-19T23:32:14.188Z|00417|bridge|INFO|bridge br-tun: deleted interface vxlan-ac11023e on port 3 2020-07-19T23:32:14.194Z|00418|bridge|INFO|bridge br-int: deleted interface qvob4b75086-38 on port 71 2020-07-19T23:32:14.194Z|00419|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2 2020-07-19T23:32:14.194Z|00420|bridge|INFO|bridge br-int: deleted interface patch-tun on port 3 2020-07-19T23:32:14.195Z|00421|bridge|INFO|bridge br-int: deleted interface br-int on port 65534 2020-07-19T23:32:14.197Z|00422|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1 2020-07-19T23:32:14.200Z|00423|bridge|INFO|bridge br-ex: deleted interface ens5 on port 1 2020-07-19T23:32:14.200Z|00424|bridge|INFO|bridge br-ex: deleted interface br-ex on port 65534 2020-07-19T23:32:14.200Z|00425|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2 2020-07-19T23:32:14.204Z|00426|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 5 2020-07-19T23:32:14.205Z|00427|bridge|INFO|bridge br-isolated: deleted interface vlan20 on port 2 2020-07-19T23:32:14.205Z|00428|bridge|INFO|bridge br-isolated: deleted interface vlan30 on port 4 2020-07-19T23:32:14.205Z|00429|bridge|INFO|bridge br-isolated: deleted interface vlan50 on port 3 2020-07-19T23:32:14.205Z|00430|bridge|INFO|bridge br-isolated: deleted interface br-isolated on port 65534 2020-07-19T23:32:14.205Z|00431|bridge|INFO|bridge br-isolated: deleted interface ens4 on port 1 and recreated later: 2020-07-19T23:34:12.185Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2020-07-19T23:34:12.199Z|00002|ovs_numa|INFO|Discovered 8 CPU cores on NUMA node 0 2020-07-19T23:34:12.199Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 8 CPU cores 2020-07-19T23:34:12.201Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2020-07-19T23:34:12.201Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2020-07-19T23:34:12.206Z|00006|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable 2020-07-19T23:34:12.216Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation 2020-07-19T23:34:12.217Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2 2020-07-19T23:34:12.217Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 1 2020-07-19T23:34:12.217Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action 2020-07-19T23:34:12.217Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids 2020-07-19T23:34:12.217Z|00012|ofproto_dpif|INFO|system@ovs-system: Datapath supports clone action 2020-07-19T23:34:12.217Z|00013|ofproto_dpif|INFO|system@ovs-system: Max sample nesting level probed as 10 2020-07-19T23:34:12.217Z|00014|ofproto_dpif|INFO|system@ovs-system: Datapath supports eventmask in conntrack action 2020-07-19T23:34:12.217Z|00015|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_clear action 2020-07-19T23:34:12.217Z|00016|ofproto_dpif|INFO|system@ovs-system: Max dp_hash algorithm probed to be 0 2020-07-19T23:34:12.217Z|00017|ofproto_dpif|INFO|system@ovs-system: Datapath does not support check_pkt_len action 2020-07-19T23:34:12.217Z|00018|ofproto_dpif|INFO|system@ovs-system: Datapath does not support timeout policy in conntrack action 2020-07-19T23:34:12.217Z|00019|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state 2020-07-19T23:34:12.217Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_zone 2020-07-19T23:34:12.217Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_mark 2020-07-19T23:34:12.217Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_label 2020-07-19T23:34:12.217Z|00023|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state_nat 2020-07-19T23:34:12.217Z|00024|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple 2020-07-19T23:34:12.217Z|00025|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple6 2020-07-19T23:34:12.217Z|00026|ofproto_dpif|INFO|system@ovs-system: Datapath does not support IPv6 ND Extensions 2020-07-19T23:34:12.253Z|00027|bridge|INFO|bridge br-ex: added interface ens5 on port 1 2020-07-19T23:34:12.253Z|00028|bridge|INFO|bridge br-ex: added interface br-ex on port 65534 2020-07-19T23:34:12.253Z|00029|bridge|INFO|bridge br-ex: added interface phy-br-ex on port 2 2020-07-19T23:34:12.253Z|00030|bridge|INFO|bridge br-tun: added interface patch-int on port 1 2020-07-19T23:34:12.253Z|00031|bridge|INFO|bridge br-tun: added interface br-tun on port 65534 2020-07-19T23:34:12.254Z|00032|bridge|INFO|bridge br-tun: added interface vxlan-ac110228 on port 4 2020-07-19T23:34:12.254Z|00033|bridge|INFO|bridge br-tun: added interface vxlan-ac110239 on port 2 2020-07-19T23:34:12.255Z|00034|bridge|INFO|bridge br-tun: added interface vxlan-ac110258 on port 7 2020-07-19T23:34:12.255Z|00035|bridge|INFO|bridge br-tun: added interface vxlan-ac110235 on port 5 2020-07-19T23:34:12.255Z|00036|bridge|INFO|bridge br-tun: added interface vxlan-ac110220 on port 6 2020-07-19T23:34:12.255Z|00037|bridge|INFO|bridge br-tun: added interface vxlan-ac11023e on port 3 2020-07-19T23:34:12.255Z|00038|bridge|INFO|bridge br-int: added interface qvob4b75086-38 on port 71 and then in compute-0/var/log/message we can see that openvswitch is stopped. Jul 19 23:32:14 compute-0 systemd[1]: Stopping Open vSwitch... Jul 19 23:32:14 compute-0 systemd[1]: Stopped Open vSwitch. Jul 19 23:32:14 compute-0 systemd[1]: Stopping Open vSwitch Forwarding Unit... Jul 19 23:32:14 compute-0 ovs-ctl[141379]: Exiting ovs-vswitchd (2731) [ OK ] Jul 19 23:32:14 compute-0 kernel: device vxlan_sys_4789 left promiscuous mode Jul 19 23:32:14 compute-0 NetworkManager[2523]: <info> [1595201534.2180] device (vxlan_sys_4789): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Jul 19 23:32:14 compute-0 systemd[1]: Stopped Open vSwitch Forwarding Unit. Jul 19 23:32:14 compute-0 systemd[1]: Stopping Open vSwitch Database Unit... Jul 19 23:32:14 compute-0 ovs-ctl[141473]: Exiting ovsdb-server (2600) [ OK ] Jul 19 23:32:14 compute-0 systemd[1]: Stopped Open vSwitch Database Unit. and then started again 2 min later: Jul 19 23:34:11 compute-0 systemd[1]: Reloading. Jul 19 23:34:11 compute-0 systemd[1]: Starting Open vSwitch Database Unit... ... Jul 19 23:34:12 compute-0 systemd[1]: Starting Open vSwitch Forwarding Unit... Bottom line, all this happen during yum upgrade and the recovery is a explicit tasks tasks: 2020-07-19 23:34:07 | TASK [Update all packages] ***************************************************** 2020-07-19 23:34:07 | Sunday 19 July 2020 23:28:20 +0000 (0:00:00.179) 0:00:36.732 *********** and we get the system started again when we do that task: 2020-07-19 23:34:38 | TASK [Ensure openvswitch is running after update] ****************************** 2020-07-19 23:34:38 | Sunday 19 July 2020 23:34:10 +0000 (0:05:49.926) 0:06:26.658 *********** I need Networking help to understand why updating packages would stop the service and why we need an explicit restart task to have it working again. In the template this match those tasks https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L582-L596
Openvswitch packaging is known to possibly disrupt the data plane by restarting services which is why the aforementioned template starting at around https://github.com/openstack/tripleo-heat-templates/blob/3c49cc8281196829882b1342501c6ba78213a40c/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L386 handles openvswitch related packaging differently to avoid outages on upgrade. Where this seems to be happening on an update, these tasks should be referenced in the update_tasks as well.
> I need Networking help to understand why updating packages would stop the > service and why we need an explicit restart task to have it working again. This is the result of openvswitch package versioning, actual package name is openvswitch2.13 (for example), and there is a wrapper package rhosp-openvswitch to handle that. So updating major version means removing the previous major version package (stopping the service) and installing the newer version (need an explicit start to re-enable it). This is why upgrade has specific code to remove old version without stopping ovs itself, and enable on new package. (hopefully I remember it correctly, Brent feel free to correct if details were wrong)
(In reply to Bernard Cafarelli from comment #7) > > I need Networking help to understand why updating packages would stop the > > service and why we need an explicit restart task to have it working again. > > This is the result of openvswitch package versioning, actual package name is > openvswitch2.13 (for example), and there is a wrapper package > rhosp-openvswitch to handle that. So updating major version means removing > the previous major version package (stopping the service) and installing the > newer version (need an explicit start to re-enable it). This is why upgrade > has specific code to remove old version without stopping ovs itself, and > enable on new package. > > (hopefully I remember it correctly, Brent feel free to correct if details > were wrong) I think this is correct. While we initially thought that Y versions of OVS would be tied to major versions of OSP, this is not true anymore and we'll keep seeing it so looks like the fix is accounting for it during the update task as Brent pointed out. We faced this in the past and possibly was left unnoticed. For example in OSP 13 we moved from OVS 2.9 to 2.11 at some point (z10 IIRC). Did we experience the same downtime during that update? It looks like the answer is yes but the way we measure the downtime in % of the total job duration possibly hid it. IMO, we must change the downtime SLA from a % of the total job duration to an absolute value in seconds. nit.: I think that the BZ title might be misleading as it only happens on updates that upgrades OVS (which should be rare compared to all possible updates). Is this right Sofer?
(In reply to Daniel Alvarez Sanchez from comment #8) > IMO, we must change the downtime SLA from a % of the total > job duration to an absolute value in seconds. Here the review for tripleo-upgarde to change from % to seconds[1]. We're aiming at a 0 seconds ping loss and see how it goes. [1] https://review.opendev.org/742626 > nit.: I think that the BZ title might be misleading as it only happens on > updates that upgrades OVS (which should be rare compared to all possible > updates). Is this right Sofer? The ping loss is linked to the update of ovs, and this happen when we're coming from 16.1. For 16.1 only we don't really update anything so there should be no cut. So maybe what should be pointed out is that: - this happen when coming from 16.0 - this may happen in osp13 as well: we need to check the jobs there.
To round up the osp13 questions, we have checked it and the workaround is still there for osp13 update, so nothing has to be done there.
Hi, According to the last run looks like we have regression and the process failed due packet loss core_puddle: RHOS_TRUNK-16.0-RHEL-8-20200204.n.1 core_puddle: RHOS-16.1-RHEL-8-20200813.n.0 TASK [tripleo-upgrade : stop l3 agent connectivity check] ********************** task path: /home/rhos-ci/jenkins/workspace/DFG-network-neutron-16-to-16.1-from-GA-composable-ipv4/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2 Monday 17 August 2020 17:42:07 +0000 (0:28:50.778) 3:26:29.490 ********* fatal: [undercloud-0]: FAILED! => { "changed": true, "cmd": "source /home/stack/overcloudrc\n /home/stack/l3_agent_stop_ping.sh 0", "delta": "0:00:00.116762", "end": "2020-08-17 17:42:08.244176", "rc": 1, "start": "2020-08-17 17:42:08.127414" } STDOUT: 11183 packets transmitted, 11176 received, +3 errors, 0.062595% packet loss, time 11818ms rtt min/avg/max/mdev = 0.613/1.432/150.010/1.570 ms, pipe 4 Ping loss higher than 0 seconds detected (7 seconds) MSG: non-zero return code to retry, use: --limit @/home/rhos-ci/jenkins/workspace/DFG-network-neutron-16-to-16.1-from-GA-composable-ipv4/infrared/plugins/tripleo-upgrade/infrared_plugin/main.retry PLAY RECAP ********************************************************************* undercloud-0 : ok=93 changed=38 unreachable=0 failed=1
All 7 packets were lost in the same timeframe, putting it here to ease the troubleshooting: [1597677871.932560] 64 bytes from 10.0.0.243: icmp_seq=2940 ttl=63 time=1.22 ms [1597677872.934097] 64 bytes from 10.0.0.243: icmp_seq=2941 ttl=63 time=1.29 ms [1597677881.118841] From 10.0.0.28 icmp_seq=2946 Destination Host Unreachable [1597677881.118957] From 10.0.0.28 icmp_seq=2947 Destination Host Unreachable [1597677881.118962] From 10.0.0.28 icmp_seq=2948 Destination Host Unreachable [1597677881.121360] 64 bytes from 10.0.0.243: icmp_seq=2949 ttl=63 time=2.56 ms [1597677882.121932] 64 bytes from 10.0.0.243: icmp_seq=2950 ttl=63 time=1.62 ms
OK we spent some time looking at a job with similar errors, we do see a ~9 seconds dataplane downtime. On one specific test setup, it also caused connectivity loss to the node, which required a reboot. It is speficic to ML2/OVS deployments and happens when we start the new neutron_ovs_agent container. At that time in ovs-vswitchd.log we get: 2020-08-19T19:36:25.866Z|00976|vconn|WARN|unix#2: version negotiation failed (we support versions 0x04, 0x06, peer supports version 0x01) 2020-08-19T19:36:25.866Z|00977|rconn|WARN|br-int<->unix#2: connection dropped (Protocol error) followed by: 2020-08-19T19:36:33.969Z|00991|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1 2020-08-19T19:36:33.973Z|00992|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2 2020-08-19T19:36:33.976Z|00993|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2 2020-08-19T19:36:33.979Z|00994|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 6 [...] 2020-08-19T19:36:39.076Z|01003|bridge|INFO|bridge br-int: added interface int-br-ex on port 346 2020-08-19T19:36:39.081Z|01004|bridge|INFO|bridge br-ex: added interface phy-br-ex on port 3 This is apparently caused by the destroy_patch_ports.py script which is run on container start. Usually, it should not do anything when just restarting a container, as ovs is up and a canary check is performed, but for updates from versions with ovs 2.11 (so 16 to 16.1), this check seems to fail (which causes patch ports recreation) As we see with the version errors in logs, this comes from a workaround needed by ovs 2.11: https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/tripleo_ovs_upgrade.py#L162 We set OpenFlow versions to 1.3 1.5 only while the simple script uses 1.0. Manually testing with OVS 2.11 shows that after running "ovs-vsctl set bridge br-int protocols=OpenFlow13,OpenFlow15" and restarting neutron_ovs_agent we see similar logs in ovs-vswitchd.log. Adding OpenFlow10 to the list, we do not see similar logs The fix therefore seems to be to also set OpenFlow10 in tripleo_ovs_upgrade.py workaround. Any potential side-effect here? (I think it was not added in initial workaround as OVN does not need this old version).
Submitted https://review.opendev.org/#/c/747270/ in case this is the correct fix.
The bug fixed and verified : https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-neutron-16-to-16.1-from-GA-composable-ipv4/12/ core_puddle: RHOS-16.1-RHEL-8-20200821.n.0 [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-ansible-0.5.1-0.202 tripleo-ansible-0.5.1-0.20200611113659.34b8fcc.el8ost.noarch [stack@undercloud-0 ~]$ rpm -qa | grep openstack-tripleo-heat-templates-11.3.2-0. openstack-tripleo-heat-templates-11.3.2-0.20200616081539.396affd.el8ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542