Description of problem: Doing an update of 16.1 to 16.2 we have a ping loss to the vm created on the overcloud after the undercloud update but before the overcloud update. TASK [tripleo-upgrade : stop l3 agent connectivity check] ********************** task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2 Thursday 03 February 2022 05:50:27 +0000 (1:46:16.798) 2:00:39.385 ***** fatal: [undercloud-0]: FAILED! => { "changed": true, "cmd": "source /home/stack/qe-Cloud-0rc\n/home/stack/l3_agent_stop_ping.sh 0\n", "delta": "0:00:00.071014", "end": "2022-02-03 05:50:27.773144", "rc": 1, "start": "2022-02-03 05:50:27.702130" } STDOUT: 6238 packets transmitted, 2411 received, 61.3498% packet loss, time 6377528ms rtt min/avg/max/mdev = 0.429/0.899/185.890/3.788 ms Ping loss higher than 0 seconds detected (1509 seconds) This loss of connectivity happen during the update of the Controllers. Version-Release number of selected component (if applicable): 16.1 puddle: RHOS-16.1-RHEL-8-20211126.n.1 16.2 puddle: RHOS-16.2-RHEL-8-20220201.n.1 OVN: rg -zi ovn controller-0/var/log/extra/podman/containers/ovn_controller/log/dnf.rpm.log.gz ovn-2021-21.12.0-11.el8fdp.x86_64 rhosp-ovn-2021-4.el8ost.1.noarch ovn-2021-host-21.12.0-11.el8fdp.x86_64 rhosp-ovn-host-2021-4.el8ost.1.noarch OVS: rg -zi openvswitch controller-0/var/log/dnf.rpm.log.gz network-scripts-openvswitch2.15-2.15.0-57.el8fdp.x86_64 rhosp-network-scripts-openvswitch-2.15-4.el8ost.1.noarch openvswitch2.15-2.15.0-57.el8fdp.x86_64 rhosp-openvswitch-2.15-4.el8ost.1.noarch rhosp-openvswitch-2.13-12.el8ost.noarch network-scripts-openvswitch2.13-2.13.0-124.el8fdp.x86_64 How reproducible: all jobs jumping from 16.1 to 16.2 failed: - DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA-ipv4 - DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-composable-ipv6 - DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal and this last one twice, so it's consistent. Steps to Reproduce: 1. install 16.1 RHOS-16.1-RHEL-8-20211126.n.1 2. update undercloud 3. create a vm with a FIP 4. ping that FIP 5. update prepare for RHOS-16.2-RHEL-8-20220201.n.1 6. update run the controllers 7. all controllers get updated 8. check the ping log Actual results: 6238 packets transmitted, 2411 received, 61.3498% packet loss, time 6377528ms rtt min/avg/max/mdev = 0.429/0.899/185.890/3.788 ms Ping loss higher than 0 seconds detected (1509 seconds) Expected results: 0 packet loss.
Hi, requesting the blocker here for 16.2 as we cannot update from 16.1 to 16.2 without a disconnection of the data plane. Regards,
Hi, so we just had the result for 16.2->16.2 and it's also impacted (it's not just 16.1->16.2 update.) 2022-02-10 18:05:57.872 | TASK [tripleo-upgrade : stop l3 agent connectivity check] ********************** 2022-02-10 18:05:57.877 | task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.2-from-ga-composable-ipv6/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2 2022-02-10 18:05:57.881 | Thursday 10 February 2022 18:05:57 +0000 (1:10:09.770) 1:25:18.427 ***** 2022-02-10 18:05:58.193 | fatal: [undercloud-0]: FAILED! => { 2022-02-10 18:05:58.197 | "changed": true, 2022-02-10 18:05:58.203 | "cmd": "source /home/stack/qe-Cloud-0rc\n/home/stack/l3_agent_stop_ping.sh 0\n", 2022-02-10 18:05:58.208 | "delta": "0:00:00.113254", 2022-02-10 18:05:58.212 | "end": "2022-02-10 18:05:58.160053", 2022-02-10 18:05:58.216 | "rc": 1, 2022-02-10 18:05:58.220 | "start": "2022-02-10 18:05:58.046799" 2022-02-10 18:05:58.224 | } 2022-02-10 18:05:58.229 | 2022-02-10 18:05:58.234 | STDOUT: 2022-02-10 18:05:58.239 | 2022-02-10 18:05:58.243 | 4147 packets transmitted, 1617 received, 61.008% packet loss, time 4210147ms 2022-02-10 18:05:58.247 | rtt min/avg/max/mdev = 0.532/1.277/16.031/0.713 ms 2022-02-10 18:05:58.251 | Ping loss higher than 0 seconds detected (989 seconds) 2022-02-10 18:05:58.255 | 2022-02-10 18:05:58.261 | 2022-02-10 18:05:58.265 | MSG:
This didn't work because of the way the OSP update framework work. We deliver the patch on the Controller first and then on the Compute. So triggering ovs-vsctl set open . external_ids:ovn-match-northd-version=true on the OSP Controller role[1] is not enough to prevent the issue from happening. This show the parameter being taken into account on the controller. DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal/10 $ rg -z external_ids:ovn-match-northd-version undercloud-0/home/stack/overcloud_update_run-Controller.log.gz 8591:2022-02-11 21:22:58 | "<13>Feb 11 21:22:18 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created", 46046:2022-02-11 22:08:36 | "<13>Feb 11 22:07:53 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created", 80883:2022-02-11 22:49:15 | "<13>Feb 11 22:48:39 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created", controller-2/var/log/extra/journal.txt.gz 69463:Feb 11 22:48:39 controller-2 ovs-vsctl[303431]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true 69471:Feb 11 22:48:39 controller-2 puppet-user[301964]: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created controller-1/var/log/extra/journal.txt.gz 60384:Feb 11 22:07:53 controller-1 ovs-vsctl[105854]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true 60385:Feb 11 22:07:53 controller-1 puppet-user[104641]: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created controller-0/var/log/extra/journal.txt.gz 52737:Feb 11 21:22:18 controller-0 ovs-vsctl[955074]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true but result in: 7430 packets transmitted, 2721 received, 63.3782% packet loss, time 7596893ms rtt min/avg/max/mdev = 0.409/0.863/22.616/0.804 ms Ping loss higher than 0 seconds detected (1760 seconds) [1] which is what the patch will do, the order of delivery being those of update (ie controller first)
*** Bug 2052576 has been marked as a duplicate of this bug. ***
Hi, following the recommendation there https://bugzilla.redhat.com/show_bug.cgi?id=2052494#c12 we have to add a new step in the update process to update the ovn-controller before the ovn-northd database. This will result in a new stage before 3.3. Updating all Controller nodes in [1]: Update all ovn-controllers. openstack overcloud external-update run --stack qe-Cloud-0 --tags ovn --no-workflow At least that the idea, the patch is still under development. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-all-controller-nodes_updating-overcloud
This is implemented in https://review.opendev.org/c/openstack/tripleo-heat-templates/+/829393 . It should be noted that all the other patches while still maybe good to have (they would trade a data plane to a control plane cut) don't solve the issue.
Hi @kgilliga , we're going to need a 16.2 documentation update for the update ... (sic) This is going to be a new step between 3.2 and 3.3. """ * Running the OVN-controller update. Log in to the undercloud as the stack user. Source the stackrc file: $ source ~/stackrc run openstack overcloud external-update run --stack <stack_name> --tags ovn This will update all ovn-container to the new version. This is in accordance with the OVN upgrade procedure where OVN-controller needs to be updated *before* the OVN-northd service. Note that the OVN-controller usually are colocated on the OSP Compute role servers and OVN-northd is on the OSP Controller role servers. """ Something like this. We will need the same for 16.1 eventually but we should start the review process for 16.2 before. Do you need/want another bz for this documentation issue ? Thanks,
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenStack Platform 16.2 (openstack-tripleo-heat-templates) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0995
Hi, provided the cut was not permanent but temporary (please confirm) than you're likely hitting that bug https://bugzilla.redhat.com/show_bug.cgi?id=2094265 . It's an issue where ovn takes more time than expected to flush and recreate the ovs flows because of a schema modification. The issue is not happening all the time and we didn't hit it in ci before release. In any case using the ovn procedure is still mandatory as without it, the cut is persisting until all compute nodes are updated[1]. This is what that bugzilla was about: make sure we follow the ovn upgrade procedure in director. Hope it helps. [1] this is actually another issue entirely where ovn requires that ovn-controller be updated before the ovn north db