Bug 2050154 - [update] 16.1->16.2 experience a connectivity cut (ping loss) to FIP during update of the controllers.
Summary: [update] 16.1->16.2 experience a connectivity cut (ping loss) to FIP during u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z2
: 16.2 (Train on RHEL 8.4)
Assignee: Sofer Athlan-Guyot
QA Contact: Jason Grosso
URL:
Whiteboard:
: 2052576 (view as bug list)
Depends On: 2052494
Blocks: 2058379
TreeView+ depends on / blocked
 
Reported: 2022-02-03 11:26 UTC by Sofer Athlan-Guyot
Modified: 2022-08-01 11:32 UTC (History)
20 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220116004912.el8ost
Doc Type: Enhancement
Doc Text:
Red Hat OpenStack Platform (RHOSP) now supports the correct method of updating OVN. For more information, see https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud
Clone Of:
Environment:
Last Closed: 2022-03-23 22:30:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 829393 0 None MERGED Update of OVN controllers as an external task. 2022-03-07 15:29:33 UTC
Red Hat Issue Tracker OSP-12454 0 None None None 2022-02-03 11:43:55 UTC
Red Hat Issue Tracker UPG-4966 0 None None None 2022-02-03 11:43:51 UTC
Red Hat Product Errata RHSA-2022:0995 0 None None None 2022-03-23 22:30:41 UTC

Description Sofer Athlan-Guyot 2022-02-03 11:26:43 UTC
Description of problem:

Doing an update of 16.1 to 16.2 we have a ping loss to the vm created on the overcloud after the undercloud update but before the overcloud update.


TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
Thursday 03 February 2022  05:50:27 +0000 (1:46:16.798)       2:00:39.385 ***** 
fatal: [undercloud-0]: FAILED! => {
    "changed": true,
    "cmd": "source /home/stack/qe-Cloud-0rc\n/home/stack/l3_agent_stop_ping.sh 0\n",
    "delta": "0:00:00.071014",
    "end": "2022-02-03 05:50:27.773144",
    "rc": 1,
    "start": "2022-02-03 05:50:27.702130"
}

STDOUT:

6238 packets transmitted, 2411 received, 61.3498% packet loss, time 6377528ms
rtt min/avg/max/mdev = 0.429/0.899/185.890/3.788 ms
Ping loss higher than 0 seconds detected (1509 seconds)

This loss of connectivity happen during the update of the Controllers.



Version-Release number of selected component (if applicable):

16.1 puddle: RHOS-16.1-RHEL-8-20211126.n.1
16.2 puddle: RHOS-16.2-RHEL-8-20220201.n.1

OVN:

rg -zi ovn    controller-0/var/log/extra/podman/containers/ovn_controller/log/dnf.rpm.log.gz
ovn-2021-21.12.0-11.el8fdp.x86_64
rhosp-ovn-2021-4.el8ost.1.noarch
ovn-2021-host-21.12.0-11.el8fdp.x86_64
rhosp-ovn-host-2021-4.el8ost.1.noarch

OVS:

rg -zi openvswitch    controller-0/var/log/dnf.rpm.log.gz
network-scripts-openvswitch2.15-2.15.0-57.el8fdp.x86_64
rhosp-network-scripts-openvswitch-2.15-4.el8ost.1.noarch
openvswitch2.15-2.15.0-57.el8fdp.x86_64
rhosp-openvswitch-2.15-4.el8ost.1.noarch
rhosp-openvswitch-2.13-12.el8ost.noarch
network-scripts-openvswitch2.13-2.13.0-124.el8fdp.x86_64


How reproducible: all jobs jumping from 16.1 to 16.2 failed:

- DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA-ipv4
- DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-composable-ipv6
- DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal

and this last one twice, so it's consistent.


Steps to Reproduce:

1. install 16.1 RHOS-16.1-RHEL-8-20211126.n.1
2. update undercloud
3. create a vm with a FIP
4. ping that FIP
5. update prepare for RHOS-16.2-RHEL-8-20220201.n.1
6. update run the controllers
7. all controllers get updated
8. check the ping log

Actual results:

6238 packets transmitted, 2411 received, 61.3498% packet loss, time 6377528ms
rtt min/avg/max/mdev = 0.429/0.899/185.890/3.788 ms
Ping loss higher than 0 seconds detected (1509 seconds)

Expected results:

0 packet loss.

Comment 3 Sofer Athlan-Guyot 2022-02-10 17:02:59 UTC
Hi,

requesting the blocker here for 16.2 as we cannot update from 16.1 to 16.2 without a disconnection of the data plane.

Regards,

Comment 6 Sofer Athlan-Guyot 2022-02-11 13:00:37 UTC
Hi,

so we just had the result for 16.2->16.2 and it's also impacted (it's not just 16.1->16.2 update.)

2022-02-10 18:05:57.872 | TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
2022-02-10 18:05:57.877 | task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.2-from-ga-composable-ipv6/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
2022-02-10 18:05:57.881 | Thursday 10 February 2022  18:05:57 +0000 (1:10:09.770)       1:25:18.427 ***** 
2022-02-10 18:05:58.193 | fatal: [undercloud-0]: FAILED! => {
2022-02-10 18:05:58.197 |     "changed": true,
2022-02-10 18:05:58.203 |     "cmd": "source /home/stack/qe-Cloud-0rc\n/home/stack/l3_agent_stop_ping.sh 0\n",
2022-02-10 18:05:58.208 |     "delta": "0:00:00.113254",
2022-02-10 18:05:58.212 |     "end": "2022-02-10 18:05:58.160053",
2022-02-10 18:05:58.216 |     "rc": 1,
2022-02-10 18:05:58.220 |     "start": "2022-02-10 18:05:58.046799"
2022-02-10 18:05:58.224 | }
2022-02-10 18:05:58.229 | 
2022-02-10 18:05:58.234 | STDOUT:
2022-02-10 18:05:58.239 | 
2022-02-10 18:05:58.243 | 4147 packets transmitted, 1617 received, 61.008% packet loss, time 4210147ms
2022-02-10 18:05:58.247 | rtt min/avg/max/mdev = 0.532/1.277/16.031/0.713 ms
2022-02-10 18:05:58.251 | Ping loss higher than 0 seconds detected (989 seconds)
2022-02-10 18:05:58.255 | 
2022-02-10 18:05:58.261 | 
2022-02-10 18:05:58.265 | MSG:

Comment 10 Sofer Athlan-Guyot 2022-02-14 13:02:06 UTC
This didn't work because of the way the OSP update framework work. We deliver the patch on the Controller first and then on the Compute.

So triggering

  ovs-vsctl set open . external_ids:ovn-match-northd-version=true

on the OSP Controller role[1] is not enough to prevent the issue from happening.

This show the parameter being taken into account on the controller.

DFG-upgrades-updates-16.1-to-16.2-from-passed_phase2-HA_no_ceph-ipv4-minimal/10 $ rg -z external_ids:ovn-match-northd-version
undercloud-0/home/stack/overcloud_update_run-Controller.log.gz
8591:2022-02-11 21:22:58 |         "<13>Feb 11 21:22:18 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created",
46046:2022-02-11 22:08:36 |         "<13>Feb 11 22:07:53 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created",
80883:2022-02-11 22:49:15 |         "<13>Feb 11 22:48:39 puppet-user: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created",

controller-2/var/log/extra/journal.txt.gz
69463:Feb 11 22:48:39 controller-2 ovs-vsctl[303431]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true
69471:Feb 11 22:48:39 controller-2 puppet-user[301964]: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created

controller-1/var/log/extra/journal.txt.gz
60384:Feb 11 22:07:53 controller-1 ovs-vsctl[105854]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true
60385:Feb 11 22:07:53 controller-1 puppet-user[104641]: Notice: /Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-match-northd-version]/ensure: created

controller-0/var/log/extra/journal.txt.gz
52737:Feb 11 21:22:18 controller-0 ovs-vsctl[955074]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-match-northd-version=true

but result in:

7430 packets transmitted, 2721 received, 63.3782% packet loss, time 7596893ms
rtt min/avg/max/mdev = 0.409/0.863/22.616/0.804 ms
Ping loss higher than 0 seconds detected (1760 seconds)


[1] which is what the patch will do, the order of delivery being those of update (ie controller first)

Comment 11 Carlos Camacho 2022-02-14 14:11:55 UTC
*** Bug 2052576 has been marked as a duplicate of this bug. ***

Comment 14 Sofer Athlan-Guyot 2022-02-16 13:35:39 UTC
Hi,

following the recommendation there https://bugzilla.redhat.com/show_bug.cgi?id=2052494#c12 
we have to add a new step in the update process to update the ovn-controller before the 
ovn-northd database.

This will result in a new stage before 3.3. Updating all Controller nodes in [1]:

 Update all ovn-controllers.

   openstack overcloud external-update run --stack qe-Cloud-0 --tags ovn --no-workflow

At least that the idea, the patch is still under development.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-all-controller-nodes_updating-overcloud

Comment 15 Sofer Athlan-Guyot 2022-02-16 13:38:30 UTC
This is implemented in https://review.opendev.org/c/openstack/tripleo-heat-templates/+/829393 . 

It should be noted that all the other patches while still maybe good to have
(they would trade a data plane to a control plane cut) don't solve the issue.

Comment 16 Sofer Athlan-Guyot 2022-02-22 14:41:54 UTC
Hi @kgilliga ,

we're going to need a 16.2 documentation update for the update ... (sic)

This is going to be a new step between 3.2 and 3.3.

"""
* Running the OVN-controller update.

Log in to the undercloud as the stack user.

Source the stackrc file:

$ source ~/stackrc

run

openstack overcloud external-update run --stack <stack_name> --tags ovn

This will update all ovn-container to the new version. This is in accordance
with the OVN upgrade procedure where OVN-controller needs to be updated *before*
the OVN-northd service.

Note that the OVN-controller usually are colocated on the OSP Compute role servers and
OVN-northd is on the OSP Controller role servers.

"""

Something like this.

We will need the same for 16.1 eventually but we should start the review process for
16.2 before. Do you need/want another bz for this documentation issue ?

Thanks,

Comment 37 errata-xmlrpc 2022-03-23 22:30:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack Platform 16.2 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0995

Comment 40 Sofer Athlan-Guyot 2022-08-01 11:18:39 UTC
Hi,

provided the cut was not permanent but temporary (please confirm) than you're likely hitting that bug https://bugzilla.redhat.com/show_bug.cgi?id=2094265 . It's an issue where ovn takes more time than expected to flush and recreate the ovs flows because of a schema modification.  The issue is not happening all the time and we didn't hit it in ci before release. In any case using the ovn procedure is still mandatory as without it, the cut is persisting until all compute nodes are updated[1]. This is what that bugzilla was about: make sure we follow the ovn upgrade procedure in director.

Hope it helps.

[1] this is actually another issue entirely where ovn requires that ovn-controller be updated before the ovn north db


Note You need to log in before you can comment on or make changes to this bug.