Bug 2052494
| Summary: | Experiencing ping loss on VM started on previous version of ovn/ovs when updating the ovn db. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Sofer Athlan-Guyot <sathlang> |
| Component: | ovn-2021 | Assignee: | OVN Team <ovnteam> |
| Status: | CLOSED WONTFIX | QA Contact: | Jianlin Shi <jishi> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | FDP 22.A | CC: | ctrautma, dceara, egarciar, ekuris, i.maximets, jiji, jpretori, mmichels, nusiddiq, smooney |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-21 08:26:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2050154 | ||
|
Description
Sofer Athlan-Guyot
2022-02-09 11:55:51 UTC
Hi, I would like to share more about the OSP update process and understand better what would be the consequence of setting external_ids:ovn-match-northd-version to true. Also, I would like to emphasis that we didn't need that parameter to have successful update up until recently for 16.1->16.2 and 16.2->16.2 OSP update. The OSP/tripleo upgrade process is: - upgrade of the OSP controller which have those services: 8d6548183046 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-server-ovn:16.2_20220201.1 kolla_start 8 days ago Up 8 days ago neutron_api e1a7b08f0724 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-novncproxy:16.2_20220201.1 kolla_start 8 days ago Up 8 days ago nova_vnc_proxy bf7a8d4d09ae undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:16.2_20220201.1 kolla_start 8 days ago Up 8 days ago ovn_controller 56d8a7505b33 cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest /bin/bash /usr/lo... 7 minutes ago Up 7 minutes ago ovn-dbs-bundle-podman-0 - later (and it might eventually be some days latters on very big OSP cloud environment with hundred of computes nodes or following the cloud operator policy) the computes get updated. They have those services (for instance): 5abb4b9d066c undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:16.1_20211111.1 /bin/bash -c HAPR... 8 days ago Up 8 days ago neutron-haproxy-ovnmeta-3cce3842-61c3-4cc3-abd3-aea6c3434313 3ad2093dcc0a undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:16.1_20211111.1 kolla_start 9 days ago Up 9 days ago ovn_metadata_agent 46624d9b2125 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:16.1_20211111.1 kolla_start 9 days ago Up 9 days ago ovn_controller So I understand that the ovn manual say update the ovn_controller before updating the ovn-dbs, but it's rather not practical/possible given the current OSP architecture. Now, until recently the "backward" compatibility of ovn was maintained for everything that we call OSP16 in OSP: - 16.1->16.2 was working (no ping loss); - 16.2ga->16.2 latest was working (no ping loss); and the parameter external_ids:ovn-match-northd-version wasn't set (so false by default). So maybe we could look what OVN change triggered the "backward" incompatibility. Maybe it could/should be possible to have the current flows still working within an OSP 16.X update. I'm asking because I'm afraid of what are the exact consequence of setting that parameters to true in OSP from a customer perspective. I sound to me that when set this command prevent any new network related object to be created. So the customer, until all of the OSP compute get updated cannot create new VMs on the Cloud. Could someone elaborate on the consequences setting this to true has, knowing that the OSP compute nodes can be update days after the OSP controller nodes ? In parallel I've trigger tests of an update using that parameter, details in https://bugzilla.redhat.com/show_bug.cgi?id=2050154#c8 and https://bugzilla.redhat.com/show_bug.cgi?id=2050154#c9. Hi @dceara ,
the test of the patch delivery the external_ids:ovn-match-northd-version to true don't work correctly in the OSP framework because of the way the update work. This would certainly work (with the caveats mentioned earlier, cut in control plane connectivity during OSP update).
We're going to look into other ways to deliver this, but in the meantime I've made a table of what was working, so maybe you could find what patch caused such a change:
- updating from ovn-2021-21.06.0-24/openvswitch2.15.x86_64 2.15.0-26 to ovn-2021-21.09.0-20/openvswitch2.15.x86_64 2.15.0-55 was working (ie updating ovn-northd before ovn-controller didn't cause any trouble)
- when updating to ovn-2021-21.12.0-11/openvswitch2.15.x86_64 2.15.0-26 updating ovn-northd before ovn-controller cause a cut in the flows.
So something between ovn-2021-21.06.0-24 and ovn-2021-21.12.0-11 causes this. Identifying this could maybe provide a short term solution while we're working out the detail of how to proceed in the future.
> Does connectivity restore once you update the compute nodes?
Yes. It has to be noted though that on real site, this could be days later.
(In reply to Sofer Athlan-Guyot from comment #9) > Hi @dceara , > Hi Sofer, CC: Numan. > the test of the patch delivery the external_ids:ovn-match-northd-version to > true don't work correctly in the OSP framework because of the way the update > work. This would certainly work (with the caveats mentioned earlier, cut in > control plane connectivity during OSP update). > The only supported and recommended OVN upgrade procedure is to upgrade ovn-controller (computes) first and then central components (ovn-northd and NB/SB databases on controller nodes). If the CMS doesn't ensure this then there might be packet drops because OVN cannot ensure forward-compatibility in ovn-controller. To *partially alleviate* the impact OVN recently started providing the ovn-match-northd-version. Obviously, the best way to upgrade is still to follow the only supported and recommended procedure, i.e., ovn-controller (computes) first. Quoting from the commit that introduced "ovn-match-northd-version": OVN recommends updating/upgrading ovn-controllers first and then ovn-northd and OVN DB ovsdb-servers. This is to ensure that any new functionality specified by the database or logical flows created by ovn-northd is understood by ovn-controller. This doesn't change in any way the recommended upgrade procedure. > We're going to look into other ways to deliver this, but in the meantime > I've made a table of what was working, so maybe you could find what patch > caused such a change: > > - updating from ovn-2021-21.06.0-24/openvswitch2.15.x86_64 2.15.0-26 to > ovn-2021-21.09.0-20/openvswitch2.15.x86_64 2.15.0-55 was working (ie > updating ovn-northd before ovn-controller didn't cause any trouble) > - when updating to ovn-2021-21.12.0-11/openvswitch2.15.x86_64 2.15.0-26 > updating ovn-northd before ovn-controller cause a cut in the flows. > > So something between ovn-2021-21.06.0-24 and ovn-2021-21.12.0-11 causes > this. Identifying this could maybe provide a short term solution while we're > working out the detail of how to proceed in the future. > The commit that added a new OVN action (internal action) in that interval is: https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a Just stressing this out, this is not a regression, this commit doesn't do anything wrong; I still think it's the CMSs responsibility to ensure that components are upgraded in the correct order. Wrt. a short term solution I don't see a way to ensure that this doesn't happen in the future. OVN just cannot provide forward-compatibility. Numan, do you maybe have alternative suggestions? > > Does connectivity restore once you update the compute nodes? > > Yes. It has to be noted though that on real site, this could be days later. Ack. Regards, Dumitru There might be ways to remove the current restriction that specifies ovn-controllers must be upgraded before central components. That, however, is not something that can be implemented as a bug fix and should be treated as an RFE. One thing that comes to mind is having northd check what feature set is supported by the running ovn-controllers. This needs detailed scoping though because it opens up the possibility for different scenarios (e.g., stale chassis records might force northd to not use new features; development complexity increase). In the meantime, the CMS will have to: - upgrade ovn-controller first OR (at least) - set ovn-match-northd-version=true in all compute nodes before performing the upgrade. This can be a one-time operational change. It will get persisted across OVS/OVN upgrades (unless the DBs are removed). the only supported way to upgrade OpenStack is to upgrade the controller first and computes second so this is a major problem for using ovn with OpenStack. within minor z stream release technically you are allowed to upgrade computes first but that is untested and you are never allowed to do a major release upgrade of the computes first. the controller must always be upgraded before any other host in a major release. so unless this is addressed you cannot do a major upgrade sanely. you would have to first upgrade the ovn container on the compute with a container that will be from a newer rhel which rhel does not actually support then you have upgraded the controllers. then you have to upgrade all the other containers on the computers then you have to do the rhel upgrade on the computer. that is not tested or supported by our tooling today. the supported upgrade path for major upgrades is 1 upgrade all the contolers 2 upgrade the container on all the computes 3 upgrade the rhel on all the computes the ovn restriction would force a step 0 to upgrade just the ovn container and would also require that ovn to work with the older ovs supplied by the unupgraded rhel. for osp 16 to osp 17 that is a rhel 8.4 to rhel 9.0 version split for 17 and 18 they at least will be the same major version fo rhel but I see the current requirement to upgrade ovn on the computes as a potential osp 17.0 release blocker, since it will block 17 upgrades and a blocker for support osp 18 controller with osp 17, computes as part of the osp 18 roadmap. (In reply to smooney from comment #13) > you would have to first upgrade the ovn container on the compute with a > container that will be from a newer rhel which rhel does not actually support Doesn't steps 2 and 3 below already imply that we're running newer containers on older rhel for some period of time? > then you have upgraded the controllers. then you have to upgrade all the > other containers on the computers then you have to do the rhel upgrade on > the computer. > > that is not tested or supported by our tooling today. > > the supported upgrade path for major upgrades is > 1 upgrade all the contolers > 2 upgrade the container on all the computes > 3 upgrade the rhel on all the computes > > the ovn restriction would force a step 0 to upgrade just the ovn container > and would also require that ovn to work with the older ovs supplied by the > unupgraded rhel. OVN is not tied to a specific version of OVS, so that should not be a problem. yes it does we already do that today for the osp 13 rhel7 to osp 16 rhel 8 FFU so in the hybrid state which is defined as running fully upgraded osp 16 controller on rhel8 with osp 16 container on rhel 7 computes we already run the osp 16(rhel 8) nova compute container on a rhel 7 host. in this hybrid mode the nova-libvirt container is kept back to the osp 13/rhel7 container and its just the pyton nova-compute agent container that is from the newer rhel. we only package one version fo the container pre release today. we do not currently allow that hybrid state to be used for extended periods and the only operation that is support without a support exception is live migration. so this transitationary state must be entered and exited in the same mantaince windows which typically does not last more then 48hours. so steps 2 and 3 in the context of the osp 16 to 17 upgrade implies running the ovn rhel 9 container on an rhel 8 host a short period of time but in 16-17 all host will be running the same version of ovn once they are in the hybrid state. there is a goal to support the hybrid state effectively indefinitely in 17.1 https://bugzilla.redhat.com/show_bug.cgi?id=2006966 for osp 17-18 upgrade rhel 9.0 to 9.x the intent is to allow you to upgrade the OpenStack controller without touching the compute at all and then later upgrade one compute at a time. in practice we expect that customer would upgrade there central components in one maintenance windows, then in successive short maintenance windows upgrade batches of compute nodes part of the 18 goal however is not need to touch every host in the deployment to do that first step of upgrading the contolers. ovns current design would require that all the agents on the distributed compute node be upgraded before the OSP upgrade could start. or we woudl have to not upgrade ovn on the controller as part of the initial step and only upgrade it at the very end. that woudl effectively be the following * pin ovn to osp 17 version and upgrade the contollers. * upgrade computes in batches to 17 * after all compute and fully upgraded to 18 upgrade ovn on the contolers. since we expect ^ to take place over a time period of 6-12 months or more potentially that basically means neutron in osp 18 woudl have to be fully function with the osp 17 version of ovn for an extended period fo time it might be possible to do that but it woudl significantly increase the testing burden for the network dfg. so ideally if ovn could provide a way to upgrade the cetralised part first and disturbed part second that would simplify integration with layered products like OpenStack or possible openshift I'm not familiar with there upgrade mechanisms but I woudl guess they also want to upgrade the centralised components first as that tend to be how distributed systems choose to address this. e.g. the server can support older clients. rather then client supporting older servers but both can be done. ceph also allows the centralised parts to be upgraded before the ceph clients are updated. Since this issue was closed for 17 and 18 (BZ#2057568), I don't think it makes sense to have it filed in OVN-2021 which is equivalent to the older OSP 16. Please reach back or reopen the other BZ as needed. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |